Patent classifications
G06F9/544
Remote validation and preview
Systems and methods are directed to remote validation and preview. An example system receives an indication of a portion of the data pipeline to be processed, generates a data pipeline configuration file describing operations in the portion of the data pipeline, causes a software framework to perform operations corresponding to the portion of the data pipeline, receives results of the operations corresponding to the portion of the data pipeline, and causes presentation of the results on a graphical user interface of a computing device.
Cooperative Group Arrays
- Greg PALMER ,
- Gentaro HIROTA ,
- Ronny Krashinsky ,
- Ze Long ,
- Brian Pharris ,
- Rajballav DASH ,
- Jeff TUCKEY ,
- Jerome F. Duluk, Jr. ,
- Lacky Shah ,
- Luke DURANT ,
- Jack Choquette ,
- Eric WERNESS ,
- Naman GOVIL ,
- Manan PATEL ,
- Shayani DEB ,
- SANDEEP NAVADA ,
- John Edmondson ,
- Prakash Bangalore Prabhakar ,
- Wish Gandhi ,
- Ravi MANYAM ,
- Apoorv PARLE ,
- Olivier GIROUX ,
- Shirish Gadre ,
- Steve HEINRICH
A new level(s) of hierarchy—Cooperate Group Arrays (CGAs)—and an associated new hardware-based work distribution/execution model is described. A CGA is a grid of thread blocks (also referred to as cooperative thread arrays (CTAs)). CGAs provide co-scheduling, e.g., control over where CTAs are placed/executed in a processor (such as a GPU), relative to the memory required by an application and relative to each other. Hardware support for such CGAs guarantees concurrency and enables applications to see more data locality, reduced latency, and better synchronization between all the threads in tightly cooperating collections of CTAs programmably distributed across different (e.g., hierarchical) hardware domains or partitions.
Distributed Shared Memory
- Prakash Bangalore Prabhakar ,
- Gentaro HIROTA ,
- Ronny Krashinsky ,
- Ze Long ,
- Brian Pharris ,
- Rajballav DASH ,
- Jeff TUCKEY ,
- Jerome F. Duluk, Jr. ,
- Lacky Shah ,
- Luke DURANT ,
- Jack Choquette ,
- Eric WERNESS ,
- Naman GOVIL ,
- Manan PATEL ,
- Shayani DEB ,
- SANDEEP NAVADA ,
- John Edmondson ,
- Greg PALMER ,
- Wish Gandhi ,
- Ravi MANYAM ,
- Apoorv PARLE ,
- Olivier GIROUX ,
- Shirish Gadre ,
- Steve HEINRICH
Distributed shared memory (DSMEM) comprises blocks of memory that are distributed or scattered across a processor (such as a GPU). Threads executing on a processing core local to one memory block are able to access a memory block local to a different processing core. In one embodiment, shared access to these DSMEM allocations distributed across a collection of processing cores is implemented by communications between the processing cores. Such distributed shared memory provides very low latency memory access for processing cores located in proximity to the memory blocks, and also provides a way for more distant processing cores to also access the memory blocks in a manner and using interconnects that do not interfere with the processing cores' access to main or global memory such as hacked by an L2 cache. Such distributed shared memory supports cooperative parallelism and strong scaling across multiple processing cores by permitting data sharing and communications previously possible only within the same processing core.
Cache-based communication for trusted execution environments
A method executes inter-enclave communication via cache memory of a processor. The method includes: instantiating a first enclave such that it is configured to execute a first communication thread, which is configured to read/write data to the cache memory; instantiating a second enclave such that it is configured to execute a second communication thread, which is configured to read/write data to cache memory; executing, by the first enclave, the first communication thread to send message data to the second enclave, executing the first communication thread comprising writing the message data to the cache memory; and executing, by the second enclave, the second communication thread to receive the message data. Executing the second communication thread can include: monitoring the cache memory to determine whether the data message is being sent; and based upon determining the data message is being sent, reading from the cache memory to receive the data message.
Efficient and reliable host distribution of totally ordered global state
An asynchronous distributed computing system with a plurality of computing nodes is provided. One of the computing nodes includes a sequencer service that receives updates from the plurality of computing nodes. The sequencer service maintains or annotates messages added to the global state of the system. Updates to the global state are published to the plurality of computing nodes. Monitoring services on the other computing nodes write the updates into a locally maintained copy of the global state that exists in shared memory on each one of the nodes. Client computer processes on the nodes may then subscribe to have updates “delivered” to the respective client computer processes.
DISTRIBUTING MATRIX MULTIPLICATION PROCESSING AMONG PROCESSING NODES
Based on a predetermined number of available processor sockets, a plurality of candidate matrix decompositions are identified, which correspond to a multiplication of matrices. Based on a first comparative relationship of a variation of first sizes of the plurality of candidate matrix decompositions along a first dimension and a second comparative relationship of a variation of second sizes of the plurality of candidate matrix decomposition sizes along a second dimension, a given candidate matrix decomposition is selected. Processing of the multiplication among the processor sockets is distributed based on the given candidate matrix decomposition.
Dynamic type resolution for shared memory
A method and apparatus of a network device that allocates a shared memory buffer for an object is described. In an exemplary embodiment, the network device receives an allocation request for the shared memory buffer for the object. In addition, the network device allocates the shared memory buffer from shared memory of a network device, where the shared memory buffer is accessible by a writer and a plurality of readers. The network device further returns a writer pointer to the writer, where the writer pointer references a base address of the shared memory buffer. Furthermore, the network device stores the object in the shared memory buffer, wherein the writer accesses the shared memory using the writer pointer. The network device further shares the writer pointer with at least a first reader of the plurality of readers. The network device additionally translates the base address of the shared memory buffer to a reader pointer, where the reader pointer is expressed in a memory space of the first reader.
Scalable and accelerated function as a service calling architecture
Examples described herein relate to requesting execution of a workload by a next function with data transport overhead tailored based on memory sharing capability with the next function. In some examples, data transport overhead is one or more of: sending a memory address pointer, virtual memory address pointer or sending data to the next function. In some examples, the memory sharing capability with the next function is based on one or more of: whether the next function shares an enclave with a sender function, the next function shares physical memory domain with a sender function, or the next function shares virtual memory domain with a sender function. In some examples, selection of the next function from among multiple instances of the next function based on one or more of: sharing of memory domain, throughput performance, latency, cost, load balancing, or service legal agreement (SLA) requirements.
Seamless access to a common physical disk in an AMP system without an external hypervisor
The present disclosure is directed to seamless access to a common physical disk in an AMP system without an external hypervisor, and includes one or more processors and one or more computer-readable non-transitory storage media comprising instructions that, when executed by the one or more processors, cause one or more components of the system to perform operations including instantiating, by a first instance, a second instance during a system upgrade, creating, in the first instance, a first disk abstraction for a block device of a physical disk, and attaching the block device under the first disk abstraction. The operations further include providing the second instance network-based access to the physical disk using the first disk abstraction of the first instance during the system upgrade.
Delivering immediate values by using program counter (PC)-relative load instructions to fetch literal data in processor-based devices
Delivering immediate values by using program counter (PC)-relative load instructions to fetch literal data in processor-based devices is disclosed. In this regard, a processing element (PE) of a processor-based device provides an execution pipeline circuit that comprises an instruction processing portion and a data access portion. Using a literal data access logic circuit, the PE detects a PC-relative load instruction within a fetch window that includes multiple fetched instructions. The PE determines that the PC-relative load instruction can be serviced using literal data that is available to the instruction processing portion of the execution pipeline circuit (e.g., located within the fetch window containing the PC-relative load instruction, or stored in a literal pool buffer), The PE then retrieves the literal data within the instruction processing portion of the execution pipeline circuit, and executes the PC-relative load instruction using the literal data.