Patent classifications
G06F2209/521
High-performance remote atomic synchronization
One example method may be performed in an operating environment including distributed and/or disaggregated compute nodes that communicate with each other and with a shared computing resource by way of an RDMA fabric. The method may include obtaining, by a first one of the compute nodes, ownership of an atomic synchronization object that controls access to the shared computing resource, using, by the first compute node, the shared computing resource until the shared computing resource is no longer needed by the first compute node, and when the shared computing resource is no longer needed by the first compute node, relinquishing, by the first compute node, the ownership of the atomic synchronization object.
CHAINED RESOURCE LOCKING
Devices and techniques for chained resource locking are described herein. Threads form a last-in-first-out (LIFO) queue on a resource lock to create a chained lock on the resource. A data store representing the lock for the resource holds the previous thread's identifier, enabling a subsequent thread to wake the previous thread using the identifier when the subsequent thread releases the lock. Generally, the thread releasing the lock need not interact with the data store, reducing contention for the data store among many threads.
Shiftable memory supporting atomic operation
A shiftable memory supporting atomic operation employs built-in shifting capability to shift a contiguous subset of data from a first location to a second location within memory during an atomic operation. The shiftable memory includes the memory to store data. The memory has the built-in shifting capability. The shiftable memory further includes an atomic primitive defined on the memory to operate on the contiguous subset.
Tokenized streams for concurrent execution between asymmetric multiprocessors
A method for executing an application program using streams. A device driver receives a first command within an application program and parses the first command to identify a first stream token that is associated with a first stream. The device driver checks a memory location associated with the first stream for a first semaphore, and determines whether the first semaphore has been released. Once the first semaphore has been released, a second command within the application program is executed. Advantageously, embodiments of the invention provide a technique for developers to take advantage of the parallel execution capabilities of a GPU.
Automatic dependency configuration for managed services
A container-orchestration system reads specification data associated with a third-party resource used by a managed resource. Based on the specification data the system retrieves resource configuration data of the third-party resource and updates a dependency definition with the resource configuration data. The dependency definition is associated with the managed resource and the third-party resource. The system provides the dependency definition to the managed resource.
Method and apparatus for monitoring a PCIe NTB
A pair of compute nodes, each having a separate PCIe root complex, are interconnected by a PCIe Non-Transparent Bridge (NTB). An instance of a NTB monitoring process is started for each root complex, and the CPU affinity of the NTB monitoring processes are set to cause each NTB monitoring process to be executed on CPU resources of each respective CPU root complex. The NTB monitoring process on a given root complex is allowed to sleep until a triggering event occurs that causes the NTB monitoring process to wake and determine the state of the NTB. One such triggering event is a failure of an atomicity algorithm on the compute node to obtain a lock on peer memory in connection with implementing an atomic read operation on the peer memory over the NTB.
Compact NUMA-aware Locks
A computer comprising multiple processors and non-uniform memory implements multiple threads that perform a lock operation using a shared lock structure that includes a pointer to a tail of a first-in-first-out (FIFO) queue of threads waiting to acquire the lock. To acquire the lock, a thread allocates and appends a data structure to the FIFO queue. The lock is released by selecting and notifying a waiting thread to which control is transferred, with the thread selected executing on the same processor socket as the thread controlling the lock. A secondary queue of threads is managed for threads deferred during the selection process and maintained within the data structures of the waiting threads such that no memory is required within the lock structure. If no threads executing on the same processor socket are waiting for the lock, entries in the secondary queue are transferred to the FIFO queue preserving FIFO order.
Atomic handling for disaggregated 3D structured SoCs
In a further embodiment, a system on a chip integrated circuit (SoC) is provided that includes an active base die including a first cache memory, a first die mounted on and coupled with the active base die, and a second die mounted on the active base die and coupled with the active base die and the first die. The first die includes an interconnect fabric, an input/output interface, and an atomic operation handler. The second die includes an array of graphics processing elements and an interface to the first cache memory of the active base die. At least one of the graphics processing elements are configured to perform, via the atomic operation handler, an atomic operation to a memory device.
System, apparatus and methods for performing shared memory operations
In an embodiment, an apparatus for memory access may include: a memory comprising at least one atomic memory region, and a control circuit coupled to the memory, The control circuit may be to: for each submission queue of a plurality of submission queues, identify an atomic memory location specified in a first entry of the submission queue, wherein each submission queue is to store access requests from a different requester; determine whether the atomic memory location includes existing requester information; and in response to a determination that the atomic memory location does not include existing requester information, perform an atomic operation for the atomic memory location based at least in part on the first entry of the submission queue. Other embodiments are described and claimed.
Hardware accelerated synchronization with asynchronous transaction support
A new transaction barrier synchronization primitive enables executing threads and asynchronous transactions to synchronize across parallel processors. The asynchronous transactions may include transactions resulting from, for example, hardware data movement units such as direct memory units, etc. A hardware synchronization circuit may provide for the synchronization primitive to be stored in a cache memory so that barrier operations may be accelerated by the circuit. A new wait mechanism reduces software overhead associated with waiting on a barrier.