Patent classifications
G06F2209/521
PROCESSING OF PLURAL-REGISTER-LOAD INSTRUCTION
An apparatus comprises processing circuitry to issue load operations to load data from memory. In response to a plural-register-load instruction specifying at least two destination registers to be loaded with data from respective target addresses, the processing circuitry permits issuing of separate load operations corresponding to the plural-register-load instruction. Load tracking circuitry maintains tracking information for one or more issued load operations. When the plural-register-load instruction is subject to an atomicity requirement and the plurality of load operations are issued separately, the load tracking circuitry detects, based on the tracking information, whether a loss-of-atomicity condition has occurred for the load operations corresponding to the plural-register-load instruction, and requests re-processing of the plural-register-load instruction when the loss-of-atomicity condition is detected.
FUNCTION AS A SERVICE (FAAS) SYSTEM ENHANCEMENTS
- Mohammad R. Haghighat ,
- Kshitij Doshi ,
- Andrew J. Herdrich ,
- Anup Mohan ,
- Ravishankar R. Iyer ,
- Mingqiu Sun ,
- Krishna Bhuyan ,
- Teck Joo Goh ,
- Mohan J. Kumar ,
- Michael Prinke ,
- Michael Lemay ,
- Leeor Peled ,
- Jr-Shian Tsai ,
- David M. Durham ,
- Jeffrey D. Chamberlain ,
- Vadim A. Sukhomlinov ,
- Eric J. Dahlen ,
- Sara Baghsorkhi ,
- Harshad Sane ,
- Areg Melik-Adamyan ,
- Ravi Sahita ,
- Dmitry Yurievich Babokin ,
- Ian M. Steiner ,
- Alexander BACHMUTSKY ,
- Anil Rao ,
- Mingwei Zhang ,
- Nilesh K. Jain ,
- Amin Firoozshahian ,
- Baiju V. Patel ,
- Wenyong Huang ,
- Yeluri Raghuram
Embodiments of systems, apparatuses and methods provide enhanced function as a service (FaaS) to users, e.g., computer developers and cloud service providers (CSPs). A computing system configured to provide such enhanced FaaS service include one or more controls architectural subsystems, software and orchestration subsystems, network and storage subsystems, and security subsystems. The computing system executes functions in response to events triggered by the users in an execution environment provided by the architectural subsystems, which represent an abstraction of execution management and shield the users from the burden of managing the execution. The software and orchestration subsystems allocate computing resources for the function execution by intelligently spinning up and down containers for function code with decreased instantiation latency and increased execution scalability while maintaining secured execution. Furthermore, the computing system enables customers to pay only when their code gets executed with a granular billing down to millisecond increments.
BARRIERLESS AND FENCELESS SHARED MEMORY SYNCHRONIZATION
When communicating through shared memory, a producer thread generates a value that is written to a location in a shared memory. The value is read from the shared memory by a consumer thread. The challenge is to ensure that the consumer thread reads the location only after the value is written and is thereby synchronized. When a memory location is written by a producer thread, a flag that is simultaneously stored in the memory location along with the value is toggled. The consumer thread tracks information to determine whether the flag stored in the location indicates whether the producer has written the value to the location. The flag is read and written simultaneously with reading and writing the location in memory, thereby eliminating the need for a memory fence. After all of the consumer threads read the value, the location may be reused to write additional value(s) and simultaneously toggle the flag.
Apparatus and method for a range comparison, exchange, and add
An apparatus and method for executing an atomic test and update instruction. For example, one embodiment of a processor comprises: a decoder to decode an atomic test and update (ATU) instruction having a first operand specifying a first value in a first storage location, a second operand specifying a second value in a second storage location, a third operand specifying a third value in a third storage location, and an opcode specifying a condition to be tested relative to the first and second values; and execution circuitry to perform a load lock operation to load the first value from the first storage location, the load lock operation to prevent access by another instruction before a result of the ATU instruction is stored, the execution circuitry to test a condition related the first value and the second value, wherein if the condition is met then the execution circuitry is to add the first value and the third value to generate a sum and to store the sum to the first storage location.
Atomic operation predictor to predict if an atomic operation will successfully complete and a store queue to selectively forward data based on the predictor
In an embodiment, a processor comprises an atomic predictor circuit to predict whether or not an atomic operation will complete successfully. The prediction may be used when a subsequent load operation to the same memory location as the atomic operation is executed, to determine whether or not to forward store data from the atomic operation to the subsequent load operation. If the prediction is successful, the store data may be forwarded. If the prediction is unsuccessful, the store data may not be forwarded. In cases where an atomic operation has been failing (not successfully performing the store operation), the prediction may prevent the forwarding of the store data and thus may prevent a subsequent flush of the load.
Techniques for ordering atomic operations
In various embodiments, an ordered atomic operation enables a parallel processing subsystem to executes an atomic operation associated with a memory location in a specified order relative to other ordered atomic operations associated with the memory location. A level 2 (L2) cache slice includes an atomic processing circuit and a content-addressable memory (CAM). The CAM stores an ordered atomic operation specifying at least a memory address, an atomic operation, and an ordering number. In operation, the atomic processing circuit performs a look-up operation on the CAM, where the look-up operation specifies the memory address. After the atomic processing circuit determines that the ordering number is equal to a current ordering number associated with the memory address, the atomic processing circuit executes the atomic operation and returns the result to a processor executing an algorithm. Advantageously, the ordered atomic operation enables the algorithm to achieve a deterministic result while optimizing latency.
MEMORY BARRIER ELISION FOR MULTI-THREADED WORKLOADS
A system includes a memory, at least one physical processor in communication with the memory, and a plurality of hardware threads executing on the at least one physical processor. A first thread of the plurality of hardware threads is configured to execute a plurality of instructions that includes a restartable sequence. Responsive to a different second thread in communication with the first thread being pre-empted while the first thread is executing the restartable sequence, the first thread is configured to restart the restartable sequence prior to reaching a memory barrier.
ENHANCED ATOMICS FOR WORKGROUP SYNCHRONIZATION
A technique for synchronizing workgroups is provided. The techniques comprise detecting that one or more non-executing workgroups are ready to execute, placing the one or more non-executing workgroups into one or more ready queues based on the synchronization status of the one or more workgroups, detecting that computing resources are available for execution of one or more ready workgroups, and scheduling for execution one or more ready workgroups from the one or more ready queues in an order that is based on the relative priority of the ready queues.
REQUEST OF AN MCS LOCK BY GUESTS
In example implementations, a method include receiving a request for a lock in a Mellor-Crummey Scott (MCS) lock protocol from a guest user that is context free (e.g., a process that does not bring a queue node). The lock determines that it contains a null value. The lock is granted to the guest user. A pi value is received from the guest user to store in the lock. The pi value notifies subsequent users that the guest user has the lock.
Compact NUMA-aware locks
A computer comprising multiple processors and non-uniform memory implements multiple threads that perform a lock operation using a shared lock structure that includes a pointer to a tail of a first-in-first-out (FIFO) queue of threads waiting to acquire the lock. To acquire the lock, a thread allocates and appends a data structure to the FIFO queue. The lock is released by selecting and notifying a waiting thread to which control is transferred, with the thread selected executing on the same processor socket as the thread controlling the lock. A secondary queue of threads is managed for threads deferred during the selection process and maintained within the data structures of the waiting threads such that no memory is required within the lock structure. If no threads executing on the same processor socket are waiting for the lock, entries in the secondary queue are transferred to the FIFO queue preserving FIFO order.