Patent classifications
G06F2209/521
Contention blocking buffer
In response to a processor receiving data associated with a shared memory location, a contention blocking buffer stores a memory address of the shared memory location. In response to a probe seeking to take ownership of the shared memory location, the contention blocking buffer determines if the memory address indicated by the probe is stored at the contention blocking buffer. If so, the contention blocking buffer blocks the probe, thereby preventing another processor from taking ownership of the shared memory location.
Preemptible-RCU CPU hotplugging while maintaining real-time response
A grace period detection technique for a preemptible read-copy update (RCU) implementation that uses a combining tree for quiescent state tracking. When a leaf level bitmask indicating online/offline CPUs is fully cleared due to all of its assigned CPUs going offline as a result of hotplugging operations, the bitmask state is not immediately propagated to the root level of the combining tree as in prior art RCU implementations. Instead, propagation is deferred until all tasks are removed from an associated leaf level task list tracking tasks that were preempted inside an RCU read-side critical section. Deferring bitmask propagation obviates the need to migrate the task list to the combining tree root level in order to prevent premature grace period termination. The task list can remain at the leaf level. In this way, CPU hotplugging is accommodated while avoiding excessive degradation of real-time latency stemming from the now-eliminated task list migration.
Preemptible-RCU CPU hotplugging while maintaining real-time response
A grace period detection technique for a preemptible read-copy update (RCU) implementation that uses a combining tree for quiescent state tracking. When a leaf level bitmask indicating online/offline CPUs is fully cleared due to all of its assigned CPUs going offline as a result of hotplugging operations, the bitmask state is not immediately propagated to the root level of the combining tree as in prior art RCU implementations. Instead, propagation is deferred until all tasks are removed from an associated leaf level task list tracking tasks that were preempted inside an RCU read-side critical section. Deferring bitmask propagation obviates the need to migrate the task list to the combining tree root level in order to prevent premature grace period termination. The task list can remain at the leaf level. In this way, CPU hotplugging is accommodated while avoiding excessive degradation of real-time latency stemming from the now-eliminated task list migration.
Accelerating and offloading lock access over a network
Lock access is managed in a data network having an initiator node and a remote target by issuing a lock command from a first process to the remote target via an initiator network interface controller to establish a lock on a memory location, and prior to receiving a reply to the lock command communicating a data access request to the memory location from the initiator network interface controller. Prior to receiving a reply to the data access request, an unlock command issues from the initiator network interface controller. The target network interface controller determines the lock content, and when permitted by the lock accesses the memory location. After accessing the memory location the target network interface controller executes the unlock command. When the lock prevents data access, the lock operation is retried a configurable number of times until data access is allowed or a threshold is exceeded.
Method and apparatus for user-level thread synchronization with a monitor and MWAIT architecture
Instructions and logic provide user-level thread synchronization with MONITOR and MWAIT instructions. One or more model specific registers (MSRs) in a processor may be configured in a first execution state to specify support of a user-level thread synchronization architecture. Embodiments include multiple hardware threads or processing cores, corresponding monitored address state storage to store a last monitored address for each of a plurality of execution threads that issues a MONITOR request, cache memory to record MONITOR requests and associated states for addresses of memory storage locations, and responsive to receipt of an MWAIT request for the address, to record an associated wait-to-trigger state of monitored addresses for execution cores associated with an MWAIT request; wherein the execution core is to transition a requesting thread to an optimized sleep state responsive to the receipt of said MWAIT request when said one or more MSRs are configured in the first execution state.
System and method to convert lock-free algorithms to wait-free using a hardware accelerator
A method to convert lock-free algorithm to wait-free using a hardware accelerator includes (i) executing a plurality of software threads by a plurality of processing units associated, the plurality of software threads is associated with at least one operation, (ii) generating at least one of a read request or a write request at the hardware accelerator based on the execution, (iii) generating at least one operation includes PARAM and read request or the write request at the hardware accelerator, (iv) checking, an operation specific condition of at least one software thread of the plurality of software threads, and (v) updating, at least one read value or write value and at least one state variable upon the operation specific condition being an operation success. The operation specific condition includes an operation success or an operation failure based on the PARAM, the read request, or the write request.
HARDWARE ACCESS COUNTERS AND EVENT GENERATION FOR COORDINATING MULTITHREADED PROCESSING
A computer system includes a hardware synchronization component (HSC). Multiple concurrent threads of execution issue instructions to update the state of the HSC. Multiple threads may update the state in the same clock cycle and a thread does not need to receive control of the HSC prior to updating its states. Instructions referencing the state received during the same clock cycle are aggregated and the state is updated according to the number of the instructions. The state is evaluated with respect to a threshold condition. If it is met, then the HSC outputs an event to a processor. The processor then identifies a thread impacted by the event and takes a predetermined action based on the event (e.g. blocking, branching, unblocking of the thread).
Atomic operation predictor to predict whether an atomic operation will complete successfully
In an embodiment, a processor comprises an atomic predictor circuit to predict whether or not an atomic operation will complete successfully. The prediction may be used when a subsequent load operation to the same memory location as the atomic operation is executed, to determine whether or not to forward store data from the atomic operation to the subsequent load operation. If the prediction is successful, the store data may be forwarded. If the prediction is unsuccessful, the store data may not be forwarded. In cases where an atomic operation has been failing (not successfully performing the store operation), the prediction may prevent the forwarding of the store data and thus may prevent a subsequent flush of the load.
Device and method for communicating between cores
A device and method for communicating between cores are provided. The device comprises: a postbox component, configured to store a message sent from a message sending core to a message receiving core and notify the message receiving core to read the message; and a bus adapter component, connected between the postbox component and the message receiving core and the message sending core which communicate with each other and configured to provide read/write interfaces of the postbox component and the message receiving core and the message sending core. By means of the disclosure, the problems that the device and method for communicating between cores with high complexity, poor timeliness and poor expandability during multi-core application in the related art are solved, thereby achieving the effects of reducing the communication between cores complexity significantly, reducing communication time delay and having excellent expandability and scalability.
Compact NUMA-aware locks
A computer comprising multiple processors and non-uniform memory implements multiple threads that perform a lock operation using a shared lock structure that includes a pointer to a tail of a first-in-first-out (FIFO) queue of threads waiting to acquire the lock. To acquire the lock, a thread allocates and appends a data structure to the FIFO queue. The lock is released by selecting and notifying a waiting thread to which control is transferred, with the thread selected executing on the same processor socket as the thread controlling the lock. A secondary queue of threads is managed for threads deferred during the selection process and maintained within the data structures of the waiting threads such that no memory is required within the lock structure. If no threads executing on the same processor socket are waiting for the lock, entries in the secondary queue are transferred to the FIFO queue preserving FIFO order.