Patent classifications
G06F9/52
POWER EFFICIENT MEMORY VALUE UPDATES FOR ARM ARCHITECTURES
Disclosed are various examples of providing provide efficient waiting for detection of memory value updates for Advanced RISC Machines (ARM) architectures. An ARM processor component instructs a memory agent to perform a processing action, and executes a waiting function. The waiting function ensures that the processing action is completed by the memory agent. The waiting function performs an exclusive load at a memory location, and a wait for event (WFE) instruction that causes the ARM processor component to wait in a low-power mode for an event register to be set. Once the event register is set, the waiting function completes and a second processing action is executed by the ARM processor component.
POWER EFFICIENT MEMORY VALUE UPDATES FOR ARM ARCHITECTURES
Disclosed are various examples of providing provide efficient waiting for detection of memory value updates for Advanced RISC Machines (ARM) architectures. An ARM processor component instructs a memory agent to perform a processing action, and executes a waiting function. The waiting function ensures that the processing action is completed by the memory agent. The waiting function performs an exclusive load at a memory location, and a wait for event (WFE) instruction that causes the ARM processor component to wait in a low-power mode for an event register to be set. Once the event register is set, the waiting function completes and a second processing action is executed by the ARM processor component.
THREAD SYNCHRONIZATION ACROSS MEMORY SYNCHRONIZATION DOMAINS
Various embodiments include a parallel processing computer system that provides multiple memory synchronization domains in a single parallel processor to reduce unneeded synchronization operations. During execution, one execution kernel may synchronize with one or more other execution kernels by processing outstanding memory references. The parallel processor tracks memory references for each domain to each portion of local and remote memory. During synchronization, the processor synchronizes the memory references for a specific domain while refraining from synchronizing memory references for other domains. As a result, synchronization operations between kernels complete in a reduced amount of time relative to prior approaches.
MULTIPLE LOCKING OF RESOURCES AND LOCK SCALING IN CONCURRENT COMPUTING
Methods and systems for implementing division of process resources of running processes into individually locked partitions, and indirect mapping of keys to process resources to locks of each partition, are provided. In computing systems implementing concurrent processing, applications may generate and destroy concurrently running processes with high frequency. Real-time security monitoring may cause the computing system to run monitoring processes collecting large volumes of data regarding system events occurring in context of various other processes, causing threads of processes of the security monitoring application to make frequent write access and read access to resources of those processes in memory. Indirect mapping of lock acquisition across locks provides scalable alleviation of lock contention and thread blocking that result from computational concurrency, while handling read and write requests which arise at unpredictable times from kernel-space monitoring processes, and which request unpredictable resources of monitored user-space processes.
MULTIPLE LOCKING OF RESOURCES AND LOCK SCALING IN CONCURRENT COMPUTING
Methods and systems for implementing division of process resources of running processes into individually locked partitions, and indirect mapping of keys to process resources to locks of each partition, are provided. In computing systems implementing concurrent processing, applications may generate and destroy concurrently running processes with high frequency. Real-time security monitoring may cause the computing system to run monitoring processes collecting large volumes of data regarding system events occurring in context of various other processes, causing threads of processes of the security monitoring application to make frequent write access and read access to resources of those processes in memory. Indirect mapping of lock acquisition across locks provides scalable alleviation of lock contention and thread blocking that result from computational concurrency, while handling read and write requests which arise at unpredictable times from kernel-space monitoring processes, and which request unpredictable resources of monitored user-space processes.
HARDWARE ASSISTED FINE-GRAINED DATA MOVEMENT
A processor includes a task scheduling unit and a compute unit coupled to the task scheduling unit. The task scheduling unit performs a task dependency assessment of a task dependency graph and task data requirements that correspond to each task of the plurality of tasks. Based on the task dependency assessment, the task scheduling unit schedules a first task of the plurality of tasks and a second proxy object of a plurality of proxy objects specified by the task data requirements such that a memory transfer of the second proxy object of the plurality of proxy objects occurs while the first task is being executed.
HARDWARE ASSISTED FINE-GRAINED DATA MOVEMENT
A processor includes a task scheduling unit and a compute unit coupled to the task scheduling unit. The task scheduling unit performs a task dependency assessment of a task dependency graph and task data requirements that correspond to each task of the plurality of tasks. Based on the task dependency assessment, the task scheduling unit schedules a first task of the plurality of tasks and a second proxy object of a plurality of proxy objects specified by the task data requirements such that a memory transfer of the second proxy object of the plurality of proxy objects occurs while the first task is being executed.
Managing partitions in a scalable environment
Systems and methods are provided that enable a general framework for partitioning application-defined jobs in a scalable environment. The general framework decouples partitioning of a job from the other aspects of the job. As a result, the effort required to define the application-defined job is reduced or minimized, as the user is not required to provide a partitioning algorithm. The general framework also facilitates management of masters and servers performing computations within the distributed environment.
Managing partitions in a scalable environment
Systems and methods are provided that enable a general framework for partitioning application-defined jobs in a scalable environment. The general framework decouples partitioning of a job from the other aspects of the job. As a result, the effort required to define the application-defined job is reduced or minimized, as the user is not required to provide a partitioning algorithm. The general framework also facilitates management of masters and servers performing computations within the distributed environment.
EFFICIENT MULTI-DEVICE SYNCHRONIZATION BARRIERS USING MULTICASTING
In various examples, a single notification (e.g., a request for a memory access operation) that a processing element (PE) has reached a synchronization barrier may be propagated to multiple physical addresses (PAs) and/or devices associated with multiple processing elements. Thus, the notification may allow an indication that the processing element has reached the synchronization barrier to be recoded at multiple targets. Each notification may access the PAs of each PE and/or device of a barrier group to update a corresponding counter. The PEs and/or devices may poll or otherwise use the counter to determine when each PE of the group has reached the synchronization barrier. When a corresponding counter indicates synchronization at the synchronization barrier, a PE may proceed with performing a compute task asynchronously with one or more other PEs until a subsequent synchronization barrier may be reached.