G06F9/522

Control barrier network for reconfigurable data processors

A processing system comprises a control bus and a plurality of logic units. The control bus is configurable by configuration data to form signal routes in a control barrier network coupled to processing units in an array of processing units. The plurality of logic units has inputs and outputs connected to the control bus and to the array of processing units. A logic unit in the plurality of logic units is operatively coupled to a processing unit in the array of processing units and is configurable by the configuration data to consume source tokens and a status signal from the processing unit on the inputs and to produce barrier tokens and an enable signal on the outputs based on the source tokens and the status signal on the inputs.

Method and system for performing parallel computations to generate multiple output feature maps
11579921 · 2023-02-14 · ·

Systems and methods for performing parallel computation are disclosed. The system can include: a task manager; and a plurality of cores coupled with the task manager and configured to respectively perform a set of parallel computation tasks based on instructions from the task manager, wherein each of the plurality of cores further comprises: a processing unit configured to generate a first output feature map corresponding to a first computation task among the set of parallel computation tasks; an interface configured to receive one or more instructions from the task manager to collect external output feature maps corresponding to the set of parallel computation tasks from other cores of the plurality of cores; a reduction unit configured to generate a reduced feature map based on the first output feature map and received external output feature maps.

Barrier synchronization circuit, barrier synchronization method, and parallel information processing apparatus

A barrier synchronization circuit that performs barrier synchronization of a plurality of processes executed in parallel by a plurality of processing circuits, the barrier synchronization circuit includes a first determination circuit configured to determine whether the number of first processing circuits among the plurality of the processing circuits is equal to or greater than a first threshold value, the first processing circuits having completed the process, and an instruction circuit configured to instruct a second processing circuit among the plurality of the processing circuits to forcibly stop the process when it is determined that the number is equal to or greater than the first threshold value by the first determination circuit, the second processing circuit having not completed the process.

COOPERATIVE GARBAGE COLLECTION BARRIER ELISION

Techniques are disclosed for eliding load and store barriers while maintaining garbage collection invariants. Embodiments described herein include techniques for identifying an instruction, such as a safepoint poll, that checks whether to pause a thread between execution of a dominant and dominated access to the same data field. If a poll instruction is identified between the two data accesses, then a pointer for the data field may be recorded in an entry associated with the poll instruction. When the thread is paused to execute a garbage collection operation, the recorded information may be used to update values associated with the data field in memory such that the dominated access may be executed without any load or store barriers.

Tracing Synchronisation Activity of a Processing Unit
20230024224 · 2023-01-26 ·

A device comprising: a processing unit comprising at least one processor configured to: participate in barrier synchronisations, each of which separates a compute phase of the at least one processor from an exchange phase for the at least one processor; and exchange sync messages with a sync controller hardware unit so as to co-ordinate each of the barrier synchronisations; and sync trace circuitry configured to: receive one or more of the sync messages; and in response to each of the one or more of the sync messages, provide sync trace information for output from the device, the sync trace information comprising timing information associated with the respective sync message.

THREAD SYNCHRONIZATION ACROSS MEMORY SYNCHRONIZATION DOMAINS

Various embodiments include a parallel processing computer system that provides multiple memory synchronization domains in a single parallel processor to reduce unneeded synchronization operations. During execution, one execution kernel may synchronize with one or more other execution kernels by processing outstanding memory references. The parallel processor tracks memory references for each domain to each portion of local and remote memory. During synchronization, the processor synchronizes the memory references for a specific domain while refraining from synchronizing memory references for other domains. As a result, synchronization operations between kernels complete in a reduced amount of time relative to prior approaches.

EFFICIENT MULTI-DEVICE SYNCHRONIZATION BARRIERS USING MULTICASTING
20230229524 · 2023-07-20 ·

In various examples, a single notification (e.g., a request for a memory access operation) that a processing element (PE) has reached a synchronization barrier may be propagated to multiple physical addresses (PAs) and/or devices associated with multiple processing elements. Thus, the notification may allow an indication that the processing element has reached the synchronization barrier to be recoded at multiple targets. Each notification may access the PAs of each PE and/or device of a barrier group to update a corresponding counter. The PEs and/or devices may poll or otherwise use the counter to determine when each PE of the group has reached the synchronization barrier. When a corresponding counter indicates synchronization at the synchronization barrier, a PE may proceed with performing a compute task asynchronously with one or more other PEs until a subsequent synchronization barrier may be reached.

Subscription to Sync Zones
20230016049 · 2023-01-19 ·

A set of configurable sync groupings (which may be referred to as sync zones) are defined. Any of the processors may belong to any of the sync zones. Each of the processor comprises a register indicating to which of the sync zones it belongs. If a processor does not belong to a sync zone, it continually asserts a sync request for that sync zone to the sync controller. If a processor does belong to a sync zone, it will only assert its sync request for that sync zone upon arriving at a synchronisation point for that sync zone indicated in its compiled code set.

MEMORY BARRIER ELISION FOR MULTI-THREADED WORKLOADS
20230019377 · 2023-01-19 ·

A system includes a memory, at least one physical processor in communication with the memory, and a plurality of threads executing on the at least one physical processor. A first thread of the plurality of threads is configured to execute a plurality of instructions that includes a restartable sequence. Responsive to a different second thread in communication with the first thread being pre-empted while the first thread is executing the restartable sequence, the first thread is configured to restart the restartable sequence prior to reaching a memory barrier.

Proactive cluster compute node migration at next checkpoint of cluster upon predicted node failure

While scheduled checkpoints are being taken of a cluster of active compute nodes distributively executing an application in parallel, a likelihood of failure of the active compute nodes is periodically and independently predicted. Responsive to the likelihood of failure of a given active compute node exceeding a threshold, the given active compute node is proactively migrated to a spare compute node of the cluster at a next scheduled checkpoint. Another spare compute node of the cluster can perform prediction and migration. Prediction can be based on both hardware events and software events regarding the active compute nodes.