G06F9/522

CONVERGENCE AMONG CONCURRENTLY EXECUTING THREADS

Convergence of threads executing common code sections is facilitated using instructions inserted at strategic locations in computer code sections. The inserted instructions enable the threads in a warp or other group to cooperate with a thread scheduler to promote thread convergence.

Memory-based software barriers

A mechanism is described for facilitating memory-based software barriers to emulate hardware barriers at graphics processors in computing devices. A method of embodiments, as described herein, includes facilitating converting thread scheduling at a processor from hardware barriers to software barriers, where the software barriers emulate the hardware barriers.

SCHEDULING TASKS USING SWAP FLAGS
20230097760 · 2023-03-30 ·

A method of activating scheduling instructions within a parallel processing unit is described. The method comprises decoding, in an instruction decoder, an instruction in a scheduled task in an active state and checking, by an instruction controller, if a swap flag is set in the decoded instruction. If the swap flag in the decoded instruction is set, a scheduler is triggered to de-activate the scheduled task by changing the scheduled task from the active state to a non-active state.

Multi-processor simulation on a multi-core machine
11574087 · 2023-02-07 · ·

The invention relates to methods of simulation of a plurality of processors running on a plurality of cores, to multi-core microprocessor systems in which such methods may be carried out, and to computer program products configured to perform a simulation of a plurality of processors, running on a plurality of cores. According to a first aspect of the invention, there is provided a method of running a plurality of simulated processors on a plurality of cores, in which simulation of the processors is performed in parallel on the plurality of cores.

SYNCHRONIZING SCHEDULING TASKS WITH ATOMIC ALU
20230033355 · 2023-02-02 ·

A method of synchronizing a group of scheduled tasks within a parallel processing unit into a known state is described. The method uses a synchronization instruction in a scheduled task which triggers, in response to decoding of the instruction, an instruction decoder to place the scheduled task into a non-active state and forward the decoded synchronization instruction to an atomic ALU for execution. When the atomic ALU executes the decoded synchronization instruction, the atomic ALU performs an operation and check on data assigned to the group ID of the scheduled task and if the check is passed, all scheduled tasks having the particular group ID are removed from the non-active state.

Phased Parameterized Combinatoric Testing for a Data Storage System
20230094854 · 2023-03-30 ·

Phased parameterized combinatoric testing for a data storage system is disclosed. A testing recipe can comprise operations. The operations can be performed according to different input arguments. Combinatoric testing of the data storage system can be based on different combinations of operations and arguments. The disclosed testing can employ a consistent integer index for arguments passed into the sequenced operations of the recipe. The recipe can be employed to generate a phased test tree that can enable testing based on a phase rather than loading an entire test suite into memory. The consistent integer index can be used to identify failed test cases such that the entire test can be reconstituted from stored failed test information. Distribution of test cases to worker process can based on the phased test tree to facilitate interning an operation. Stored failed test information can comprise human-readable failure information.

Data through gateway

A gateway for use in a computing system to interface a host with the subsystem for acting as a work accelerator to the host, the gateway having: an accelerator interface for connection to the subsystem to enable transfer of batches of data between the subsystem and the gateway; a data connection interface for connection to external storage for exchanging data between the gateway and storage; a gateway interface for connection to at least one second gateway; a memory interface connected to a local memory associated with the gateway; and a streaming engine for controlling the streaming of batches of data into and out of the gateway in response to pre-compiled data exchange synchronisation points attained by the subsystem, wherein the streaming of batches of data are selectively via at least one of the accelerator interface, data connection interface, gateway interface and memory interface.

SYSTEM AND METHOD FOR IMPLEMENTING A NETWORK-INTERFACE-BASED ALLREDUCE OPERATION

An apparatus is provided that includes a network interface to transmit and receive data packets over a network; a memory including one or more buffers; an arithmetic logic unit to perform arithmetic operations for organizing and combining the data packets; and a circuitry to receive, via the network interface, data packets from the network; aggregate, via the arithmetic logic unit, the received data packets in the one or more buffers at a network rate; and transmit, via the network interface, the aggregated data packets to one or more compute nodes in the network, thereby optimizing latency incurred in combining the received data packets and transmitting the aggregated data packets, and hence accelerating a bulk data allreduce operation. One embodiment provides a system and method for performing the allreduce operation. During operation, the system performs the allreduce operation by pacing network operations for enhancing performance of the allreduce operation.

Processor with conditional-fence commands excluding designated memory regions
20230036954 · 2023-02-02 ·

An apparatus includes a processor, configured to designate a memory region in a memory, and to issue (i) memory-access commands for accessing the memory and (ii) a conditional-fence command associated with the designated memory region. Memory-Access Control Circuitry (MACC) is configured, in response to identifying the conditional-fence command, to allow execution of the memory-access commands that access addresses within the designated memory region, and to defer the execution of the memory-access commands that access addresses outside the designated memory region, until completion of all the memory-access commands that were issued before the conditional-fence command.

Barrier synchronization system and parallel information processing apparatus

A barrier synchronization system, a parallel information processing apparatus, and the like are described in the embodiments. In an example, provided is a solution to reduce latency time and improve processing speed in barrier synchronization. The parallel information processing apparatus includes: a completion information storage configured to store completion information, wherein the completion information includes information relating to completion of processing of an own apparatus and information relating to completion of processing of a lower information processing apparatus located in the tree structure; and a control circuit configured to, in response to a determination result indicating that a current status amounts to a given condition, instruct a specified information processing apparatus to forcibly suspend processing, the specified information processing apparatus being an apparatus that has not yet completed processing before all of the plurality of information processing apparatuses have completed the processing.