G06F15/17325

Sync groupings

A gateway implementing multiple independent sync networks. The independent sync networks can be used to allow for synchronisation between different synchronisation groups of accelerators. The independent sync networks allow synchronisations to be carried out asynchronously and simultaneously. The gateway has sync propagation circuitry that receives a first synchronisation request for an upcoming exchange phase and propagates this sync request through a first sync network. The first synchronisation request is a request for synchronisation between subsystems of a first synchronisation group. The sync propagation circuitry of the gateway also receives a second synchronisation request for a different exchange phase and propagates this sync request through the second sync network. The second synchronisation request is a request for synchronisation between subsystems of a second synchronisation group. The two exchange phases overlap in time. Therefore, the syncs are simultaneous and asynchronous.

Systems and methods for multi-architecture computing
11275709 · 2022-03-15 · ·

Disclosed herein are systems and methods for multi-architecture computing. For example, in some embodiments, a computing system may include: a processor system including at least one first processor core having a first instruction set architecture (ISA); a memory device coupled to the processor system, wherein the memory device has stored thereon a first binary representation of a program for the first ISA; and control logic to suspend execution of the program by the at least one first processor core and cause at least one second processor core to resume execution of the program, wherein the at least one second processor core has a second ISA different from the first ISA; wherein the program is to generate data having an in-memory representation compatible with both the first ISA and the second ISA.

INTERFACING WITH SYSTEMS, FOR PROCESSING DATA SAMPLES, AND RELATED SYSTEMS, METHODS AND APPARATUSES
20220092006 · 2022-03-24 ·

Disclosed examples include an apparatus. The apparatus may include first interfaces, second interfaces, a bus interface, and a buffer interface. The first interfaces may be to communicate at first data widths via first interconnects for operable coupling with data samples sources. The second interface may be to communicate at second data widths via second interconnects for operable coupling with data sinks. The buffer interface may be to communicate with a system to process data samples sampled using different sampling rates according to processing frame durations. The buffer interface may include an uplink channel handler and a downlink channel handler. The uplink channel handler may be to receive data samples from the first interfaces at first data widths and provide the data samples to the bus interface at third data widths. The downlink channel handler may be to receive processed data samples from the bus interface at third data widths and provide the processed data samples to the second interfaces at second data widths. The bus interface to communicate at a third data width via a third interconnect for operative coupling with allocated memory region utilized by the system to process data samples.

Parallel data transfer to increase bandwidth for accelerated processing devices

Techniques for improving data transfer in a system having multiple accelerated processing devices (“APDs”) are described herein. In such a system, multiple APDs are coupled to a processor (e.g., a central processing unit (“CPU”)) via a general interconnect fabric and to each other via a high speed interconnect. The techniques herein increase the effective bandwidth for transfer of data between the CPU and the APD by transmitting data to both APDs through the portion of the interconnect fabric coupled to each respective APD. Then, one of the APDs transfers data to the other APD or to the processor via the high speed inter-APD interconnect. Although data transferred “indirectly” through the helper APD takes slightly more time to be transferred than a direct transfer, the total effective bandwidth to the target is increased due to the high-speed inter-APD interconnect.

Loop execution control for a multi-threaded, self-scheduling reconfigurable computing fabric using a reenter queue
11288074 · 2022-03-29 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. A representative system includes an interconnection network; a processor; and a plurality of configurable circuit clusters. Each configurable circuit cluster includes a plurality of configurable circuits arranged in an array; a synchronous network coupled to each configurable circuit of the array; and an asynchronous packet network coupled to each configurable circuit of the array. A representative configurable circuit includes a configurable computation circuit and a configuration memory having a first, instruction memory storing a plurality of data path configuration instructions to configure a data path of the configurable computation circuit; and a second, instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices for selection of a master synchronous input, a current data path configuration instruction, and a next data path configuration instruction for a next configurable computation circuit.

Processing system, inter-processor communication method, and shared resource management method
20220091884 · 2022-03-24 ·

The disclosure provides a processing system which comprises a plurality of processors independently executing instructions; a plurality of registers arranged in a manner corresponding to the plurality of processors. Each of the plurality of registers is configured to set first register bits for other processors except the corresponding processor, and the other processors can write the first register bits set for them to indicate event requests. And each processor of the plurality of processors is configured to learn the event requests from the remaining processors for it by reading the first register bits of the corresponding register. Therefore, the application establishes an inter-processor communication module for a plurality of processors in a processing system.

Data exchange pathways between pairs of processing units in columns in a computer

A time deterministic computer is architected so that exchange code compiled for one set of tiles, e.g., a column, can be reused on other sets. The computer comprises: a plurality of processing units each having an input interface with a set of input wires, and an output interface with a set of output wires; a switching fabric connected to each of the processing units by the respective set of output wires and connectable to each of the processing units by the respective set of output wires and connectable to each of the processing units by the respective input wires via switching circuitry controllable by its associated processing unit; the processing units arranged in columns, each column having a base processing unit proximate the switching fabric and multiple processing units one adjacent the other in respective positions in the direction of the column, wherein to implement exchange of data between the processing units at least one processing unit is configured to transmit at a transmit time a data packet intended for a recipient processing unit onto its output set of connection wires, the data packet having no destination identifier of the recipient processing unit but destined for receipt at the recipient processing unit with a predetermined delay relative to the transmit time, wherein the predetermined delay is dependent on an exchange pathway between the transmitting and recipient processing units, wherein the exchange pathway between any pair of transmitting and recipient processing unit at respective positions in one column has the same delay as the exchange pathway between each pair of transmitting and recipient processing units at corresponding respective positions in the other columns.

SYSTEM AND METHOD FOR MANAGING MULTI-CORE ACCESSES TO SHARED PORTS

A port is provided that utilized various techniques to manage contention for the same by controlling data that is written to and read from the port in multi-core assembly within a usable computing system. When the port is a sampling port, the assembly may include at least two cores, a plurality of buffers in operative communication with the at least one sampling ports, a non-blocking contention management unit comprising a plurality of pointers that collectively operate to manage contention of shared ports in a multi-core computing system. When the port is queuing port, the assembly may include buffers in communication with the queuing port and the buffers are configured to hold multiple messages in the queuing port. The assembly may manage contention of shared queuing ports in a multi-core computing system.

AC parallelization circuit, AC parallelization method, and parallel information processing device
11144317 · 2021-10-12 · ·

An AC parallelization circuit includes a transmitting circuit configured to transmit a stop signal to instruct a device for executing calculation in an iteration immediately preceding an iteration for which a concerned device is responsible to stop the calculation in loop-carried dependency calculation; and an estimating circuit configured to generate, as a result of executing the calculation in the preceding iteration, an estimated value to be provided to an arithmetic circuit when the transmitting circuit transmits the stop signal.

Update of Model Parameters in a Parallel Processing System
20210311807 · 2021-10-07 ·

A data processing system comprising a plurality of processing nodes that are arranged to update a model in a parallel manner Each of the processing nodes starts with a different set of updates to model parameters. Each of the processing nodes is configured to perform one or more reduce-scatter collectives so as to exchange and reduce the updates. Having done so, each processing node is configured to apply the reduced set of updates to obtain an updated set of model parameters. The processing nodes then exchange the updated model parameters using an all-gather so that each processing node ends up with the same model parameters at the end of the process.