G06F15/17375

Hierarchical ring-based interconnection network for symmetric multiprocessors

A symmetric multiprocessor includes with a hierarchical ring-based interconnection network is disclosed. The symmetric processor includes a plurality of buses comprised on the symmetric multiprocessor, wherein each of the buses are configured in a circular topology. The symmetric multiprocessor also includes a plurality of multi-processing nodes interconnected by the buses to make a hierarchical ring-based interconnection network for conveying commands between the multi-processing nodes. The interconnection network includes a command network configured to transport commands based on command tokens, wherein the tokens dictate a destination of the command, a partial response network configured to transport partial responses generated by the multi-processing nodes, and a combined response network configured to combine the partial responses generated by the multi-processing nodes using combined response tokens.

INTERCONNECT CIRCUIT
20230023021 · 2023-01-26 ·

A circuit having multiple inputs and multiple outputs the circuit being for switching signals received at any of the inputs to any of the outputs, the circuit comprising: a first switch matrix, the first switch matrix being capable of directing signals received at the inputs of the circuit to multiple first intermediate ports; a second switch matrix, the second switch matrix being capable of directing signals received at multiple second intermediate ports to multiple third intermediate ports, the number of the second intermediate ports being less than the number of the inputs of the circuit; one or more primary bypass links, each primary bypass link being capable of coupling one or more of the first intermediate ports to a respective one or more of the outputs of the circuit independently of the second switch matrix; a first redirection layer, the first redirection layer being capable of, for each first intermediate port, directing a signal received at that first intermediate port to a primary bypass link or to a second intermediate port; and a second redirection layer, the second redirection layer being capable of directing signals received at each of the primary bypass links to a respective one or more outputs of the circuit, and directing signals received at each of the third intermediate ports to a respective one or more outputs of the circuit.

INTERCONNECT FOR DIRECT MEMORY ACCESS CONTROLLERS

A computing device is provided, including a plurality of memory devices, a plurality of direct memory access (DMA) controllers, and an on-chip interconnect. The on-chip interconnect may be configured to implement control logic to convey a read request from a primary DMA controller of the plurality of DMA controllers to a source memory device of the plurality of memory devices. The on-chip interconnect may be further configured to implement the control logic to convey a read response from the source memory device to the primary DMA controller and one or more secondary DMA controllers of the plurality of DMA controllers.

Accelerating AI training by an all-reduce process with compression over a distributed system

According to various embodiments, methods and systems are provided to accelerate artificial intelligence (AI) model training with advanced interconnect communication technologies and systematic zero-value compression over a distributed training system. According to an exemplary method, during each iteration of a Scatter-Reduce process performed on a cluster of processors arranged in a logical ring to train a neural network model, a processor receives a compressed data block from a prior processor in the logical ring, performs an operation on the received compressed data block and a compressed data block generated on the processor to obtain a calculated data block, and sends the calculated data block to a following processor in the logical ring. A compressed data block calculated from corresponding data blocks from the processors can be identified on each processor and distributed to each other processor and decompressed therein for use in the AI model training.

Embedding rings on a toroid computer network
11531637 · 2022-12-20 · ·

A computer comprising a plurality of interconnected processing nodes arranged in a toroid configuration in which multiple layers of interconnected nodes are arranged along an axis; each layer comprising a plurality of processing nodes connected in a ring in a non-axial plane by at least an intralayer respective set of links between each pair of neighbouring processing nodes, the links in each set adapted to operate simultaneously; wherein each of the processing nodes in each layer is connected to a respective corresponding node in each adjacent layer by an interlayer link to form respective rings along the axis; the computer programmed to provide a plurality of embedded one-dimensional logical paths and to transmit data around each of the embedded one-dimensional paths in such a manner that the plurality of embedded one-dimensional logical paths operate simultaneously, each logical path using all processing nodes of the computer in a sequence.

Data transfer system, circuit, and method

Disclosed is a data transfer system capable of accelerating data transmission between two chips. The data transfer system includes: a master system-on-a-chip (SoC) including a master transmission circular buffer and a master reception circular buffer; and a slave SoC including a slave reception circular buffer and a slave transmission circular buffer. The slave/master reception circular buffer is a duplicate of the master/slave transmission circular buffer; accordingly, the write pointers of the two corresponding buffers are substantially synchronous and the read pointers of the two corresponding buffers are substantially synchronous as well. In light of the above, the read and write operations of the master/slave transmission circular buffer can be treated as the read and write operations of the slave/master reception circular buffer; therefore some conventional data reproducing procedure(s) for the data transmission can be omitted and the data transmission is accelerated.

Partitionable Networked Computer
20220382707 · 2022-12-01 ·

A computer is provided, including a plurality of processing nodes arranged two-dimensional arrays in respective front and rear layers. Each processing node has a set of activatable links. When activated, transmission of data items between the nodes connected via the activated link is enabled. When not activated, transmission of data items between said nodes is prevented. The set of activatable links includes a respective link which connects the processing node to each adjacent node in the array, and to a facing processing node in the other layer. An allocation engine is configured to receive an allocation instruction and connected to the processing nodes to selectively activate the links in a configuration in which: (i) links between adjacent nodes are activated; (ii) links between facing nodes are only activated for edge processing nodes; and (iii) links between processing nodes outside the group and adjacent processing nodes in the group are deactivated.

CONNECTIVITY IN COARSE GRAINED RECONFIGURABLE ARCHITECTURE
20230053439 · 2023-02-23 ·

A reconfigurable compute fabric can include multiple nodes, and each node can include multiple tiles with respective processing and storage elements. The tiles can be arranged in an array or grid and can be communicatively coupled. In an example, the tiles can be arranged in a one-dimensional array and each tile can be coupled to its respective adjacent neighbor tiles using a direct bus coupling. Each tile can be further coupled to at least one non-adjacent neighbor tile that is one tile, or device space, away using a passthrough bus. The passthrough bus can extend through intervening tiles.

A Network Computer with External Memory
20230084132 · 2023-03-16 ·

A computer comprising a plurality of processor devices connected in a ring, wherein each of the processor devices is connected to each of two neighbouring ones of the processor devices by a respective physical inter-processor link. Each of a set of external memory device stores a local portion of the externally stored dataset. Each processor device executes instructions to: determine that a synchronisation point has been reached by the plurality of processor devices; responsive to the determination, access from its connected external memory device its local portion of the externally stored dataset stored; record a copy of its local portion of the externally stored dataset in its local memory; transmit its local portion of the externally stored dataset to at least one of its connected neighbouring processing devices; and receive an incoming portion of the externally stored dataset from at least one of its connected neighbouring processing devices.

MULTIPLY-ACCUMULATE WITH VARIABLE FLOATING POINT PRECISION

An integrated circuit including a multiplier-accumulator execution pipeline including a plurality of multiplier-accumulator circuits to, in operation, perform multiply and accumulate operations, wherein each multiplier-accumulator circuit includes: (i) a multiplier to multiply first input data, having a first floating point data format, by a filter weight data, having the first floating point data format, and generate and output a product data having a second floating point data format, and (ii) an accumulator, coupled to the multiplier of the associated MAC circuit, to add second input data and the product data output by the associated multiplier to generate sum data. The plurality of multiplier-accumulator circuits of the multiplier-accumulator execution pipeline may be connected in series and, in operation, perform a plurality of concatenated multiply and accumulate operations.