Patent classifications
G06F15/17318
Embedding Rings on a Toroid Computer Network
A computer comprising a plurality of interconnected processing nodes arranged in a toroid configuration in which multiple layers of interconnected nodes are arranged along an axis; each layer comprising a plurality of processing nodes connected in a ring in a non-axial plane by at least an intralayer respective set of links between each pair of neighbouring processing nodes, the links in each set adapted to operate simultaneously; wherein each of the processing nodes in each layer is connected to a respective corresponding node in each adjacent layer by an interlayer link to form respective rings along the axis; the computer programmed to provide a plurality of embedded one-dimensional logical paths and to transmit data around each of the embedded one-dimensional paths in such a manner that the plurality of embedded one-dimensional logical paths operate simultaneously, each logical path using all processing nodes of the computer in a sequence.
Networked computer with embedded rings field
One aspect of the invention provides a computer comprising a plurality of interconnected processing nodes arranged in a ladder configuration comprising a plurality of facing pairs of processing nodes. The processing nodes of each pair are connected to each other by two links. A processing node in each pair is connected to a corresponding processing node in an adjacent pair by at least one link. The processing nodes are programmed to operate the ladder configuration to transmit data around two embedded one-dimensional rings formed by respective sets of processing nodes and links, each ring using all processing nodes in the ladder once only.
METHOD FOR IMPLEMENTING A LINE SPEED INTERCONNECT STRUCTURE
A method and apparatus including a cache controller coupled to a cache memory, wherein the cache controller receives a plurality of cache access requests, performs a pre-sorting of the plurality of cache access requests by a first stage of the cache controller to order the plurality of cache access requests, wherein the first stage functions by performing a presorting and pre-clustering process on the plurality of cache access requests in parallel to map the plurality of cache access requests from a first position to a second position corresponding to ports or banks of a cache memory, performs the combining and splitting of the plurality of cache access request by a second stage of the cache controller, and applies the plurality of cache access requests to the cache memory at line speed.
Networked Computer With Multiple Embedded Rings
A network comprising interconnected first and second processors, each processor comprising one or more of: multiple processing units arranged on a chip configured to execute program code; an on-chip interconnect comprising groups of exchange paths connected to receive data from corresponding groups of the processing units; external interfaces configured to communicate data off-chip as packets, each having a destination address, external interfaces of the first and second processors being connected by an external link; multiple exchange blocks, each connected to groups of the exchange paths; a routing bus configured to route packets between the exchange blocks and the external interfaces. Processing units of the first processor generate off-chip packets such that the group of processing units serviced by the first exchange block on the first processor address off-chip packets to the group of processing units on the second processor serviced by the corresponding first exchange block of the second processor.
METHOD OF RING ALLREDUCE PROCESSING
A method of performing ring allreduce operations is disclosed. The method includes sending a chunk of a message in a receive buffer at a current index of a send buffer to a next node in a virtual ring of nodes, receiving a chunk of the message from a previous node in the virtual ring of nodes and store the chunk at the current index of the receive buffer, and reducing a chunk in a send buffer at a previous index of the receive buffer and a chunk in the receive buffer at a previous index of the receive buffer and storing a result at the previous index of the receive buffer. The method includes repeating the sending, receiving and storing, and reducing and storing steps until all chunks of the message are reduced, and sending reduced chunks to the next node and receive reduced chunks from the previous node.
Networked computer with multiple embedded rings
According to an aspect of the invention, there is provided a computer comprising a plurality of interconnected processing nodes arranged in a configuration with multiple stacked layers. Each layer comprises four processing nodes connected by respective links between the processing nodes. In end layers of the stack, the four processing nodes are interconnected in a ring formation by two links between the nodes, the two links adapted to operate simultaneously. Processing nodes in the multiple stacked layers provide four faces, each face comprising multiple layers, each layer comprising a pair of processing nodes. The processing nodes are programmed to operate a configuration to transmit data around embedded one-dimensional rings, each ring formed by processing nodes in two opposing faces.
Network-on-chip data processing method and device
The present application relates to a network-on-chip data processing method. The method is applied to a network-on-chip processing system, the network-on-chip processing system is used for executing machine learning calculation, and the network-on-chip processing system comprises a storage device and a calculation device. The method comprises: accessing the storage device in the network-on-chip processing system by means of a first calculation device in the network-on-chip processing system, and obtaining first operation data; performing an operation on the first operation data by means of the first calculation device to obtain a first operation result; and sending the first operation result to a second calculation device in the network-on-chip processing system. According to the method, operation overhead can be reduced and data read/write efficiency can be improved.
MASSIVELY PARALLEL IN-NETWORK COMPUTE
Efficient scaling of in-network compute operations to large numbers of compute nodes is disclosed. Each compute node is connected to a same plurality of network compute nodes, such as compute-enabled network switches. Compute processes at the compute nodes generate local gradients or other vectors by, for instance, performing a forward pass on a neural network. Each vector comprises values for a same set of vector elements. Each network compute node is assigned to, based on the local vectors, reduce vector data for a different a subset of the vector elements. Each network compute node returns a result chunk for the elements it processed back to each of the compute nodes, whereby each compute node receives the full result vector. This configuration may, in some embodiments, reduce buffering, processing, and/or other resource requirements for the network compute node or network at large.
Triggered operations for collective communication
Examples include a method of managing storage for triggered operations. The method includes receiving a request to allocate a triggered operation; if there is a free triggered operation, allocating the free triggered operation; if there is no free triggered operation, recovering one or more fired triggered operations, freeing one or more of the recovered triggered operations, and allocating one of the freed triggered operations; configuring the allocated triggered operation; and storing the configured triggered operation in a cache on an input/output (I/O) device for subsequent asynchronous execution of the configured triggered operation.
METHODS AND SYSTEMS FOR DATA TRANSMISSION
Methods and systems for transmitting data are presented. Data received from at least one data source is retained in at least one buffer. In one example, initial hierarchical data may be provided from the at least one buffer to a device, followed by additional hierarchical data. In one example, the data is received into the at least one buffer via a multicast connection, and the data is provided to the device via a point-to-point connection.