Patent classifications
G06F15/17318
MULTI-PROCESSING UNIT INTERCONNECTED ACCELERATOR SYSTEMS AND CONFIGURATION TECHNIQUES
A compute system providing hierarchical scaling can include one or more sets of parallel processing units. The parallel processing units in a set can be organized into subsets of parallel processing units. Each parallel processing unit can be configurably couplable to two nearest neighbor parallel processing units in a same subset by two communication links, and each parallel processing unit can be configurably couplable to farthest neighbor parallel processing unit in the same subset by one communication link. Furthermore, each parallel processing unit can be configurably couplable to a corresponding parallel processing unit in the other subset by two communication links. The compute system can be configured by configuring the communication links of a set of parallel processing units into one or more compute clusters including a corresponding number of communication rings based on a specified compute parameter. Input data for computing on a given compute cluster divided and loaded onto respective parallel processing units of the given compute cluster. A function can be computed on the loaded input data by the given compute cluster using a parallel communication ring algorithm of the function.
System and method for supporting lazy deserialization of session information in a server cluster
A system and method can support in-memory session replication in a server cluster using a lazy deserialization approach. The server cluster can include a primary application server and a secondary application server. The primary application server operates to receive a request associated with a session from a client and maintains session information associated with the session. Based on the session information, the primary application server can responds to the client. The secondary application server operates to receive and maintain serialized session information from the primary application server. The secondary application server operates to update the serialized session information based on one or more session updates received from the primary application server. When the primary application server fails, the secondary application server can generate deserialized session information based on the updated serialized session information and responds to the client.
System, method, and storage medium
A system includes a plurality of arithmetic devices configured to execute arithmetic processes in parallel. Each of plurality of arithmetic devices is configured to: determine whether a time period from the start of collective communication to reception from another arithmetic device involved in the collective communication is equal to or shorter than a predetermined threshold, determine a target arithmetic device that is among the plurality of arithmetic devices and for which a waiting scheme involved in the collective communication is to be changed when the time period is determined to be equal to or shorter than the predetermined threshold, and transmit, to the target arithmetic device, an instruction to change the waiting scheme involved in the collective communication.
Embedding rings on a toroid computer network
A computer comprising a plurality of interconnected processing nodes arranged in a configuration with multiple layers, arranged along an axis, comprising first and second endmost layers and at least one intermediate layer between the first and second endmost layers is provided. Each layer comprises a plurality of processing nodes connected in a ring by an intralayer respective set of links between each pair of neighbouring processing nodes, the links adapted to operate simultaneously. Nodes in each layer are connected to respective corresponding nodes in each adjacent layer by an interlayer link. Each processing node in the first endmost layer is connected to a corresponding node in the second endmost layer. Data is transmitted around a plurality of embedded one-dimensional logical rings with an asymmetric bandwidth utilisation, each logical ring using all processing nodes of the computer in such a manner that the plurality of embedded one-dimensional logical rings operate simultaneously.
Methods and apparatus for multiplexing data flows via a single data structure
Methods and apparatus for transacting multiple data flows between multiple processors. In one such implementation, multiple data pipes are aggregated over a common transfer data structure. Completion status information corresponding to each data pipe is provided over individual completion data structures. Allocating a common fixed pool of resources for data transfer can be used in a variety of different load balancing and/or prioritization schemes; however, individualized completion status allows for individualized data pipe reclamation. Unlike prior art solutions which dynamically created and pre-allocated memory space for each data pipe individually, the disclosed embodiments can only request resources from a fixed pool. In other words, outstanding requests are queued (rather than immediately serviced with a new memory allocation), thus overall bandwidth remains constrained regardless of the number of data pipes that are opened and/or closed.
DATA TRANSMISSION CIRCUIT AND METHOD, CORE, CHIP, ELECTRONIC DEVICE AND STORAGE MEDIUM
A data transmission circuit and method, a core, a chip with a multi-core structure, an electronic device and a storage medium are provided. The data transmission circuit includes a receiver, a controller, a lookup table circuit and a selector. The receiver is configured to receive an original data packet from Fabric; the controller is configured to determine whether the original data packet needs to be relayed according to an original control bit, and control a first input terminal of the selector to be enabled in response to that the original data packet needs to be relayed; the selector is configured to send a new data packet to the Fabric via the first input terminal, wherein the new data packet includes the original data and a new header acquired by the lookup table circuit according to an original index. In this way, power consumption of the data transmission circuit is reduced.
Networked computer with multiple embedded rings
A computer comprising a plurality of interconnected processing nodes arranged in multiple stacked layers forming a multi-face prism is provided. Each face of the prism comprises multiple stacked pairs of nodes. Said nodes are connected by at least two intralayer links. Each node is connected to a corresponding node in an adjacent pair by an interlayer link. The corresponding nodes are connected by respective interlayer links to form respective edges. Each pair forms part of a layers, each layer comprising multiple nodes, each node connected to their neighbouring nodes in the layer by at least one of the intralayer links to form a ring. Data is transmitted around paths formed by respective sets of nodes and links, each path having a first portion between a first and second endmost layers, and a second portion provided between the second and first endmost layers and comprising one of the edges.
Topologies and algorithms for multi-processing unit interconnected accelerator systems
An accelerator system can include one or more clusters of eight processing units. The processing units can include seven communication ports. Each cluster of eight processing units can be organized into two subsets of four processing units. Each processing unit can be coupled to each of the other processing units in the same subset by a respective set of two bi-directional communication links. Each processing unit can also be coupled to a corresponding processing unit in the other subset by a respective single bi-directional communication link. Input data can be divided into one or more groups of four subsets of data. Each processing unit can be configured to sum corresponding subsets of the input data received on the two bi-directional communication links from the other processing units in the same subset with the input data of the respective processing unit to generate a respective set of intermediate data. Each processing unit can be configured to sum a corresponding set of intermediate data received on the one bi-directional communication link from the corresponding processing unit in the other subset with the intermediate data of the respective processing unit to generate respective sum data. Each processing unit can be configured to broadcast the sum data of the respective processing unit to the other processing units in the same subset on the respective sets of two bi-directional communication links.
NETWORK-ON-CHIP DATA PROCESSING METHOD AND DEVICE
The present application relates to a network-on-chip data processing method. The method is applied to a network-on-chip processing system, the network-on-chip processing system is used for executing machine learning calculation, and the network-on-chip processing system comprises a storage device and a calculation device. The method comprises: accessing the storage device in the network-on-chip processing system by means of a first calculation device in the network-on-chip processing system, and obtaining first operation data; performing an operation on the first operation data by means of the first calculation device to obtain a first operation result; and sending the first operation result to a second calculation device in the network-on-chip processing system. According to the method, operation overhead can be reduced and data read/write efficiency can be improved.
Network Computer with Two Embedded Rings
A computer comprising a plurality of interconnected processing nodes arranged in a configuration in which multiple layers of interconnected nodes are arranged along an axis, each layer comprising at least four processing nodes connected in a non-axial ring by at least respective intralayer link between each pair of neighbouring processing nodes, wherein each of the at least four processing nodes in each layer is connected to a respective corresponding node in one or more adjacent layer by a respective interlayer link, the computer being programmed to provide in the configuration two embedded one dimensional paths and to transmit data around each of the two embedded one dimensional paths, each embedded one dimensional path using all processing nodes of the computer in such a manner that the two embedded one dimensional paths operate simultaneously without sharing links.