Patent classifications
G06F15/8015
Hierarchical ring-based interconnection network for symmetric multiprocessors
A symmetric multiprocessor includes with a hierarchical ring-based interconnection network is disclosed. The symmetric processor includes a plurality of buses comprised on the symmetric multiprocessor, wherein each of the buses are configured in a circular topology. The symmetric multiprocessor also includes a plurality of multi-processing nodes interconnected by the buses to make a hierarchical ring-based interconnection network for conveying commands between the multi-processing nodes. The interconnection network includes a command network configured to transport commands based on command tokens, wherein the tokens dictate a destination of the command, a partial response network configured to transport partial responses generated by the multi-processing nodes, and a combined response network configured to combine the partial responses generated by the multi-processing nodes using combined response tokens.
COMPUTATIONAL MEMORY WITH COOPERATION AMONG ROWS OF PROCESSING ELEMENTS AND MEMORY THEREOF
A computing device includes an array of processing elements mutually connected to perform single instruction multiple data (SIMD) operations, memory cells connected to each processing element to store data related to the SIMD operations, and a cache connected to each processing element to cache data related to the SIMD operations. Caches of adjacent processing elements are connected. The same or another computing device includes rows of mutually connected processing elements to share data. The computing device further includes a row arithmetic logic unit (ALU) at each row of processing elements. The row ALU of a respective row is configured to perform an operation with processing elements of the respective row.
Accelerating AI training by an all-reduce process with compression over a distributed system
According to various embodiments, methods and systems are provided to accelerate artificial intelligence (AI) model training with advanced interconnect communication technologies and systematic zero-value compression over a distributed training system. According to an exemplary method, during each iteration of a Scatter-Reduce process performed on a cluster of processors arranged in a logical ring to train a neural network model, a processor receives a compressed data block from a prior processor in the logical ring, performs an operation on the received compressed data block and a compressed data block generated on the processor to obtain a calculated data block, and sends the calculated data block to a following processor in the logical ring. A compressed data block calculated from corresponding data blocks from the processors can be identified on each processor and distributed to each other processor and decompressed therein for use in the AI model training.
High bandwidth memory system with distributed request broadcasting masters
A system comprises a processor and a plurality of memory units. The processor is coupled to each of the plurality of memory units by a plurality of network connections. The processor includes a plurality of processing elements arranged in a two-dimensional array and a corresponding two-dimensional communication network communicatively connecting each of the plurality of processing elements to other processing elements on same axes of the two-dimensional array. Each processing element that is located along a diagonal of the two-dimensional array is configured as a request broadcasting master for a respective group of processing elements located along a same axis of the two-dimensional array.
Synchronization amongst processor tiles
A processing system comprising an arrangement of tiles and an interconnect between the tiles. The interconnect comprises synchronization logic for coordinating a barrier synchronization to be performed between a group of the tiles. The instruction set comprises a synchronization instruction taking an operand which selects one of a plurality of available modes each specifying a different membership of the group. Execution of the synchronization instruction cause a synchronization request to be transmitted from the respective tile to the synchronization logic, and instruction issue to be suspended on the respective tile pending a synchronization acknowledgement being received back from the synchronization logic. In response to receiving the synchronization request from all the tiles in the group as specified by the operand of the synchronization instruction, the synchronization logic returns the synchronization acknowledgment to the tiles in the specified group.
HIERARCHICAL RING-BASED INTERCONNECTION NETWORK FOR SYMMETRIC MULTIPROCESSORS
A symmetric multiprocessor includes with a hierarchical ring-based interconnection network is disclosed. The symmetric processor includes a plurality of buses comprised on the symmetric multiprocessor, wherein each of the buses are configured in a circular topology. The symmetric multiprocessor also includes a plurality of multi-processing nodes interconnected by the buses to make a hierarchical ring-based interconnection network for conveying commands between the multi-processing nodes. The interconnection network includes a command network configured to transport commands based on command tokens, wherein the tokens dictate a destination of the command, a partial response network configured to transport partial responses generated by the multi-processing nodes, and a combined response network configured to combine the partial responses generated by the multi-processing nodes using combined response tokens.
HIERARCHICAL RING-BASED INTERCONNECTION NETWORK FOR SYMMETRIC MULTIPROCESSORS
A symmetric multiprocessor includes with a hierarchical ring-based interconnection network is disclosed. The symmetric processor includes a plurality of buses comprised on the symmetric multiprocessor, wherein each of the buses are configured in a circular topology. The symmetric multiprocessor also includes a plurality of multi-processing nodes interconnected by the buses to make a hierarchical ring-based interconnection network for conveying commands between the multi-processing nodes. The interconnection network includes a command network configured to transport commands based on command tokens, wherein the tokens dictate a destination of the command, a partial response network configured to transport partial responses generated by the multi-processing nodes, and a combined response network configured to combine the partial responses generated by the multi-processing nodes using combined response tokens.
Computational memory with cooperation among rows of processing elements and memory thereof
A computing device includes an array of processing elements mutually connected to perform single instruction multiple data (SIMD) operations, memory cells connected to each processing element to store data related to the SIMD operations, and a cache connected to each processing element to cache data related to the SIMD operations. Caches of adjacent processing elements are connected. The same or another computing device includes rows of mutually connected processing elements to share data. The computing device further includes a row arithmetic logic unit (ALU) at each row of processing elements. The row ALU of a respective row is configured to perform an operation with processing elements of the respective row.
SMID processing unit performing concurrent load/store and ALU operations
A computing device comprising: a plurality of ALUs; a set of registers; a memory; a memory interface between the registers and the memory; a control unit controlling the ALUs by generating: at least one cycle i including both implementing at least one first computing operation by way of an arithmetic logic unit and downloading a first dataset from the memory to at least one register; at least one cycle ii, following the at least one cycle i, including implementing a second computing operation by way of an arithmetic logic unit, for which second computing operation at least part of the first dataset forms at least one operand.
HYBRID HARDWARE ACCELERATOR AND PROGRAMMABLE ARRAY ARCHITECTURE
Techniques are disclosed for the use of a hybrid architecture that combines a programmable processing array and a hardware accelerator. The hybrid architecture dedicates the most computationally intensive blocks to the hardware accelerator, while maintaining flexibility for additional computations to be performed by the programmable processing array. An interface is also described for coupling the processing array to the hardware accelerator, which achieves a division of functionality and connects the programmable processing array components to the hardware accelerator components without sacrificing flexibility. This results in a balance between power/area and flexibility.