G06F15/8015

Architecture to support synchronization between core and inference engine for machine learning
11687837 · 2023-06-27 · ·

A system to support a machine learning (ML) operation comprises a core configured to receive and interpret commands into a set of instructions for the ML operation and a memory unit configured to maintain data for the ML operation. The system further comprises an inference engine having a plurality of processing tiles, each comprising an on-chip memory (OCM) configured to maintain data for local access by components in the processing tile and one or more processing units configured to perform tasks of the ML operation on the data in the OCM. The system also comprises an instruction streaming engine configured to distribute the instructions to the processing tiles to control their operations and to synchronize data communication between the core and the inference engine so that data transmitted between them correctly reaches the corresponding processing tiles while ensuring coherence of data shared and distributed among the core and the OCMs.

Processor memory system
09836412 · 2017-12-05 · ·

A plurality of processing elements (PEs) include memory local to at least one of the processing elements in a data packet-switched network interconnecting the processing elements and the memory to enable any of the PEs to access the memory. The network consists of nodes arranged linearly or in a grid to connect the PEs and their local memories to a common controller. The processor performs memory accesses on data stored in the memory in response to control signals sent by the controller to the memory. The local memories share the same memory map or space. The packet-switched network supports multiple concurrent transfers between PEs and memory. Memory accesses include block and/or broadcast read and write operations, in which data can be replicated within the nodes and, according to the operation, written into the shared memory or into the local PE memory.

ARCHITECTURE TO SUPPORT SYNCHRONIZATION BETWEEN CORE AND INFERENCE ENGINE FOR MACHINE LEARNING
20220374774 · 2022-11-24 ·

A system to support a machine learning (ML) operation comprises a core configured to receive and interpret commands into a set of instructions for the ML operation and a memory unit configured to maintain data for the ML operation. The system further comprises an inference engine having a plurality of processing tiles, each comprising an on-chip memory (OCM) configured to maintain data for local access by components in the processing tile and one or more processing units configured to perform tasks of the ML operation on the data in the OCM. The system also comprises an instruction streaming engine configured to distribute the instructions to the processing tiles to control their operations and to synchronize data communication between the core and the inference engine so that data transmitted between them correctly reaches the corresponding processing tiles while ensuring coherence of data shared and distributed among the core and the OCMs.

Dual vector arithmetic logic unit

A processing system executes wavefronts at multiple arithmetic logic unit (ALU) pipelines of a single instruction multiple data (SIMD) unit in a single execution cycle. The ALU pipelines each include a number of ALUs that execute instructions on wavefront operands that are collected from vector general process register (VGPR) banks at a cache and output results of the instructions executed on the wavefronts at a buffer. By storing wavefronts supplied by the VGPR banks at the cache, a greater number of wavefronts can be made available to the SIMD unit without increasing the VGPR bandwidth, enabling multiple ALU pipelines to execute instructions during a single execution cycle.

Variable clocked serial array processor
09823689 · 2017-11-21 ·

A serial array processor may have an execution unit, which is comprised of a multiplicity of single bit arithmetic logic units (ALUs), and which may perform parallel operations on a subset of all the words in memory by serially accessing and processing them, one bit at a time, while an instruction unit of the processor is pre-fetching the next instruction, a word at a time, in a manner orthogonal to the execution unit.

Computational memory

A processing device includes an array of processing elements, each processing element including an arithmetic logic unit to perform an operation. The processing device further includes interconnections among the array of processing elements to provide direct communication among neighboring processing elements of the array of processing elements. A processing element of the array of processing elements may be connected to a first neighbor processing element that is immediately adjacent the processing element. The processing element may be further connected to a second neighbor processing element that is immediately adjacent the first neighbor processing element. A processing element of the array of processing elements may be connected to a neighbor processing element via an input selector to selectively take output of the neighbor processing element as input to the processing element. A computing device may include such processing devices in an arrangement of banks.

Merging and sorting arrays on an SIMD processor

Methods, systems, and articles of manufacture for merging and sorting arrays on a processor are provided herein. A method includes splitting an input array into multiple sub-arrays across multiple processing elements; merging the multiple sub-arrays into multiple vectors; and sorting the multiple vectors by comparing and swapping one or more vector elements among the multiple vectors.

Embedding rings on a toroid computer network
11372791 · 2022-06-28 · ·

A computer comprising a plurality of interconnected processing nodes arranged in a configuration with multiple layers, arranged along an axis, comprising first and second endmost layers and at least one intermediate layer between the first and second endmost layers is provided. Each layer comprises a plurality of processing nodes connected in a ring by an intralayer respective set of links between each pair of neighbouring processing nodes, the links adapted to operate simultaneously. Nodes in each layer are connected to respective corresponding nodes in each adjacent layer by an interlayer link. Each processing node in the first endmost layer is connected to a corresponding node in the second endmost layer. Data is transmitted around a plurality of embedded one-dimensional logical rings with an asymmetric bandwidth utilisation, each logical ring using all processing nodes of the computer in such a manner that the plurality of embedded one-dimensional logical rings operate simultaneously.

DUAL VECTOR ARITHMETIC LOGIC UNIT
20220188076 · 2022-06-16 ·

A processing system executes wavefronts at multiple arithmetic logic unit (ALU) pipelines of a single instruction multiple data (SIMD) unit in a single execution cycle. The ALU pipelines each include a number of ALUs that execute instructions on wavefront operands that are collected from vector general process register (VGPR) banks at a cache and output results of the instructions executed on the wavefronts at a buffer. By storing wavefronts supplied by the VGPR banks at the cache, a greater number of wavefronts can be made available to the SIMD unit without increasing the VGPR bandwidth, enabling multiple ALU pipelines to execute instructions during a single execution cycle.

ONE-DIMENSIONAL ZERO PADDING IN A STREAM OF MATRIX ELEMENTS
20220147484 · 2022-05-12 ·

Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array and a specified width for a selected dimension of the array. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. When the selected dimension in the stream of vectors exceeds the specified width, the streaming engine inserts null elements into each portion of a respective vector for the selected dimension that exceeds the specified width in the stream of vectors. Stream vectors that are completely null are formed by the streaming engine without accessing the system memory for respective data.