G06F15/8046

Systolic neural network engine capable of backpropagation

A method of computer processing is disclosed comprising receiving a data packet at a processing node of a neural network, performing a calculation of the data packet at the processing node to create a processed data packet, attaching a tag to the processed data packet, transmitting the processed data packet from the processing node to a receiving node during a systolic pulse, receiving the processed data packet at the receiving node, performing a clockwise convolution on the processed data packet and a counter clockwise convolution on the processed data packet, performing an adding function and backpropagating results of the performed sigmoid function to each of the processing nodes that originally processed the data packet.

MATRIX OPERATION WITH MULTIPLE TILES PER MATRIX DIMENSION

An embodiment of an apparatus comprises a systolic array to perform a matrix operation on two input tiles to produce an output tile result, and circuitry coupled to the systolic array to cause the systolic array to perform respective full matrix operations on more than one tile per matrix dimension in response to a single request. Other embodiments are disclosed and claimed.

Error Checking For Systolic Array Computation
20230036421 · 2023-02-02 ·

Aspects of the disclosure are directed to a computation unit implementing a systolic array and configured for detecting errors while processing data on the systolic array. Checksum circuit in communication with a systolic array is configured to compute checksums and perform error detection while the systolic array processes input data. Instead of pre-generating checksums in input matrices, input matrices can be directly fed into the systolic array through the checksum circuit. On the output side, the checksum circuit can generate and compare checksums with checksums in an output matrix generated by the systolic array. Error checking the operations to generate the output matrix can be performed without delaying the operations of the systolic array, and without preprocessing the input matrices.

EMULATION OF FLOATING POINT CALCULATION

Emulating floating point calculation using lower precision format calculations is described. An example of a processor includes a floating point unit (FPU) to provide a native floating point operation in a first precision format; and systolic array hardware including multiple data processing units, wherein the processor is to receive data for performance of a matrix multiplication operation in the first precision format; enable an emulated floating point multiplication operation using one or more values with a second precision format, the second precision format having a lower precision than the first precision format, the emulated floating point multiplication including operation of the systolic array hardware; and generate an emulated result for the matrix multiplication operation.

Optimizing memory bandwidth in spatial architectures
11481329 · 2022-10-25 · ·

A technique to facilitate efficient, parallelized execution of a program using a multiprocessor system having two or more processors includes detecting and, optionally, minimizing broadcast data communication between a shared memory and two or more processors. To this end, the broadcast space of a data structure is generated as an intersection of the reuse space of the data structure and the placement space of a statement accessing the data structure. A non-empty broadcast space implies broadcast data communication that can be minimized by rescheduling the statement accessing the data structure.

Systems and methods for systolic array design from a high-level program
11604758 · 2023-03-14 · ·

Systems and methods for automated systolic array design from a high-level program are disclosed. One implementation of a systolic array design supporting a convolutional neural network includes a two-dimensional array of reconfigurable processing elements arranged in rows and columns. Each processing element has an associated SIMD vector and is connected through a local connection to at least one other processing element. An input feature map buffer having a double buffer is configured to store input feature maps, and an interconnect system is configured to pass data to neighboring processing elements in accordance with a processing element scheduler. A CNN computation is mapped onto the two-dimensional array of reconfigurable processing elements using an automated system configured to determine suitable reconfigurable processing element parameters.

SYSTOLIC ARRAY-BASED DATA PROCESSING METHOD AND APPARATUS, MEDIUM, AND PROGRAM PRODUCT
20230070177 · 2023-03-09 ·

The present disclosure provides a systolic array-based data processing method that includes determining an input splice quantity for the systolic array based on a target input depth and a standard input depth, and determining an output splice quantity for the systolic array based on a target output depth and a standard output depth; inputting the input data matching the input splice quantity to an input buffer of the systolic array in batches, without overlaps in the input data, and processing, by the systolic array, the input data in the input buffer to generate output data corresponding to each piece of input data; and in accordance with a determination that a quantity of output data received by an output buffer of the systolic array from the systolic array matches the output splice quantity, outputting, in the output buffer, output data having a quantity matching the output splice quantity in batches.

Low latency matrix multiply unit
11599601 · 2023-03-07 · ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

Scalable sparse matrix multiply acceleration using systolic arrays with feedback inputs

Described herein is an accelerator device including a host interface, a fabric interconnect coupled with the host interface, and one or more hardware tiles coupled with the fabric interconnect, the one or more hardware tiles including sparse matrix multiply acceleration hardware including a systolic array with feedback inputs.

Systolic arithmetic on sparse data

Embodiments described herein provided for an instruction and associated logic to enable a processing resource including a tensor accelerator to perform optimized computation of sparse submatrix operations. One embodiment provides hardware logic to apply a numerical transform to matrix data to increase the sparsity of the data. Increasing the sparsity may result in a higher compression ratio when the matrix data is compressed.