G06F2207/3892

SPARSE MATRIX MULTIPLICATION ACCELERATION MECHANISM

An apparatus to facilitate acceleration of matrix multiplication operations. The apparatus comprises a systolic array including matrix multiplication hardware to perform multiply-add operations on received matrix data comprising data from a plurality of input matrices and sparse matrix acceleration hardware to detect zero values in the matrix data and perform one or more optimizations on the matrix data to reduce multiply-add operations to be performed by the matrix multiplication hardware.

Hardware accelerator for systolic matrix multiplication
10915297 · 2021-02-09 · ·

Computational apparatus includes a systolic array of processing elements. In each of a sequence of processing cycles, the processing elements in a first row of the array each receive a respective first plurality of first operands, while the processing elements in a first column of the array each receive a respective second plurality of second operands. Each processing element, except in the first row and first column, receives the respective first and second pluralities of the operands from adjacent processing elements in a preceding row and column of the array. Each processing element multiplies pairs of the first and second operands together to generate multiple respective products, and accumulates the products in accumulators. Synchronization logic loads a succession of first and second vectors of the operands into the array, and upon completion of processing triggers the processing elements to transfer respective data values from the accumulators out of the array.

DILATED CONVOLUTION USING SYSTOLIC ARRAY
20200410036 · 2020-12-31 ·

In one example, a non-transitory computer readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to: load a first weight data element of an array of weight data elements from a memory into a systolic array; select a subset of input data elements from the memory into the systolic array to perform first computations of a dilated convolution operation, the subset being selected based on a rate of the dilated convolution operation and coordinates of the weight data element within the array of weight data elements; and control the systolic array to perform the first computations based on the first weight data element and the subset to generate first output data elements of an output data array. An example of a compiler that generates the instructions is also provided.

Compression-encoding scheduled inputs for matrix computations

A method of performing matrix computations includes receiving a compression-encoded matrix including a plurality of rows. Each row of the compression-encoded matrix has a plurality of defined element values and, for each such defined element value, a schedule tag indicating a schedule for using the defined element value in a scheduled matrix computation. The method further includes loading the plurality of rows of the compression-encoded matrix into a corresponding plurality of work memory banks, and providing decoded input data to a matrix computation module configured for performing the scheduled matrix computation. For each work memory bank, a next defined element value and a corresponding schedule tag are read. If the schedule tag meets a scheduling condition, the next defined element value is provided to the matrix computation module. Otherwise, a default element value is provided to the matrix computation module.

COMPRESSION-ENCODING SCHEDULED INPUTS FOR MATRIX COMPUTATIONS

A method of performing matrix computations includes receiving a compression-encoded matrix including a plurality of rows. Each row of the compression-encoded matrix has a plurality of defined element values and, for each such defined element value, a schedule tag indicating a schedule for using the defined element value in a scheduled matrix computation. The method further includes loading the plurality of rows of the compression-encoded matrix into a corresponding plurality of work memory banks, and providing decoded input data to a matrix computation module configured for performing the scheduled matrix computation. For each work memory bank, a next defined element value and a corresponding schedule tag are read. If the schedule tag meets a scheduling condition, the next defined element value is provided to the matrix computation module. Otherwise, a default element value is provided to the matrix computation module.

Pipelined cascaded digital signal processing structures and methods
10417004 · 2019-09-17 · ·

Circuitry operating under a floating-point mode or a fixed-point mode includes a first circuit accepting a first data input and generating a first data output. The first circuit includes a first arithmetic element accepting the first data input, a plurality of pipeline registers disposed in connection with the first arithmetic element, and a cascade register that outputs the first data output. The circuitry further includes a second circuit accepting a second data input and generating a second data output. The second circuit is cascaded to the first circuit such that the first data output is connected to the second data input via the cascade register. The cascade register is selectively bypassed when the first circuit is operated under the fixed-point mode.

Sparse matrix multiplication acceleration mechanism

An apparatus to facilitate acceleration of matrix multiplication operations. The apparatus comprises a systolic array including matrix multiplication hardware to perform multiply-add operations on received matrix data comprising data from a plurality of input matrices and sparse matrix acceleration hardware to detect zero values in the matrix data and perform one or more optimizations on the matrix data to reduce multiply-add operations to be performed by the matrix multiplication hardware.

Memory-Size- and Bandwidth-Efficient Method for Feeding Systolic Array Matrix Multipliers

Matrix multiplication systolic array feed methods and related processing element (PE) microarchitectures for efficiently implementing systolic array generic matrix multiplier (SGEMM) in integrated circuits is provided. A systolic array architecture may include a processing element array, a column feeder array, and a row feeder array. A bandwidth of external memory may be reduced by a factor of reduction based on interleaving of the matrix data via a feeding pattern of the column feeder array and the row feeder array.

Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range
12067375 · 2024-08-20 · ·

Systems and methods are provided to perform multiply-accumulate operations of at least one normalized number in a systolic array. The systolic array can obtain a first input and detect that the first input is denormal. Based on determining the first input is denormal, the systolic array can generate a first normalized number by normalizing the first input. Processing elements of the systolic array can include a multiplier and an adder. The multiplier can multiply the first normalized number by a second normal or normalized number to generate a multiplier product and the adder can add an input partial sum to the multiplier product to generate an addition result.

Calculation circuit and deep learning system including the same
12141545 · 2024-11-12 · ·

A calculation circuit may include a plurality of calculator groups constituting a systolic array composed of a plurality of rows and columns, wherein calculator groups included in each of the rows propagate a data value set through a single data path corresponding to the row in a data propagation direction, and propagate a plurality of drain value sets through a plurality of drain paths corresponding to the row in a drain propagation direction, and wherein a calculator group of the calculator groups included in each of the rows comprises a plurality of MAC (Multiplier-Accumulator) circuits, and the MAC circuits generate drain values respectively included in the drain value sets at the same time. The calculator groups included in each column may further propagate a weight value set corresponding to the column through a plurality of weight data paths corresponding to the column.