G06F7/527

Matrix Multiplication System, Apparatus and Method
20210124560 · 2021-04-29 · ·

The present disclosure advantageously provides a system, matrix multiply accelerator (MMA) and method for efficiently multiplying matrices. The MMA includes a vector register to store the row vectors of one input matrix, a vector register to store the column vectors of another input matrix, a vector register to store an output matrix, and an array of vector multiply and accumulate (VMAC) units coupled to the vector registers. Each VMAC unit is coupled to at least two row vector signal lines and at least two column vector signal lines, and is configured to calculate the dot product for one element i,j of the output matrix by multiplying each row vector formed from the i.sup.th row of the first matrix with a corresponding column vector formed from the j.sup.th column of the second matrix to generate intermediate products, and accumulate the intermediate products into a scalar value.

Systems and Methods for Low Latency Modular Multiplication
20210117157 · 2021-04-22 ·

An integrated circuit device includes multiplier circuitry configured to determine a plurality of columns of subproducts by multiplying a plurality of values. Each column of the plurality of columns includes one or more subproducts of a plurality of subproducts. The integrated circuit device also includes adder circuitry configured to determine a plurality of sums, each sum being a sum of one column of the plurality of columns. A first portion of the adder circuitry associated with a first column of the plurality of columns is configured to receive a first value and second value that are associated with the first column and a third value associated with a second column of the plurality of columns that differs from the first column. The third value is a carry-out value generated by a second portion of the adder circuitry associated with the second column of the plurality of columns.

APPARATUS AND METHOD FOR PERFORMING AN INDEX OPERATION

An apparatus and method are provided for performing an index operation. The apparatus has vector processing circuitry to perform an index operation in each of a plurality of lanes of parallel processing. The index operation requires an index value opm to be multiplied by a multiplier value e to produce a multiplication result. The number of lanes of parallel processing is dependent on a specified element size, and the multiplier value is different, but known, for each lane of parallel processing. The vector processing circuitry comprises mapping circuitry to perform, within each lane, mapping operations on the index value opm in order to generate a plurality of intermediate input values. The plurality of intermediate input values are such that the addition of the plurality of intermediate input values produces the multiplication result. Within each lane the mapping operations are determined by the multiplier value used for that lane. The vector processing circuitry also has vector adder circuitry to perform, within each lane, an addition of at least the plurality of intermediate input values, in order to produce a result vector providing a result value for the index operation performed in each lane. This provides a high performance, low latency, technique for vectorising index operations.

APPARATUS AND METHOD FOR PERFORMING AN INDEX OPERATION

An apparatus and method are provided for performing an index operation. The apparatus has vector processing circuitry to perform an index operation in each of a plurality of lanes of parallel processing. The index operation requires an index value opm to be multiplied by a multiplier value e to produce a multiplication result. The number of lanes of parallel processing is dependent on a specified element size, and the multiplier value is different, but known, for each lane of parallel processing. The vector processing circuitry comprises mapping circuitry to perform, within each lane, mapping operations on the index value opm in order to generate a plurality of intermediate input values. The plurality of intermediate input values are such that the addition of the plurality of intermediate input values produces the multiplication result. Within each lane the mapping operations are determined by the multiplier value used for that lane. The vector processing circuitry also has vector adder circuitry to perform, within each lane, an addition of at least the plurality of intermediate input values, in order to produce a result vector providing a result value for the index operation performed in each lane. This provides a high performance, low latency, technique for vectorising index operations.

Systems and Methods for Loading Weights into a Tensor Processing Block
20200326948 · 2020-10-15 ·

The present disclosure describes a digital signal processing (DSP) block that includes a plurality of columns of weight registers and a plurality of inputs configured to receive a first plurality of values and a second plurality of values. The first plurality of values is stored in the plurality of columns of weight registers after being received. In a first mode of operation, the first and second pluralities of values are received via a first portion of the plurality of inputs. In a second mode of operation, the first plurality of values is received via a second portion of the plurality of inputs, and the second plurality of values is received via the first portion of the plurality of inputs. Additionally, the DSP block includes a plurality of multipliers configured to simultaneously multiply each value of the first plurality of values by each value of the second plurality of values.

FPGA Specialist Processing Block for Machine Learning
20200327271 · 2020-10-15 ·

The present disclosure describes a digital signal processing (DSP) block that includes a plurality of columns of weight registers and a plurality of inputs configured to receive a first plurality of values and a second plurality of values. The first plurality of values is stored in the plurality of columns of weight registers after being received. Additionally, the DSP block includes a plurality of multipliers configured to simultaneously multiply each value of the first plurality of values by each value of the second plurality of values.

PROCESSING APPARATUS AND PROCESSING METHOD
20200117976 · 2020-04-16 ·

The present disclosure provides a processing device and method. The device includes: an input/output module, a controller module, a computing module, and a storage module. The input/output module is configured to store and transmit input and output data; the controller module is configured to decode a computation instruction into a control signal to control other modules to perform operation; the computing module is configured to perform four arithmetic operation, logical operation, shift operation, and complement operation on data; and the storage module is configured to temporarily store instructions and data. The present disclosure can execute a composite scalar instruction accurately and efficiently.

PROCESSING APPARATUS AND PROCESSING METHOD
20200097792 · 2020-03-26 ·

The present disclosure relates to a processing device including a memory configured to store data to be computed; a computational circuit configured to compute the data to be computed, which includes performing acceleration computations on the data to be computed by using an adder circuit and a multiplier circuit; and a control circuit configured to control the memory and the computational circuit, which includes performing acceleration computations according to the data to be computed. The present disclosure may have high flexibility, good configurability, fast computational speed, low power consumption, and other features.

PROCESSING APPARATUS AND PROCESSING METHOD
20200097794 · 2020-03-26 ·

The present disclosure provides a counting device and counting method. The device includes a storage unit, a counting unit, and a register unit, where the storage unit may be connected to the counting unit for storing input data to be counted and storing a number of elements satisfying a given condition in the input data after counting; the register unit may be configured to store an address where input data to be counted is stored in the storage unit; and the counting unit may be connected to the register unit, and may be configured to acquire a counting instruction, read a storage address of the input data to be counted in the register unit according to the counting instruction, acquire corresponding input data to be counted in the storage unit, perform statistical counting on a number of elements in the input data to be counted that satisfy the given condition, and obtain a counting result. The counting device and the method may improve the computation efficiency by writing an algorithm of counting a number of elements that satisfy a given condition in input data into an instruction form.

PROCESSING APPARATUS AND PROCESSING METHOD
20200097795 · 2020-03-26 ·

The present disclosure provides a computation device and method. The device may include an input module configured to acquire input data; a model generation module configured to construct an offline model according to an input network structure and weight data; a neural network operation module configured to generate a computation instruction based on the offline model and cache the computation instruction, and compute the data to be processed based on the computation instruction to obtain a computation result; and an output module configured to output a computation result. The device and method may avoid the overhead caused by running an entire software architecture, which is a problem in a traditional method.