Patent classifications
G06F7/527
In-Memory Near-Data Approximate Acceleration
A random access memory may include memory banks and arithmetic approximation units. Each arithmetic approximation unit may be dedicated to one or more of the memory banks and include a respective multiply-and-accumulate unit and a respective lookup-table unit. The respective multiply-and-accumulate unit is configured to iteratively perform shift and add operations with two inputs and to provide a result of the shift and add operations to the respective lookup-table unit. The result approximates or is a product of the two inputs. The respective lookup-table unit is configured produce an output by applying a pre-defined function to the result. The arithmetic approximation units are configured for parallel operation. The random access memory may also include a memory controller configured to receive instructions, from a processor, regarding locations within the memory banks from which to obtain the two inputs and in which to write the output.
Matrix multiplication system, apparatus and method
The present disclosure advantageously provides a system, matrix multiply accelerator (MMA) and method for efficiently multiplying matrices. The MMA includes a vector register to store the row vectors of one input matrix, a vector register to store the column vectors of another input matrix, a vector register to store an output matrix, and an array of vector multiply and accumulate (VMAC) units coupled to the vector registers. Each VMAC unit is coupled to at least two row vector signal lines and at least two column vector signal lines, and is configured to calculate the dot product for one element i,j of the output matrix by multiplying each row vector formed from the i.sup.th row of the first matrix with a corresponding column vector formed from the j.sup.th column of the second matrix to generate intermediate products, and accumulate the intermediate products into a scalar value.
Processing apparatus and processing method
The present disclosure provides a processing device and method. The device includes: an input/output module, a controller module, a computing module, and a storage module. The input/output module is configured to store and transmit input and output data; the controller module is configured to decode a computation instruction into a control signal to control other modules to perform operation; the computing module is configured to perform four arithmetic operation, logical operation, shift operation, and complement operation on data; and the storage module is configured to temporarily store instructions and data. The present disclosure can execute a composite scalar instruction accurately and efficiently.
MULTIPLE MULTIPLICATION ARRAYS
A data processing apparatus is provided. An A×B multiplier array has a group of logic gates clocked by a first clock signal, where A and B are both integers. A C×D multiplier array, separate from the A×B multiplier array, has second group of logic gates clocked by a second clock signal, where C and D are both integers. Addition circuitry performs an addition operation between a first at least partial product produced by the A×B multiplier array and a second at least partial product produced by the C×D multiplier array.
Counting elements in neural network input data
The present disclosure provides a counting device and counting method. The device includes a storage unit, a counting unit, and a register unit, where the storage unit may be connected to the counting unit for storing input data to be counted and storing a number of elements satisfying a given condition in the input data after counting; the register unit may be configured to store an address where input data to be counted is stored in the storage unit; and the counting unit may be connected to the register unit, and may be configured to acquire a counting instruction, read a storage address of the input data to be counted in the register unit according to the counting instruction, acquire corresponding input data to be counted in the storage unit, perform statistical counting on a number of elements in the input data to be counted that satisfy the given condition, and obtain a counting result. The counting device and the method may improve the computation efficiency by writing an algorithm of counting a number of elements that satisfy a given condition in input data into an instruction form.
Systems and methods for loading weights into a tensor processing block
The present disclosure describes a digital signal processing (DSP) block that includes a plurality of columns of weight registers and a plurality of inputs configured to receive a first plurality of values and a second plurality of values. The first plurality of values is stored in the plurality of columns of weight registers after being received. In a first mode of operation, the first and second pluralities of values are received via a first portion of the plurality of inputs. In a second mode of operation, the first plurality of values is received via a second portion of the plurality of inputs, and the second plurality of values is received via the first portion of the plurality of inputs. Additionally, the DSP block includes a plurality of multipliers configured to simultaneously multiply each value of the first plurality of values by each value of the second plurality of values.
Implementing large multipliers in tensor arrays
The present disclosure describes an integrated circuit device that includes a digital signal processing (DSP) block. The DSP block that includes a plurality of columns of weight registers and a plurality of inputs configured to receive a first plurality of values and a second plurality of values. The first plurality of values is stored in the plurality of columns of weight registers after being received. Also, the first plurality of inputs, the second plurality of inputs, or both are derived from higher precision values. Additionally, the DSP block includes a plurality of multipliers configured to simultaneously multiply each value of the first plurality of values by each value of the second plurality of values.
BOOTH MULTIPLIER FOR COMPUTE-IN-MEMORY
A compute-in-memory device may include a Booth encoder configured to receive at least one input of first bits, a Booth decoder configured to receive at least one weight of second bits and to output a plurality of partial products of the at least one input and the at least one weight, an adder configured to add a first partial product of the plurality of the partial products and a second partial product of the plurality of partial products before the Booth decoder generates a third partial product of the plurality of the partial products and to generate a plurality of sums of partial products, and a carry-lookahead adder configured to add the plurality of sums of partial products and to generate a final sum.
Processing apparatus and processing method with dynamically configurable operation bit width
A processing device with dynamically configurable operation bit width, characterized by comprising: a memory for storing data, the data comprising data to be operated, intermediate operation result, final operation result, and data to be buffered in a neural network; a data width adjustment circuit for adjusting the width of the data to be operated, the intermediate operation result, the final operation result, and/or the data to be buffered; an operation circuit for operating the data to be operated, including performing operation on data to be operated of different bit widths by using an adder circuit and a multiplier; and a control circuit for controlling the memory, the data width adjustment circuit and the operation circuit. The device of the present disclosure can have the advantages of strong flexibility, high configurability, fast operation speed, low power consumption or the like.
FPGA specialist processing block for machine learning
The present disclosure describes a digital signal processing (DSP) block that includes a plurality of columns of weight registers and a plurality of inputs configured to receive a first plurality of values and a second plurality of values. The first plurality of values is stored in the plurality of columns of weight registers after being received. Additionally, the DSP block includes a plurality of multipliers configured to simultaneously multiply each value of the first plurality of values by each value of the second plurality of values.