Patent classifications
G06F7/527
Bit-serial linear algebra processor
The invention is notably directed to a computing system configured to perform linear algebraic operations. The computing system comprises a co-processing module comprising a co-processing unit. The co-processing unit comprises a parallel array of bit-serial processing units. The bit-serial processing units are adapted to perform the linear algebraic operations with variable precision. The invention further concerns a related computer implemented method and a related computer program product.
Bit-serial linear algebra processor
The invention is notably directed to a computing system configured to perform linear algebraic operations. The computing system comprises a co-processing module comprising a co-processing unit. The co-processing unit comprises a parallel array of bit-serial processing units. The bit-serial processing units are adapted to perform the linear algebraic operations with variable precision. The invention further concerns a related computer implemented method and a related computer program product.
MEMORY DEVICE AND OPERATION METHOD THEREOF
A memory device and an operation method thereof are provided. The memory device includes: a memory array including a plurality of memory cells for storing a plurality of weights; a multiplication circuit for performing bitwise multiplication on a plurality of input data and the weights to generate a plurality of multiplication results, wherein in performing bitwise multiplication, the memory cells generate a plurality of memory cell currents; a digital accumulating circuit for performing a digital accumulating on the multiplication results; an analog accumulating circuit for performing an analog accumulating on the memory cell currents to generate a first MAC operation result; and a decision unit for deciding whether to perform the analog accumulating; the digital accumulating or a hybrid accumulating, wherein in performing the hybrid accumulating, whether the digital accumulating circuit is triggered is based on the first MAC operation result.
DEVICE FOR COMPUTING AN INNER PRODUCT OF VECTORS
A device for computing an inner product of vectors includes a vector data arranger, a vector data pre-accumulator, a number converter, and a post-accumulator. The vector data arranger stores a first vector and sequentially outputs a plurality of vector data based on the first vector. The vector data pre-accumulator stores a second vector, receives each of the vector data, and pre-accumulates the second vector, so as to generate a plurality accumulation results. The number converter and the post-accumulator receive and process all the accumulation results corresponding to each of the vector data to generate an inner product value. The present invention implements a lookup table with the vector data pre-accumulator and the number converter to increase calculation speed and reduce power consumption.
Implementing Large Multipliers in Tensor Arrays
The present disclosure describes an integrated circuit device that includes a digital signal processing (DSP) block. The DSP block that includes a plurality of columns of weight registers and a plurality of inputs configured to receive a first plurality of values and a second plurality of values. The first plurality of values is stored in the plurality of columns of weight registers after being received. Also, the first plurality of inputs, the second plurality of inputs, or both are derived from higher precision values. Additionally, the DSP block includes a plurality of multipliers configured to simultaneously multiply each value of the first plurality of values by each value of the second plurality of values.
Efficient Logic Blocks Architectures for Dense Mapping of Multipliers
An integrated circuit includes a logic block configured to perform multiplication operations. The logic block includes a plurality of lookup tables configured to receive a plurality of inputs and generate a first plurality of outputs. Additionally, the logic block includes adding circuitry configured to receive the first plurality of outputs and generate a second plurality of outputs. Furthermore, the logic block includes circuitry configured to receive a portion of the plurality of inputs, determine one or more partial products, and generate a third plurality of outputs.
FPGA Specialist Processing Block for Machine Learning
The present disclosure describes a digital signal processing (DSP) block that includes a plurality of columns of weight registers and a plurality of inputs configured to receive a first plurality of values and a second plurality of values. The first plurality of values is stored in the plurality of columns of weight registers after being received. Additionally, the DSP block includes a plurality of multipliers configured to simultaneously multiply each value of the first plurality of values by each value of the second plurality of values.
FPGA Specialist Processing Block for Machine Learning
The present disclosure describes a digital signal processing (DSP) block that includes a plurality of columns of weight registers and a plurality of inputs configured to receive a first plurality of values and a second plurality of values. The first plurality of values is stored in the plurality of columns of weight registers after being received. Additionally, the DSP block includes a plurality of multipliers configured to simultaneously multiply each value of the first plurality of values by each value of the second plurality of values.
Apparatus and method for performing an index operation
An apparatus and method are provided for performing an index operation. The apparatus has vector processing circuitry to perform an index operation in each of a plurality of lanes of parallel processing. The index operation requires an index value opm to be multiplied by a multiplier value e to produce a multiplication result. The number of lanes of parallel processing is dependent on a specified element size, and the multiplier value is different, but known, for each lane of parallel processing. The vector processing circuitry comprises mapping circuitry to perform, within each lane, mapping operations on the index value opm in order to generate a plurality of intermediate input values. The plurality of intermediate input values are such that the addition of the plurality of intermediate input values produces the multiplication result. Within each lane the mapping operations are determined by the multiplier value used for that lane. The vector processing circuitry also has vector adder circuitry to perform, within each lane, an addition of at least the plurality of intermediate input values, in order to produce a result vector providing a result value for the index operation performed in each lane. This provides a high performance, low latency, technique for vectorising index operations.
Apparatus and method for performing an index operation
An apparatus and method are provided for performing an index operation. The apparatus has vector processing circuitry to perform an index operation in each of a plurality of lanes of parallel processing. The index operation requires an index value opm to be multiplied by a multiplier value e to produce a multiplication result. The number of lanes of parallel processing is dependent on a specified element size, and the multiplier value is different, but known, for each lane of parallel processing. The vector processing circuitry comprises mapping circuitry to perform, within each lane, mapping operations on the index value opm in order to generate a plurality of intermediate input values. The plurality of intermediate input values are such that the addition of the plurality of intermediate input values produces the multiplication result. Within each lane the mapping operations are determined by the multiplier value used for that lane. The vector processing circuitry also has vector adder circuitry to perform, within each lane, an addition of at least the plurality of intermediate input values, in order to produce a result vector providing a result value for the index operation performed in each lane. This provides a high performance, low latency, technique for vectorising index operations.