G06F7/53

ARTIFICIAL INTELLIGENCE ACCELERATORS
20220374690 · 2022-11-24 · ·

An artificial intelligence (AI) accelerator includes memory circuits configured to output weight data and vector data, a multiplication circuit/adder tree performing a multiplying/adding calculation on the weight data and the vector data to generate multiplication/addition result data, a first accumulator synchronized with an odd clock signal to perform an accumulative adding calculation on odd-numbered multiplication/addition result data of the multiplication/addition result data and a first latched data, and a second accumulator synchronized with an even clock signal to perform an accumulative adding calculation on even-numbered multiplication/addition result data of the multiplication/addition result data and a second latched data.

PROCESSING ELEMENT AND NEURAL PROCESSING DEVICE INCLUDING SAME
20220374691 · 2022-11-24 ·

The present disclosure discloses a processing element and a neural processing device including the processing element. The processing element includes a weight register configured to store a weight, an input activation register configured to store an input activation, a flexible multiplier configured to receive a first sub-weight of a first precision included in the weight, receive a first sub-input activation of the first precision included in the input activation, and generate result data by performing multiplication calculation of the first sub-weight and the first sub-input activation as the first precision or a second precision different from the first precision according to the first sub-weight and the first sub-input activation and a saturating adder configured to generate a partial sum by using the result data.

PARALLEL MATRIX OPERATIONS IN A RECONFIGURABLE COMPUTE FABRIC
20230056246 · 2023-02-23 ·

A first set of multiple coordinate data structure elements describing non-zero values of an input matrix may be loaded to a compute element. A first set of input vector values having input vector row numbers corresponding to input matrix column numbers of the first set of multiple coordinate data structure elements may also be loaded to the compute element. Multiple parallel processing lanes of the compute element may be used to update multiple partial accumulation values, where each partial accumulation value corresponds to an output vector row and one of the multiple parallel processing lanes. At least a portion of the partial accumulation values corresponding to the first input matrix row may be summed across at least a portion of the parallel processing lanes to generate a first output vector row value.

COMPUTING APPARATUS AND METHOD FOR VECTOR INNER PRODUCT, AND INTEGRATED CIRCUIT CHIP
20220366006 · 2022-11-17 ·

The present disclosure relates to a computing apparatus, a method and an integrated circuit chip for a vector inner product, where the computing apparatus may be included in a combined processing apparatus. The combined processing apparatus may further include a general interconnection interface and other processing apparatus. The computing apparatus interacts with other processing apparatus to jointly complete a computing operation specified by a user. The combined processing apparatus may further include a storage apparatus, where the storage apparatus is respectively connected to the computing apparatus and other processing apparatus, and the storage apparatus is used for storing data of the computing apparatus and other processing apparatus.

COMPRESSED WALLACE TREES IN FMA CIRCUITS

An embodiment of an apparatus comprises one or more fractional width fused multiply-accumulate (FMA) circuits configured as a shared Wallace tree, and circuitry coupled to the one or more fractional width FMA circuits to provide one or more fractional width FMA operations through the one or more fractional width FMA circuits. Other embodiments are disclosed and claimed.

Modular gated multiplier circuitry and multiplication technique

Various implementations described herein are related to a device having multiplier circuitry with an array of summation result cells that holds summation bit values for shifted arrays added together. The device may include latch circuitry having one or more gated elements disposed between the summation result cells, and the gated elements may be adapted to provide a portion of the summation bit values based on a gating signal.

Multi-partitioning for combination operations

Systems and methods are disclosed for processing and executing queries against one or more dataset. As part of processing the query, the system determines whether the query is susceptible to a significantly imbalanced partition. In the event, the query is susceptible to an imbalanced partition, the system monitors the query and determines whether to perform a multi-partitioning determination to avoid a significantly imbalanced partition.

DADDA ARCHITECTURE THAT SCALES WITH INCREASING OPERAND SIZE
20220357921 · 2022-11-10 ·

Aspects of the invention include physical design-optimal Dadda architectures that scale with increasing operand size. Partial product arrays can be generated for two n-bit operands and columns in the partial product arrays can be shifted to a first row. The number of partial products in each column can be iteratively reduced across one or more stages until each column has at most two partial products. At each stage a maximum column height is determined and each column having a height greater than the maximum column height is reduced using half-adders and full-adders. Result bits of the half-adders and the full-adders are placed at the bottom of the current column and carry bits of the half-adders and the full-adders are placed at the bottom of the next column.

Arithmetic circuit for performing product-sum arithmetic

An arithmetic circuit includes a LUT generation circuit (1) that, when coefficients c[n] (n=1, . . . , N) are paired two by two, outputs a value calculated for each of the pairs, and distributed arithmetic circuits (2-m) that calculate values z[m] that are sums of products of data x[m, n] of a data set X[m] containing M pairs of data x[m, n] and the coefficients c[n], in parallel for each of the M pairs. The distributed arithmetic circuit (2-m) includes binomial distributed arithmetic circuits that, for each of the pairs, calculate sums of products of a value obtained by pairing N data x[m, n] corresponding to the circuit two by two and a value obtained by pairing the coefficients c[n] two by two, and a figure matching circuit that matches a number of decimal figures of the sums with a predetermined number of decimal figures.

FOLDING COLUMN ADDER ARCHITECTURE FOR DIGITAL COMPUTE IN MEMORY
20230031841 · 2023-02-02 ·

Certain aspects provide an apparatus for performing machine learning tasks, and in particular, to computation-in-memory architectures. One aspect provides a circuit for in-memory computation. The circuit generally includes: a plurality of memory cells on each of multiple columns of a memory, the plurality of memory cells being configured to store multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple columns are on different word-lines of the memory; multiple addition circuits, each coupled to a respective one of the multiple columns; a first adder circuit coupled to outputs of at least two of the multiple addition circuits; and an accumulator coupled to an output of the first adder circuit.