G06F7/5318

HARDWARE ACCELERATOR FOR PERFORMING COMPUTATIONS OF DEEP NEURAL NETWORK AND ELECTRONIC DEVICE INCLUDING THE SAME

A hardware accelerator includes a processing core including a plurality of multipliers configured to perform one-dimensional (1D) sub-word parallelism between a sign and a mantissa of a first tensor and a sign and a mantissa of a second tensor, a first processing device configured to operate in a two-dimensional (2D) operation mode in which results of computation by the plurality of multipliers are output, and a second processing device configured to operate in a three-dimensional (3D) operation mode in which results of computation by the plurality of multipliers are accumulated in a channel direction and then a result of accumulating the results of computation is output.

MULTI-CHIP ELECTRO-PHOTONIC NETWORK
20220405566 · 2022-12-22 ·

Various embodiments provide for computational systems including multiple circuit packages, each circuit package comprising an electronic integrated circuit having multiple processing elements and intra-chip bidirectional photonic channels connecting the processing elements into an electro-photonic network, with inter-chip bidirectional photonic channels connecting the processing elements across the electro-photonic networks of the multiple circuit packages into a larger electro-photonic network.

Support for different matrix multiplications by selecting adder tree intermediate results

A first group of elements is element-wise multiplied with a second group of elements using a plurality of multipliers belonging to a matrix multiplication hardware unit. Results of the plurality of multipliers are added together using a hierarchical tree of adders belonging to the matrix multiplication hardware unit and a final result of the hierarchical tree of adders or any of a plurality of intermediate results of the hierarchical tree of adders is selectively provided for use in determining an output result matrix. A control unit is used to instruct the matrix multiplication hardware unit to perform a plurality of different matrix multiplications in parallel by using a combined matrix that includes elements of a plurality of different operand matrices and utilize one or more selected ones of the intermediate results of the hierarchical tree of adders for use in determining the output result matrix that includes different groups of elements representing different multiplication results corresponding to different ones of the different operand matrices.

ARTIFICIAL INTELLIGENCE ACCELERATORS
20220374690 · 2022-11-24 · ·

An artificial intelligence (AI) accelerator includes memory circuits configured to output weight data and vector data, a multiplication circuit/adder tree performing a multiplying/adding calculation on the weight data and the vector data to generate multiplication/addition result data, a first accumulator synchronized with an odd clock signal to perform an accumulative adding calculation on odd-numbered multiplication/addition result data of the multiplication/addition result data and a first latched data, and a second accumulator synchronized with an even clock signal to perform an accumulative adding calculation on even-numbered multiplication/addition result data of the multiplication/addition result data and a second latched data.

COMPUTING APPARATUS AND METHOD FOR VECTOR INNER PRODUCT, AND INTEGRATED CIRCUIT CHIP
20220366006 · 2022-11-17 ·

The present disclosure relates to a computing apparatus, a method and an integrated circuit chip for a vector inner product, where the computing apparatus may be included in a combined processing apparatus. The combined processing apparatus may further include a general interconnection interface and other processing apparatus. The computing apparatus interacts with other processing apparatus to jointly complete a computing operation specified by a user. The combined processing apparatus may further include a storage apparatus, where the storage apparatus is respectively connected to the computing apparatus and other processing apparatus, and the storage apparatus is used for storing data of the computing apparatus and other processing apparatus.

COMPRESSED WALLACE TREES IN FMA CIRCUITS

An embodiment of an apparatus comprises one or more fractional width fused multiply-accumulate (FMA) circuits configured as a shared Wallace tree, and circuitry coupled to the one or more fractional width FMA circuits to provide one or more fractional width FMA operations through the one or more fractional width FMA circuits. Other embodiments are disclosed and claimed.

DADDA ARCHITECTURE THAT SCALES WITH INCREASING OPERAND SIZE
20220357921 · 2022-11-10 ·

Aspects of the invention include physical design-optimal Dadda architectures that scale with increasing operand size. Partial product arrays can be generated for two n-bit operands and columns in the partial product arrays can be shifted to a first row. The number of partial products in each column can be iteratively reduced across one or more stages until each column has at most two partial products. At each stage a maximum column height is determined and each column having a height greater than the maximum column height is reduced using half-adders and full-adders. Result bits of the half-adders and the full-adders are placed at the bottom of the current column and carry bits of the half-adders and the full-adders are placed at the bottom of the next column.

FOLDING COLUMN ADDER ARCHITECTURE FOR DIGITAL COMPUTE IN MEMORY
20230031841 · 2023-02-02 ·

Certain aspects provide an apparatus for performing machine learning tasks, and in particular, to computation-in-memory architectures. One aspect provides a circuit for in-memory computation. The circuit generally includes: a plurality of memory cells on each of multiple columns of a memory, the plurality of memory cells being configured to store multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple columns are on different word-lines of the memory; multiple addition circuits, each coupled to a respective one of the multiple columns; a first adder circuit coupled to outputs of at least two of the multiple addition circuits; and an accumulator coupled to an output of the first adder circuit.

COMPUTING APPARATUS AND METHOD FOR NEURAL NETWORK OPERATION, INTEGRATED CIRCUIT, AND DEVICE
20220350569 · 2022-11-03 ·

The present disclosure relates to a computing apparatus, a method, an integrated circuit chip and an integrated circuit device for performing a neural network operation. The computing apparatus may be included in a combined processing apparatus. The combined processing apparatus may further include a general interconnection interface and other processing apparatus. The computing apparatus interacts with other processing apparatus to jointly complete calculation operations specified by users. The combined processing apparatus may further include a storage apparatus. The storage apparatus is respectively connected to the computing apparatus and other processing apparatus, and the storage apparatus is used for storing data of the computing apparatus and other processing apparatus. Solutions of the present disclosure may be widely applied to various floating-point data computations.

CONFIGURABLE LOGIC CELL
20230077881 · 2023-03-16 ·

Configurable circuits include an input selection region, a computation region, a switching region, and an output region. The input selection region includes a set of input multiplexers and selects and routes input signals. The computation region includes a set of lookup tables, each lookup table being coupled to selected signals from the input selection stage to generate a respective output signal. The switching region includes a set of output multiplexers, each output multiplexer being coupled to output signals from the set of lookup tables to provide circuit outputs responsive to respective output selection signals. The output region includes a domino logic stage, having a set of transistors, coupled to output signals from the set of lookup tables to provide circuit outputs that determine combinations of the signals output by the set of lookup tables.