Patent classifications
G06F7/5318
FULL ADDER CELL WITH IMPROVED POWER EFFICIENCY
An adder circuit provides a first operand input and a second operand input to an XNOR cell. The XNOR cell transforms these inputs to a propagate signal that is applied to an OAT cell to produce a carry out signal. A third OAT cell transforms a third operand input and the propagate signal into a sum output signal.
FULL ADDER CELL WITH IMPROVED POWER EFFICIENCY
This disclosure relates to an adder circuit. The adder circuit comprises an operand input and a second operand input to an XNOR cell. The XNOR cell may be configured to provide the operand input and the second operand input to both a NAND gate and a first OAI cell. A second OAI cell may transform the output of the XNOR cell into a carry out signal.
Device and method for accelerating matrix multiply operations as a sum of outer products
A processing device is provided which includes memory and a processor comprising a plurality of processor cores in communication with each other via first and second hierarchical communication links. Each processor core in a group of the processor cores is in communication with each other via the first hierarchical communication links. Each processor core is configured to store, in the memory, one of a plurality of sub-portions of data of a first matrix, store, in the memory, one of a plurality of sub-portions of data of a second matrix, determine an outer product of the sub-portion of data of the first matrix and the sub-portion of data of the second matrix, receive, from another processor core of the group of processor cores, another sub-portion of data of the second matrix and determine another outer product of the sub-portion of data of the first matrix and the other sub-portion of data of the second matrix.
SYSTEMS AND METHODS FOR DATA PLACEMENT FOR IN-MEMORY-COMPUTE
According to one embodiment, a memory module includes: a memory die including a dynamic random access memory (DRAM) banks, each including: an array of DRAM cells arranged in pages; a row buffer to store values of one of the pages; an input/output (IO) module; and an in-memory compute (IMC) module including: an arithmetic logic unit (ALU) to receive operands from the row buffer or the IO module and to compute an output based on the operands and one of a plurality of ALU operations; and a result register to store the output of the ALU; and a controller to: receive, from a host processor, operands and an instruction; determine, based on the instruction, a data layout; supply the operands to the DRAM banks in accordance with the data layout; and control an IMC module to perform one of the ALU operations on the operands in accordance with the instruction.
PROGRAMMABLE MULTIPLY-ADD ARRAY HARDWARE
An integrated circuit including a data architecture including N adders and N multipliers configured to receive operands. The data architecture receives instructions for selecting a data flow between the N multipliers and the N adders of the data architecture. The selected data flow includes the options: (1) a first data flow using the N multipliers and the N adders to provide a multiply-accumulate mode and (2) a second data flow to provide a multiply-reduce mode.
SEMICONDUCTOR DEVICE
According to the embodiments, a semiconductor device includes: an adder configured to generate positive multiple data of the multiplicand which is used for a plurality of the multiplication in plurality and does not include a value of 2.sup.n (n is a positive integer) of the multiplicand; a Wallace tree circuit provided in each of the multiplier circuits and configured to operate a sum of a plurality of partial products by using a plurality of adders; and a selection circuit provided in each of the multiplier circuits and configured to select, according to a plurality of bits selected from the multiplier, data falling in a multiple of one of the multiplicand, data of 2.sup.n of the multiplicand, and the positive multiple data from the adder in order to output as one partial product of the plurality of partial products to the Wallace tree circuit.
Apparatus and Method of Fast Floating-Point Adder Tree for Neural Networks
A computing device to implement fast floating-point adder tree for the neural network applications is disclosed. The fast float-point adder tree comprises a data preparation module, a fast fixed-point Carry-Save Adder (CSA) tree, and a normalization module. The floating-point input data comprises a sign bit, exponent part and fraction part. The data preparation module aligns the fraction part of the input data and prepares the input data for subsequent processing. The fast adder uses a signed fixed-point CSA tree to quickly add a large number of fixed-point data into 2 output values and then uses a normal adder to add the 2 output values into one output value. The fast adder uses for a large number of operands is based on multiple levels of fast adders for a small number of operands. The output from the signed fixed-point Carry-Save Adder tree is converted to a selected floating-point format.
Programmable multiply-add array hardware
An integrated circuit including a data architecture including N adders and N multipliers configured to receive operands. The data architecture receives instructions for selecting a data flow between the N multipliers and the N adders of the data architecture. The selected data flow includes the options: (1) a first data flow using the N multipliers and the N adders to provide a multiply-accumulate mode and (2) a second data flow to provide a multiply-reduce mode.
DEVICE AND METHOD FOR ACCELERATING MATRIX MULTIPLY OPERATIONS AS A SUM OF OUTER PRODUCTS
A processing device is provided which includes memory and a processor comprising a plurality of processor cores in communication with each other via first and second hierarchical communication links. Each processor core in a group of the processor cores is in communication with each other via the first hierarchical communication links. Each processor core is configured to store, in the memory, one of a plurality of sub-portions of data of a first matrix, store, in the memory, one of a plurality of sub-portions of data of a second matrix, determine an outer product of the sub-portion of data of the first matrix and the sub-portion of data of the second matrix, receive, from another processor core of the group of processor cores, another sub-portion of data of the second matrix and determine another outer product of the sub-portion of data of the first matrix and the other sub-portion of data of the second matrix.
Counter-based multiplication using processing in memory
The present disclosure is directed to systems and methods for a memory device such as, for example, a Processing-In-Memory Device that is configured to perform multiplication operations in memory using a popcount operation. A multiplication operation may include a summation of multipliers being multiplied with corresponding multiplicands. The inputs may be arranged in particular configurations within a memory array. Sense amplifiers may be used to perform the popcount by counting active bits along bit lines. One or more registers may accumulate results for performing the multiplication operations.