Patent classifications
G06F7/5306
PARALLEL MATRIX OPERATIONS IN A RECONFIGURABLE COMPUTE FABRIC
A first set of multiple coordinate data structure elements describing non-zero values of an input matrix may be loaded to a compute element. A first set of input vector values having input vector row numbers corresponding to input matrix column numbers of the first set of multiple coordinate data structure elements may also be loaded to the compute element. Multiple parallel processing lanes of the compute element may be used to update multiple partial accumulation values, where each partial accumulation value corresponds to an output vector row and one of the multiple parallel processing lanes. At least a portion of the partial accumulation values corresponding to the first input matrix row may be summed across at least a portion of the parallel processing lanes to generate a first output vector row value.
FOLDING COLUMN ADDER ARCHITECTURE FOR DIGITAL COMPUTE IN MEMORY
Certain aspects provide an apparatus for performing machine learning tasks, and in particular, to computation-in-memory architectures. One aspect provides a circuit for in-memory computation. The circuit generally includes: a plurality of memory cells on each of multiple columns of a memory, the plurality of memory cells being configured to store multiple bits representing weights of a neural network, wherein the plurality of memory cells on each of the multiple columns are on different word-lines of the memory; multiple addition circuits, each coupled to a respective one of the multiple columns; a first adder circuit coupled to outputs of at least two of the multiple addition circuits; and an accumulator coupled to an output of the first adder circuit.
MASKED SHIFTED ADD OPERATION
A computer-implemented method includes receiving, by a processing unit, an instruction to perform a masked shift add operation with a set of operands. A logical AND operation is performed on a first pair of operands from the set of operands to obtain a first intermediate result. The first intermediate result is shifted by a first shift amount that is based on a first operand from the first pair of operands. A logical AND operation is performed on a second pair of operands from the set of operands to obtain a second intermediate result. The second intermediate result is shifted by a second shift amount that is based on a first operand from the second pair of operands. The shifted first intermediate result is added with the shifted second intermediate result. The method further includes outputting, as a result of the masked shift add operation, an output of the adding.
Compute-in-memory (CIM) binary multiplier
Certain aspects provide methods and apparatus for binary computation. An example circuit for such computation generally includes a memory cell having at least one of a bit-line or a complementary bit-line; a computation circuit coupled to a computation input node of the circuit and the bit-line or the complementary bit-line; and an adder coupled to the computation circuit, wherein the computation circuit comprises a first n-type metal-oxide-semiconductor (NMOS) transistor coupled to the memory cell, and a first p-type metal-oxide-semiconductor (PMOS) transistor coupled to the memory cell, drains of the first NMOS and PMOS transistors being coupled to the adder, wherein a source of the first PMOS transistor is coupled to a reference potential node, and wherein a source of the first NMOS transistor is coupled to the computation input node.
SEMICONDUCTOR DEVICE, DATA GENERATION METHODS USED FOR THE SAME, AND METHOD OF CONTROLLING THE SAME
A semiconductor device includes: a local memory outputting a plurality of pieces of weight data in parallel; a plurality of product-sum operation units corresponding to the plurality of pieces of weight data; and a plurality of unit selectors corresponding to the product-sum operations units, supplied with a plurality of pieces of input data in parallel, selecting the one piece of input data from the supplied plurality of pieces of input data according to a plurality of pieces of additional information each indicating a position of the input data to be calculated with the corresponding product-sum arithmetic unit calculator in the pieces of input data, and outputting the selected input data. Each of the plurality of product-sum arithmetic units performs a product-sum operation between the weight data different from each other in the plurality of pieces of weight data and the input data outputted from the corresponding unit selector.
Method and apparatus for performing field programmable gate array packing with continuous carry chains
A method for designing a system on a target device includes identifying a length for a carry chain that is supported by predefined quanta of a resource on the target device. A plurality of logical adders is mapped onto a single logical adder implemented on the carry chain subject to the identified length to increase logic utilization in a design for the system.
Programmable-logic-directed multiplier mapping
Multiplier circuitry includes first combinatorial circuitry configured to perform a combinatorial function, based at least in part on redundant form arithmetic, to generate a first subset of two or more partial products. The two or more partial products are based at least in part on a first input to the multiplier circuitry and a second input to the multiplier circuitry. The multiplier circuitry also includes a carry chain that includes a second combinatorial circuitry configured to generate a second subset of the two or more partial products based at least in part on the first input and the second input. Furthermore, the carry chain includes one or more binary ripple-carry adders configured to generate a product of the multiplier circuitry based at least in part on a sum of the two or more partial products.
Method and apparatus for performing synthesis for field programmable gate array embedded feature placement
A method for designing and configuring a system on a field programmable gate array (FPGA) is disclosed. A portion of the system that is implemented greater than a predetermined number of times is identified. A structural netlist that describes how to implement the portion of the system a plurality of times on the FPGA and that leverages a repetitive nature of implementing the portion is generated. The identifying and generating is performed prior to synthesizing and placing other portions of the system that are not implemented greater than the predetermined number of time. Synthesizing, placing, and routing the other portions of the system on the FPGA is performed in accordance with the structural netlist. The FPGA is configured with a configuration file that includes a design for the system that reflects the synthesizing, placing, and routing, wherein the configuring physically transforms resources on the FPGA to implement the system.
COMPUTE-IN-MEMORY (CIM) BINARY MULTIPLIER
Certain aspects provide methods and apparatus for binary computation. An example circuit for such computation generally includes a memory cell having at least one of a bit-line or a complementary bit-line; a computation circuit coupled to a computation input node of the circuit and the bit-line or the complementary bit-line; and an adder coupled to the computation circuit, wherein the computation circuit comprises a first n-type metal-oxide-semiconductor (NMOS) transistor coupled to the memory cell, and a first p-type metal-oxide-semiconductor (PMOS) transistor coupled to the memory cell, drains of the first NMOS and PMOS transistors being coupled to the adder, wherein a source of the first PMOS transistor is coupled to a reference potential node, and wherein a source of the first NMOS transistor is coupled to the computation input node.
Device and method for accelerating matrix multiply operations as a sum of outer products
A processing device is provided which includes memory and a processor comprising a plurality of processor cores in communication with each other via first and second hierarchical communication links. Each processor core in a group of the processor cores is in communication with each other via the first hierarchical communication links. Each processor core is configured to store, in the memory, one of a plurality of sub-portions of data of a first matrix, store, in the memory, one of a plurality of sub-portions of data of a second matrix, determine an outer product of the sub-portion of data of the first matrix and the sub-portion of data of the second matrix, receive, from another processor core of the group of processor cores, another sub-portion of data of the second matrix and determine another outer product of the sub-portion of data of the first matrix and the other sub-portion of data of the second matrix.