Patent classifications
G06F7/5443
Power efficient near memory analog multiply-and-accumulate (MAC)
A near memory system is provided for the calculation of a layer in a machine learning application. The near memory system includes an array of memory cells for storing an array of filter weights. A multiply-and-accumulate circuit couples to columns of the array to form the calculation of the layer.
Resistive matrix computation circuit
A resistive matrix computation circuit and methods for using the same are disclosed. In one embodiment, a resistive matrix computation circuit includes a memory configured to store a first set of operands and a second set of operands, where the first set of input operands and the second set of input operands are programmable by a controller, and the first set of operands and the second set of operands are cross-multiplied to form a plurality of product pairs, a plurality of resistive multiplier circuits configured to generate a plurality of output voltages according to the plurality of product pairs; the controller is configured to control the plurality of resistive multiplier circuits to perform multiplications using the first set of operands and the second set of operands, and an aggregator circuit configured to aggregate the plurality of output voltages from the plurality of resistive multiplier circuits, where the plurality of output voltages represent an aggregated value of the plurality of product pairs.
Scalable matrix computation circuit
A scalable matrix computation circuit and methods for using the same are disclosed. In one embodiment, a matrix computation circuit includes a plurality of first operand memory configured to store a first set of input operands of the matrix computation circuit, a plurality of second operand memory configured to store a second set of input operands of the matrix computation circuit, where the first and second sets of input operands are programmable by the controller, a plurality of multiplier circuits arranged in a plurality of rows and plurality of columns, where each row receives a corresponding operand from the first set of operands, and each column receives a corresponding operand from the second set of operands, and the each corresponding operand from the each row is used multiple times by the multiplier circuits in that row to perform multiplications controlled by the controller, and a plurality of aggregator circuits configured to store charges produced by the plurality of multiplier circuits.
Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format
Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.
SYSTEMS, METHODS, AND APPARATUSES FOR TILE LOAD
Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in the form of decode circuitry to decode an instruction having fields for an opcode, a destination matrix operand identifier, and source memory information, and execution circuitry to execute the decoded instruction to load groups of strided data elements from memory into configured rows of the identified destination matrix operand to memory.
NEURAL NETWORK DATA COMPUTATION USING MIXED-PRECISION
Techniques for mixed-precision data manipulation for neural network data computation are disclosed. A first left group comprising eight bytes of data and a first right group of eight bytes of data are obtained for computation using a processor. A second left group comprising eight bytes of data and a second right group of eight bytes of data are obtained. A sum of products is performed between the first left and right groups and the second left and right groups. The sum of products is performed on bytes of 8-bit integer data. A first result is based on a summation of eight values that are products of the first group’s left eight bytes and the second group’s left eight bytes. A second result is based on the summation of eight values that are products of the first group’s left eight bytes and the second group’s right eight bytes. Results are output.
METHOD AND DEVICE FOR ADDITIVE CODING OF SIGNALS IN ORDER TO IMPLEMENT DIGITAL MAC OPERATIONS WITH DYNAMIC PRECISION
A computer-implemented method is provided for coding a digital signal quantized on a given number N.sub.d of bits and intended to be processed by a digital computing system, the signal being coded on a predetermined number N.sub.p of bits which is strictly less than N.sub.d, the method including the steps of: receiving a digital signal composed of a plurality of samples, decomposing each sample into a sum of k maximum values which are equal to 2.sup.NP−1 and a residual value, with k being a positive or zero integer, successively transmitting the values obtained after decomposition to an integration unit for carrying out a MAC operation between the sample and a weighting coefficient.
MEMORY ARRAY STRUCTURE
The present invention disclosures a memory array structure, comprising an array composed of multiple memory devices arranged in rows and columns, each of the rows is set with a row leading-out wire, and each of the columns is set with a column leading-out wire, memory devices are correspondingly positioned at intersection points of each row leading-out wire and each column leading-out wire; wherein, the first terminal of each of the memory devices is individually connected to the row leading-out wire of the same row, and the second terminal of each of the memory devices is connected to a first terminal of a switch in the same column, the second terminal of the switch is connected to the column leading-out wire of the same column; wherein, each of the rows is set with one to multiple the switches, and the first terminal of each of the switches is connected to one to all of the second terminals of the memory devices in the same column. The advantage of the present invention is that the corresponding analog currents output of input signals of different specified rows according to multiply-accumulate operation requirements of each of the columns can be obtained simultaneously, thus multiply-accumulate operations of different input signals of different scales can be performed, which greatly improves operation speed and using efficiency of the array.
ELEMENTS FOR IN-MEMORY COMPUTE
A memory array arranged in multiple columns and rows. Computation circuits that each calculate a computation value from cell values in a corresponding column. A column multiplexer cycles through multiple data lines that each corresponds to a computation circuit. Cluster cycle management circuitry determines a number of multiplexer cycles based on a number of columns storing data of a compute cluster. A sensing circuit obtains the computation values from the computation circuits via the column multiplexer as the column multiplexer cycles through the data lines. The sensing circuit combines the obtained computation values over the determined number of multiplexer cycles. A first clock may initiate the multiplexer to cycle through its data lines for the determined number of multiplexer cycles, and a second clock may initiate each individual cycle. The multiplexer or additional circuitry may be utilized to modify the order in which data is written to the columns.
HIGH THROUGHPUT MATRIX PROCESSOR WITH SUPPORT FOR CONCURRENTLY PROCESSING MULTIPLE MATRICES
A system comprises a data input vector unit, a weight input vector unit, and a plurality of calculation units. The data input vector unit is configured to concurrently receive elements of different rows of a first and second data matrix. The weight input vector unit is configured to receive a combined weight vector and at least in part concurrently provide obtained weight elements of a first and second weight matrix to a corresponding first and second group of calculation units. At least one calculation unit of each group of the first and second group of calculation units is configured to multiply elements from the data input vector unit with corresponding elements of the corresponding weight matrix from the weight input vector unit and sum together multiplication results of the corresponding calculation unit to at least in part determine a corresponding element in a first or second convolution result matrix.