Patent classifications
G06F9/3893
ACCESSING TENSORS
Apparatuses, systems, and techniques to access a multidimensional tensor from memory while minimizing tile quantization is disclosed. In at least one embodiment, a processor includes one or more circuits to cause a first one or more portions of at least one tensor to be accessed from a memory using a first technique and a second one or more portions of the at least one tensor to be accessed from the memory using a second technique based, at least in part, on an input to combine the first and second techniques.
Efficient multiply-accumulation based on sparse matrix
Disclosed herein includes improving computational efficiency of multiply-accumulate (MAC) operation. In one aspect, a computing device identifies, a first vector including non-zero elements of a base matrix, and a second vector indicating a location of each of the non-zero elements of the base matrix. In one aspect, the device determines a first element and a second element of the first vector. In one aspect, the device determines a third element and a fourth element of the second vector. In one aspect, the device determines i) a fifth element of an input vector according to the third element of the second vector, and ii) a sixth element of the input vector according to the fourth element of the second vector. In one aspect, the device causes a MAC circuitry to perform a dot product according to the first element, the second element, the fifth element, and the sixth element.
RANDOM SPARSITY HANDLING IN A SYSTOLIC ARRAY
Matrix multiply units can take advantage of input sparsity by zero gating ALUs, which saves power consumption, but compute throughput does not increase. To improve compute throughput from sparsity, processing resources in a matrix accelerator can skip computation with zero involved in input or output. If zeros in input can be skipped, the processing units can focus calculations on generating meaningful non-zero output.
Time Domain Unrolling Sparse Matrix Multiplication System and Method
A system and method for multiplying matrices are provided. The system includes a processor coupled to a memory and a matrix multiply accelerator (MMA) coupled to the processor. The MMA is configured to multiply, based on a bitmap, a compressed first matrix and a second matrix to generate an output matrix including, for each element i,j of the output matrix, calculate a dot product of an i.sup.th row of the compressed first matrix and a j.sup.th column of the second matrix based on the bitmap. Or, the MMA is configured to multiply, based on the bitmap, the second matrix and the compressed first matrix and to generate the output matrix including, for each element i,j of the output matrix, calculate a dot product of an i.sup.th row of the second matrix and a j.sup.th column of the compressed first matrix based on the bitmap.
Functional unit having tree structure to support vector sorting algorithm and other algorithms
An apparatus is described having a functional unit of an instruction execution pipeline. The functional unit has a plurality of compare-and-exchange circuits coupled to network circuitry to implement a vector sorting tree for a vector sorting instruction. Each of the compare-and-exchange circuits has a respective comparison circuit that compares a pair of inputs. Each of the compare-and-exchange circuits have a same sided first output for presenting a higher of the two inputs and a same sided second output for presenting a lower of the two inputs, said comparison circuit to also support said functional unit's execution of a prefix min and/or prefix add instruction.
Scalable sparse matrix multiply acceleration using systolic arrays with feedback inputs
Described herein is an accelerator device including a host interface, a fabric interconnect coupled with the host interface, and one or more hardware tiles coupled with the fabric interconnect, the one or more hardware tiles including sparse matrix multiply acceleration hardware including a systolic array with feedback inputs.
Enhanced Multiply Accumulate Device For Neural Networks
A device for performing multiply/accumulate operations processes values in first and second buffers and having a first width using a computational pipeline with a second width, such as half the first width. A sequencer processes combinations of portions (high-high, low-low, high-low, low-high) of the values in the first and second buffers using a multiply/accumulate circuit and adds the accumulated result of each combination of portions to a group accumulator. Adding to the group accumulator may be preceded by left shifting the accumulated result (the first width for the high-high combination and the second width for the low-high and high-low combination).
Multiplier-accumulator circuitry having processing pipelines and methods of operating same
An integrated circuit including memory to store image data and filter weights, and a plurality of multiplier-accumulator execution pipelines, each multiplier-accumulator execution pipeline coupled to the memory to receive (i) image data and (ii) filter weights, wherein each multiplier-accumulator execution pipeline processes the image data, using associated filter weights, via a plurality of multiply and accumulate operations. In one embodiment, the multiplier-accumulator circuitry of each multiplier-accumulator execution pipeline, in operation, receives a different set of image data, each set including a plurality of image data, and, using filter weights associated with the received set of image data, processes the set of image data associated therewith, via performing a plurality of multiply and accumulate operations concurrently with the multiplier-accumulator circuitry of the other multiplier-accumulator execution pipelines, to generate output data. Each set of image data includes all of the image that correlates to the output data generated therefrom.
COMPUTE OPTIMIZATIONS FOR NEURAL NETWORKS
One embodiment provides for a compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including a multi-bit input value and a ternary weight associated with a neural network and an arithmetic logic unit including a multiplier, an adder, and an accumulator register. To execute the decoded instruction, the multiplier is to perform a multiplication operation on the multi-bit input based on the ternary weight to generate an intermediate product and the adder is to add the intermediate product to a value stored in the accumulator register and update the value stored in the accumulator register.
Modulo Operation Unit
The present disclosure advantageously provides a modulo operation unit that includes a first input configured to receive operand data, a second input configured to receive modulus data, an initial modulo stage, a sequence of intermediate modulo stages, and a final modulo stage.