G06F5/015

Compute optimizations for neural networks using bipolar binary weight

One embodiment provides for a compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including a multi-bit input value and a bipolar binary weight associated with a neural network and an arithmetic logic unit including a multiplier, an adder, and an accumulator register. To execute the decoded instruction, the multiplier is to perform a multiplication operation on the multi-bit input based on the bipolar binary weight to generate an intermediate product and the adder is to add the intermediate product to a value stored in the accumulator register and update the value stored in the accumulator register.

System, method, and recording medium for mirroring matrices for batched Cholesky decomposition on a graphic processing unit

A batched Cholesky decomposition method, system, and non-transitory computer readable medium for a Graphics Processing Unit (GPU), include mirroring a second problem matrix of a second problem to a first problem matrix of a first problem as paired matrices and shifting the second problem matrix by N+1 and combining the first problem matrix and the mirrored second problem matrix into one matrix of (N+1)×N, where the first problem shared memory comprises regular intervals, where the second problem shared memory is continuous, and where the GPU performs batched dense Cholesky decomposition with the one matrix from the combining to accelerate the Cholesky decomposition.

Low latency matrix multiply unit
10970362 · 2021-04-06 · ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

LOW LATENCY MATRIX MULTIPLY UNIT
20210209193 · 2021-07-08 ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

Reconfigurable segmented scalable shifter

Systems and methods that provide reconfigurable shifter configurations supporting multiple instruction, multiple data (MIMD) are described. Shifters implemented according to embodiments support multiple data shifts with respect to an instance of data shifting, wherein multiple individual different data shifts are implemented at a time in parallel. Reconfigurable segmented scalable shifters of embodiments, in addition being reconfigurable for scalability in supporting data shifting with respect to various bit lengths of data, are configured to support data shifting of differing bit lengths in parallel. The data shifters of embodiments implement segmentation for facilitating data shifting with respect to differing bit lengths. Different data shift commands may be provided with respect to each such segment, thereby facilitating multiple data shifts in parallel with respect to various bit lengths of data. Reconfigurable segmented scalable shifter configurations provide for fully reconfigurable data width and shift command of each message of input data.

LOW LATENCY MATRIX MULTIPLY UNIT
20200327186 · 2020-10-15 ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

HYBRID MATRIX MULTIPLICATION PIPELINE
20200272687 · 2020-08-27 ·

Systems, apparatuses, and methods implementing a hybrid matrix multiplication pipeline are disclosed. A hybrid matrix multiplication pipeline is able to execute a plurality of different types of instructions in a plurality of different formats by reusing execution circuitry in an efficient manner. For a first type of instruction for source operand elements of a first size, the pipeline uses N multipliers to perform N multiplication operations on N different sets of operands, where N is a positive integer greater than one. For a second type of instruction for source operand elements of a second size, the N multipliers work in combination to perform a single multiplication operation on a single set of operands, where the second size is greater than the first size. The pipeline also shifts element product results in an efficient manner when implementing a dot product operation.

RECONFIGURABLE SEGMENTED SCALABLE SHIFTER
20200249909 · 2020-08-06 ·

Systems and methods that provide reconfigurable shifter configurations supporting multiple instruction, multiple data (MIMD) are described. Shifters implemented according to embodiments support multiple data shifts with respect to an instance of data shifting, wherein multiple individual different data shifts are implemented at a time in parallel. Reconfigurable segmented scalable shifters of embodiments, in addition being reconfigurable for scalability in supporting data shifting with respect to various bit lengths of data, are configured to support data shifting of differing bit lengths in parallel. The data shifters of embodiments implement segmentation for facilitating data shifting with respect to differing bit lengths. Different data shift commands may be provided with respect to each such segment, thereby facilitating multiple data shifts in parallel with respect to various bit lengths of data. Reconfigurable segmented scalable shifter configurations provide for fully reconfigurable data width and shift command of each message of input data.

LOW LATENCY MATRIX MULTIPLY UNIT
20200226202 · 2020-07-16 ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. The matrix multiply unit may include cells arranged in columns of the systolic array. Two chains of weight shift registers per column of the systolic array are in the matrix multiply unit. Each weight shift register is connected to only one chain and each cell is connected to only one weight shift register. A weight matrix register per cell is configured to store a weight input received from a weight shift register. A multiply unit is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

Low latency matrix multiply unit
10698974 · 2020-06-30 · ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.