G06F5/015

Low latency matrix multiply unit
10698976 · 2020-06-30 · ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

BLOCK OPERATIONS FOR AN IMAGE PROCESSOR HAVING A TWO-DIMENSIONAL EXECUTION LANE ARRAY AND A TWO-DIMENSIONAL SHIFT REGISTER

A method is described that includes, on an image processor having a two dimensional execution lane array and a two dimensional shift register array, repeatedly shifting first content of multiple rows or columns of the two dimensional shift register array and repeatedly executing at least one instruction between shifts that operates on the shifted first content and/or second content that is resident in respective locations of the two dimensional shift register array that the shifted first content has been shifted into.

Low latency matrix multiply unit
10635740 · 2020-04-28 · ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. The matrix multiply unit may include cells arranged in columns of the systolic array. Two chains of weight shift registers per column of the systolic array are in the matrix multiply unit. Each weight shift register is connected to only one chain and each cell is connected to only one weight shift register. A weight matrix register per cell is configured to store a weight input received from a weight shift register. A multiply unit is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

System, Method, and recording medium for mirroring matrices for batched Cholesky decomposition on a graphic processing unit

A batched Cholesky decomposition method, system, and non-transitory computer readable medium for a Graphics Processing Unit (GPU), include mirroring a second problem matrix of a second problem to a first problem matrix of a first problem as paired matrices and shifting the second problem matrix by N+1 and combining the first problem matrix and the mirrored second problem matrix into one matrix of (N+1)N by merging the first problem matrix and the mirrored second problem matrix. The first problem matrix and the second problem matrix are symmetric and positive definite matrices.

SYSTEM, METHOD, AND RECORDING MEDIUM FOR MIRRORING MATRICES FOR BATCHED CHOLESKY DECOMPOSITION ON A GRAPHIC PROCESSING UNIT
20200057790 · 2020-02-20 ·

A batched Cholesky decomposition method, system, and non-transitory computer readable medium for a Graphics Processing Unit (GPU), include mirroring a second problem matrix of a second problem to a first problem matrix of a first problem as paired matrices and shifting the second problem matrix by N+1 and combining the first problem matrix and the mirrored second problem matrix into one matrix of (N+1)N, where the first problem shared memory comprises regular intervals, where the second problem shared memory is continuous, and where the GPU performs batched dense Cholesky decomposition with the one matrix from the combining to accelerate the Cholesky decomposition.

Low latency matrix multiply unit
11907330 · 2024-02-20 · ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

LOW LATENCY MATRIX MULTIPLY UNIT
20190354571 · 2019-11-21 ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

COMPUTE OPTIMIZATIONS FOR NEURAL NETWORKS

One embodiment provides for a compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including a multi-bit input value and a bipolar binary weight associated with a neural network and an arithmetic logic unit including a multiplier, an adder, and an accumulator register. To execute the decoded instruction, the multiplier is to perform a multiplication operation on the multi-bit input based on the bipolar binary weight to generate an intermediate product and the adder is to add the intermediate product to a value stored in the accumulator register and update the value stored in the accumulator register.

BLOCK OPERATIONS FOR AN IMAGE PROCESSOR HAVING A TWO-DIMENSIONAL EXECUTION LANE ARRAY AND A TWO-DIMENSIONAL SHIFT REGISTER

A method is described that includes, on an image processor having a two dimensional execution lane array and a two dimensional shift register array, repeatedly shifting first content of multiple rows or columns of the two dimensional shift register array and repeatedly executing at least one instruction between shifts that operates on the shifted first content and/or second content that is resident in respective locations of the two dimensional shift register array that the shifted first content has been shifted into.

SYSTEM, METHOD, AND RECORDING MEDIUM FOR MIRRORING MATRICES FOR BATCHED CHOLESKY DECOMPOSITION ON A GRAPHIC PROCESSING UNIT
20190294651 · 2019-09-26 ·

A batched Cholesky decomposition method, system, and non-transitory computer readable medium for a Graphics Processing Unit (GPU), include mirroring a second problem matrix of a second problem to a first problem matrix of a first problem as paired matrices and shifting the second problem matrix by N+1 and combining the first problem matrix and the mirrored second problem matrix into one matrix of (N+1)N by merging the first problem matrix and the mirrored second problem matrix. The first problem matrix and the second problem matrix are symmetric and positive definite matrices.