Patent classifications
G06F5/015
LOW LATENCY MATRIX MULTIPLY UNIT
Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.
FAST MATRIX MULTIPLICATION
A system and method of multiplying a first matrix and a second matrix is provided, the method comprising compressing the second matrix into a third matrix to process primarily non-zero values. For each row in the first matrix, a row may be loaded into a row lookup unit. For each entry in the third matrix, a row address may be extracted, a row value may be obtained from a corresponding loaded row of the first matrix based on the extracted row address, the row value from the loaded row may be multiplied with the matrix value from the third matrix for each column, and the multiplied value may be added to an accumulator corresponding to the each column. Lastly, a multiplied matrix may be output for the loaded row.
System, method, and recording medium for mirroring matrices for batched cholesky decomposition on a graphic processing unit
A batched Cholesky decomposition method, system, and non-transitory computer readable medium for a Graphics Processing Unit (GPU), include mirroring matrices to form paired matrices solving the paired matrices simultaneously.
COMPUTE OPTIMIZATIONS FOR NEURAL NETWORKS
One embodiment provides for a compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including a multi-bit input value and a one-bit weight associated with a neural network, as well as an arithmetic logic unit including a multiplier, an adder, and an accumulator register. To execute the decoded instruction, the multiplier is to perform a fused operation including an exclusive not OR (XNOR) operation and a population count operation. The adder is configured to add the intermediate product to a value stored in the accumulator register and update the value stored in the accumulator register.
SYSTEM, METHOD AND RECORDING MEDIUM FOR MIRRORING MATRICES FOR BATCHED CHOLESKY DECOMPOSITION ON A GRAPHIC PROCESSING UNIT
A batched Cholesky decomposition method, system, and non-transitory computer readable medium for a Graphics Processing Unit (GPU), include mirroring matrices to form paired matrices solving the paired matrices simultaneously.
VARIABLE WIDTH BARREL SHIFTER
A variable width barrel shifter. The variable width barrel shifter includes a first barrel shifter configured to receive a data vector of width M as input. The variable width barrel shifter further includes a second barrel shifter configured to receive the data vector of width M as input. The variable width barrel shifter includes an element-wise multiplexer coupled to the first and second barrel shifters. The element-wise multiplexer is configured to provide a shifted output of the data vector of width M by including a first portion of output from the second barrel shifter and a second portion of output from the first barrel shifter.
LOW LATENCY MATRIX MULTIPLY UNIT
Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. The matrix multiply unit may include cells arranged in columns of the systolic array. Two chains of weight shift registers per column of the systolic array are in the matrix multiply unit. Each weight shift register is connected to only one chain and each cell is connected to only one weight shift register. A weight matrix register per cell is configured to store a weight input received from a weight shift register. A multiply unit is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.
High speed data weighted averaging (DWA) to binary converter circuit
A latch circuit sequentially latches a first data weighted averaging (DWA) data word and then a second DWA data word. A first detector circuit identifies a first bit location in the first DWA data that is associated with an ending of a first string of logic 1 bits in the first DWA data word. A second detector circuit identifies a second bit location in the second DWA data word associated with an ending of a second string of logic 1 bits in the second DWA data word. A DWA-to-binary conversion circuit converts the second DWA data word to a binary word by using the first bit location and second bit location to identify a number of logic 1 bits present in said second DWA data word. A binary value for that binary word that is equal to the identified number is output.
HIGH SPEED DATA WEIGHTED AVERAGING (DWA) TO BINARY CONVERTER CIRCUIT
A latch circuit sequentially latches a first data weighted averaging (DWA) data word and then a second DWA data word. A first detector circuit identifies a first bit location in the first DWA data that is associated with an ending of a first string of logic 1 bits in the first DWA data word. A second detector circuit identifies a second bit location in the second DWA data word associated with an ending of a second string of logic 1 bits in the second DWA data word. A DWA-to-binary conversion circuit converts the second DWA data word to a binary word by using the first bit location and second bit location to identify a number of logic 1 bits present in said second DWA data word. A binary value for that binary word that is equal to the identified number is output.
SYSTEM, METHOD, AND RECORDING MEDIUM FOR MIRRORING MATRICES FOR BATCHED CHOLESKY DECOMPOSITION ON A GRAPHIC PROCESSING UNIT
A batched Cholesky decomposition method, system, and non-transitory computer readable medium for a Graphics Processing Unit (GPU), include mirroring matrices to form paired matrices solving the paired matrices simultaneously.