IPIQ

G06F5/015

System, method, and recording medium for mirroring matrices for batched Cholesky decomposition on a graphic processing unit

10423695 · 2019-09-24 ·

International Business Machines Corporation

A batched Cholesky decomposition method, system, and non-transitory computer readable medium for a Graphics Processing Unit (GPU), include mirroring a second problem matrix of a second problem to a first problem matrix of a first problem as paired matrices and shifting the second problem by N+1, combining the first problem matrix and the mirrored second problem matrix into one matrix of (N+1)N, and reading the fixed size data length of the one square matrix with a fixed data interval for both the first problem and the second problem.

Compute optimizations for neural networks

10410098 · 2019-09-10 ·

Intel Corporation

One embodiment provides for a compute apparatus to perform machine learning operations, the apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including an input value and a quantized weight value associated with a neural network and an arithmetic logic unit including a barrel shifter, an adder, and an accumulator register, wherein to execute the decoded instruction, the barrel shifter is to shift the input value by the quantized weight value to generate a shifted input value and the adder is to add the shifted input value to a value stored in the accumulator register and update the value stored in the accumulator register.

Low latency matrix multiply unit

11989259 · 2024-05-21 ·

Google Llc

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. The matrix multiply unit may include cells arranged in columns of the systolic array. Two chains of weight shift registers per column of the systolic array are in the matrix multiply unit. Each weight shift register is connected to only one chain and each cell is connected to only one weight shift register. A weight matrix register per cell is configured to store a weight input received from a weight shift register. A multiply unit is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

Apparatus and method for supporting a conversion instruction

10310809 · 2019-06-04 ·

Arm Limited

A data processing system includes instruction decoder circuitry responsive to a conversion instruction FCVTJS to convert a double precision floating point number into a 32-bit integer number. Right shifting circuitry performs a right shift upon at least part of the input number and left shifting circuitry performs a left shift of at least part of the input number. Selection circuitry serves to select one of the right shifted number and the left shifted number as a selected shifted number which forms at least part of the output number which is generated.

Selectively combinable directional shifters

10289382 · 2019-05-14 ·

Wave Computing, Inc.

Samit Chaudhuri

An apparatus for mathematical manipulation is described allowing the selective combination of shifters to shift binary numbers of various widths. Selective combination allows on-the-fly adjustment of shifters from independent to coordinated shifting operations. Selective combination allows adjustable hardware-based shifting while saving space and resources. Multiple eight-bit shifters can be configured for a variety of operand widths, such as a 32-bit width, a 24-bit width, a 16-bit width, or an eight-bit width. Multiplexers route the appropriate input data to the appropriate shifters. Bidirectional shifting is configured through a selector tree, including both shift left and shift right operations. Opcodes configure the shifters for the desired type of shift and a shifted result is generated.

LOW LATENCY MATRIX MULTIPLY UNIT

20180336163 · 2018-11-22 ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

LOW LATENCY MATRIX MULTIPLY UNIT

20180336164 · 2018-11-22 ·

COMPUTE OPTIMIZATIONS FOR NEURAL NETWORKS

20180307950 · 2018-10-25 ·

Intel Corporation

LOW LATENCY MATRIX MULTIPLY UNIT

20240303297 · 2024-09-12 ·

Mirroring matrices for batched cholesky decomposition on a graphic processing unit

12086207 · 2024-09-10 ·

International Business Machines Corporation

A batched Cholesky decomposition method, system, and non-transitory computer readable medium for a Graphics Processing Unit (GPU), include mirroring matrices to form paired matrices solving the paired matrices simultaneously.

Patent classifications

G06F5/015