G06F5/015

BLOCK OPERATIONS FOR AN IMAGE PROCESSOR HAVING A TWO-DIMENSIONAL EXECUTION LANE ARRAY AND A TWO-DIMENSIONAL SHIFT REGISTER

A method is described that includes, on an image processor having a two dimensional execution lane array and a two dimensional shift register array, repeatedly shifting first content of multiple rows or columns of the two dimensional shift register array and repeatedly executing at least one instruction between shifts that operates on the shifted first content and/or second content that is resident in respective locations of the two dimensional shift register array that the shifted first content has been shifted into.

MULTI-PRECISION ARITHMETIC RIGHT SHIFT
20230214176 · 2023-07-06 ·

A method includes receiving, by each of an upper shift circuit and a lower shift circuit, an operand for an arithmetic right shift operation. The upper shift circuit is configured to provide an upper output, the lower shift circuit is configured to provide a lower output, and the upper output concatenated with the lower output is a result of the arithmetic right shift operation. The method also includes receiving a shift value for the arithmetic right shift operation; responsive to the shift value, detecting a shift condition in which a portion of, but not all of, the operand could be shifted into bits corresponding to the lower output; and responsive to detecting the shift condition, providing, by a middle shift circuit, at least a portion of the operand to the lower shift circuit as a selectable input.

Compute optimizations for neural networks using ternary weight

One embodiment provides for a compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including a multi-bit input value and a ternary weight associated with a neural network and an arithmetic logic unit including a multiplier, an adder, and an accumulator register. To execute the decoded instruction, the multiplier is to perform a multiplication operation on the multi-bit input based on the ternary weight to generate an intermediate product and the adder is to add the intermediate product to a value stored in the accumulator register and update the value stored in the accumulator register.

Low latency matrix multiply unit
11500961 · 2022-11-15 · ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. The matrix multiply unit may include cells arranged in columns of the systolic array. Two chains of weight shift registers per column of the systolic array are in the matrix multiply unit. Each weight shift register is connected to only one chain and each cell is connected to only one weight shift register. A weight matrix register per cell is configured to store a weight input received from a weight shift register. A multiply unit is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

Low latency matrix multiply unit
11599601 · 2023-03-07 · ·

Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. Each cell of the matrix multiply includes: a weight matrix register configured to receive a weight input from either a transposed or a non-transposed weight shift register; a transposed weight shift register configured to receive a weight input from a horizontal direction to be stored in the weight matrix register; a non-transposed weight shift register configured to receive a weight input from a vertical direction to be stored in the weight matrix register; and a multiply unit that is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.

TECHNIQUE FOR BIT UP-CONVERSION WITH SIGN EXTENSION
20220326909 · 2022-10-13 ·

A technique for bit depth up-conversion including obtaining an input value for a computation in a first bit depth with a fewer number of bits as compared to a second bit depth, converting the input value from the first bit depth to the second bit depth as an unsigned data value, adjusting a pointer to the converted input value based on the first bit depth, performing the computation based on the adjusted pointer to obtain an adjusted output value, and performing a right shift operation on the adjusted output value based on the first bit depth to obtain an output value.

APPARATUS AND METHOD FOR SUPPORTING A CONVERSION INSTRUCTION
20170293467 · 2017-10-12 ·

A data processing system 2 includes instruction decoder circuitry 12 responsive to a conversion instruction FCVTJS to convert a double precision floating point number into a 32-bit integer number. Right shifting circuitry 28 performs a right shift upon at least part of the input number and left shifting circuitry 32 performs a left shift of at least part of the input number. Selection circuitry 38 serves to select one of the right shifted number and the left shifted number as a selected shifted number which forms at least part of the output number which is generated.

Block operations for an image processor having a two-dimensional execution lane array and a two-dimensional shift register

A method is described that includes, on an image processor having a two dimensional execution lane array and a two dimensional shift register array, repeatedly shifting first content of multiple rows or columns of the two dimensional shift register array and repeatedly executing at least one instruction between shifts that operates on the shifted first content and/or second content that is resident in respective locations of the two dimensional shift register array that the shifted first content has been shifted into.

COMPUTE OPTIMIZATIONS FOR NEURAL NETWORKS

One embodiment provides for a compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including a multi-bit input value and a ternary weight associated with a neural network and an arithmetic logic unit including a multiplier, an adder, and an accumulator register. To execute the decoded instruction, the multiplier is to perform a multiplication operation on the multi-bit input based on the ternary weight to generate an intermediate product and the adder is to add the intermediate product to a value stored in the accumulator register and update the value stored in the accumulator register.

Hybrid matrix multiplication pipeline

Systems, apparatuses, and methods implementing a hybrid matrix multiplication pipeline are disclosed. A hybrid matrix multiplication pipeline is able to execute a plurality of different types of instructions in a plurality of different formats by reusing execution circuitry in an efficient manner. For a first type of instruction for source operand elements of a first size, the pipeline uses N multipliers to perform N multiplication operations on N different sets of operands, where N is a positive integer greater than one. For a second type of instruction for source operand elements of a second size, the N multipliers work in combination to perform a single multiplication operation on a single set of operands, where the second size is greater than the first size. The pipeline also shifts element product results in an efficient manner when implementing a dot product operation.