G06F2207/3812

RECONFIGURABLE DIGITAL SIGNAL PROCESSING (DSP) VECTOR ENGINE

Systems and methods described herein may relate to providing a dynamically configurable circuitry able to process data associated with a variety of matrix dimensions one or more complex number operations, one or more real number operations, or both. Configurations may be applied to the configurable circuitry to program the configurable circuitry for a next operation. The configurable circuitry may process data according to a variety of operations based at least in part on operation of a repeated processing element coupled in a compute network of processing elements.

Method and apparatus for efficient binary and ternary support in fused multiply-add (FMA) circuits
10713012 · 2020-07-14 · ·

An apparatus and method for efficiently performing a multiply add or multiply accumulate operation. For example, one embodiment of a processor comprises: a decoder to decode an instruction specifying a multiply-accumulate or multiply-add operation, the instruction comprising a first operand identifying a multiplier and a second operand identifying a multiplicand; and fused multiply-add (FMA) execution circuitry comprising first multiplication circuitry to perform a multiplication using the multiplicand and multiplier to generate a result for multipliers and multiplicands falling within a first precision range, and second multiplication circuitry to be used instead of the first multiplication circuitry for multipliers and multiplicands falling within a second precision range; control circuitry, responsive to a precision of the first and second operands being below a threshold, to cause the first operand and second operand to be processed by the second multiplication circuitry to generate the result; and adder circuitry to add the result to an accumulated value to generate a new accumulated value.

Packed 16 bits instruction pipeline

Systems, apparatuses, and methods for efficiently processing arithmetic operations are disclosed. A computing system includes a processor capable of executing single precision mathematical instructions on data sizes of M bits and half precision mathematical instructions on data sizes of N bits, which is less than M bits. At least two source operands with M bits indicated by a received instruction are read from a register file. If the instruction is a packed math instruction, at least a first source operand with a size of N bits less than M bits is selected from either a high portion or a low portion of one of the at least two source operands read from the register file. The instruction includes fields storing bits, each bit indicating the high portion or the low portion of a given source operand associated with a register identifier specified elsewhere in the instruction.

CHANGING PRECISION OF OPERANDS
20240095302 · 2024-03-21 ·

Apparatuses, systems, and techniques to perform matrix multiply-accumulate (MMA) operations on data of a first type using one or more MMA instructions for data of a second type. In at least one embodiment, a single tensorfloat-32 (TF32) MMA instruction computes a 32-bit floating point (FP32) output using TF32 input operands converted from FP32 data values.

Multiplication and accumulation (MAC) operator
11909421 · 2024-02-20 · ·

A MAC operator includes a plurality of data type converters and a plurality of multipliers. Each of the plurality of data type converters may receive 16-bit input data of one of first to fourth data types of a floating-point format to convert into L-bit output data of the floating-point format. Each of the plurality of multipliers may perform a multiplication on the L-bit output data of the floating-point format outputted from two of the plurality of data type converters to output multiplication result data of the floating-point format.

Use of multiple different variants of floating point number formats in floating point operations on a per-operand basis
11966740 · 2024-04-23 · ·

A processor comprising: a register file comprising a group of operand registers for holding data values, each operand register being a fixed number of bits in length for holding a respective data value of that length; and processing logic comprising floating point logic for performing floating point operations on data values in the register file, the floating point logic is configured to process the fixed number of bits in the respective data value according to a floating point format comprising a set of mantissa bits and a set of exponent bits. The processing logic is operable to select between a plurality of different variants of the floating point format, at least some of the variants having a different size sets of mantissa bits and exponent bits relative to one another.

Decimal and binary floating point arithmetic calculations

Logic is provided for performing decimal and binary floating point arithmetic calculations on first and second operands. The method includes: receiving the first and second operands in packed format; unpacking the first and second operands; swapping the first operand to a fourth operand and the second operand to a third operand, if an exponent of the first operand is less than an exponent of the second operand, otherwise storing the first operand to the third operand and the second operand to the fourth operand; aligning the third operand and the fourth operands based on the exponent difference of the third and fourth operand and a number of leading zeroes of the third operand; performing an add/subtract operation on the aligned third and fourth operands with normalizing and rounding between the operands; and packing the result obtained from the add/subtract.

Reconfigurable digital signal processing (DSP) vector engine

Systems and methods described herein may relate to providing a dynamically configurable circuitry able to process data associated with a variety of matrix dimensions using one or more complex number operations, one or more real number operations, or both. Configurations may be applied to the configurable circuitry to program the configurable circuitry for a next operation. The configurable circuitry may process data according to a variety of operations based at least in part on operation of a repeated processing element coupled in a compute network of processing elements.

PACKED 16 BITS INSTRUCTION PIPELINE

Systems, apparatuses, and methods for routing traffic between clients and system memory are disclosed. A computing system includes a processor capable of executing single precision mathematical instructions on data sizes of M bits and half precision mathematical instructions on data sizes of N bits, which is less than M bits. At least two source operands with M bits indicated by a received instruction are read from a register file. If the instruction is a packed math instruction, at least a first source operand with a size of N bits less than M bits is selected from either a high portion or a low portion of one of the at least two source operands read from the register file. The instruction includes fields storing bits, each bit indicating the high portion or the low portion of a given source operand associated with a register identifier specified elsewhere in the instruction.

METHOD AND APPARATUS FOR EFFICIENT BINARY AND TERNARY SUPPORT IN FUSED MULTIPLY-ADD (FMA) CIRCUITS
20190056916 · 2019-02-21 ·

An apparatus and method for efficiently performing a multiply add or multiply accumulate operation. For example, one embodiment of a processor comprises: a decoder to decode an instruction specifying a multiply-accumulate or multiply-add operation, the instruction comprising a first operand identifying a multiplier and a second operand identifying a multiplicand; and fused multiply-add (FMA) execution circuitry comprising first multiplication circuitry to perform a multiplication using the multiplicand and multiplier to generate a result for multipliers and multiplicands falling within a first precision range, and second multiplication circuitry to be used instead of the first multiplication circuitry for multipliers and multiplicands falling within a second precision range; control circuitry, responsive to a precision of the first and second operands being below a threshold, to cause the first operand and second operand to be processed by the second multiplication circuitry to generate the result; and adder circuitry to add the result to an accumulated value to generate a new accumulated value