G06F9/3001

ACCELERATED PROCESSING DEVICE AND METHOD OF SHARING DATA FOR MACHINE LEARNING
20230004385 · 2023-01-05 · ·

A processing device is provided which comprises a plurality of compute units configured to process data, a plurality of arithmetic logic units, instantiated separate from the plurality of compute units, and configured to store the data at the arithmetic logic units and perform calculations using the data and an interconnect network, connecting the arithmetic logic units and configured to provide the arithmetic logic units with shared access to the data for communication between the arithmetic logic units. The interconnect network is also configured to provide the compute units with shared access to the data for communication between the compute units.

SYSTOLIC ARRAY WITH EFFICIENT INPUT REDUCTION AND EXTENDED ARRAY PERFORMANCE

Systems and methods are provided to perform multiply-accumulate operations of reduced precision numbers in a systolic array. Each row of the systolic array can receive reduced inputs from a respective reducer. The reduced input can include a reduced input data element and/or a reduced weight. The systolic array may lack support for inputs with a first bit-length and the reducers may reduce the bit-length of a given input from the first bit-length to a second shorter bit-length and provide the reduced input to the array. In order to reduce the bit-length, the reducer may reduce the number of trailing bits of the input. Further, the systolic array can receive a reduced and rounded input. The systolic array can propagate the reduced input through the processing elements in the systolic array. Each processing element may include a multiplier and/or an adder to perform arithmetical operations based on the reduced input.

Monolithic vector processor configured to operate on variable length vectors using a vector length register

A computer processor comprising a vector unit is disclosed. The vector unit may comprise a vector register file comprising at least one register to hold a varying number of elements. The vector unit may further comprise a vector length register file comprising at least one register to specify the number of operations of a vector instruction to be performed on the varying number of elements in the at least one register of the vector register file. The computer processor may be implemented as a monolithic integrated circuit.

LARGE INTEGER MULTIPLICATION ENHANCEMENTS FOR GRAPHICS ENVIRONMENT

An apparatus to facilitate large integer multiplication enhancements in a graphics environment is disclosed. The apparatus includes a processor comprising processing resources, the processing resources comprising multiplier circuitry to: receive operands for a multiplication operation, wherein the multiplication operation is part of a chain of multiplication operations for a large integer multiplication; and issue a multiply and add (MAD) instruction for the multiplication operation utilizing at least one of a double precision multiplier or a 48 bit output, wherein the MAD instruction to generate an output in a single clock cycle of the processor.

APPARATUSES, METHODS, AND SYSTEMS FOR A PACKED DATA CONVOLUTION INSTRUCTION WITH SHIFT CONTROL AND WIDTH CONTROL

Systems, methods, and apparatuses to support packed data convolution instructions with shift control and width control are described. In one embodiment, a hardware processor includes a decoder circuit to decode a single instruction into a decoded single instruction, the single instruction having fields that identify a first packed data source, a second packed data source, a packed data destination, a sliding window width, and a stride, and an opcode that indicates an execution circuit is to generate a first chunk of contiguous elements of the first packed data source having a width of the sliding window width, generate a second chunk of contiguous elements of the first packed data source having the width of the sliding window width and shifted by the stride, multiply each element of the first chunk by a corresponding element of a respective chunk of the second packed data source to generate a first set of products, add the first set of products together to generate a first sum, multiply each element of the second chunk by a corresponding element of a respective chunk of the second packed data source to generate a second set of products, add the second set of products together to generate a second sum, and store the first sum in a first element of the packed data destination and the second sum in a second element of the packed data destination; and the execution circuit is to execute the decoded single instruction according to the opcode.

DUAL PIPELINE PARALLEL SYSTOLIC ARRAY

A processing apparatus described herein includes a general-purpose parallel processing engine comprising a systolic array having multiple pipelines, each of the multiple pipelines including multiple pipeline stages, wherein the multiple pipelines include a first pipeline, a second pipeline, and a common input shared between the first pipeline and the second pipeline.

SYSTOLIC ARRAY HAVING SUPPORT FOR OUTPUT SPARSITY

A processing apparatus is described herein that includes a general-purpose parallel processing engine comprising a matrix accelerator including one or more systolic arrays, at least one of the one or more systolic arrays comprising multiple pipeline stages, each pipeline stage of the multiple pipeline stages including multiple processing elements, the multiple processing elements configured to perform processing operations on input matrix elements based on output sparsity metadata. The output sparsity metadata indicates to the multiple processing elements to bypass multiplication for a first row of elements of a second matrix and multiply a second row of elements of the second matrix with a column of matrix elements of a first matrix.

SYSTOLIC ARRAY OF ARBITRARY PHYSICAL AND LOGICAL DEPTH

A processing apparatus includes a processing resource including a general-purpose parallel processing engine and a matrix accelerator. The matrix accelerator includes first circuitry to receive a command to perform operations associated with an instruction, second circuitry to configure the matrix accelerator according to a physical depth of a systolic array within the matrix accelerator and a logical depth associated with the instruction, third circuitry to read operands for the instruction from a register file associated with the systolic array, fourth circuitry to perform operations for the instruction via one or more passes through one or more physical pipeline stages of the systolic array based on a configuration performed by the second circuitry, and fifth circuitry to write output of the operations to the register file associated with the systolic array.

DUAL SUM OF QUADWORD 16X16 MULTIPLY AND ACCUMULATE

Techniques for matrix multiplication are described. In some examples, a single instruction having a format of fields for an opcode, one or more fields to indicate a location of a source/destination operand, one or more fields to indicate a location of a first source operand, and one or more fields to indicate a location of a second source operand is used. Wherein the opcode is to indicate that execution circuitry is to: multiply values from corresponding data elements of the first and second sources, add a first subset of the multiplied values to a first value from the source/destination operand and store in a first data element position of the source/destination operand, and add a second subset of the multiplied values to a second value from the source/destination operand and store in a second data element position of the source/destination operand.

APPARATUSES, METHODS, AND SYSTEMS FOR INSTRUCTIONS FOR MATRIX MULTIPLICATION INSTRUCTIONS

Techniques for matrix multiplication are described. In some examples, decode circuitry is to decode a single instruction having fields for an opcode, an indication of a location of a first source operand, an indication of a location of a second source operand, and an indication of a location of a destination operand, wherein the opcode is to indicate that execution circuitry is to at least convert data elements of the first and second source operands from a first floating point representation to a second floating point representation, perform matrix multiplication with the converted data elements, and accumulate results of the matrix multiplication in the destination operand in the first floating point representation; and the execution circuitry is to execute to the decoded instruction as specified by the opcode.