Patent classifications
G06F9/30109
EVICTING AND RESTORING INFORMATION USING A SINGLE PORT OF A LOGICAL REGISTER MAPPER AND HISTORY BUFFER IN A MICROPROCESSOR COMPRISING MULTIPLE MAIN REGISTER FILE ENTRIES MAPPED TO ONE ACCUMULATOR REGISTER FILE ENTRY
A computer system, processor, programming instructions and/or method of processing data that includes a main register file having a plurality of entries for storing data; an accumulator register file having a plurality of entries for storing data wherein multiple main register file entries are mapped to one accumulator register file entry in the at least one accumulator register file; a logical register mapper to track and map logical registers to main register file entries, and a history buffer. Processing wide data width instructions includes evicting and restoring information from a single primary entry in the logical register mapper through a single read or write port in the logical register mapper without evicting or restoring the remaining other multiple main register file entries mapped in the accumulator register.
Multiply-accumulation in a data processing apparatus
A data processing apparatus, a method of operating a data processing apparatus, a non-transitory computer readable storage medium, and an instruction are provided. The instruction specifies a first source register, a second source register, and a set of N accumulation registers. In response to the instruction control signals are generated, causing processing circuitry to extract N data elements from content of the first source register, perform a multiplication of each of the N data elements by content of the second source register, and apply a result of each multiplication to content of a respective target register of the set of N accumulation registers. As a result plural (N) multiplications are performed in a manner that effectively provides a multiplier N times the register width, but without requiring the register file to be made N times larger.
Apparatus and method for converting a floating-point value from half precision to single precision
An embodiment of the invention is a processor including execution circuitry to, in response to a decoded instruction, convert a half-precision floating-point value to a single-precision floating-point value and store the single-precision floating-point value in each of the plurality of element locations of a destination register. The processor also includes a decoder and the destination register. The decoder is to decode an instruction to generate the decoded instruction.
PERMUTATION INSTRUCTION
A device includes a vector register file, a memory, and a processor. The vector register file includes a plurality of vector registers. The memory is configured to store a permutation instruction. The processor is configured to access a periodicity parameter of the permutation instruction. The periodicity parameter indicates a count of a plurality of data sources that contain source data for the permutation instruction. The processor is also configured to execute the permutation instruction to, for each particular element of multiple elements of a first permutation result register of the plurality of vector registers, select a data source of the plurality of data sources based at least in part on the count of the plurality of data sources and populate the particular element based on a value in a corresponding element of the selected data source.
CONVOLUTIONAL NEURAL NETWORK OPERATIONS
Methods and systems are disclosed for executing operations on single-instruction-multiple-data (SIMD) units. Techniques disclosed perform a dot product operation on input data during one computer cycle, including convolving the input data, generating intermediate data, and applying one or more transitional operations to the intermediate data to generate output data. Aspects described, wherein the input data is an input to a layer of a convolutional neural network and the generated output data is the output of the layer.
Systems and methods for performing 16-bit floating-point matrix dot product instructions
Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processor includes decode circuitry to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices containing doubleword elements, and execution circuitry to execute the decoded instruction to perform a flow K times for each element (m, n) of the specified destination matrix to generate eight products by multiplying each nibble of a doubleword element (M,K) of the specified first source matrix by a corresponding nibble of a doubleword element (K,N) of the specified second source matrix, and to accumulate and saturate the eight products with previous contents of the doubleword element.
Systems, apparatuses, and methods for chained fused multiply add
Embodiments of systems, apparatuses, and methods for chained fused multiply add. In some embodiments, an apparatus includes a decoder to decode a single instruction having an opcode, a destination field representing a destination operand, a first source field representing a plurality of packed data source operands of a first type that have packed data elements of a first size, a second source field representing a plurality of packed data source operands that have packed data elements of a second size, and a field for a memory location that stores a scalar value. A register file having a plurality of packed data registers includes registers for the plurality of packed data source operands that have packed data elements of a first size, the source operands that have packed data elements of a second size, and the destination operand. Execution circuitry executes the decoded single instruction to perform iterations of packed fused multiply accumulate operations by multiplying packed data elements of the sources of the first type by sub-elements of the scalar value, and adding results of these multiplications to an initial value in a first iteration and a result from a previous iteration in subsequent iterations.
Systems and methods to load a tile register pair
Embodiments detailed herein relate to systems and methods to load a tile register pair. In one example, a processor includes: decode circuitry to decode a load matrix pair instruction having fields for an opcode and source and destination identifiers to identify source and destination matrices, respectively, each matrix having a PAIR parameter equal to TRUE; and execution circuitry to execute the decoded load matrix pair instruction to load every element of left and right tiles of the identified destination matrix from corresponding element positions of left and right tiles of the identified source matrix, respectively, wherein the executing operates on one row of the identified destination matrix at a time, starting with the first row.
SYSTEMS, APPARATUSES, AND METHODS FOR CHAINED FUSED MULTIPLY ADD
Embodiments of systems, apparatuses, and methods for chained fused multiply add. In some embodiments, an apparatus includes a decoder to decode a single instruction having an opcode, a destination field representing a destination operand, a first source field representing a plurality of packed data source operands of a first type that have packed data elements of a first size, a second source field representing a plurality of packed data source operands that have packed data elements of a second size, and a field for a memory location that stores a scalar value. A register file having a plurality of packed data registers includes registers for the plurality of packed data source operands that have packed data elements of a first size, the source operands that have packed data elements of a second size, and the destination operand. Execution circuitry executes the decoded single instruction to perform iterations of packed fused multiply accumulate operations by multiplying packed data elements of the sources of the first type by sub-elements of the scalar value, and adding results of these multiplications to an initial value in a first iteration and a result from a previous iteration in subsequent iterations.
Data Processing Method and Device, and Storage Medium
A data processing method and device, and a storage medium are provided. A processor of the data processing device comprises an index register group. Said method comprises: obtaining a first index value of each of at least one index register according to instruction codes, and determining the at least one index register according to the first index value, the instruction codes being generated by a compiler, and the at least one index register being at least one register in the index register group; and acquiring a first content stored in each of the at least one index register, and determining a first vector register according to the first content; and executing the instruction codes by accessing the first vector register.