Patent classifications
G06F9/30014
Extended memory operations
Systems, apparatuses, and methods related to extended memory operations are described. Extended memory operations can include operations specified by a single address and operand and may be performed by a computing device that includes a processing unit and a memory resource. The computing device can perform extended memory operations on data streamed through the computing tile without receipt of intervening commands. In an example, a computing device is configured to receive a command to perform an operation that comprises performing an operation on a data with the processing unit of the computing device and determine that an operand corresponding to the operation is stored in the memory resource. The computing device can further perform the operation using the operand stored in the memory resource.
METHOD AND APPARATUS FOR VECTOR SORTING USING VECTOR PERMUTATION LOGIC
A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.
Implementing 128-bit SIMD operations on a 64-bit datapath
A method of implementing a processor architecture and corresponding system includes operands of a first size and a datapath of a second size. The second size is different from the first size. Given a first array of registers and a second array of registers, each register of the first and second arrays being of the second size, selecting a first register and corresponding second register from the first array and the second array, respectively, to perform operations of the first size. This allows a user, who is interfacing with the hardware processor through software, to provide data of the datapath bit-width instead of the register bit-width. Advantageously, the user is agnostic to the size of the registers.
Computing device and method
The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.
Arrangements for communicating and processing data in a computing system
Systems and methods for reducing data movement in a computer system. The systems and methods use information or knowledge about the structure of an algorithm, operations to be executed at a receiving processing unit, variables or subsets or groups of variables in a distributed algorithm, or other forms of contextual information, for reducing the number of bits transmitted from at least one transmitting processing unit to at least one receiving processing unit or storage device.
Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format
Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.
Bit string operations in memory
Systems, apparatuses, and methods related to bit string operations in memory are described. The bit string operations may be performed within a memory array without transferring the bit strings or intermediate results of the operations to circuitry external to the memory array. For instance, sensing circuitry that can include a sense amplifier and a compute component can be coupled to a memory array. A controller can be coupled to the sensing circuitry and can be configured to cause one or more bit strings that are formatted according to a universal number format or a posit format to be transferred from the memory array to the sensing circuitry. The sensing circuitry can perform an arithmetic operation, a logical operation, or both using the one or more bit strings.
SYSTEMS, METHODS, AND APPARATUSES FOR TILE LOAD
Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in the form of decode circuitry to decode an instruction having fields for an opcode, a destination matrix operand identifier, and source memory information, and execution circuitry to execute the decoded instruction to load groups of strided data elements from memory into configured rows of the identified destination matrix operand to memory.
SYSTEMS AND METHODS FOR PERFORMING 16-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS
Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processor includes decode circuitry to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices containing doubleword elements, and execution circuitry to execute the decoded instruction to perform a flow K times for each element (m, n) of the specified destination matrix to generate eight products by multiplying each nibble of a doubleword element (M,K) of the specified first source matrix by a corresponding nibble of a doubleword element (K,N) of the specified second source matrix, and to accumulate and saturate the eight products with previous contents of the doubleword element.
Methods and Apparatus for Efficient Denormal Handling In Floating-Point Units
A floating-point (FP) arithmetic unit includes a first FP execution pipeline operatively coupled to a register file, the first FP execution pipeline configured to perform a first FP operation on a first FP operand provided by the register file, the first FP execution pipeline comprising a plurality of execution units; and a first normalization unit operatively coupled to the register file, and the first FP execution pipeline, the first normalization unit configured to normalize the first FP operand, wherein the first normalization unit is configured to operate in parallel with the first FP execution pipeline, and is further configured to, in response to detecting that the first FP operand is a denormal, assert a first FP execution pipeline busy flag to stall the instruction dispatch of a first subsequent FP operation, the first FP operation and the first subsequent FP operation being of one FP operation type.