Patent classifications
G06F9/3802
UN-MARK INSTRUCTIONS ON AN INSTRUCTION MATCH TO REDUCE RESOURCES REQUIRED TO MATCH A GROUP OF INSTRUCTIONS
A method of performing instruction marking in a computer processor architecture includes fetching instructions from a memory unit by a fetching unit in the computer processor architecture. Instruction groups for marking are determined. Fetched instructions are matched to instruction groups for marking. The fetched instructions are marked. Some of the marked instructions are selectively unmarked. The marked and unmarked instructions are forwarded to a queue of instructions for processing in the computer processor architecture.
Systems and methods for performing horizontal tile operations
Disclosed embodiments relate to systems and methods for performing instructions specifying horizontal tile operations. In one example, a processor includes fetch circuitry to fetch an instruction specifying a horizontal tile operation, a location of a M by N source matrix comprising K groups of elements, and locations of K destinations, wherein each of the K groups of elements comprises the same number of elements, decode circuitry to decode the fetched instruction, and execution circuitry to respond to the decoded instruction by generating K results, each result being generated by performing the specified horizontal tile operation across every element of a corresponding group of the K groups, and writing each generated result to a corresponding location of the K specified destination locations.
METHOD AND APPARATUS FOR VECTOR SORTING USING VECTOR PERMUTATION LOGIC
A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.
Iterating group sum of multiple accumulate operations
Methods, systems and apparatuses for performing walk operations of single instruction, multiple data (SIMD) instructions are disclosed. One method includes initiating, by a scheduler, a SIMD thread, where the scheduler is operative to schedule the SIMD thread. The method further includes fetching a plurality of instructions for the SIMD thread. The method further includes determining, by a thread arbiter, at least one instruction that is a walk instruction, where the walk instruction iterates a block of instructions for a subset of channels of the SIMD thread, where the walk instruction includes a walk size, and where the walk size is a number of channels in the subset of channels of the SIMD thread that are processed in a walk iteration in association with the walk instruction. The method further includes executing the walk instruction based on the walk size.
Apparatus and method for inhibiting instruction manipulation
An apparatus and method are provided for inhibiting instruction manipulation. The apparatus has execution circuitry for performing data processing operations in response to a sequence of instructions from an instruction set, and decoder circuitry for decoding each instruction in the sequence in order to generate control signals for the execution circuitry. Each instruction comprises a plurality of instruction bits, and the decoder circuitry is arranged to perform a decode operation on each instruction to determine from the value of each instruction bit, and knowledge of the instruction set, the control signals to be issued to the execution circuitry in response to that instruction. An input path to the decoder circuitry comprises a set of wires over which the instruction bits of each instruction are provided. Scrambling circuitry is used to perform a scrambling function on each instruction using a secret scrambling key, such that the wire within the set of wires over which any given instruction bit is provided to the decoder circuitry is dependent on the secret scrambling key. The decode operation performed by the decoder circuitry is then adapted to incorporate a descrambling function using the secret scrambling key to reverse the effect of the scrambling function. As a result, independent of which wire any given instruction bit is provided on, the decode operation is arranged when decoding a given instruction to correctly interpret each instruction bit of that given instruction, based on knowledge of the instruction set, in order to determine from the value of each instruction bit the control signals to be issued to the execution circuitry in response to that given instruction.
INSTRUCTION PRE-FETCHING
Pre-fetching instructions for tasks of an operating system (OS) is provided by calling a task scheduler that determines a load start time for a set of instructions for a particular task corresponding to a task switch condition. The OS calls, and in response to the load start time, a loader entity module that generates a pre-fetch request that loads the set of instructions for the particular task from a non-volatile memory circuit into a random access memory circuit. The OS calls the task scheduler to switch to the particular task.
HARDWARE CONTROLLED INSTRUCTION PRE-FETCHING
A task control circuit maintains, in response to task event information, a task information queue that includes task information for a plurality of tasks. Based upon the task information in the task information queue, a future task switch condition is identified as corresponding to a task switch time for a particular task of the plurality of tasks. A load start time is determined for a set of instructions for the particular task. A pre-fetch request is generated to load the set of instructions for the particular task into the memory circuit. The pre-fetch request is forwarded to a hardware loader circuit. In response to the task switch time, a task event trigger is generated for the particular task. The hardware loader circuit is used to load, in response to the pre-fetch request, the set of instructions from a non-volatile memory into the memory circuit.
Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format
Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.
SYSTEMS AND METHODS FOR PERFORMING 16-BIT FLOATING-POINT MATRIX DOT PRODUCT INSTRUCTIONS
Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processor includes decode circuitry to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices containing doubleword elements, and execution circuitry to execute the decoded instruction to perform a flow K times for each element (m, n) of the specified destination matrix to generate eight products by multiplying each nibble of a doubleword element (M,K) of the specified first source matrix by a corresponding nibble of a doubleword element (K,N) of the specified second source matrix, and to accumulate and saturate the eight products with previous contents of the doubleword element.
ZERO LATENCY PREFETCHING IN CACHES
This invention involves a cache system in a digital data processing apparatus including: a central processing unit core; a level one instruction cache; and a level two cache. The cache lines in the second level cache are twice the size of the cache lines in the first level instruction cache. The central processing unit core requests additional program instructions when needed via a request address. Upon a miss in the level one instruction cache that causes a hit in the upper half of a level two cache line, the level two cache supplies the upper half level cache line to the level one instruction cache. On a following level two cache memory cycle, the level two cache supplies the lower half of the cache line to the level one instruction cache. This cache technique thus prefetches the lower half level two cache line employing fewer resources than an ordinary prefetch.