G06F9/3012

Dedicated vector sub-processor system

A processor includes a plurality of vector sub-processors (VSPs) and a plurality of memory banks dedicated to respective VSPs. A first memory bank corresponding to a first VSP includes a first plurality of high vector general purpose register (VGPR) banks and a first plurality of low VGPR banks corresponding to the first plurality of high VGPR banks. The first memory bank further includes a plurality of operand gathering components that store operands from respective high VGPR banks and low VGPR banks. The operand gathering components are assigned to individual threads while the threads are executed by the first VSP.

Circular buffer accessing device, system and method

A device includes a circular buffer, which, in operation, is organized into a plurality of subsets of buffers, and control circuitry coupled to the circular buffer. The control circuitry, in operation, receives a memory load command to load a set of data into the circular buffer. The memory load command has an offset parameter indicating a data offset and a subset parameter indicating a subset of the plurality of subsets into which the circular buffer is organized. The control circuitry responds to the command by identifying a set of buffer addresses of the circular buffer based on a value of the offset parameter and a value of the subset parameter, and loading the set of data into the circular buffer using the identified set of buffer addresses.

Method and system for instruction block to execution unit grouping
11656875 · 2023-05-23 · ·

A method for emulating a guest centralized flag architecture by using a native distributed flag architecture. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks, wherein each of the instruction blocks comprise two half blocks; scheduling the instructions of the instruction block to execute in accordance with a scheduler; and using a distributed flag architecture to emulate a centralized flag architecture for the emulation of guest instruction execution.

METHOD AND SYSTEM FOR OPTIMIZING DATA TRANSFER FROM ONE MEMORY TO ANOTHER MEMORY
20230111058 · 2023-04-13 · ·

A method and system for moving data from a source memory to a destination memory by a processor is disclosed herein. The destination memory stores a sequence of instructions and the sequence of instructions comprises one or more load instructions and one or more store instructions. The processor initially moves the one or more store instructions from the destination memory to the source memory. The processor then executes the one or more load instructions from the destination memory. On executing the one or more load instructions, the data is loaded from the source memory to at least one register in the processor. The processor further initiates execution of the one or more store instructions stored in the source memory. On executing the one or more store instructions from the source memory, the processor stores the data from the at least one register to the destination memory.

Vector table load instruction with address generation field to access table offset value

A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop.

Wavefront selection and execution
11656877 · 2023-05-23 · ·

Techniques are provided for executing wavefronts. The techniques include at a first time for issuing instructions for execution, performing first identifying, including identifying that sufficient processing resources exist to execute a first set of instructions together within a processing lane; in response to the first identifying, executing the first set of instructions together; at a second time for issuing instructions for execution, performing second identifying, including identifying that no instructions are available for which sufficient processing resources exist for execution together within the processing lane; and in response to the second identifying, executing an instruction independently of any other instruction.

SYSTEM AND METHOD FOR USING VIRTUAL VECTOR REGISTER FILES

Described is a system and method for using virtual vector register files. In particular, a graphics processor includes a logic unit, a virtual vector register file coupled to the logic unit, a vector register backing store coupled to the virtual vector register file, and a virtual vector register file controller coupled to the virtual vector register file. The virtual vector register file includes a N deep vector register file and a M deep vector register file, where N is less than M. The virtual vector register file controller performing eviction and allocation between the N deep vector register file, the M deep vector register file and the vector register backing store dependent on at least access requests for certain vector registers.

METHOD AND APPARATUS FOR LEVERAGING SIMULTANEOUS MULTITHREADING FOR BULK COMPUTE OPERATIONS

Apparatus and method for leveraging simultaneous multithreading for bulk compute operations. For example, one embodiment of a processor comprises: a plurality of cores including a first core to simultaneously process instructions of a plurality of threads; a cache hierarchy coupled to the first core and the memory, the cache hierarchy comprising a Level 1 (L1) cache, a Level 2 (L2) cache, and a Level 3 (L3) cache; and a plurality of compute units coupled to the first core including a first compute unit associated with the L1 cache, a second compute unit associated with the L2 cache, and a third compute unit associated with the L3 cache, wherein the first core is to offload instructions for execution by the compute units, the first core to offload instructions from a first thread to the first compute unit, instructions from a second thread to the second compute unit, and instructions from a third thread to the third compute unit.

Renaming with generation numbers

A processor including a register file having a plurality of registers, and configured for out-of-order instruction execution, further includes a renamer unit that produces generation numbers that are associated with register file addresses to provide a renamed version of a register that is temporally offset from an existing version of that register rather than assigning a non-programmer-visible physical register as the renamed register.

Loop execution control for a multi-threaded, self-scheduling reconfigurable computing fabric using a reenter queue
11675598 · 2023-06-13 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. A representative system includes an interconnection network; a processor; and a plurality of configurable circuit clusters. Each configurable circuit cluster includes a plurality of configurable circuits arranged in an array; a synchronous network coupled to each configurable circuit of the array; and an asynchronous packet network coupled to each configurable circuit of the array. A representative configurable circuit includes a configurable computation circuit and a configuration memory having a first, instruction memory storing a plurality of data path configuration instructions to configure a data path of the configurable computation circuit; and a second, instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices for selection of a master synchronous input, a current data path configuration instruction, and a next data path configuration instruction for a next configurable computation circuit.