G06F9/3012

Loop thread order execution control of a multi-threaded, self-scheduling reconfigurable computing fabric
11675734 · 2023-06-13 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. A representative system includes an interconnection network; a processor; and a plurality of configurable circuit clusters. Each configurable circuit cluster includes a plurality of configurable circuits arranged in an array; a synchronous network coupled to each configurable circuit of the array; and an asynchronous packet network coupled to each configurable circuit of the array. A representative configurable circuit includes a configurable computation circuit and a configuration memory having a first, instruction memory storing a plurality of data path configuration instructions to configure a data path of the configurable computation circuit; and a second, instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices for selection of a master synchronous input, a current data path configuration instruction, and a next data path configuration instruction for a next configurable computation circuit.

Dual vector arithmetic logic unit

A processing system executes wavefronts at multiple arithmetic logic unit (ALU) pipelines of a single instruction multiple data (SIMD) unit in a single execution cycle. The ALU pipelines each include a number of ALUs that execute instructions on wavefront operands that are collected from vector general process register (VGPR) banks at a cache and output results of the instructions executed on the wavefronts at a buffer. By storing wavefronts supplied by the VGPR banks at the cache, a greater number of wavefronts can be made available to the SIMD unit without increasing the VGPR bandwidth, enabling multiple ALU pipelines to execute instructions during a single execution cycle.

Vector processor and control method therefor

A vector processor is disclosed. The vector processor includes a plurality of register files provided to each of a plurality of single instruction multiple data (SIMD) lanes, storing each of a plurality of pieces of data, and respectively outputting input data to be used in a current cycle among the plurality of pieces of data, a shuffle unit for receiving a plurality of pieces of input data outputted from the plurality of register files, and performing shuffling such that the received plurality of pieces of input data respectively correspond to the plurality of SIMD lanes and outputting the same; and a command execution unit for performing a parallel operation by receiving input data outputted from the shuffle unit.

Processor having read shifter and controlling method using the same
11263013 · 2022-03-01 · ·

A processor that includes a register file, a read shifter, a decode unit and a plurality of functional units is introduced. The register file includes a read port. The read shifter includes a plurality of shifter entries and is configured to shift out a shifter entry among the plurality of shifter entries every clock cycle. Each of the plurality of shifter entries is associated with a clock cycle and each of the plurality of shifter entries comprises a read value that indicates an availability of the read port of the register file for a read operation in the clock cycle. The decode unit is coupled to the read shifter and is configured to decode and issue an instruction based on the read values included in the plurality of shifter entries of the read shifter. The plurality of functional units is coupled to the decode unit and the register file and is configured to execute the instruction issued by the decode unit and perform the read operation to the read port of the register file.

DATA PROCESSING METHOD AND DEVICE
20170315811 · 2017-11-02 ·

Provided is a data processing method including the operations of storing, in a register, a first immediate portion included in a first instruction, from among the first immediate portion and a second immediate portion that constitute an immediate value, which is an operand; determining the immediate value by catenating the second immediate portion included in a second instruction with the stored first immediate portion; and performing an operation by using a value indicated by the second instruction and the determined immediate value.

Fast mapping table register file allocation algorithm for SIMT processors
09798543 · 2017-10-24 · ·

One embodiment of the present invention sets forth a technique for allocating register file entries included in a register file to a thread group. A request to allocate a number of register file entries to the thread group is received. A required number of mapping table entries included in a register file mapping table (RFMT) is determined based on the request, where each mapping table entry included in the RFMT is associated with a different plurality of register file entries included in the register file. The RFMT is parsed to locate an available mapping table entry in the RFMT for each of the required mapping table entries. For each available mapping table entry, a register file pointer is associated with an address that corresponds to a first register file entry in the plurality of register file entries associated with the available mapping table entry.

Memory access for a vector processor

A method and device for memory access in processors is provided. A processor, comprising a plurality of computational units, is capable of executing a single instruction on multiple pieces of data simultaneously (SIMD). A read operation is initiated to load data from memory into the plurality of computational units (CUs) arranged into a plurality of CU groups. The memory is arranged into a plurality of memory macro-blocks each associated with a respective CU group of the plurality of CU groups. For each CU group a respective first memory address is determined and for each CU group, the data in the associated memory macro-block is accessed at the respective first memory address.

VLIW processor including a state register for inter-slot data transfer and extended bits operations
09798547 · 2017-10-24 · ·

A very long instruction word (VLIW) processor that performs efficient processing including extended bits operations is provided. The VLIW processor includes an instruction control unit, a register file unit, and an instruction execution unit. The instruction execution unit includes a plurality of slots, and a state register arranged between the second slot and the third slot to transfer N-bit data between the second and third slots. The VLIW processor stores data output from the third slot into the state register and uses the data, and thus achieves efficient processing including bit-expanded operations, such as processing performed in response to instructions commonly used in image processing, image recognition, and other processing, while preventing scaling up of the circuit.

Unaligned instruction relocation

In one embodiment, a computer-implemented method includes receiving source code to be compiled into an executable file for an unaligned instruction set architecture (ISA). Aligned assembled code is generated, by a computer processor. The aligned assembled code complies with an aligned ISA and includes aligned processor code for a processor and aligned accelerator code for an accelerator. A first linking pass is performed on the aligned assembled code, including relocating a first relocation target in the aligned accelerator code that refers to a first object outside the aligned accelerator code. Unaligned assembled code is generated in accordance with the unaligned ISA and includes unaligned accelerator code for the accelerator and unaligned processor code for the processor. A second linking pass is performed on the unaligned assembled code, including relocating a second relocation target outside the unaligned accelerator code that refers to an object in the unaligned accelerator code.

ROTATIONAL DISPATCH FOR PARALLEL SLICE PROCESSOR

Supplemental instruction dispatch may be used in some instances in a parallel slice processor to dispatch additional instructions, referred to as supplemental instructions, to supplemental instruction ports of execution slices and using primary instruction ports of one or more execution slices to supply one or more source operands for such supplemental instructions. In addition, in some instances, in lieu of or in addition to supplemental instruction dispatch, selective slice partitioning may be used to selectively partition groups of execution slices in a parallel slice processor based upon a threading mode within which such execution slices are executing.