G06F15/8076

Vector register file

An aspect includes accessing a vector register in a vector register file. The vector register file includes a plurality of vector registers and each vector register includes a plurality of elements. A read command is received at a read port of the vector register file. The read command specifies a vector register address. The vector register address is decoded by an address decoder to determine a selected vector register of the vector register file. An element address is determined for one of the plurality of elements associated with the selected vector register based on a read element counter of the selected vector register. A word is selected in a memory array of the selected vector register as read data based on the element address. The read data is output from the selected vector register based on the decoding of the vector register address by the address decoder.

EXECUTION UNIT SHARING BETWEEN PROCESSING CORES IN A CLUSTER OF A SYSTEM-ON-CHIP (SOC)

A method of execution unit (EU) sharing between processor cores is described. The method includes encountering a structural hazard associated with an issued instruction in an instruction queue of a dispatch stage inside an active processor core. The method also includes issuing a request for an idle execution unit of an inactive processor core. The method further includes sending a transaction containing source operands of the issued instruction, and a word address of a result buffer as a destination operand to an allocated EU of the inactive processor core. The method also includes replacing the issued instruction in the instruction queue with a load operation to forward a result of the issued instruction from the result buffer based on the word address.

Vector register file

An aspect includes accessing a vector register in a vector register file. The vector register file includes a plurality of vector registers and each vector register includes a plurality of elements. A read command is received at a read port of the vector register file. The read command specifies a vector register address. The vector register address is decoded by an address decoder to determine a selected vector register of the vector register file. An element address is determined for one of the plurality of elements associated with the selected vector register based on a read element counter of the selected vector register. A word is selected in a memory array of the selected vector register as read data based on the element address. The read data is output from the selected vector register based on the decoding of the vector register address by the address decoder.

APPARATUS AND METHOD FOR TRANSFERRING A PLURALITY OF DATA STRUCTURES BETWEEN MEMORY AND A PLURALITY OF VECTOR REGISTERS

An apparatus and method are provided for transferring a plurality of data structures between memory and a plurality of vector registers, each vector register being arranged to store a vector operand comprising a plurality of data elements. Access circuitry is used to perform access operations to move data elements of vector operands between the data structures in memory and specified vector registers, each data structure comprising multiple data elements stored at contiguous addresses in the memory. Decode circuitry is responsive to a single access instruction identifying a plurality of vector registers and a plurality of data structures that are located discontiguously with respect to each other in the memory, to generate control signals to control the access circuitry to perform a sequence of access operations to move the plurality of data structures between the memory and the plurality of vector registers such that the vector operand in each vector register holds a corresponding data element from each of the plurality of data structures. This provides a very efficient mechanism for performing complex access operations, resulting in an increase in execution speed, and potential reductions in power consumption.

Execution unit sharing between processing cores in a cluster of a system-on-chip (SoC)

A method of execution unit (EU) sharing between processor cores is described. The method includes encountering a structural hazard associated with an issued instruction in an instruction queue of a dispatch stage inside an active processor core. The method also includes issuing a request for an idle execution unit of an inactive processor core. The method further includes sending a transaction containing source operands of the issued instruction, and a word address of a result buffer as a destination operand to an allocated EU of the inactive processor core. The method also includes replacing the issued instruction in the instruction queue with a load operation to forward a result of the issued instruction from the result buffer based on the word address.

Signal processing apparatus and non-transitory computer-readable storage medium
12405794 · 2025-09-02 · ·

This invention provides a signal processing apparatus which comprises a CPU and a programmable signal processor which has a plurality of execution units, a register file having a plurality of registers connected serially, and a shift controller which issues a shift signal to shift data held in the register file when each of the execution units has completed execution of a program of one cycle. Each of the execution units, after the execution of the program of one cycle, re-executes the program in accordance with the reception of the shift signal. The CPU stores, to a program memory in each of the execution units, a program including instructions to store data of a result of processing to one of registers so that the other execution unit refers the data in next cycle.

Offloading processing tasks to decoupled accelerators for increasing performance in a system on a chip

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.