G06F9/30138

Non-cached loads and stores in a system having a multi-threaded, self-scheduling processor
11579888 · 2023-02-14 · ·

Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute instructions; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In a representative embodiment, the processor core is further adapted to execute a non-cached load instruction to designate a general purpose register rather than a data cache for storage of data received from a memory circuit. The core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, and to generate one or more work descriptor data packets to another circuit for execution of corresponding execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Portions of configuration state registers in-memory

Portions of configuration state registers in-memory. An instruction is obtained, and a determination is made that the instruction accesses a configuration state register. A portion of the configuration state register is in-memory and another portion of the configuration state register is in-processor. Processing associated with the configuration state register is performed. The performing processing is based on a type of access and whether the portion or the other portion is being accessed.

Virtualized Multicore Systems With Extended Instruction Heterogeneity
20230237009 · 2023-07-27 ·

A system on a chip may include a plurality of data plane processor cores sharing a common instruction set architecture. At least one of the data plane processor cores is specialized to perform a particular function via extensions to the otherwise common instruction set architecture. Such systems on a chip may have reduced physical complexity, cost, and time-to-market, and may provide improvements in core utilization and reductions in system power consumption.

Bit width reconfiguration using a shadow-latch configured register file

A processor includes a front-end with an instruction set that operates at a first bit width and a floating point unit coupled to receive the instruction set in the processor that operates at the first bit width. The floating point unit operates at a second bit width and, based upon a bit width assessment of the instruction set provided to the floating point unit, the floating point unit employs a shadow-latch configured floating point register file to perform bit width reconfiguration. The shadow-latch configured floating point register file includes a plurality of regular latches and a plurality of shadow latches for storing data that is to be either read from or written to the shadow latches. The bit width reconfiguration enables the floating point unit that operates at the second bit width to operate on the instruction set received at the first bit width.

PROCESSING DEVICE AND METHOD OF USING A REGISTER CACHE
20220413858 · 2022-12-29 · ·

A processing device is provided which comprises memory, a plurality of registers and a processor. the processor is configured to execute a plurality of portions of a program, allocate a number of the registers per portion of the program such that a number of remaining registers are available as a register cache and transfer data between the number of registers, which are allocated per portion of the program, and the register cache. The processor loads data to the allocated registers to execute a portion of the program, stores data, resulting from execution of the portion, in the register cache, reloads the data in the allocated registers and executes another portion of the program using the data reloaded to the allocated registers and A called function uses the number of allocated registers, which is less than an architectural limit of registers allocated per portion of the program.

Compiler-assisted inter-SIMD-group register sharing

Systems, apparatuses, and methods for efficiently sharing registers among threads are disclosed. A system includes at least a processor, control logic, and a register file with a plurality of registers. The processor assigns a base set of registers to each thread of a plurality of threads executing on the processor. When a given thread needs more than the base set of registers to execute a given phase of program code, the given thread executes an acquire instruction to acquire exclusive access to an extended set of registers from a shared resource pool. When the given thread no longer needs additional registers, the given thread executes a release instruction to release the extended set of registers back into the shared register pool for other threads to use. In one implementation, the compiler inserts acquire and release instructions into the program code based on a register liveness analysis performed during compilation.

DATA PROCESSING

Data processing circuitry comprises out-of-order instruction execution circuitry; register mapping circuitry to map zero or more architectural processor registers relating to execution of that program instruction to respective ones of a set of physical processor registers; commit circuitry to commit, in a program code order, the results of executed program instructions, the commit circuitry being configured to access a data store which stores register tag data to indicate which physical registers mapped by the register mapping circuitry relate to a given program instruction; fault detection circuitry to detect a memory access fault in respect of a vector memory access operation and to generate fault indication data indicative of an element earliest in the element order for which a memory access fault was detected; a fault indication register to store the fault indication data, in which the register mapping circuitry is configured to generate a register mapping for a program instruction for any architectural processor registers relating to execution of that program instruction other than the fault indication register; and control circuitry to encode the fault indication data, applicable to a program instruction not yet committed by the commit circuitry, to register tag data associated with that program instruction.

Hierarchical general register file (GRF) for execution block

In an example, an apparatus comprises a plurality of execution units, and a first general register file (GRF) communicatively couple to the plurality of execution units, wherein the first GRF is shared by the plurality of execution units. Other embodiments are also disclosed and claimed.

SYSTEMS AND METHODS FOR SYNCHRONIZING DATA PROCESSING IN A CELLULAR MODEM

A cellular modem processor can include dedicated processing engines that implement specific, complex data processing operations. The processing engines can be arranged in pipelines, with different processing engines executing different steps in a sequence of operations. Flow control or data synchronization between pipeline stages can be provided using a hybrid of firmware-based flow control and hardware-based data dependency management. Firmware instructions can define data flow by reference to a virtual address space associated with pipeline buffers. A hardware interlock controller within the pipeline can track and enforce the data dependencies for the pipeline.

Hierarchical register file device based on spin transfer torque-random access memory

The embodiments provide a register file device which increases energy efficiency using a spin transfer torque-random access memory for a register file used to compute a general purpose graphic processing device, and hierarchically uses a register cache and a buffer together with the spin transfer torque-random access memory, to minimize leakage current, reduce a write operation power, and solve the write delay.