G06F9/30123

Reducing save restore latency for power control based on write signals

A method of save-restore operations includes monitoring, by a power controller of a parallel processor (such as a graphics processing unit), of a register bus for one or more register write signals. The power controller determines that a register write signal is addressed to a state register that is designated to be saved prior to changing a power state of the parallel processor from a first state to a second state having a lower level of energy usage. The power controller instructs a copy of data corresponding to the state register to be written to a local memory module of the parallel processor. Subsequently, the parallel processor receives a power state change signal and writes state register data saved at the local memory module to an off-chip memory prior to changing the power state of the parallel processor.

Register sharing mechanism to equally allocate disabled thread registers to active threads

An apparatus is disclosed. The apparatus includes one or more processors comprising register sharing circuitry to receive meta-information indicating a number of threads that are to be disabled and provide an indication that an associated thread is disabled, a plurality of General Purpose Register Files (GRFs), wherein one or more of the plurality of GRFs is associated with one of the plurality of threads and a plurality of multiplexers coupled to the one or more GRFs to receive the indication from the register sharing circuitry and disable thread access to an associated GRF based on an indication that a thread is to be disabled.

Non-cached loads and stores in a system having a multi-threaded, self-scheduling processor
11579888 · 2023-02-14 · ·

Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute instructions; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In a representative embodiment, the processor core is further adapted to execute a non-cached load instruction to designate a general purpose register rather than a data cache for storage of data received from a memory circuit. The core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, and to generate one or more work descriptor data packets to another circuit for execution of corresponding execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

SHARED UNIT INSTRUCTION EXECUTION

A data processing apparatus comprises receiver circuitry for receiving instructions from each of a plurality of requester devices. Processing circuitry executes the instructions associated with each of a subset of the requester devices at a time and arbitration circuitry determines the subset of the requester devices and causes the instructions associated with each of the subset of the requester devices to be executed next. In response to the receiver circuitry receiving an instruction of a predetermined type from one of the requester devices outside the subset of requester devices, the arbitration circuitry causes the instruction of the predetermined type to be executed next.

POWER EFFICIENT MEMORY VALUE UPDATES FOR ARM ARCHITECTURES

Disclosed are various examples of providing provide efficient waiting for detection of memory value updates for Advanced RISC Machines (ARM) architectures. An ARM processor component instructs a memory agent to perform a processing action, and executes a waiting function. The waiting function ensures that the processing action is completed by the memory agent. The waiting function performs an exclusive load at a memory location, and a wait for event (WFE) instruction that causes the ARM processor component to wait in a low-power mode for an event register to be set. Once the event register is set, the waiting function completes and a second processing action is executed by the ARM processor component.

OPERATING SYSTEM-LEVEL ASSISTIVE FEATURES FOR CONTEXTUAL PRIVACY

Systems and methods are described that include operations such as detecting a plurality of computing devices configured as a distributed ambient computing system, receiving a request to execute a computing task, obtaining, from the distributed ambient computing system, data representing a device context for at least two of the plurality of devices, generating a combined context corresponding to the distributed ambient computing system, the combined context representing a combination of the device context for the at least two devices, generating and providing at least one decision request based on the computing task and the combined context, receiving a response to the at least one decision request, and triggering execution of the computing task based on the response and the combined context.

EFFICIENT INTER-THREAD COMMUNICATION BETWEEN HARDWARE PROCESSING THREADS OF A HARDWARE MULTITHREADED PROCESSOR BY SELECTIVE ALIASING OF REGISTER BLOCKS
20230229445 · 2023-07-20 ·

A hardware multithreaded processor including a register file, a thread controller, and aliasing circuitry. The thread controller is configured to assign each of multiple hardware processing threads to a corresponding one of multiple register block sets in which each register block set includes at least two of multiple register blocks and in which each register block includes at least two registers. The aliasing circuitry is programmable to redirect a reference provided by a first hardware processing thread to a register of a register block assigned to a second hardware processing thread. The reference may be a register number in an instruction issued by the first hardware processing thread. The register number is converted by the aliasing circuitry to a register file address locating a register of the register block assigned to the second hardware processing thread. The aliasing circuitry may include a programmable register for one or more threads.

Graphics systems and methods for accelerating synchronization using fine grain dependency check and scheduling optimizations based on available shared memory space

Accelerated synchronization operations using fine grain dependency check are disclosed. A graphics multiprocessor includes a plurality of execution units and synchronization circuitry that is configured to determine availability of at least one execution unit. The synchronization circuitry to perform a fine grain dependency check of availability of dependent data or operands in shared local memory or cache when at least one execution unit is available.

Scheduling tasks in a multi-threaded processor
11550591 · 2023-01-10 · ·

A processor comprising: an execution unit for executing a respective thread in each of a repeating sequence of time slots; and a plurality of context register sets, each comprising a respective set of registers for representing a state of a respective thread. The context register sets comprise a respective worker context register set for each of the number of time slots the execution unit is operable to interleave, and at least one extra context register set. The worker context register sets represent the respective states of worker threads and the extra context register set being represents the state of a supervisor thread. The processor is configured to begin running the supervisor thread in each of the time slots, and to enable the supervisor thread to then individually relinquish each of the time slots in which it is running to a respective one of the worker threads.

Initialisation of Worker Threads

A processing device comprising: at least one execution unit configured to interleave execution of a plurality of worker threads, wherein each of the worker threads is configured to execute a same set of code to perform operations on a different set of data held in an input buffer of a memory of the processing device and output the results data to an output buffer. An instruction is executed so as to cause a plurality of operand registers, each of which is associated with one of the worker threads, to be populated with one or more variables enabling each worker to determine where in the input buffer is located its set of input data and where to store its results data.