G06F9/3009

MANAGING RETURN PARAMETER ALLOCATION
20230058935 · 2023-02-23 ·

A hybrid threading processor (HTP) supports thread creation by executing an instruction that indicates an amount of storage space to reserve for return values. Before a thread is created, the indicated amount of space is reserved. The newly created child thread sends a return packet back to the parent thread when the child thread completes. The thread writes its return information into the reserved space and waits for the parent thread to execute a thread join instruction. The thread join instruction takes the returned information from the reserved space and transfers it to the parent thread's register state. The reserved space is released once the child thread is joined. Using a configurable amount of space for each child thread may allow for more child threads to be executed simultaneously.

MECHANISM TO PROVIDE RELIABLE RECEIPT OF EVENT MESSAGES
20230056665 · 2023-02-23 ·

Devices and techniques for providing receipts for event messages in a processor are described herein. A system includes multiple memory-compute nodes coupled to one another over a scale fabric; a set of registers; and an event manager hardware circuitry to: receive an event message corresponding to an event, and the event associated with an event mode; track a counter value representing a number of received event messages related to the event, the counter value stored in the set of registers; compare the number of received event messages to a trigger value; and in response to the number of received event messages equaling the trigger value: use an atomic operation to reset the counter value in the set of registers while maintaining the event mode; and alert a thread of the event.

Synchronization amongst processor tiles

A processing system comprising an arrangement of tiles and an interconnect between the tiles. The interconnect comprises synchronization logic for coordinating a barrier synchronization to be performed between a group of the tiles. The instruction set comprises a synchronization instruction taking an operand which selects one of a plurality of available modes each specifying a different membership of the group. Execution of the synchronization instruction cause a synchronization request to be transmitted from the respective tile to the synchronization logic, and instruction issue to be suspended on the respective tile pending a synchronization acknowledgement being received back from the synchronization logic. In response to receiving the synchronization request from all the tiles in the group as specified by the operand of the synchronization instruction, the synchronization logic returns the synchronization acknowledgment to the tiles in the specified group.

Processor and pipeline processing method for processing multiple threads including wait instruction processing

A pipeline processing unit includes a fetch unit that fetches the instruction for the thread having an execution right, a decoding unit that decodes the instruction fetched by the fetch unit, and a computation execution unit that executes the instruction decoded by the decoding unit. When the WAIT instruction for the thread having the execution right is executed, an instruction holding unit holds instruction fetch information on a processing target instruction to be processed immediately after the WAIT instruction. An execution target thread selection unit selects a thread to be executed based on a wait command and, in response to a wait state started from the execution of the WAIT instruction being canceled, processes the processing target instruction from decoding thereof based on the instruction fetch information on the processing target instruction held in the instruction holding unit.

Synchronizing scheduling tasks with atomic ALU

A method of synchronizing a group of scheduled tasks within a parallel processing unit into a known state is described. The method uses a synchronization instruction in a scheduled task which triggers, in response to decoding of the instruction, an instruction decoder to place the scheduled task into a non-active state and forward the decoded synchronization instruction to an atomic ALU for execution. When the atomic ALU executes the decoded synchronization instruction, the atomic ALU performs an operation and check on data assigned to the group ID of the scheduled task and if the check is passed, all scheduled tasks having the particular group ID are removed from the non-active state.

Computer program, method, and device for distributing resources of computing device
11500634 · 2022-11-15 · ·

A computer program stored in a computer readable storage medium is provided. It includes encoded commands, in which when the computer program is executed by one or more processors of a computer system. The computer program allows the one or more processors to perform certain commands for distributing resources of a computing device.

SCHEDULING TASKS USING SWAP FLAGS
20230097760 · 2023-03-30 ·

A method of activating scheduling instructions within a parallel processing unit is described. The method comprises decoding, in an instruction decoder, an instruction in a scheduled task in an active state and checking, by an instruction controller, if a swap flag is set in the decoded instruction. If the swap flag in the decoded instruction is set, a scheduler is triggered to de-activate the scheduled task by changing the scheduled task from the active state to a non-active state.

Conditional branching control for a multi-threaded, self-scheduling reconfigurable computing fabric
11573796 · 2023-02-07 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. A representative system includes an interconnection network; a processor; and a plurality of configurable circuit clusters. Each configurable circuit cluster includes a plurality of configurable circuits arranged in an array; a synchronous network coupled to each configurable circuit of the array; and an asynchronous packet network coupled to each configurable circuit of the array. A representative configurable circuit includes a configurable computation circuit and a configuration memory having a first, instruction memory storing a plurality of data path configuration instructions to configure a data path of the configurable computation circuit; and a second, instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices for selection of a master synchronous input, a current data path configuration instruction, and a next data path configuration instruction for a next configurable computation circuit.

Computational partition for a multi-threaded, self-scheduling reconfigurable computing fabric
11573834 · 2023-02-07 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. A representative system includes an asynchronous packet network; a plurality of configurable circuits arranged in an array, each configurable circuit coupled to the asynchronous packet network and adapted to perform a plurality of computations; and a dispatch interface circuit adapted to partition the plurality of configurable circuits into one or more separate partitions of configurable circuits and to load one or more computation kernels into each partition of configurable circuits. The dispatch interface circuit may load balance across the partitions of configurable circuits by starting threads for execution in the partition having the highest number of available thread identifiers. The dispatch interface may also assert a partition enable signal to merge the one or more separate partitions and assert a stop signal to all configurable circuits of the one or more separate partitions of configurable circuits.

Implementing state-based frame barriers to process colorless roots during concurrent execution

An application thread executes concurrently with a garbage collection (GC) thread traversing a call stack of the application thread. Frames of the call stack that have been processed by the GC thread assume a global state associated with the GC thread. The application thread may attempt to return to a target frame that has not yet assumed the global state. The application thread hits a frame barrier, preventing return to the target frame. The application thread determines a frame state of the target frame. The application thread selects appropriate operations for bringing the target frame to the global state based on the frame state. The selected operations are performed to bring the target frame to the global state. The application thread returns to the target frame.