G06F9/30087

Managing load and store instructions for memory barrier handling

A front-end portion of a pipeline includes a stage that speculatively issues at least some instructions out-of-order. A back-end portion of the pipeline includes one or more stages that access a processor memory system. In the front-end (back-end), execution of instructions is managed based on information available in the front-end (back-end). Managing execution of a first memory barrier instruction includes preventing speculative out-of-order issuance of store instructions. The back-end control circuitry provides information accessible to the front-end control circuitry indicating that one or more particular memory instructions have completed handling by the processor memory system. The front-end control circuitry identifies one or more load instructions that were issued before the first memory barrier instruction was issued and are ordered after the first memory barrier instruction, and causes at least one of the identified load instructions to be reissued after the first memory barrier instruction has been issued.

MASKING FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURE
20230058355 · 2023-02-23 ·

A reconfigurable compute fabric can include multiple nodes, and each node can include multiple tiles with respective processing and storage elements. The tiles can be arranged in an array or grid and can be communicatively coupled. In an example, a first node can include a tile cluster of N memory-compute tiles, and the N memory-compute tiles can be coupled using a first portion of a synchronous compute fabric. Operations performed by the respective processing and storage elements of the N memory-compute tiles can be selectively enabled or disabled based on information in a mask field of data propagated through the first portion of the synchronous compute fabric.

SYSTEMS AND METHODS FOR PROCESSING OUT-OF-ORDER EVENTS
20230056344 · 2023-02-23 ·

The present disclosure provides new and innovative systems and methods for processing out-of-order events. In an example, a computer-implemented method includes obtaining data, committing the obtained data to a fixed-size storage pool, the fixed-size storage pool including a plurality of slots and a pool index including a fixed-length array, by acquiring a slot in the plurality of slots, locking the acquired slot, storing the obtained data in the acquired slot, updating the pool index for the storage pool by updating an element in the array corresponding to the acquired slot, the element storing an indication of the obtained data, and unlocking the acquired slot, and transmitting an indication that the data is available.

Synchronization amongst processor tiles

A processing system comprising an arrangement of tiles and an interconnect between the tiles. The interconnect comprises synchronization logic for coordinating a barrier synchronization to be performed between a group of the tiles. The instruction set comprises a synchronization instruction taking an operand which selects one of a plurality of available modes each specifying a different membership of the group. Execution of the synchronization instruction cause a synchronization request to be transmitted from the respective tile to the synchronization logic, and instruction issue to be suspended on the respective tile pending a synchronization acknowledgement being received back from the synchronization logic. In response to receiving the synchronization request from all the tiles in the group as specified by the operand of the synchronization instruction, the synchronization logic returns the synchronization acknowledgment to the tiles in the specified group.

Synchronizing scheduling tasks with atomic ALU

A method of synchronizing a group of scheduled tasks within a parallel processing unit into a known state is described. The method uses a synchronization instruction in a scheduled task which triggers, in response to decoding of the instruction, an instruction decoder to place the scheduled task into a non-active state and forward the decoded synchronization instruction to an atomic ALU for execution. When the atomic ALU executes the decoded synchronization instruction, the atomic ALU performs an operation and check on data assigned to the group ID of the scheduled task and if the check is passed, all scheduled tasks having the particular group ID are removed from the non-active state.

PROCESSOR WITH TABLE LOOKUP UNIT

A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop.

SYSTEMS AND METHODS FOR SYNCHRONIZING DATA PROCESSING IN A CELLULAR MODEM

A cellular modem processor can include dedicated processing engines that implement specific, complex data processing operations. The processing engines can be arranged in pipelines, with different processing engines executing different steps in a sequence of operations. Flow control or data synchronization between pipeline stages can be provided using a hybrid of firmware-based flow control and hardware-based data dependency management. Firmware instructions can define data flow by reference to a virtual address space associated with pipeline buffers. A hardware interlock controller within the pipeline can track and enforce the data dependencies for the pipeline.

SYNCHRONIZING SCHEDULING TASKS WITH ATOMIC ALU
20230033355 · 2023-02-02 ·

A method of synchronizing a group of scheduled tasks within a parallel processing unit into a known state is described. The method uses a synchronization instruction in a scheduled task which triggers, in response to decoding of the instruction, an instruction decoder to place the scheduled task into a non-active state and forward the decoded synchronization instruction to an atomic ALU for execution. When the atomic ALU executes the decoded synchronization instruction, the atomic ALU performs an operation and check on data assigned to the group ID of the scheduled task and if the check is passed, all scheduled tasks having the particular group ID are removed from the non-active state.

MONITORING EXECUTION OF APPLICATION SCHEDULES IN COMPUTING SYSTEMS

One or more embodiments of the present disclosure relate to monitoring execution of runnables that may be executed by a computing system, the executing begin based at least on a schedule. The monitoring may include one or more of: monitoring timing of execution of the runnables, monitoring one or more sequences of execution of the runnables, or monitoring health of at least a portion of the computing system executing the runnables. Additionally or alternatively, one or more embodiments may relate to determining compliance with respect to one or more execution constraints based at least in part on the monitoring.

Processor with conditional-fence commands excluding designated memory regions
20230036954 · 2023-02-02 ·

An apparatus includes a processor, configured to designate a memory region in a memory, and to issue (i) memory-access commands for accessing the memory and (ii) a conditional-fence command associated with the designated memory region. Memory-Access Control Circuitry (MACC) is configured, in response to identifying the conditional-fence command, to allow execution of the memory-access commands that access addresses within the designated memory region, and to defer the execution of the memory-access commands that access addresses outside the designated memory region, until completion of all the memory-access commands that were issued before the conditional-fence command.