G06F9/3889

SUPER-THREAD PROCESSOR
20220229662 · 2022-07-21 ·

The disclosed inventions include a processor apparatus and method that enable a general purpose processor to achieve twice the operating frequency of typical processor implementations with a modest increase in area and a modest increase in energy per operation. The invention relies upon exploiting multiple independent streams of execution. Low area and low energy memory arrays used for register files operate a modest frequency. Instructions can be issued at a rate higher than this frequency by including logic that guarantees the spacing between instructions from the same thread are spaced wider than the time to access the register file. The result of the invention is the ability to overlap long latency structures, which allows using lower energy structures, thereby reducing energy per operation.

APPARATUS FOR OPTIMIZED MICROCODE INSTRUCTIONS FOR DYNAMIC PROGRAMMING BASED ON IDEMPOTENT SEMIRING OPERATIONS
20210406009 · 2021-12-30 ·

In one embodiments, a method is provided. The method includes determining whether a set of algorithmic operations can be represented using an algebraic formulation. The method also includes generating a sequence of idempotent semiring operations based on the set of algorithmic operations in response to determining that the set of algorithmic operations can be represented using the algebraic formulation. The sequence of idempotent semiring operations are part of an algebraic idempotent semiring, represent the algebraic formulation, and comprise one or more of an associative, commutative pick operation that forms an abelian monoid and an associative tally operation that forms a monoid and distributes over the pick operation. The method also includes generating a sequence of microcode instructions based on the sequence of idempotent semiring operations, wherein the sequence of microcode instructions carries out the sequence of idempotent semiring operations.

Memory fragments for supporting code block execution by using virtual cores instantiated by partitionable engines
11204769 · 2021-12-21 · ·

A global front end scheduler to schedule instruction sequences to a plurality of virtual cores implemented via a plurality of partitionable engines. The global front end scheduler includes a thread allocation array to store a set of allocation thread pointers to point to a set of buckets in a bucket buffer in which execution blocks for respective threads are placed, a bucket buffer to provide a matrix of buckets, the bucket buffer including storage for the execution blocks, and a bucket retirement array to store a set of retirement thread pointers that track a next execution block to retire for a thread.

Methods and systems for inter-pipeline data hazard avoidance

Methods and parallel processing units for avoiding inter-pipeline data hazards wherein inter-pipeline data hazards are identified at compile time. For each identified inter-pipeline data hazard the primary instruction and secondary instruction(s) thereof are identified as such and are linked by a counter which is used to track that inter-pipeline data hazard. Then when a primary instruction is output by the instruction decoder for execution the value of the counter associated therewith is adjusted (e.g. incremented) to indicate that there is hazard related to the primary instruction, and when primary instruction has been resolved by one of multiple parallel processing pipelines the value of the counter associated therewith is adjusted (e.g. decremented) to indicate that the hazard related to the primary instruction has been resolved. When a secondary instruction is output by the decoder for execution, the secondary instruction is stalled in a queue associated with the appropriate instruction pipeline if at least one counter associated with the primary instructions from which it depends indicates that there is a hazard related to the primary instruction.

Instruction issue according to in-order or out-of-order execution modes

Apparatus for processing data (2) includes issue circuitry (22) for issuing program instructions (processing operations) to execute either within real time execution circuitry (32) or non real time execution circuitry (24, 26, 28, 30). Registers within a register file (18) are marked as non real time dependent registers if they are allocated to store a data value which is to be written by an uncompleted program instruction issued to the non real time execution circuitry and not yet completed. Issue policy control circuitry (42) responds to a trigger event to enter a real time issue policy mode to control the issue circuitry (22) to issue candidate processing operations (such as program instruction, micro-operations, architecturally triggered processing operations etc.) to one of the non real time execution circuitry or the real time execution circuitry in dependence upon whether that candidate processing operation reads a register marked as a non real time dependent register.

Controlling the operation of a decoupled access-execute processor
11194581 · 2021-12-07 · ·

Data processing apparatuses, methods of data processing, instructions, and simulator computer programs for providing a corresponding instruction execution environment are disclosed. Decode circuitry is responsive to an instance of a predetermined instruction type to cause issue circuitry to issue at least one subsequent instruction for execution to one of first and second instruction execution circuitry which support decoupled access-execute instruction execution. The predetermined instruction type is thus a steering instruction for at least one subsequent instruction and the programmer is provided with a mechanism for determining which program instructions are treated as access instructions and which are treated as execute instructions.

FRACTAL CALCULATING DEVICE AND METHOD, INTEGRATED CIRCUIT AND BOARD CARD
20220188614 · 2022-06-16 ·

A fractal computing device according to an embodiment of the present application may be included in an integrated circuit device. The integrated circuit device includes a universal interconnect interface and other processing devices. The calculating device interacts with other processing devices to jointly complete a user specified calculation operation. The integrated circuit device may also include a storage device. The storage device is respectively connected with the calculating device and other processing devices and is used for data storage of the computing device and other processing devices.

Neural processing accelerator

A system for calculating. A scratch memory is connected to a plurality of configurable processing elements by a communication fabric including a plurality of configurable nodes. The scratch memory sends out a plurality of streams of data words. Each data word is either a configuration word used to set the configuration of a node or of a processing element, or a data word carrying an operand or a result of a calculation. Each processing element performs operations according to its current configuration and returns the results to the communication fabric, which conveys them back to the scratch memory.

Load balancing of two processors when executing diverse-redundant instruction sequences
11354132 · 2022-06-07 · ·

A method and a system are presented for load balancing of two processors when executing diverse-redundant instruction sequences.

LONG-TERM PROGRAMMATIC WORKFLOW MANAGEMENT
20220164224 · 2022-05-26 ·

A system and method for long-term programmatic workflow execution, including: iteratively, while a suspension event is not detected: with a run, executing the code block using passed variable values from another code block; when a suspension event is detected, suspending run execution and persistently storing the run state; and resuming run execution responsive to receipt of a valid run resumption request.