G06F9/3873

Loop code processor optimizations

Loop code processor optimizations are implemented as a loop optimizer extension to a processor pipeline. The loop optimizer generates optimized code associated with code loops that include at least one zero-optimizable instruction. The loop optimizer may generate multiple versions of optimized code associated with a particular code loop, where each of the multiple version of optimized code has a different associated condition under which the optimized code can be safely executed.

Rescheduling a failed memory request in a processor

Devices and techniques to reschedule a memory request that has failed when a thread is executing in a processor are described herein. When a memory request for a thread is denied at a point in the execution pipeline of the processor beyond a thread rescheduling point, the thread can be placed into a memory response path of the processor. An indicator that a register write-back will not occur for the thread can also be provided. Then, the thread can be rescheduled with other threads in the memory response path.

Inclusion of Dedicated Accelerators in Graph Nodes

Systems, apparatuses, and methods for implementing a hierarchical scheduling in fixed-function graphics pipeline are disclosed. In various implementations, a processor includes a pipeline comprising a plurality of fixed-function units and a scheduler. The scheduler is configured to schedule a first operation for execution by one or more fixed-function units of the pipeline by scheduling the first operation with a first unit of the pipeline, responsive to a first mode of operation and schedule a second operation for execution by a selected fixed-function unit of the pipeline by scheduling the second operation directly to the selected fixed-function unit, independent of a sequential arrangement of the one or more fixed-function units in the pipeline, responsive to a second mode of operation.

Pipelined configurable processor
10275390 · 2019-04-30 · ·

A configurable processing circuit capable of handling multiple threads simultaneously, the circuit comprising a thread data store, a plurality of configurable execution units, a configurable routing network for connecting locations in the thread data store to the execution units, a configuration data store for storing configuration instances that each define a configuration of the routing network and a configuration of one or more of the plurality of execution units, and a pipeline formed from the execution units, the routing network and the thread data store that comprises a plurality of pipeline sections configured such that each thread propagates from one pipeline section to the next at each clock cycle, the circuit being configured to: (i) associate each thread with a configuration instance; and (ii) configure each of the plurality of pipeline sections for each clock cycle to be in accordance with the configuration instance associated with the respective thread that will propagate through that pipeline section during the clock cycle.

Method and system for determining instruction conflict states for issuance of memory instructions in a VLIW processor

An apparatus including a queue configured to store a plurality of instructions and state information indicating whether each instruction of the plurality of instructions can be performed independently of older pending instructions; and a state-selection circuit configured to set a state information of each instruction of the plurality of instructions in view of an older pending instruction in the queue.

PROCESSOR WITH HYBRID PIPELINE CAPABLE OF OPERATING IN OUT-OF-ORDER AND IN-ORDER MODES

A method and circuit arrangement provide support for a hybrid pipeline that dynamically switches between out-of-order and in-order modes. The hybrid pipeline may selectively execute instructions from at least one instruction stream that require the high performance capabilities provided by out-of-order processing in the out-of-order mode. The hybrid pipeline may also execute instructions that have strict power requirements in the in-order mode where the in-order mode conserves more power compared to the out-of-order mode. Each stage in the hybrid pipeline may be activated and fully functional when the hybrid pipeline is in the out-of-order mode. However, stages in the hybrid pipeline not used for the in-order mode may be deactivated and bypassed by the instructions when the hybrid pipeline dynamically switches from the out-of-order mode to the in-order mode. The deactivated stages may then be reactivated when the hybrid pipeline dynamically switches from the in-order mode to the out-of-order mode.

APPARATUS AND METHOD FOR CONTROLLING USE OF A REGISTER CACHE

An apparatus and method are provided for controlling use of a register cache. The apparatus has execution circuitry for executing instructions to process data values, and a register file comprising a plurality of registers in which to store the data values for access by the execution circuitry. A register cache is also provided that has a plurality of entries and is arranged to cache a subset of the data values for access by the execution circuitry. Each entry is arranged to cache a data value and an indication of the register associated with that cached data value. Prefetch circuitry then performs prefetch operations to prefetch data values from the register file into the register cache. Timing indication storage is used to store, for each data value to be generated as a result of instructions being executed within the execution circuitry, a register identifier for that data value, and timing information indicating when that data value will be generated by the execution circuitry. Cache usage control circuitry is then responsive to receipt of a plurality of register identifiers associated with source data values for a pending instruction yet to be executed by the execution circuitry, to generate, with reference to the timing information in the timing indication storage, a timing control signal to control timing of at least one prefetch operation performed by the prefetch circuitry. Such an approach can lead to significant improvements in the efficiency of utilisation of the register cache.

METHODS AND APPARATUS FOR HANDLING RUNTIME MEMORY DEPENDENCIES

An integrated circuit may include elastic datapaths or pipelines, through which software threads or iterations of loops, may be executed. Throttling circuitry may be coupled along an elastic pipeline in the integrated circuit. The throttling circuitry may include dependency detection circuitry that dynamically detect memory dependency issues that may arise during runtime. To mitigate these dependency issues, the throttling circuitry may assert stall signals to upstream stages in the pipeline.

Additionally, the throttling circuitry may control the pipeline to resolve a store operation prior to a corresponding load operation in order to avoid store/load conflicts. In an embodiment, the throttling circuitry may include a validator circuit, a rewind block, a revert block, and a flush block. The throttling circuitry may pass speculative iterations through the rewind block, and later validate the speculative iterations using the validator block.

Computing device with a conversion unit to convert data values between various sizes of fixed-point and floating-point data

The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.

Architecture for long latency operations in emulated shared memory architectures
10127048 · 2018-11-13 · ·

A processor architecture arrangement for emulated shared memory (ESM) architectures, comprises a number of, preferably a plurality of, multi-threaded processors each provided with interleaved inter-thread pipeline, wherein the pipeline comprises a plurality of functional units arranged in series for executing arithmetic, logical and optionally further operations on data, wherein one or more functional units of lower latency are positioned prior to the memory access segment in said pipeline and one or more long latency units (LLU) for executing more complex operations associated with longer latency are positioned operatively in parallel with the memory access segment. In some embodiments, the pipeline may contain multiple branches in parallel with the memory access segment, each branch containing at least one long latency unit.