G06F9/3873

Pipelining out-of-order instructions

Systems, methods and computer program product provide for pipelining out-of-order instructions. Embodiments comprise an instruction reservation station for short instructions of a short latency type and long instructions of a long latency type, an issue queue containing at least two short instructions of a short latency type, which are to be chained to match a latency of a long instruction of a long latency type, a register file, at least one execution pipeline for instructions of a short latency type and at least one execution pipeline for instructions of a long latency type; wherein results of the at least one execution pipeline for instructions of the short latency type are written to the register file, preserved in an auxiliary buffer, or forwarded to inputs of said execution pipelines. Data of the auxiliary buffer are written to the register file.

Coprocessors with bypass optimization, variable grid architecture, and fused vector operations

In an embodiment, a coprocessor may include a bypass indication which identifies execution circuitry that is not used by a given processor instruction, and thus may be bypassed. The corresponding circuitry may be disabled during execution, preventing evaluation when the output of the circuitry will not be used for the instruction. In another embodiment, the coprocessor may implement a grid of processing elements in rows and columns, where a given coprocessor instruction may specify an operation that causes up to all of the processing elements to operate on vectors of input operands to produce results. Implementations of the coprocessor may implement a portion of the processing elements. The coprocessor control circuitry may be designed to operate with the full grid or partial grid, reissuing instructions in the partial grid case to perform the requested operation. In still another embodiment, the coprocessor may be able to fuse vector mode operations.

ON-THE-FLY ADJUSTMENT OF ISSUE-WRITE BACK LATENCY TO AVOID WRITE BACK COLLISIONS USING A RESULT BUFFER

A system and method for avoiding write back collisions. The system receives a plurality of instructions at a pipeline queue. Next an issue queue determines a number of cycles for each instruction of the plurality of instructions. The issue queue further determines if a collision will occur between at least two of the instructions. Additionally, the system determines in response to a collision between at least two of the instructions, a number of cycles to delay at least one of the at least two instructions. The instructions are then executed. The system then places the results of the instruction for instructions that had a calculated delay in a result buffer for the determined number of cycles of delay. After the determined number of cycles of delay, the system sends the results to a results mux. Once received at the results mux the results are written back to the register file.

Computing device and method

The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.

VECTOR PROCESSOR SUPPORTING LINEAR INTERPOLATION ON MULTIPLE DIMENSIONS

Techniques are disclosed for a vector processor architecture that enables data interpolation in accordance with multiple dimensions, such as one-, two-, and three-dimensional linear interpolation. The vector processor architecture includes a vector processor and accompanying vector addressable memory that enable a simultaneous retrieval of multiple entries in the vector addressable memory to facilitate linear interpolation calculations. The vector processor architecture vastly increases the speed in which such calculations may occur compared to conventional processing architectures. Example implementations include the calculation of digital pre-distortion (DPD) coefficients for use with radio frequency (RF) transmitter chains to support multi-band applications.

PROCESSOR WITH ADAPTIVE PIPELINE LENGTH

A system and method for reducing pipeline latency. In one embodiment, a processing system includes a processing pipeline. The processing pipeline includes a plurality of processing stages. Each stage is configured to further processing provided by a previous stage. A first of the stages is configured to perform a first function in a pipeline cycle. A second of the stages is disposed downstream of the first of the stages, and is configured to perform, in a pipeline cycle, a second function that is different from the first function. The first of the stages is further configured to selectably perform the first function and the second function in a pipeline cycle, and bypass the second of the stages.

Computing device and method

The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.

VARIABLE PIPELINE LENGTH IN A BARREL-MULTITHREADED PROCESSOR
20220121450 · 2022-04-21 ·

Devices and techniques for variable pipeline length in a barrel-multithreaded processor are described herein. A completion time for an instruction can be determined prior to insertion into a pipeline of a processor. A conflict between the instruction and a different instruction based on the completion time can be detected. Here, the different instruction is already in the pipeline and the conflict detected when the completion time equals the previously determined completion time for the different instruction. A difference between the completion time and an unconflicted completion time can then be calculated and completion of the instruction delayed by the difference.

RESCHEDULING A FAILED MEMORY REQUEST IN A PROCESSOR
20220121486 · 2022-04-21 ·

Devices and techniques for rescheduling a failed memory request in a processor are described herein. When a memory request for a thread is denied at a point in the execution pipeline of the processor beyond a thread rescheduling point, the thread can be placed into a memory response path of the processor. An indicator that a register write-back will not occur for the thread can also be provided. Then, the thread can be rescheduled with other threads in the memory response path.

RISC-V ISA BASED MICRO-CONTROLLER UNIT FOR LOW POWER IOT AND EDGE COMPUTING APPLICATIONS
20210357227 · 2021-11-18 ·

A micro-controller unit (MCU) for low power IoT and edge computing applications is disclosed. MCU includes instruction fetching module configured to fetch instruction from instruction memory, instruction decoding module configured to decode instruction to obtain decoded instruction, and execution module including first and second execution units and clock gating circuit. Second execution unit is configured to execute instruction types. Execution module is configured to receive instruction from instruction decoding module and execute decoded instruction via particular logic circuit from first logic circuits associated with first execution unit. First logic circuits except the particular logic circuit are turned-off during execution via clock gating circuit. Execution module is configured to determine whether type of decoded instruction is included in instruction types or not and disable second logic circuits included in second execution unit via clock gating circuit in response to determination that type of decoded instruction is not included in instruction types.