Patent classifications
G06F9/3869
HIGH-THROUGHPUT ASYNCHRONOUS DATA PIPELINE
One embodiment of the present invention sets forth a data pipeline, which includes a first mousetrap element and a second mousetrap element in a first pipeline stage. Each mousetrap element includes a request latch that, when enabled, allows a request signal to pass from the first pipeline stage to a second pipeline stage following the first pipeline stage in the data pipeline. Each mousetrap element also includes a data latch that, when enabled, allows a data element to pass from the first pipeline stage to the second pipeline stage. Each mousetrap element further includes a latch controller that enables and disables the request and data latches based on a phase signal that alternates between a first value that configures the first mousetrap element to transmit data to the second pipeline stage and a second value that configures the second mousetrap element to transmit data to the second pipeline stage.
ARITHMETIC PROCESSING DEVICE AND ARITHMETIC PROCESSING METHOD
An arithmetic processing device includes: a memory; and a processor coupled to the memory and configured to: execute a plurality of data processes each of which is divided into a plurality of pipeline stages in parallel at different timings; measure a processing time of each of the plurality of pipeline stages; and set a priority of the plurality of pipeline stages in a descending order of the measured processing time.
Writeback Hazard Elimination
A processor includes a processing pipeline, a plurality of result-storage elements, and writeback logic. The processing pipeline is configured to process program operations and to write, to a result storage, up to a predefined maximal number of results of the processed program operations per clock cycle. The result-storage elements are configured to store respective ones of the results. The writeback logic is configured to (i) detect a writeback conflict event in which the processing pipeline produces simultaneous results that exceed the predefined maximal number of results, for writing to the result storage, in a same clock cycle, (ii) in response to detecting the writeback conflict event, to temporarily store at least a given result, from among the simultaneous results, in a given result-storage element, and (iii) to subsequently write the temporarily-stored given result from the given result-storage element to the result storage.
METHOD AND CIRCUIT FOR DYNAMIC POWER CONTROL
Dynamic power control embodiments concern a data processing pipeline. First and second pipeline stages respectively receive first and second clock signals. The first and second pipeline stages are configured to perform first and second operations respectively triggered by first timing edges of the first clock signal and second timing edges of the second clock signal. A clock controller is configured to generate the first and second clock signals. The clock controller is capable of operating in a first mode in which, during a first data processing cycle of the data processing pipeline, a first of the first timing edges is in-phase with a first of the second timing edges. The clock controller is also capable of operating in a second mode in which, during a second data processing cycle of the data processing pipeline, a second of the first timing edges is out of phase with a second of the second timing edges.
Replicating logic blocks to enable increased throughput with sequential enabling of input register blocks
A datapath pipeline which uses replicated logic blocks to increase the throughput of the pipeline is described. In an embodiment, the pipeline, or a part thereof, comprises a number of parallel logic paths each comprising the same logic. Input register stages at the start of each logic path are enabled in turn on successive clock cycles such that data is read into each logic path in turn and the logic in the different paths operates out of phase. The output of the logic paths is read into one or more output register stages and the logic paths are combined using a multiplexer which selects an output from one of the logic paths on any clock cycle. Various optimization techniques are described and in various examples, register retiming may also be used. In various examples, the datapath pipeline is within a processor.
Clock mesh-based power conservation in a coprocessor based on in-flight instruction characteristics
A pipeline includes a first portion configured to process a first subset of bits of an instruction and a second portion configured to process a second subset of the bits of the instruction. A first clock mesh is configured to provide a first clock signal to the first portion of the pipeline. A second clock mesh is configured to provide a second clock signal to the second portion of the pipeline. The first and second clock meshes selectively provide the first and second clock signals based on characteristics of in-flight instructions that have been dispatched to the pipeline but not yet retired. In some cases, a physical register file is configured to store values of bits representative of instructions. Only the first subset is stored in the physical register file in response to the value of the zero high bit indicating that the second subset is equal to zero.
Pipeline merging in a circuit
Devices and techniques for pipeline merging in a circuit are described herein. A parallel pipeline result can be obtained for a transaction index of a parallel pipeline. Here, the parallel pipeline is one of several parallel pipelines that share transaction indices. An element in a vector can be marked. The element corresponds to the transaction index. The vector is one of several vectors respectively assigned to the several parallel pipelines. Further each element in the several vectors corresponds to a possible transaction index with respective elements between vectors corresponding to the same transaction index. Elements between the several vectors that correspond to the same transaction index can be compared to determine when a transaction is complete. In response to the transaction being complete, the result can be released to an output buffer in response to the transaction being complete.
Differential pipeline delays in a coprocessor
A coprocessor such as a floating-point unit includes a pipeline that is partitioned into a first portion and a second portion. A controller is configured to provide control signals to the first portion and the second portion of the pipeline. A first physical distance traversed by control signals propagating from the controller to the first portion of the pipeline is shorter than a second physical distance traversed by control signals propagating from the controller to the second portion of the pipeline. A scheduler is configured to cause a physical register file to provide a first subset of bits of an instruction to the first portion at a first time. The physical register file provides a second subset of the bits of the instruction to the second portion at a second time subsequent to the first time.
USING SPARSITY METADATA TO REDUCE SYSTOLIC ARRAY POWER CONSUMPTION
A processing apparatus can include a general-purpose parallel processing engine comprising a matrix accelerator including a multi-stage systolic array, where each stage includes multiple processing elements associated with multiple processing channels. The multiple processing elements are configured to receive output sparsity metadata that is independent of input sparsity of input matrix elements and perform processing operations on the input matrix elements based on the output sparsity metadata.
Resource management unit for capturing operating system configuration states and offloading tasks
This disclosure describes methods, devices, systems, and procedures in a computing system for capturing a configuration state of an operating system executing on a central processing unit (CPU), and offloading resource-related tasks, based on the configuration state, to a resource management unit such as a system-on-chip (SoC). The resource management unit identifies a status of each resource based on the captured configuration state of the operating system. The resource management unit then processes tasks associated with the status of the resources, such as modifying a clock rate of a clocked component in the computing system. This can alleviate the CPU from processing those tasks thereby improving overall computing system performance and dynamics.