Patent classifications
G06F9/3873
COMPUTING DEVICE AND METHOD
The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.
COMPUTING DEVICE AND METHOD
The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.
VARIABLE LATENCY REQUEST ARBITRATION
A technique for scheduling processing tasks having different latencies is provided. The technique involves identifying one or more available requests in a request queue, where each request queue corresponds to a different latency. A request arbiter examines a shift register to determine whether there is an available slot for the one or more requests. A slot is available for a request if there is a slot that is a number of slots from the end of the shift register equal to the number of cycles the request takes to complete processing in a corresponding processing pipeline. If a slot is available, the request is scheduled for execution and the slot is marked as being occupied. If a slot is not available, the request is not scheduled for execution on the current cycle. On transitioning to a new cycle, the shift register is shifted towards its end and the technique repeats.
Triple-pass execution using a retire queue having a functional unit to independently execute long latency instructions and dependent instructions
An execution pipeline architecture of a microprocessor employs a third-pass functional unit, for example, third-level of arithmetic logic unit (ALU) or third short-latency execution unit to execute instructions with reduced complexity and area cost of out-of-order execution. The third-pass functional unit allows instructions with long latency execution to be moved into a retire queue. The retire queue further includes the third functional unit (e.g., ALU), a reservation station and a graduate buffer. Data dependencies of dependent instructions in the retire queue is handled independently from the main pipeline.
Techniques to derive efficient conversion and/or color correction of video data
The present disclosure describes techniques for removing unnecessary processing stages from a graphics processing pipeline based on the format of data passed between the stages. Starting with a stage at a middle point in a pipeline, formats of data that are input to and output from the middle stage may be compared to each other. If the formats match, the middle stage may be removed from the pipeline. Thereafter, the format of data input to a pair of middle stages of the pipeline and output from the pipeline may be compared and, if they match, the middle pair may be deleted. This process may repeat until a middle pair is found where no match occurs between the input and output format. The remaining stages of the pipeline may be retained. In cases where a pipeline is not symmetrical, the formats of data at each node may be compared to each other. If a node possesses a format that does not match the format of any other node, then the stages between the node and its closest endpoint in the pipeline may be retained.
Efficient instruction processing for sparse data
Efficient instruction processing for sparse data includes extensions to a processor pipeline to identify zero-optimizable instructions that include at least one zero input operand, and bypass the execute stage of the processor pipeline, determining the result of the operation without executing the instruction. When possible, the extensions also bypass the writeback stage of the processor pipeline.
PIPELINED CONFIGURABLE PROCESSOR
A configurable processing circuit capable of handling multiple threads simultaneously, the circuit comprising a thread data store, a plurality of configurable execution units, a configurable routing network for connecting locations in the thread data store to the execution units, a configuration data store for storing configuration instances that each define a configuration of the routing network and a configuration of one or more of the plurality of execution units, and a pipeline formed from the execution units, the routing network and the thread data store that comprises a plurality of pipeline sections configured such that each thread propagates from one pipeline section to the next at each clock cycle, the circuit being configured to: (i) associate each thread with a configuration instance; and (ii) configure each of the plurality of pipeline sections for each clock cycle to be in accordance with the configuration instance associated with the respective thread that will propagate through that pipeline section during the clock cycle.
HIGH PERFORMANCE EXPRESSION EVALUATOR UNIT
Devices and methods for limiting register usage through the use of fixed function processing is provided. The method may include receiving instructions executable by a processor. The method may also include that a set of the instructions is executable according to a restricted register mode when the set of the instructions relate to one or more single function operations, wherein the restricted register mode includes only a single access or no access to a register. The method may further include executing, by an expression evaluator, operations of the set of the instructions related to the one or more single function operations, wherein the executing is performed in the restricted register mode and in parallel with the processor performing additional operations of the instructions.
Memristor based multithreading
A method and a device that includes a set of multiple pipeline stages, wherein the set of multiple pipeline stages is arranged to execute a first thread of instructions; multiple memristor based registers that are arranged to store a state of another thread of instructions that differs from the first thread of instructions; and a control circuit that is arranged to control a thread switch between the first thread of instructions and the other thread of instructions by controlling a storage of a state of the first thread of instructions at the multiple memristor based registers and by controlling a provision of the state of the other thread of instructions by the set of multiple pipeline stages; wherein the set of multiple pipeline stages is arranged to execute the other thread of instructions upon a reception of the state of the other thread of instructions.
EXECUTION PIPELINE ADAPTATION
An apparatus and method of data processing are provided. The apparatus comprises at least two execution pipelines, one with a shorter execution latency than the other. The execution pipelines share a write port and issue circuitry of the apparatus issues decoded instructions to a selected execution pipeline. The apparatus further comprises at least one additional pipeline stage and the issue circuitry can detect a write port conflict condition in dependence on a latency indication associated with a decoded instruction which it is to issue. If the issue circuitry intends to issue the decoded instruction to the execution pipeline with the shorter execution latency then when the write port conflict condition is found the issue circuitry will cause use of at least one additional pipeline stage in addition to the target execution pipeline to avoid the write port conflict.