Patent classifications
G06F9/3873
TRIPLE-PASS EXECUTION
An execution pipeline architecture of a microprocessor employs a third-pass functional unit, for example, third-level of arithmetic logic unit (ALU) or third short-latency execution unit to execute instructions with reduced complexity and area cost of out-of-order execution. The third-pass functional unit allows instructions with long latency execution to be moved into a retire queue. The retire queue further includes the third functional unit (e.g., ALU), a reservation station and a graduate buffer. Data dependencies of dependent instructions in the retire queue is handled independently from the main pipeline.
System, method, and computer program product for implementing large integer operations on a graphics processing unit
A system, method, and computer program product for generating executable code for performing large integer operations on a parallel processing unit is disclosed. The method includes the steps of compiling a source code linked to a large integer library to generate an executable file and executing the executable file to perform a large integer operation using a parallel processing unit. The large integer library includes functions for processing large integers that are optimized for the parallel processing unit.
PARALLEL INSTRUCTION SCHEDULER FOR BLOCK ISA PROCESSOR
Apparatus and methods are disclosed for implementing block-based processors, including field programmable gate-array (FPGA) implementations. In one example of the disclosed technology, an instruction decoder configured to generate ready state data for a set of instructions in an instruction block, each of the set of instructions being associated with a different instruction identifier encoded in the transactional block and a parallel instruction scheduler configured to issue an instruction from the set of instructions based on the decoded ready state data. In some examples, the parallel instruction scheduler allows for improved area and energy savings according to the size and type of FPGA components available.
INCREMENTAL SCHEDULER FOR OUT-OF-ORDER BLOCK ISA PROCESSORS
Apparatus and methods are disclosed for implementing incremental schedulers for out-of-order block-based processors, including field programmable gate array implementations. In one example of the disclosed technology, a processor includes an instruction scheduler formed by configuring one or more look up table RAMs to store ready state data for a plurality of instructions in an instruction block. The instruction scheduler further includes a plurality of queues that store ready state data for the processor and sends dependency information to ready determination logic on a first in/first out basis. The instruction scheduler selects one or more of the ready instructions to be issued and executed by the block-based processor.
OUT-OF-ORDER BLOCK-BASED PROCESSORS AND INSTRUCTION SCHEDULERS
Apparatus and methods are disclosed for implementing block-based processors including field programmable gate-array implementations. In one example of the disclosed technology, a block-based processor includes an instruction decoder configured to generate decoded ready dependencies for a transactional block of instructions, where each of the instructions is associated with a different instruction identifier encoded in the transactional block. The processor further includes an instruction scheduler configured to issue an instruction from the set of transactional block of instructions out of order. The instruction is issued based on determining that decoded ready state dependencies for an instruction are satisfied. The determining includes accessing storage with the decoded ready dependencies indexed with a respective instruction identifier that is encoded in the transactional block of instructions.
Scalable compute fabric
A method and apparatus for providing a scalable compute fabric are provided herein. The method includes determining a workflow for processing by the scalable compute fabric, wherein the workflow is based on an instruction set. A pipeline in configured dynamically for processing the workflow, and the workflow is executed using the pipeline.
SCALABLE COMPUTE FABRIC
A method and apparatus for providing a scalable compute fabric are provided herein. The method can include determining a workflow for processing by the scalable compute fabric, wherein the workflow is based on an instruction set. A pipeline can be configured dynamically for processing the workflow, and the workflow is executed using the pipeline. A computing device can include a first group of two or more processing cores, a second group of two or more processing cores, and a third group of two or more processing cores. One or more of the first group of two or more processing cores, the second group of two or more processing cores, and the third group of two or more processing cores can power down based on a type of task to be performed.
EFFICIENT INSTRUCTION PROCESSING FOR SPARSE DATA
Efficient instruction processing for sparse data includes extensions to a processor pipeline to identify zero-optimizable instructions that include at least one zero input operand, and bypass the execute stage of the processor pipeline, determining the result of the operation without executing the instruction. When possible, the extensions also bypass the writeback stage of the processor pipeline.
LOOP CODE PROCESSOR OPTIMIZATIONS
Loop code processor optimizations are implemented as a loop optimizer extension to a processor pipeline. The loop optimizer generates optimized code associated with code loops that include at least one zero-optimizable instruction. The loop optimizer may generate multiple versions of optimized code associated with a particular code loop, where each of the multiple version of optimized code has a different associated condition under which the optimized code can be safely executed.
Parallel slice processor with dynamic instruction stream mapping
A method of operation of a processor core having multiple parallel instruction execution slices and coupled to multiple dispatch queues coupled by a dispatch routing network provides flexible and efficient use of internal resources. The dispatch routing network is controlled to dynamically vary the relationship between the slices and instruction streams according to execution requirements for the instruction streams and the availability of resources in the instruction execution slices. The instruction execution slices may be dynamically reconfigured as between single-instruction-multiple-data (SIMD) instruction execution and ordinary instruction execution on a per-instruction basis. Instructions having an operand width greater than the width of a single instruction execution slice may be processed by multiple instruction execution slices configured to act in concert for the particular instructions. When an instruction execution slice is busy processing a current instruction for one of the streams, another slice can be selected to proceed with execution.