Patent classifications
G06F9/3828
Instruction block allocation
Apparatus and methods are disclosed for throttling processor operation in block-based processor architectures. In one example of the disclosed technology, a block-based instruction set architecture processor includes a plurality of processing cores configured to fetch and execute a sequence of instruction blocks. Each of the processing cores includes function resources for performing operations specified by the instruction blocks. The processor further includes a core scheduler configured to allocate functional resources for performing the operations. The functional resources are allocated for executing the instruction blocks based, at least in part, on a performance metric. The performance metric can be generated dynamically or statically based on branch prediction accuracy, energy usage tolerance, and other suitable metrics.
System and method of populating an instruction word
A methodology for populating an instruction word for simultaneous execution of instruction operations by a plurality of ALUs in a data path is provided. The methodology includes: creating a dependency graph of instruction nodes, each instruction node including at least one instruction operation; first selecting a first available instruction node from the dependency graph; first assigning the selected first available instruction node to the instruction word; second selecting any available dependent instruction nodes that are dependent upon a result of the selected first available instruction node and do not violate any predetermined rule; second assigning to the instruction word the selected any available dependent instruction nodes; and updating the dependency graph to remove any instruction nodes assigned during the first and second assigning from further consideration for assignment.
Coprocessors with Bypass Optimization, Variable Grid Architecture, and Fused Vector Operations
In an embodiment, a coprocessor may include a bypass indication which identifies execution circuitry that is not used by a given processor instruction, and thus may be bypassed. The corresponding circuitry may be disabled during execution, preventing evaluation when the output of the circuitry will not be used for the instruction. In another embodiment, the coprocessor may implement a grid of processing elements in rows and columns, where a given coprocessor instruction may specify an operation that causes up to all of the processing elements to operate on vectors of input operands to produce results. Implementations of the coprocessor may implement a portion of the processing elements. The coprocessor control circuitry may be designed to operate with the full grid or partial grid, reissuing instructions in the partial grid case to perform the requested operation. In still another embodiment, the coprocessor may be able to fuse vector mode operations.
FRACTAL CALCULATING DEVICE AND METHOD, INTEGRATED CIRCUIT AND BOARD CARD
A fractal calculating device according to an embodiment of the present application is included in an integrated circuit device. The integrated circuit device includes a universal interconnect interface and other processing devices. The calculating device interacts with other processing devices to jointly complete a user specified calculation operation. The integrated circuit device may also comprise a storage device. The storage device is respectively connected with the calculating device and other processing devices and is used for data storage of the computing device and other processing devices
Systems and methods for implementing chained tile operations
Disclosed embodiments relate to systems and methods for implementing chained tile operations. In one example, a processor includes fetch circuitry to fetch one or more instructions until a plurality of instructions has been fetched, each instruction to specify source and destination tile operands, decode circuitry to decode the fetched instructions, and execution circuitry, responsive to the decoded instructions, to: identify first and second decoded instructions belonging to a chain of instructions, dynamically select and configure a SIMD path comprising first and second processing engines (PE) to execute the first and second decoded instructions, and set aside the specified destination of the first decoded instruction, and instead route a result of the first decoded instruction from the first PE to be used by the second PE to perform the second decoded instruction.
Arithmetic logic unit layout for a processor
A processor has first, second and third ALUs. The first ALU has on a first side an input and an output. The second ALU has a first side facing the first side of the first ALU, an input and an output on the first side of the second ALU and being in a rotated orientation relative to the input and the output of the first side of the first ALU, and an output on a second side of the second ALU. The third ALU has a first side facing the second side of the second ALU, and an input and an output on the first side of the third ALU. The input of the first side of the first ALU is logically directly connected to the output of the first side of the second ALU.
System and method for populating multiple instruction words
A methodology for populating multiple instruction words is provided. The methodology includes: creating a dependency graph of instruction nodes, each instruction node including at least one instruction operation; first assigning a first instruction node to a first instruction word; identifying a dependent instruction node that is directly dependent upon a result of the first instruction node; first determining whether the dependent instruction node requires any input from two or more sources that are outside of a predefined physical range of each other, the range being smaller than the full extent of the data path; and second assigning, in response to satisfaction of at least one predetermined criteria including a negative result of the first determining, the dependent instruction node to the first instruction word.
Exception handling
A processor includes: a memory; an execution pipeline having a plurality of pipeline stages for executing an operation on data provided to the execution pipeline and storing a result of the operation into the memory; a receive pipeline having a plurality of pipeline stages for handling incoming data to the processor and storing the incoming data into memory; context status storage for holding an exception indicator of an exception encountered by the receive pipeline whilst it is handling incoming data; the receive pipeline being configured to determine that an exception has been encountered in one of its pipeline stages and to delay committing the exception indicator to the context status storage until a final one of its pipeline stages and to continue to receive and store incoming data into the memory until the exception indicator has been committed.
Routing circuitry for permutation of single-instruction multiple-data operands
Techniques are disclosed relating to routing circuitry configured to perform permute operations for operands of threads in a single-instruction multiple-data group. In some embodiments, an apparatus includes hierarchical operand routing circuitry configured to route operands between a set of single-instruction multiple-data (SIMD) pipelines based on a permute instruction. In some embodiments, the routing circuitry includes a first level and a second level. The first level may include a set of multiple crossbar circuits each configured to receive operands from a respective subset of the pipelines and output one or more of the received operands on multiple output lines based on the permute instruction, where the crossbar circuits support full permutation within a respective subset. A second level may be configured to select an operand from a previous level for each of the pipelines, and may select from among only a portion of output operands from the previous level to provide an operand for a respective pipeline.
SYSTEM AND METHOD OF POPULATING AN INSTRUCTION WORD
A methodology for populating an instruction word for simultaneous execution of instruction operations by a plurality of ALUs in a data path is provided. The methodology includes: creating a dependency graph of instruction nodes, each instruction node including at least one instruction operation; first selecting a first available instruction node from the dependency graph; first assigning the selected first available instruction node to the instruction word; second selecting any available dependent instruction nodes that are dependent upon a result of the selected first available instruction node and do not violate any predetermined rule; second assigning to the instruction word the selected any available dependent instruction nodes; and updating the dependency graph to remove any instruction nodes assigned during the first and second assigning from further consideration for assignment.