G06F9/3824

ADVANCED PROCESSOR ARCHITECTURE
20180004530 · 2018-01-04 ·

The invention relates to a method for processing instructions out-of-order on a processor comprising an arrangement of execution units. The inventive method comprises: 1) looking up operand sources in a Register Positioning Table and setting operand input references of the instruction to be issued accordingly; 2) checking for an Execution Unit (EXU) available for receiving a new instruction; and 3) issuing the instruction to the available Execution Unit and enter a reference of the result register addressed by the instruction to be issued to the Execution Unit into the Register Positioning Table (RPT).

METHOD FOR EXECUTING MULTITHREADED INSTRUCTIONS GROUPED INTO BLOCKS
20180011738 · 2018-01-11 ·

A method for executing multithreaded instructions grouped into blocks. The method includes receiving an incoming instruction sequence using a global front end; grouping the instructions to form instruction blocks, wherein the instructions of the instruction blocks are interleaved with multiple threads; scheduling the instructions of the instruction block to execute in accordance with the multiple threads; and tracking execution of the multiple threads to enforce fairness in an execution pipeline.

STREAM REFERENCE REGISTER WITH DOUBLE VECTOR AND DUAL SINGLE VECTOR OPERATING MODES
20180011709 · 2018-01-11 ·

A streaming engine employed in a digital signal processor specifies a fixed read only data stream. Once fetched the data stream is stored in two head registers for presentation to functional units in the fixed order. Data use by the functional unit is preferably controlled using the input operand fields of the corresponding instruction. A first read only operand coding supplies data from the first head register. A first read/advance operand coding supplies data from the first head register and also advances the stream to the next sequential data elements. Corresponding second read only operand coding and second read/advance operand coding operate similarly with the second head register. A third read only operand coding supplies double width data from both head registers.

APPARATUS TO OPTIMIZE GPU THREAD SHARED LOCAL MEMORY ACCESS

One embodiment provides for a graphics processor comprising first logic coupled with a first execution unit, the first logic to receive a first single instruction multiple data (SIMD) message from the first execution unit; second logic coupled with a second execution unit, the second logic to receive a second SIMD message from the second execution unit; and third logic coupled with a bank of shared local memory (SLM), the third logic to receive a first request to access the bank of SLM from the first logic, a second request to access the bank of SLM from the second logic, and in a single access cycle, schedule a read access to a read port for the first request and a write access to a write port for the second request.

NEURAL NETWORK COMPUTE TILE

A computing unit is disclosed, comprising a first memory bank for storing input activations and a second memory bank for storing parameters used in performing computations. The computing unit includes at least one cell comprising at least one multiply accumulate (“MAC”) operator that receives parameters from the second memory bank and performs computations. The computing unit further includes a first traversal unit that provides a control signal to the first memory bank to cause an input activation to be provided to a data bus accessible by the MAC operator. The computing unit performs one or more computations associated with at least one element of a data array, the one or more computations being performed by the MAC operator and comprising, in part, a multiply operation of the input activation received from the data bus and a parameter received from the second memory bank.

System, apparatus, and method for a transient load instruction within a VLIW operation
11561792 · 2023-01-24 · ·

A transient load instruction for a processor may include a transient or temporary load instruction that is executed in parallel with a plurality of input operands. The temporary load instruction loads a memory value into a temporary location for use within the instruction packet. According to some examples, a VLIW based microprocessor architecture may include a temporary cache for use in writing/reading a temporary memory value during a single VLIW packet cycle. The temporary cache is different from the normal register bank that does not allow writing and then reading the value just written during the same VLIW packet cycle.

Execution unit
11561799 · 2023-01-24 · ·

An execution unit comprising a processing pipeline configured to perform calculations to evaluate a plurality of mathematical functions. The processing pipeline comprises a plurality of stages through which each calculation for evaluating a mathematical function progresses to an end result. Each of a plurality of processing circuits in the pipeline is configured to perform an operation on input values during at least one stage of the plurality of stages. The plurality of processing circuits include multiplier circuits. A first multiplier circuit and a second multiplier circuit are configured to operate in parallel, such that at the same stage in the processing pipeline, the first multiplier circuit and the second multiplier circuit perform their processing. A third multiplier circuit is arranged in series with the first multiplier circuit and the second multiplier circuit and processes outputs from the first multiplier circuit and the second multiplier circuit.

Allocation of spare cache reserved during non-speculative execution and speculative execution
11561903 · 2023-01-24 · ·

A cache system, having cache sets, a connection to a line identifying an execution type, a connection to a line identifying a status of speculative execution, and a logic circuit that can: allocate a first subset of cache sets when the execution type is a first type indicating non-speculative execution, allocate a second subset when the execution type changes from the first type to a second type indicating speculative execution, and reserve a cache set when the execution type is the second type. When the execution type changes from the second to the first type and the status of speculative execution indicates that a result of speculative execution is to be accepted, the logic circuit can reconfigure the second subset when the execution type is the first type; and allocate the at least one cache set when the execution type changes from the first to the second type.

Inhibiting load instruction execution based on reserving a resource of a load and store queue but failing to reserve a resource of a store data queue

A calculation processing apparatus includes a decoder that decodes memory access instructions including a store instruction and a load instruction; a first queue that stores the decoded memory access instructions; a second queue that stores store data related to the store instruction; a storage circuit that stores target address information of the store instruction for which the first queue is reserved but the second queue is not reserved; and an inhibitor that inhibits execution of the load instruction when address information matching target address information of the load instruction is stored in the storage circuit when the load instruction is being processed. This configuration inhibits switching of the order of a store instruction and a load instruction.

Apparatus and method for store pairing with reduced hardware requirements

An apparatus and method for pairing store operations. For example, one embodiment of a processor comprises: a grouping eligibility checker to evaluate a plurality of store instructions based on a set of grouping rules to determine whether two or more of the plurality of store instructions are eligible for grouping; and a dispatcher to simultaneously dispatch a first group of store instructions of the plurality of store instructions determined to be eligible for grouping by the grouping eligibility checker.