G06F9/3804

Power efficient pattern history table fetch in branch predictor

A method and apparatus for branch prediction is disclosed. A pattern history table (PHT) is accessed based on at least one global history value to obtain a prediction value. The prediction value and the at least one global history value used to obtain the prediction value are placed in a queue. If a branch prediction is requested, the queue is accessed to obtain a prediction value. The queue may include any number of entries and the queue maintains the oldest prediction value at the head of the queue. The prediction value at the head of the queue is used when a branch prediction is needed.

DYNAMIC PIPELINE THROTTLING USING CONFIDENCE-BASED WEIGHTING OF IN-FLIGHT BRANCH INSTRUCTIONS

Systems and methods for operating a processor include determining confidence levels, such as high, low, and medium confidence levels, associated with in-flight branch instructions in an instruction pipeline of the processor, based on counters used for predicting directions of the in-flight branch instructions. Numbers of in-flight branch instructions associated with each of confidence levels are determined. A weighted sum of the numbers weighted with weights corresponding to the confidence levels is calculated and the weighted sum is compared with a threshold. A throttling signal may be asserted to indicate that instructions are to be throttled in a pipeline stage of the instruction pipeline based on the comparison.

Instruction sequence buffer to store branches having reliably predictable instruction sequences
09733944 · 2017-08-15 · ·

A method for outputting reliably predictable instruction sequences. The method includes tracking repetitive hits to determine a set of frequently hit instruction sequences for a microprocessor, and out of that set, identifying a branch instruction having a series of subsequent frequently executed branch instructions that form a reliably predictable instruction sequence. The reliably predictable instruction sequence is stored into a buffer. On a subsequent hit to the branch instruction, the reliably predictable instruction sequence is output from the buffer.

METHODS, SYSTEMS, AND APPARATUSES FOR PRECISE LAST BRANCH RECORD EVENT LOGGING

Systems, methods, and apparatuses relating to circuitry to implement precise last branch record event logging in a processor are described. In one embodiment, a hardware processor core includes an execution circuit to execute instructions, a retirement circuit to retire executed instructions, a status register, and a last branch record circuit to, in response to retirement by the retirement circuit of a first taken branch instruction, start a cycle timer and a performance monitoring event counter, and in response to retirement by the retirement circuit of a second taken branch instruction, that is a next taken branch instruction in program order after the first taken branch instruction, write values from the cycle timer and the performance monitoring event counter into a first entry in the status register and clear the values from the cycle timer and the performance monitoring event counter.

System and technique for retrieving an instruction from memory based on a determination of whether a processor will execute the instruction
09817665 · 2017-11-14 · ·

A technique includes receiving a request from a processor to retrieve a first instruction from a memory for a staged execution pipeline. The technique includes selectively retrieving the first instruction from the memory in response to the request based on a determination of whether the processor will execute the first instruction.

Managed instruction cache prefetching

Disclosed is an apparatus and method to manage instruction cache prefetching from an instruction cache. A processor may comprise: a prefetch engine; a branch prediction engine to predict the outcome of a branch; and dynamic optimizer. The dynamic optimizer may be used to control: identifying common instruction cache misses and inserting a prefetch instruction from the prefetch engine to the instruction cache.

Arithmetic processing unit and control method for arithmetic processing unit
11249763 · 2022-02-15 · ·

An arithmetic processing unit includes an instruction decoder which decodes a fetch instruction to issue an execution instruction; a reservation station which temporarily stores the execution instruction; and an arithmetic unit which executes the execution instruction, and the fetch instruction includes a multi-flow instruction which is divided into divided instructions and a single instruction. The instruction decoder includes: a pre-decoder including N number of slots each of which divides the multi-flow instruction into divided instructions; a main decoder including N number of slots each of which decodes the instructions to issue an execution instruction; and a pre-decoder buffer including N−K number of slots each of which temporarily stores instructions in the pre-decoder. The instruction decoder repeats transferring the divided instructions and the single instructions from the slots of the pre-decoder and the slots of the pre-decoder buffer to the main decoder as much as possible in order.

Apparatus and method for handling incorrect branch direction predictions

An apparatus and method are provided for handling incorrect branch direction predictions. The apparatus has processing circuitry for executing instructions, branch prediction circuitry for making branch direction predictions in respect of branch instructions, and fetch circuitry for fetching instructions from an instruction cache in dependence on the branch direction predictions and for forwarding the fetched instructions to the processing circuitry for execution. A cache location buffer stores cache location information for a given branch instruction for which accuracy of the branch direction predictions made by the branch prediction circuitry is below a determined threshold. The cache location information identifies where within the instruction cache one or more instructions are stored that will need to be executed in the event that a subsequent branch direction prediction made for the given branch instruction is incorrect. Control circuitry is responsive to such a subsequent branch direction prediction to obtain from the cache location buffer the cache location information and to provide the obtained cache location information to the fetch circuitry to enable the fetch circuitry to retrieve from the instruction cache the one or more instructions that are to be executed in the event that the processing circuitry does determine that the subsequent branch direction prediction was incorrect.

SINGLE CYCLE MULTI-BRANCH PREDICTION INCLUDING SHADOW CACHE FOR EARLY FAR BRANCH PREDICTION
20170262287 · 2017-09-14 · ·

A method of identifying instructions including accessing a plurality of instructions that comprise multiple branch instructions. For each branch instruction of the multiple branch instructions, a respective first mask is generated representing instructions that are executed if a branch is taken. A respective second mask is generated representing instructions that are executed if the branch is not taken. A prediction output is received that comprises a respective branch prediction for each branch instruction. For each branch instruction, the prediction output is used to select a respective resultant mask from among the respective first and second masks. For each branch instruction, a resultant mask of a subsequent branch is invalidated if a previous branch is predicted to branch over said subsequent branch. A logical operation is performed on all resultant masks to produce a final mask. The final mask is used to select a subset of instructions for execution.

COALESCING ADJACENT GATHER/SCATTER OPERATIONS

According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location.