G06F9/381

DYNAMIC HAMMOCK BRANCH TRAINING FOR BRANCH HAMMOCK DETECTION IN AN INSTRUCTION STREAM EXECUTING IN A PROCESSOR
20210089313 · 2021-03-25 ·

Dynamic hammock branch training for branch hammock detection in an instruction stream executing in a processor is disclosed. A branch hammock detection circuit is configured to dynamically detect branch hammocks in an instruction stream during run-time processing of the instruction stream. In response to an identified conditional branch instruction, the branch hammock detection circuit starts a training process for a potential branch hammock predicated by the conditional branch instruction. The branch hammock detection circuit is configured to determine if an identified in-training branch hammock is an actual branch hammock based on setting a potential convergence point as the target address for the conditional branch instruction based on whether the branch is taken or not taken. If an instruction is processed at the set convergence point, this means the set convergence point can be an actual convergence point and the in-training branch hammock can be detected as an actual branch hammock.

Using loop exit prediction to accelerate or suppress loop mode of a processor

A processor predicts a number of loop iterations associated with a set of loop instructions. In response to the predicted number of loop iterations exceeding a first loop iteration threshold, the set of loop instructions are executed in a loop mode that includes placing at least one component of an instruction pipeline of the processor in a low-power mode or state and executing the set of loop instructions from a loop buffer. In response to the predicted number of loop iterations being less than or equal to a second loop iteration threshold, the set of instructions are executed in a non-loop mode that includes maintaining at least one component of the instruction pipeline in a powered up state and executing the set of loop instructions from an instruction fetch unit of the instruction pipeline.

Memory-adaptive processing method for convolutional neural network and system thereof

A memory-adaptive processing method for a convolutional neural network includes a feature map counting step, a size relation counting step and a convolution calculating step. The feature map counting step is for counting a plurality of input channels of an input feature map tile and a plurality of output channels of an output feature map tile for a convolutional layer operation of the convolutional neural network. The size relation counting step is for obtaining a cache free space size in a feature map cache and counting a size relation among a total input size, a total output size and the cache free space size of the feature map cache. The convolution calculating step is for performing the convolutional layer operation according to a memory-adaptive processing technique.

TRACKING STREAMING ENGINE VECTOR PREDICATES TO CONTROL PROCESSOR EXECUTION
20230418764 · 2023-12-28 ·

In a method of operating a computer system, an instruction loop is executed by a processor in which each iteration of the instruction loop accesses a current data vector and an associated current vector predicate. The instruction loop is repeated when the current vector predicate indicates the current data vector contains at least one valid data element and the instruction loop is exited when the current vector predicate indicates the current data vector contains no valid data elements.

Streaming engine with early exit from loop levels supporting early exit loops and irregular loops
10908901 · 2021-02-02 · ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements. A steam head register stores data elements next to be supplied to functional units for use as operands. Upon a stream break instruction specifying one of the nested loops, the stream engine ends a current iteration of the loop. If the specified loop was not the outermost loop, the streaming engine begins an iteration of a next outer loop. If the specified loop was the outermost nested loop, the streaming engine ends the stream. The streaming engine places a vector of data elements in order in lanes within a stream head register. A stream break instruction is operable upon a vector break.

PROCESSING METHOD, DEVICE, EQUIPMENT AND STORAGE MEDIUM OF LOOP INSTRUCTION
20210208891 · 2021-07-08 ·

The present application discloses a processing method, device, equipment and storage medium of a loop instruction, and relates to the fields of voice and chips. A specific embodiment is: acquiring a computer program including a first loop body, where the first loop body is generated according to a second loop body in a software code to be compiled, the first loop body includes a plurality of first loop instructions, the plurality of first loop instructions can be identified by a hardware structure of a computer device; in the case that the first loop body is detected, determining loop parameters of the first loop body according to the plurality of first loop instructions; acquiring the plurality of first loop instructions according to the loop parameters of the first loop body; executing the plurality of first loop instructions.

Checkpointing of architectural state for in order processing circuitry
11055096 · 2021-07-06 · ·

An in-order processor has a mapping storage element to store current register mapping information identifying, for each of two or more architectural register specifiers, which physical register specifies valid data for that architectural register specifier. At least one checkpoint storage element stores checkpoint register mapping corresponding to a checkpoint of previous architectural state. This enables checkpoints to be saved and restored simply by transferring mapping information between the mapping and checkpoint storage elements, rather than transferring the actual state data.

PROVIDING CODE SECTIONS FOR MATRIX OF ARITHMETIC LOGIC UNITS IN A PROCESSOR
20210026637 · 2021-01-28 · ·

The present invention relates to a processor having a trace cache and a plurality of ALUs arranged in a matrix, comprising an analyser unit located between the trace cache and the ALUs, wherein the analyser unit analyses the code in the trace cache, detects loops, transforms the code, and issues to the ALUs sections of the code combined to blocks for joint execution for a plurality of clock cycles.

Two address translations from a single table look-aside buffer read

A streaming engine employed in a digital data processor specifies a fixed read only data stream. An address generator produces virtual addresses of data elements. An address translation unit converts these virtual addresses to physical addresses by comparing the most significant bits of a next address N with the virtual address bits of each entry in an address translation table. Upon a match, the translated address is the physical address bits of the matching entry and the least significant bits of address N. The address translation unit can generate two translated addresses. If the most significant bits of address N+1 match those of address N, the same physical address bits are used for translation of address N+1. The sequential nature of the data stream increases the probability that consecutive addresses match the same address translation entry and can use this technique.

Apparatus and method for making predictions for instruction flow changing instructions
10901742 · 2021-01-26 · ·

An apparatus and method are provided for making predictions for instruction flow changing instructions. The apparatus has a fetch queue that identifies a sequence of instructions to be fetched for execution by execution circuitry, and prediction circuitry for making predictions in respect of instruction flow changing instructions, and for controlling which instructions are identified in the fetch queue in dependence on the predictions. The prediction circuitry is arranged, during each prediction iteration, to make a prediction for a predict block comprising a sequence of M instruction addresses, in order to identify whether that predict block contains the instruction address for an instruction flow changing instruction that is predicted as taken. During each prediction iteration, the prediction circuitry is arranged by default to access a prediction storage in order to produce prediction information for instructions associated with a specified block of instruction addresses (including at least the predict block being considered), and to use that prediction information to make the prediction for the predict block. Buffer storage is used to retain the prediction information obtained from the prediction storage during one or more previous prediction iterations, and detection circuitry is used to detect when a current predict block being considered during a current prediction iteration comprises one or more instruction addresses for which the associated prediction information is retained in the buffer storage. In that event, the above default behaviour is not adopted, and an override condition is triggered to cause the prediction information for those one or more instruction addresses to be obtained from the buffer storage rather than from the prediction storage, giving rise to a power saving.