G06F9/30065

CONTROL OF BRANCH PREDICTION FOR ZERO-OVERHEAD LOOP

In response to decoding a zero-overhead loop control instruction of an instruction set architecture, processing circuitry sets at least one loop control parameter for controlling execution of one or more iterations of a program loop body of a zero-overhead loop. Based on the at least one loop control parameter, loop control circuitry controls execution of the one or more iterations of the program loop body of the zero-overhead loop, the program loop body excluding the zero-overhead loop control instruction. Branch prediction disabling circuitry detects whether the processing circuitry is executing the program loop body of the zero-overhead loop associated with the zero-overhead loop control instruction, and dependent on detecting that the processing circuitry is executing the program loop body of the zero-overhead loop, disables branch prediction circuitry. This reduces power consumption during a zero-overhead loop when the branch prediction circuitry is unlikely to provide a benefit.

Systems and methods for improving cache efficiency and utilization

Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache coupled to the processing resources. The cache controller is configured to control cache priority by determining whether default settings or an instruction will control cache operations for the cache.

Program flow prediction for loops
11650822 · 2023-05-16 · ·

Instruction processing circuitry comprises fetch circuitry to fetch instructions for execution; instruction decoder circuitry to decode fetched instructions; execution circuitry to execute decoded instructions; and program flow prediction circuitry to predict a next instruction to be fetched; in which the instruction decoder circuitry is configured to decode a loop control instruction in respect of a given program loop and to derive information from the loop control instruction for use by the program flow prediction circuitry to predict program flow for one or more iterations of the given program loop.

ARITHMETIC PROCESSING UNIT AND CONTROL METHOD FOR ARITHMETIC PROCESSING UNIT
20170371674 · 2017-12-28 · ·

An apparatus includes: a cache to retain an instruction; an instruction-control circuit to read out the instruction from the cache; and an instruction-execution circuit to execute the instruction read out from the cache, wherein the cache includes: a pipeline processing circuit including a plurality of selection stages in each of which, among a plurality of requests for causing the cache to operate, a request having a priority level higher than priority levels of other requests is outputted to a next stage and a plurality of processing stages in each of which processing based on a request outputted from a last stage among the plurality of selection stages is sequentially executed; and a cache-control circuit to input a request received from the instruction-control circuit to the selection stage in which processing order of the processing stage is reception order of the request.

PROCESSOR AND CONTROL METHOD OF PROCESSOR

A processor includes: a storage unit that stores instructions; a counting unit that specifies an instruction to be decoded by a count value; a decoding unit that decodes an instruction; and a control unit that, when the decoded instruction is a repeat instruction, updates the count value of the counting unit so as to cause repeat target instructions in number corresponding to a designated number of instructions, out of instructions succeeding the repeat instruction, to be repeatedly executed a designated number of repetition times, and generates updated operands being operation objects of the repeat target instructions that are to be executed for the second or later time, and when the repeat target instructions are to be executed for the second or later time, updates operands of the repeat target instructions for use in the second or later time execution, to the generated updated operands and outputs the updated operands.

Providing code sections for matrix of arithmetic logic units in a processor
11687346 · 2023-06-27 · ·

The present invention relates to a processor having a trace cache and a plurality of ALUs arranged in a matrix, comprising an analyser unit located between the trace cache and the ALUs, wherein the analyser unit analyses the code in the trace cache, detects loops, transforms the code, and issues to the ALUs sections of the code combined to blocks for joint execution for a plurality of clock cycles.

Method and system for hard ware-assisted pre-execution

One aspect provides a system for hardware-assisted pre-execution. During operation, the system determines a pre-execution code region comprising one or more instructions. The system increments a global counter upon initiating the one or more instructions. The system issues a first instruction, which involves setting, in a first entry for the first instruction in a data structure, a first prefetch region identifier with a current value of the global counter. Responsive to a head pointer of the data structure reaching the first entry, the system: determines, based on a non-zero value for the first prefetch region identifier, that the first entry is not available to be allocated; and advances the head pointer to a next entry in the data structure, which renders a load associated with the first entry as a non-blocking load. The system resets the global counter upon completing the one or more instructions.

GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT

Described herein is a graphics processing unit (GPU) configured to receive an instruction having multiple operands, where the instruction is a single instruction multiple data (SIMD) instruction configured to use a bfloat16 (BF16) number format and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent. The GPU can process the instruction using the multiple operands, where to process the instruction includes to perform a multiply operation, perform an addition to a result of the multiply operation, and apply a rectified linear unit function to a result of the addition.

GRAPH COMPUTING APPARATUS, PROCESSING METHOD, AND RELATED DEVICE
20230195526 · 2023-06-22 ·

Embodiments of this application disclose apparatuses, processing methods, and related devices An example apparatus includes at least one processing engine (PE), and each of the at least one PE includes M status buffers, an arbitration logic circuit, and X operation circuits. Each of the M status buffers is configured to store status data of one iterative computing task. The arbitration logic circuit is configured to determine, based on the status data in the each of the M status buffers, L graph computing instructions to be executed in a current clock cycle, and allocate the L graph computing instructions to the X operation circuits. Each of the X operation-units circuits is configured to execute a graph computing instruction allocated by the arbitration logic circuit.

METHOD TO EXECUTE A SENSITIVE COMPUTATION USING MULTIPLE DIFFERENT AND INDEPENDENT BRANCHES
20170344376 · 2017-11-30 · ·

The present invention relates to a method to execute by a processing unit a sensitive computation using multiple different and independent branches each necessitating a given number of processing unit time units to be executed, characterized in that it comprises the following steps of, at each execution of a sensitive computation: generating at least as many identifiers as the number of branches, associating each identifier to a unique branch, generating a random permutation of identifiers, the number of occurrences of each identifier in the permutation being at least equal to the number of central processing unit time units in the shortest of the branches, by processing each identifier in the random permutation, determining successively the branch to execute by each successive central processing unit time units according to the identifier value, for each identifier of the random permutation, executing a central processing unit time unit for the branch determined according to the identifier value.