G06F9/3832

COMPUTING DEVICE AND METHOD FOR REUSING DATA
20220206749 · 2022-06-30 · ·

A computing device and a method for reusing data are provided. The computing device includes a general register and an arithmetic unit coupled to the general register. The arithmetic unit includes a data reuse unit, which is coupled to multiple dot product data units. The data reuse unit is configured to read from the general register and temporarily store a data set used for multiple convolution operations, and determine multiple data subsets from the data set to be respectively inputted into the multiple dot product data units. Two data subsets inputted into two adjacent dot product data unit include a portion of the same data. Each of the multiple dot product data units is configured to perform a dot product operation on the inputted data subset, so as to generate a dot product operation result.

APPARATUS AND METHOD FOR HARDWARE-BASED MEMOIZATION OF FUNCTION CALLS TO REDUCE INSTRUCTION EXECUTION

Apparatus and method for memoizing repeat function calls are described herein. An apparatus embodiment includes: uop buffer circuitry to identify a function for memoization based on retiring micro-operations (uops) from a processing pipeline; memoization retirement circuitry to generate a signature of the function which includes input and output data of the function; a memoization data structure to store the signature; and predictor circuitry to detect an instance of the function to be executed by the processing pipeline and to responsively exclude a first subset of uops associated with the instance from execution when a confidence level associated with the function is above a threshold. One or more instructions that are data-dependent on execution of the instance is then provided with the output data of the function from the memoization data structure.

Logical register recovery within a processor

A computer system, processor, and method for processing information is disclosed that includes partitioning a logical register in the processor into a plurality of ranges of logical register entries based upon the logical register entry, assigning at least one recovery port of a history buffer to each range of logical register entries, initiating a flush recovery process for the processor, and directing history buffer entries to the assigned recovery port based upon the logical register entry associated with the history buffer entry.

METHOD AND APPARATUS FOR COMPARING PREDICTED LOAD VALUE WITH MASKED LOAD VALUE

A digital processor, method, and a non-transitory computer readable storage medium are described, and include a load pipeline operative to access a data content and convert the data content into a load result. The digital processor also includes a value prediction check circuit that is operative to access a speculative content, determine a predicted value from the speculative content, and determine a masked value by masking the data content with a data mask. The masked value is compared to the predicted value, and an action associated with the load result is commanded based upon the comparing of the masked value and the predicted value.

Slice-based allocation history buffer

A multi-slice processor comprising a high-level structure and history buffer. Write backs are no longer associated with the history buffer and the history buffer comprises slices determined by logical register allocation. The history buffer receives a register pointer entry and either releases or restores the entry with functional units comprised in the history buffer.

Shared pointer for local history records used by prediction circuitry

An apparatus has processing circuitry, and history storage circuitry to store local history records. Each local history record corresponds to a respective subset of instruction addresses and tracks a sequence of observed instruction behaviour observed for successive instances of instructions having addresses in that subset. Pointer storage circuitry to store a shared pointer shared between the local history records. The shared pointer indicates a common storage position reached in each local history record. Prediction circuitry determines predicted instruction behaviour for a given instruction address based on a selected portion of a selected local history record stored in the history storage circuitry. The prediction circuitry selects the selected local history record based on the given instruction address and selects the selected portion based on the shared pointer.

Opportunistic consumer instruction steering based on producer instruction value prediction in a multi-cluster processor

Opportunistic consumer instruction steering based on producer instruction value prediction in a multi-cluster processor is disclosed. A processor provides producer instructions and consumer instructions to a steering circuit that steers the program instructions to clusters of instruction execution circuits. An input value provided to a consumer instruction may be a produced value of a producer instruction, creating a dependency. The steering circuit steers a producer instruction to a first cluster and, in response to receiving the consumer instruction and the predicted value of the producer instruction, provides the predicted value to at least a second cluster and steers the consumer instruction to the second cluster for execution with the predicted value as the input value. A consumer instruction can be executed in a different cluster than a producer instruction without a cluster-to-cluster latency penalty, which allows the instruction loads to be better balanced among the clusters for higher processor throughput.

Method and apparatus for a page-local delta-based prefetcher

A method includes recording a first set of consecutive memory access deltas, where each of the consecutive memory access deltas represents a difference between two memory addresses accessed by an application, updating values in a prefetch training table based on the first set of memory access deltas, and predicting one or more memory addresses for prefetching responsive to a second set of consecutive memory access deltas and based on values in the prefetch training table.

Instruction set architecture based and automatic load tracking for opportunistic re-steer of data-dependent flaky branches

Methods and apparatuses relating to instruction set architecture (ISA) based and automatic load tracking hardware for opportunistic re-steer of data-dependent flaky branches are described. In one embodiment, a processor includes a pipeline circuit comprising a decoder to decode instructions into decoded instructions and an execution circuit to execute the decoded instructions, a branch predictor circuit to generate a predicted path for a branch instruction, and a branch re-steer circuit to, for the branch instruction dependent on a result from a load instruction, check if an instruction received by the pipeline circuit is the load instruction, and when the instruction received by the pipeline circuit is the load instruction, check for a write back of the result from the load instruction between a decode of the branch instruction with the decoder and an execution of the branch instruction with the execution circuit, and when the predicted path differs from a path based on the result from the load instruction, re-steer the branch instruction in the pipeline circuit to the path and cause execution of the branch instruction for the path based on the result from the load instruction.

REUSE IN-FLIGHT REGISTER DATA IN A PROCESSOR
20220121448 · 2022-04-21 ·

Devices and techniques for short-thread rescheduling in a processor are described herein. When an instruction for a thread completes, a result is produced. The condition that the same thread is scheduled in a next execution slot and that the next instruction of the thread will use the result can be detected. In response to this condition, the result can be provided directly to an execution unit for the next instruction.