Patent classifications
G06F9/3832
Replay of partially executed instruction blocks in a processor-based system employing a block-atomic execution model
Replay of partially executed instruction blocks in a processor-based system employing a block-atomic execution model is disclosed. In one aspect, a partial replay controller is provided in a processor(s) of a central processing unit (CPU). If an instruction is detected in the instruction block associated with a potential architectural state modification, or an exception occurs during execution of instructions, the instruction block is re-executed. During re-execution of the instruction block, the partial replay controller is configured to record produced results from load/store instructions. Thus, if an exception occurs during re-execution of the instruction block, previously recorded produced results for the executed load/store instructions before the exception occurred are replayed during re-execution of the instruction block after the exception is resolved. Thus, execution of instructions leading up to side-effect operations in the instruction block can be deterministically repeated with previously produced results, without repeating the side-effects.
Context value retrieval prior to or parallel with expansion of previous symbol for context-decoding in range decoder
A highly programmable device, referred to generally as a data processing unit, having multiple processing units for processing streams of information, such as network packets or storage packets, is described. The data processing unit includes one or more specialized hardware accelerators configured to perform acceleration for various data-processing functions. This disclosure describes examples of retrieving values represented by one or more previous symbols needed for decoding a current symbol before or in parallel with the insertion of the values represented by the one or more previous symbols in the data stream.
ENABLING REMOVAL AND RECONSTRUCTION OF FLAG OPERATIONS IN A PROCESSOR
In one embodiment, a processor includes a fetch logic to fetch instructions, a decode logic to decode the fetched instructions, and an execution logic to execute at least some of the instructions. The decode logic may determine whether a flag portion of a first instruction to be folded is to be performed, and if not, accumulate a first immediate value of the first instruction with a folded immediate value obtained from an entry of an immediate buffer. Other embodiments are described and claimed.
DETERMINING PREFETCH PATTERNS
An apparatus and method are provided. The apparatus comprises storage circuitry to store a plurality of data elements. Processing circuitry executes a stream of instructions comprising access instructions that access some of the data elements at given locations.
Training circuitry determines a pattern of the given locations based on the access instructions. Prefetch circuitry performs prefetches based on the pattern and filter circuitry filters the access instructions used by the training circuitry to determine the pattern by including discontinuous access instructions whose given location raises a discontinuity with the given location of a previous access instruction. In this way, it is possible to perform prefetching by calculating, rather than guessing, at a cumulative stride between the access instructions.
Data operations and finite state machine for machine learning via bypass of computational tasks based on frequently-used data values
A mechanism is described for facilitating fast data operations and for facilitating a finite state machine for machine learning at autonomous machines. A method of embodiments, as described herein, includes detecting input data to be used in computational tasks by a computation component of a processor including a graphics processor. The method may further include determining one or more frequently-used data values (FDVs) from the data, and pushing the one or more frequent data values to bypass the computational tasks.
Complex I/O value prediction for multiple values with physical or virtual addresses
An apparatus, and corresponding method, for input/output (I/O) value determination, generates an I/O instruction for an I/O device, the I/O device including a state machine with state transition logic. The apparatus comprises a controller that includes a simplified state machine with a reduced version of the state transition logic of the state machine of the I/O device. The controller is configured to improve instruction execution performance of a processor core by employing the simplified state machine to predict at least one state value of at least one I/O device true state value to be affected by the I/O instruction at the I/O device.
Delivering immediate values by using program counter (PC)-relative load instructions to fetch literal data in processor-based devices
Delivering immediate values by using program counter (PC)-relative load instructions to fetch literal data in processor-based devices is disclosed. In this regard, a processing element (PE) of a processor-based device provides an execution pipeline circuit that comprises an instruction processing portion and a data access portion. Using a literal data access logic circuit, the PE detects a PC-relative load instruction within a fetch window that includes multiple fetched instructions. The PE determines that the PC-relative load instruction can be serviced using literal data that is available to the instruction processing portion of the execution pipeline circuit (e.g., located within the fetch window containing the PC-relative load instruction, or stored in a literal pool buffer), The PE then retrieves the literal data within the instruction processing portion of the execution pipeline circuit, and executes the PC-relative load instruction using the literal data.
Coprocessor prefetcher
A prefetcher for a coprocessor is disclosed. An apparatus includes a processor and a coprocessor that are configured to execute processor and coprocessor instructions, respectively. The processor and coprocessor instructions appear together in code sequences fetched by the processor, with the coprocessor instructions being provided to the coprocessor by the processor. The apparatus further includes a coprocessor prefetcher configured to monitor a code sequence fetched by the processor and, in response to identifying a presence of coprocessor instructions in the code sequence, capture the memory addresses, generated by the processor, of operand data for coprocessor instructions. The coprocessor is further configured to issue, for a cache memory accessible to the coprocessor, prefetches for data associated with the memory addresses prior to execution of the coprocessor instructions by the coprocessor.
TECHNIQUES FOR OPTIMIZING NEURAL NETWORKS FOR MEMOIZATION USING VALUE LOCALIZATION
A system and method for improving parallel processing utilizes increased cache hits in a value cache utilizing memoization. The method includes receiving an input matrix for a parallel processing circuitry, the parallel processing configured to process the input matrix with a second matrix; selecting a portion of the input matrix, the portion having a plurality of values in binary representation; adjusting a first value of the plurality of values to a value which is a power of two; adjusting a second value of the plurality of values based on a third value of the plurality of values; generating a new input matrix based on the input matrix and the adjusted first value; and configuring the parallel processing circuitry to process the new input matrix with the second matrix.
Method and apparatus for dynamically simplifying processor instructions
There is provided methods and devices for dynamically simplifying processor instructions. A method includes receiving, at a computing device, processor instructions and determining, by the computing device, if instruction simplification is enabled for an instruction being processed. The method further includes determining, by the computing device, from an instruction simplification table if the instruction is capable of being simplified and scheduling, by the computing device, a simplified instruction based on the determination from the instruction simplification table. A device includes a processor, and a non-transient computer readable memory having stored thereon instructions which when executed by the processor configure the device to execute the methods disclosed herein.