Patent classifications
G06F9/383
PREDICTION USING INSTRUCTION CORRELATION
A data processing apparatus is provided, which is able to provide predictions for hard to predict instructions. Prediction circuitry generates predictions relating to predictable instructions in a stream, where the prediction circuitry comprises storage circuitry to store, in respect of each of the predictable instructions, a reference to a set of monitored instructions in the stream to be used for generating predictions for the predictable instructions. Processing circuitry receives the predictions from the prediction circuitry and executes the predictable instructions in the stream using the predictions. Programmable instruction correlation parameter storage circuitry stores a given correlation parameter between a given predictable instruction in the stream and a subset of the set of monitored instructions of the given predictable instruction, to assist the prediction circuitry in generating the predictions. If the programmable instruction correlation parameter storage circuitry is currently storing the given correlation parameter, the prediction circuitry generates a given prediction relating to the given predictable instruction based on the subset of the set of monitored instructions indicated in the programmable instruction correlation parameter storage circuitry. Otherwise the prediction circuitry generates the given prediction relating to the given predictable instruction based on the set of monitored instructions indicated in the storage circuitry.
STREAMING ENGINE WITH ERROR DETECTION, CORRECTION AND RESTART
Disclosed embodiments relate to a streaming engine employed in, for example, a digital signal processor. A fixed data stream sequence including plural nested loops is specified by a control register. The streaming engine includes an address generator producing addresses of data elements and a steam head register storing data elements next to be supplied as operands. The streaming engine fetches stream data ahead of use by the central processing unit core in a stream buffer. Parity bits are formed upon storage of data in the stream buffer which are stored with the corresponding data. Upon transfer to the stream head register a second parity is calculated and compared with the stored parity. The streaming engine signals a parity fault if the parities do not match. The streaming engine preferably restarts fetching the data stream at the data element generating a parity fault.
STREAMING ENGINE WITH SEPARATELY SELECTABLE ELEMENT AND GROUP DUPLICATION
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements. A steam head register stores data elements next to be supplied to functional units for use as operands. An element duplication unit optionally duplicates data element an instruction specified number of times. A vector masking unit limits data elements received from the element duplication unit to least significant bits within an instruction specified vector length. If the vector length is less than a stream head register size, the vector masking unit stores all 0's in excess lanes of the stream head register (group duplication disabled) or stores duplicate copies of the least significant bits in excess lanes of the stream head register.
Data Loading
A data loading circuit and method are provided. The circuit is configured to load data for a feature map calculated by a neural network into a calculation circuit, wherein the size of the convolution kernel of the neural network is K*K data, and a window corresponding to the convolution kernel slides with a step size of S in the feature map, where K and S are positive integers and S<K, the circuit comprising: two data loaders comprising a first data loader and a second data loader; and a controller configured to: control the first data loader to be in a data outputting mode and control the second data loader to be in a data reading mode, when the window slides within K consecutive rows of the feature map.
OPTIMIZE BOUND INFORMATION ACCESSES IN BUFFER PROTECTION
A method, system and apparatus for providing bound information accesses in buffer protection, including providing one-to-one mapping between a general-purpose register and bound information in a BI (bound information) register, saving loaded bound information in the BI register for future use, providing integrity of the bound information in the BI register that is maintained along program execution, and providing a pro-active load of the bound information with one-bit extra control on load instruction of the BI register.
Efficient scheduling of load instructions
When scheduling instructions for execution on a computing device, load instructions are processed before their dependent computational instructions. This can result in the load instructions being scheduled in a non-optimal order. To schedule the load instructions in a preferred order, a scheduler can speculatively schedule the load instructions without committing to their order. Subsequently, when the scheduler encounters the dependent computational instructions, the scheduler can reorder the speculatively scheduled load instructions according to the execution order of the dependent computational instructions.
EXPOSING VALID BYTE LANES AS VECTOR PREDICATES TO CPU
A streaming engine employed in a digital data processor specifies a fixed read only data stream. Once fetched data elements in the data stream are disposed in lanes in a stream head register in the fixed order. Some lanes may be invalid, for example when the number of remaining data elements are less than the number of lanes in the stream head register. The streaming engine automatically produces a valid data word stored in a stream valid register indicating lanes holding valid data. The data in the stream valid register may be automatically stored in a predicate register or otherwise made available. This data can be used to control vector SIMD operations or may be combined with other predicate register data.
PROCESSOR INSTRUCTIONS FOR DATA COMPRESSION AND DECOMPRESSION
A processor that includes compression instructions to compress multiple adjacent data blocks of uncompressed read-only data stored in memory into one compressed read-only data block and store the compressed read-only data block in multiple adjacent blocks in the memory is provided. During execution of an application to operate on the read-only data, one of the multiple adjacent blocks storing the compressed read-only block is read from memory, stored in a prefetch buffer and decompressed in the memory controller. In response to a subsequent request during execution of the application for an adjacent data block in the compressed read-only data block, the uncompressed adjacent block is read directly from the prefetch buffer.
DEVICE, SYSTEM AND METHOD FOR SELECTIVELY DROPPING SOFTWARE PREFETCH INSTRUCTIONS
Techniques and mechanisms for providing information to determine whether a software prefetch instruction is to be executed. In an embodiment, one or more entries of a translation lookaside buffer (TLB) each include a respective value which indicates whether, according to one or more criteria, corresponding data has been sufficiently utilized. Insufficiently utilized data is indicated in a TLB entry with an identifier of an executed instruction to prefetch the corresponding data. An eviction of the TLB entry results in the creation of an entry in a registry of prefetch instructions. The entry in the registry includes the identifier of the executed prefetch instruction, and a value indicating a number of times that one or more future prefetch instructions are to be dropped. In another embodiment, execution of a subsequent prefetch instruction—which also corresponds to the identifier—is prevented based on the registry entry.
Instruction and Micro-Architecture Support for Decompression on Core
Methods and apparatus relating to an instruction and/or micro-architecture support for decompression on core are described. In an embodiment, decode circuitry decodes a decompression instruction into a first micro operation and a second micro operation. The first micro operation causes one or more load operations to fetch data into one or more cachelines of a cache of a processor core. Decompression Engine (DE) circuitry decompresses the fetched data from the one or more cachelines of the cache of the processor core in response to the second micro operation. Other embodiments are also disclosed and claimed.