Patent classifications
G06F9/312
Space efficient checkpoint facility and technique for processor with integrally indexed register mapping and free-list arrays
A processor may efficiently implement register renaming and checkpoint repair even in instruction set architectures with large numbers of wide (bit-width) registers by (i) renaming all destination operand register targets, (ii) implementing free list and architectural-to-physical mapping table as a combined array storage with unitary (or common) read, write and checkpoint pointer indexing and (iiii) storing checkpoints as snapshots of the mapping table, rather than of actual register contents. In this way, uniformity (and timing simplicity) of the decode pipeline may be accentuated and architectural-to-physical mappings (or allocable mappings) may be efficiently shuttled between free-list, reorder buffer and mapping table stores in correspondence with instruction dispatch and completion as well as checkpoint creation, retirement and restoration.
Mitigation of thread hogs on a threaded processor and prevention of allocation of resources to one or more instructions following a load miss
Systems and methods for efficient thread arbitration in a threaded processor with dynamic resource allocation. A processor includes a resource shared by multiple threads. The resource includes an array with multiple entries, each of which may be allocated for use by any thread. Control logic detects a load miss to memory, wherein the miss is associated with a latency greater than a given threshold. The load instruction or an immediately younger instruction is selected for replay for an associated thread. A pipeline flush and replay for the associated thread begins with the selected instruction. Instructions younger than the load instruction are held at a given pipeline stage until the load instruction completes. During replay, this hold prevents resources from being allocated to the associated thread while the load instruction is being serviced.
Stack pointer value prediction
Methods and apparatus for predicting the value of a stack pointer which store data when an instruction is seen which grows the stack. The information which is stored includes a size parameter which indicates by how much the stack is grown and one or both of: the register ID currently holding the stack pointer value or the current stack pointer value. When a subsequent instruction shrinking the stack is seen, the stored data is searched for one or more entries which has a corresponding size parameter. If such an entry is identified, the other information stored in that entry is used to predict the value of the stack pointer instead of using the instruction to calculate the new stack pointer value. Where register renaming is used, the information in the entry is used to remap the stack pointer to a different physical register.
Hardware and firmware paths for performing memory read processes
A storage module may include a controller that has hardware path that includes a plurality of hardware modules configured to perform a plurality of processes associated with execution of a host request. The storage module may also include a firmware module having a processor that executes firmware to perform at least some of the plurality of processes performed by the hardware modules. The firmware module performs the processes when the hardware modules are not able to successfully perform them.
Handling compressed data over distributed cache fabric
Technologies are presented that optimize data processing cost and efficiency. A computing system may comprise at least one processing element; a memory communicatively coupled to the at least one processing element; at least one compressor-decompressor communicatively coupled to the at least one processing element, and communicatively coupled to the memory through a memory interface; and a cache fabric comprising a plurality of distributed cache banks communicatively coupled to each other, to the at least one processing element, and to the at least one compressor-decompressor via a plurality of nodes. In this system, the at least one compressor-decompressor and the cache fabric are configured to manage and track uncompressed data of variable length for data requests by the processing element(s), allowing usage of compressed data in the memory.
Instruction and logic to provide vector compress and rotate functionality
Instructions and logic provide vector compress and rotate functionality. Some embodiments, responsive to an instruction specifying: a vector source, a mask, a vector destination and destination offset, read the mask, and copy corresponding unmasked vector elements from the vector source to adjacent sequential locations in the vector destination, starting at the vector destination offset location. In some embodiments, the unmasked vector elements from the vector source are copied to adjacent sequential element locations modulo the total number of element locations in the vector destination. In some alternative embodiments, copying stops whenever the vector destination is full, and upon copying an unmasked vector element from the vector source to an adjacent sequential element location in the vector destination, the value of a corresponding field in the mask is changed to a masked value. Alternative embodiments zero elements of the vector destination, in which no element from the vector source is copied.
Matrix multiplication operations using pair-wise load and splat operations
Mechanisms for performing a matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A pair-wise load and splat operation is performed to load a pair of scalar values of a second vector operand and replicate the pair of scalar values within a second target vector register. An operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored. This operation may be repeated for a second pair of scalar values of the second vector operand.
Techniques for implementing barriers to efficiently support cumulativity in a weakly-ordered memory system
A technique for operating a cache memory of a data processing system includes creating respective pollution vectors to track which of multiple concurrent threads executed by an associated processor core are currently polluted by a store operation resident in the cache memory. Dependencies in a dependency data structure of a store queue of the cache memory are set based on the pollution vectors to reduce unnecessary ordering effects. Store operations are dispatched from the store queue in accordance with the dependencies indicated by the dependency data structure.
Vector indexed memory access plus arithmetic and/or logical operation processors, methods, systems, and instructions
A processor including a decode unit to receive a vector indexed load plus arithmetic and/or logical (A/L) operation plus store instruction. The instruction is to indicate a source packed memory indices operand that is to have a plurality of packed memory indices. The instruction is also to indicate a source packed data operand that is to have a plurality of packed data elements. The processor also includes an execution unit coupled with the decode unit. The execution unit, in response to the instruction, is to load a plurality of data elements from memory locations corresponding to the plurality of packed memory indices, perform A/L operations on the plurality of packed data elements of the source packed data operand and the loaded plurality of data elements, and store a plurality of result data elements in the memory locations corresponding to the plurality of packed memory indices.