G06F12/0886

Processors, methods, systems, and instructions to load multiple data elements to destination storage locations other than packed data registers

A processor of an aspect includes a plurality of packed data registers, and a decode unit to decode an instruction. The instruction is to indicate a packed data register of the plurality of packed data registers that is to store a source packed memory address information. The source packed memory address information is to include a plurality of memory address information data elements. An execution unit is coupled with the decode unit and the plurality of packed data registers, the execution unit, in response to the instruction, is to load a plurality of data elements from a plurality of memory addresses that are each to correspond to a different one of the plurality of memory address information data elements, and store the plurality of loaded data elements in a destination storage location. The destination storage location does not include a register of the plurality of packed data registers.

Processors, methods, systems, and instructions to load multiple data elements to destination storage locations other than packed data registers

A processor of an aspect includes a plurality of packed data registers, and a decode unit to decode an instruction. The instruction is to indicate a packed data register of the plurality of packed data registers that is to store a source packed memory address information. The source packed memory address information is to include a plurality of memory address information data elements. An execution unit is coupled with the decode unit and the plurality of packed data registers, the execution unit, in response to the instruction, is to load a plurality of data elements from a plurality of memory addresses that are each to correspond to a different one of the plurality of memory address information data elements, and store the plurality of loaded data elements in a destination storage location. The destination storage location does not include a register of the plurality of packed data registers.

FLEXIBLE DICTIONARY SHARING FOR COMPRESSED CACHES
20210232505 · 2021-07-29 ·

Systems, apparatuses, and methods for implementing flexible dictionary sharing techniques for caches are disclosed. A set-associative cache includes a dictionary for each data array set. When a cache line is to be allocated in the cache, a cache controller determines to which set a base index of the cache line address maps. Then, a selector unit determines which dictionary of a group of dictionaries stored by those sets neighboring this set would achieve the most compression for the cache line. This dictionary is then selected to compress the cache line. An offset is added to the base index of the cache line to generate a full index in order to map the cache line to the set corresponding to this chosen dictionary. The compressed cache line is stored in this set with the chosen dictionary, and the offset is stored in the corresponding tag array entry.

DATA TRANSFER SYSTEM
20210255977 · 2021-08-19 ·

A data transfer system including a first memory and a processor includes a second memory and a DMA controller. The processor performs RMW on data which has a size less than a cache line size and in which a portion of a cache line (a unit area of the first memory) is a write destination. Output target data is transferred from an I/O device to the second memory. Thereafter, the DMA controller transfers the output target data from the second memory to the first memory in one or a plurality of transfer unit sizes by which the number of occurrences of RMW is minimized.

Apparatus and method for providing decoded instructions from a decoded instruction cache

A decoding apparatus has fetch circuitry, decode circuitry, and a decoded instruction cache. The decoded instruction cache comprises a plurality of cache blocks, where each cache block is arranged to store up to P decoded instructions from at least one fetch granule allocated to that cache block. When the corresponding decoded instruction for a required instruction is already stored in the decoded instruction cache, the decoded instruction is output in the stream of decoded instructions. Allocation circuitry is arranged, when a cache block is already allocated for existing decoded instructions from a particular fetch granule, and then additional decoded instructions from that particular fetch granule are subsequently produced by the decode circuitry due to a different path being taken through the fetch granule, to update the already allocated cache block to provide both the existing decoded instructions and the additional decoded instructions.

Pseudo-first in, first out (FIFO) tag line replacement

A method is provided that includes searching tags in a tag group comprised in a tagged memory system for an available tag line during a clock cycle, wherein the tagged memory system includes a plurality of tag lines having respective tags and wherein the tags are divided into a plurality of non-overlapping tag groups, and searching tags in a next tag group of the plurality of tag groups for an available tag line during a next clock cycle when the searching in the tag group does not find an available tag line.

DYNAMIC CLUSTERING-BASED DATA COMPRESSION
20210288659 · 2021-09-16 ·

Methods, systems, and techniques for data compression. A cluster fingerprint of an uncompressed data block is determined to correspond to a cluster fingerprint of a base block stored in a base array. This determining involves looking up the cluster fingerprint of the first base block from the base array using the cluster fingerprint of the first uncompressed data block. The difference between the uncompressed data block and the base block is determined, and a compressed data block is encoded using this difference. The compressed data block is then stored in a data array.

System, apparatus and method for selective enabling of locality-based instruction handling

In an embodiment, a processor includes a sparse access buffer having a plurality of entries each to store for a memory access instruction to a particular address, address information and count information; and a memory controller to issue read requests to a memory, the memory controller including a locality controller to receive a memory access instruction having a no-locality hint and override the no-locality hint based at least in part on the count information stored in an entry of the sparse access buffer. Other embodiments are described and claimed.

PARALLEL DECOMPRESSION MECHANISM

An apparatus to facilitate packing compressed data is disclosed. The apparatus includes compression hardware to compress memory data into a plurality of compressed data components and packing hardware to receive the plurality of compressed data components and pack a first of the plurality of compressed data components beginning at a least significant bit (LSB) location of a compressed bit stream and pack a second of the plurality of compressed data components beginning at a most significant bit (MSB) of the compressed bit stream.

PATTERN-BASED CACHE BLOCK COMPRESSION

Systems, methods, and devices for performing pattern-based cache block compression and decompression. An uncompressed cache block is input to the compressor. Byte values are identified within the uncompressed cache block. A cache block pattern is searched for in a set of cache block patterns based on the byte values. A compressed cache block is output based on the byte values and the cache block pattern. A compressed cache block is input to the decompressor. A cache block pattern is identified based on metadata of the cache block. The cache block pattern is applied to a byte dictionary of the cache block. An uncompressed cache block is output based on the cache block pattern and the byte dictionary. A subset of cache block patterns is determined from a training cache trace based on a set of compressed sizes and a target number of patterns for each size.