Patent classifications
G06F9/3816
Gather-scatter cache architecture having plurality of tag and data banks and arbiter for single program multiple data (SPMD) processor
In one embodiment, a cache memory includes: a plurality of data banks, each of the plurality of data banks having a plurality of entries each to store a portion of a cache line distributed across the plurality of data banks; and a plurality of tag banks decoupled from the plurality of data banks, wherein a tag for a cache line is to be assigned to one of the plurality of tag banks. Other embodiments are described and claimed.
PIPELINED READ-MODIFY-WRITE OPERATIONS IN CACHE MEMORY
In described examples, a processor system includes a processor core that generates memory write requests, a cache memory, and a memory pipeline of the cache memory. The memory pipeline has a holding buffer, an anchor stage, and an RMW pipeline. The anchor stage determines whether a data payload of a write request corresponds to a partial write. If so, the data payload is written to the holding buffer and conforming data is read from a corresponding cache memory address to merge with the data payload. The RMW pipeline has a merge stage and a syndrome generation stage. The merge stage merges the data payload in the holding buffer with the conforming data to make merged data. The syndrome generation stage generates an ECC syndrome using the merged data. The memory pipeline writes the data payload and ECC syndrome to the cache memory.
WRITE CONTROL FOR READ-MODIFY-WRITE OPERATIONS IN CACHE MEMORY
In described examples, a processor system includes a processor core that generates memory write requests, and a cache memory with a memory controller having a memory pipeline. The cache memory has cache lines of length L. The cache memory has a minimum write length that is less than a cache line length of the cache memory. The memory pipeline determines whether the data payload includes a first chunk and ECC syndrome that correspond to a partial write and are writable by a first cache write operation, and a second chunk and ECC syndrome that correspond to a full write operation that can be performed separately from the first cache write operation. The memory pipeline performs an RMW operation to store the first chunk and ECC syndrome in the cache memory, and performs the full write operation to store the second chunk and ECC syndrome in the cache memory.
BRANCH PREDICTION THROUGHPUT BY SKIPPING OVER CACHELINES WITHOUT BRANCHES
According to one general aspect, an apparatus may include a branch prediction circuit configured to predict if a branch instruction will be taken or not. The apparatus may include a branch target buffer circuit configured to store a memory segment empty flag that indicates whether or not the memory segment after a target address includes at least one other branch instruction, wherein the memory segment empty flag was created during a commit stage of a prior occurrence of the branch instruction. The branch prediction circuit may be configured to skip over the memory segment if the memory segment empty flag indicates a lack of other branch instruction(s).
Device and method for cache utilization aware data compression
A processing device is provided which includes memory and at least one processor. The memory includes main memory and cache memory in communication with the main memory via a link. The at least one processor is configured to receive a request for a cache line and read the cache line from main memory. The at least one processor is also configured to compress the cache line according to a compression algorithm and, when the compressed cache line includes at least one byte predicted not to be accessed, drop the at least one byte from the compressed cache line based on whether the compression algorithm is determined to successfully compress the cache line according to a compression parameter.
Network interface device and host processing device
A network interface device has an input configured to receive data from a network. The data is for one of a plurality of different applications. The network interface device also has at least one processor configured to determine which of a plurality of available different caches in a host system the data is to be injected by accessing to a receive queue comprising at least one descriptor indicating a cache location in one of said plurality of caches to which data is to be injected, wherein said at least one descriptor, which indicates the cache location, has an effect on subsequent descriptors of said receive queue until a next descriptor indicates another cache location. The at least one processor is also configured to cause the data to be injected to the cache location in the host system.
Ordering execution of an interrupt handler
A processing unit for a multiprocessor data processing system includes a processor core having an upper level cache and a lower level cache coupled to the processor core. The lower level cache includes one or more state machines for handling requests snooped from the system interconnect. The processing unit includes an interrupt unit configured to, based on receipt of an interrupt request while the processor core is in a powered up state, record which of the one or more state machines are active processing a prior snooped request that can invalidate a cache line in the upper level cache and present an interrupt to the processor core based on determining that each state machine that was active processing a prior snooped request that can invalidate a cache line in the upper level cache has completed processing of its respective prior snooped request.
Partial write management in a multi-tiled compute engine
Embodiments described herein provide a general purpose graphics processor comprising a plurality of tiles, each tile of the plurality of tiles comprising at least one execution unit, a local cache, and a cache control unit, and a high bandwidth memory communicatively coupled to the plurality of tiles, wherein the high bandwidth memory is shared between the plurality of tiles. The cache control unit is to implement a partial write management protocol to receive a partial write operation directed to a cache line in the local cache, the partial write operation comprising write data, write the data associated with the partial write operation to the local cache when the cache line is in a modified state, and forward the write data associated with the partial write operation to the high bandwidth memory when the partial write operation triggers a cache miss or when the cache line is in an exclusive state or a shared state. Other embodiments may be described and claimed.
WIDENING MEMORY ACCESS TO AN ALIGNED ADDRESS FOR UNALIGNED MEMORY OPERATIONS
Unaligned atomic memory operations on a processor using a load-store instruction set architecture (ISA) that requires aligned accesses are performed by widening the memory access to an aligned address by the next larger power of two (e.g., 4-byte access is widened to 8 bytes, and 8-byte access is widened to 16 bytes). Data processing operations supported by the load-store ISA including shift, rotate, and bitfield manipulation are utilized to modify only the bytes in the original unaligned address so that the atomic memory operations are aligned to the widened access address. The aligned atomic memory operations using the widened accesses avoid the faulting exceptions associated with unaligned access for most 4-byte and 8-byte accesses. Exception handling is performed in cases in which memory access spans a 16-byte boundary.
Multi-processor system with configurable cache sub-domains and cross-die memory coherency
Disclosed embodiments relate to a system with configurable cache sub-domains and cross-die memory coherency. In one example, a system includes R racks, each rack housing N nodes, each node incorporating D dies, each die containing C cores and a die shadow tag, each core including P pipelines and a core shadow tag, each pipelines associated with a data cache and data cache tags and being either non-coherent or coherent and one of X coherency domains, wherein each pipeline, when needing to read a cache line, issues a read request to its associated data cache, then, if need be, issues a read request to its associated core-level cache, then, if need be, issues a read request to its associated die-level cache, then, if need be, issues a no-cache remote read request to a target die being mapped to hold the cache line.