G06F2212/6046

Multi-core communication acceleration using hardware queue device

Apparatus and methods implementing a hardware queue management device for reducing inter-core data transfer overhead by offloading request management and data coherency tasks from the CPU cores. The apparatus include multi-core processors, a shared L3 or last-level cache (LLC), and a hardware queue management device to receive, store, and process inter-core data transfer requests. The hardware queue management device further comprises a resource management system to control the rate in which the cores may submit requests to reduce core stalls and dropped requests. Additionally, software instructions are introduced to optimize communication between the cores and the queue management device.

Acceleration of cache-to-cache data transfers for producer-consumer communication
10430343 · 2019-10-01 · ·

A communication bypass mechanism accelerates cache-to-cache data transfers for communication traffic between caching agents that have separate last-level caches. A method includes bypassing a last-level cache of a first caching agent in response to a cache line having a modified state being evicted from a penultimate-level cache of the first caching agent and a communication attribute of a shadow tag entry associated with the cache line being set. The communication attribute indicates prior communication of the cache line with a second caching agent having a second last-level cache.

FAST UNALIGNED MEMORY ACCESS
20190286445 · 2019-09-19 ·

Fast unaligned memory access. hi accordance with a first embodiment of the present invention, a computing device includes a load queue memory structure configured to queue load operations and a store queue memory structure configured to queue store operations. The computing device includes also includes at least one bit configured to indicate the presence of an unaligned address component for an entry of said load queue memory structure, and at least one bit configured to indicate the presence of an unaligned address component for an entry of said store queue memory structure. The load queue memory may also include memory configured to indicate data forwarding of an unaligned address component from said store queue memory structure to said load queue memory structure.

Management of shared pipeline resource usage based on level information

The execution or processing of an application can be adapted or modified based on a level of a cache in which a requested data block resides, by extracting level information from a cache hierarchy. When a request for a data block is made by a core to a cache memory system, the cache memory system extracts a level of a cache memory in which the data block resides from information stored in the cache memory system. The core is informed of the level of the cache memory in which the data block resides, and uses this information to adapt its processing of the application.

Methods and apparatus for data request scheduling in performing parallel IO operations

Methods and apparatus for data request scheduling in performing parallel IO operations are disclosed. In one example, IO requests directed to an operating system having an IO scheduling component are processed. There, an IO request directed from an application to the operating system is intercepted. A determination is made whether the IO request is subject to immediate processing using available parallel processing resources. When it is determined that the IO request is subject to immediate processing using the available parallel processing resources, the IO scheduling component of the operating system is bypassed. The IO request is directly and immediately processed and passed back to the application using the available parallel processing resources.

System, apparatus and method for selective enabling of locality-based instruction handling

In an embodiment, a processor includes a sparse access buffer having a plurality of entries each to store for a memory access instruction to a particular address, address information and count information; and a memory controller to issue read requests to a memory, the memory controller including a locality controller to receive a memory access instruction having a no-locality hint and override the no-locality hint based at least in part on the count information stored in an entry of the sparse access buffer. Other embodiments are described and claimed.

Systems and methods for implementing a tag-less shared cache and a larger backing cache

A computer processing system includes a plurality of nodes, each node having at least one processor core and at least one level of cache memory which is private to the node, a shared, last level cache (LLC) memory device and a shared, last level cache location buffer containing cache location entries, each cache location entry storing an address tag and a plurality of location information. The location information stored in a cache location entry points to an identified cacheline location within the LLC that stores a cacheline associated with the location information. The cacheline stored in the LLC has associated information identifying the cache location entry.

System, apparatus and method for overriding of non-locality-based instruction handling

In one embodiment, a processor includes: a core including a decode unit to decode a memory access instruction having a no-locality hint to indicate that data associated with the memory access instruction has at least one of non-spatial locality and non-temporal locality; and a locality controller to determine whether to override the no-locality hint based at least in part on one or more performance monitoring values. Other embodiments are described and claimed.

Using an access increment number to control a duration during which tracks remain in cache

Provided are a computer program product, system, and method for using an access increment number to control a duration during which tracks remain in cache. Tracks in a storage in the cache are indicated in a cache list. For each of the tracks indicated in the cache list, an access value is updated when one of the tracks is accessed in the cache. An access to a track in the cache indicated in the cache list is received. A determination is made as to whether an access increment number for the accessed track, wherein the access increment number is greater than one. The access value for the accessed track is incremented by the determined access increment number in response to the track being accessed in the cache. The access value for one of the tracks is used to determine whether to initiate to demote the track from the cache.

METHOD AND SYSTEM FOR EFFICIENT COMMUNICATION AND COMMAND SYSTEM FOR DEFERRED OPERATION
20190236017 · 2019-08-01 ·

A method and system for efficiently executing a delegate of a program by a processor coupled to an external memory. A payload including state data or command data is bound with a program delegate. The payload is mapped with the delegate via the payload identifier. The payload is pushed to a repository buffer in the external memory. The payload is flushed by reading the payload identifier and loading the payload from the repository buffer. The delegate is executed using the loaded payload.