G06F9/3816

MULTIPORTED PARITY SCOREBOARD CIRCUIT
20210117201 · 2021-04-22 · ·

A fast and frugal item-state tracking scoreboard circuit is disclosed. The scoreboard maintains per-item partial states across multiple memory circuits, enabling multiple lookups per clock cycle and multiple state updates per clock cycle. In an embodiment a scoreboard is used to schedule instructions in an out-of-order processor. Each clock cycle the scoreboard indicates the busy state of an instruction's registers and may update the busy state of the destination registers of issuing instructions and completing instructions. Applications include register tracking, function-unit tracking, and cache-line state tracking, in embodiments including processor cores (including superscalar, superpipelined, and multithreaded processors), accelerators, memory systems, and networks. In an embodiment, a register-busy scoreboard circuit is implemented using FPGA LUT RAM memory. In an embodiment, a three-read/two-write per cycle register file scoreboard of 64 registers uses 16 LUTs and indicates whether an instruction is issuable in two LUT delays.

DEVICES TRANSFERRING CACHE LINES, INCLUDING METADATA ON EXTERNAL LINKS
20230409332 · 2023-12-21 ·

In a processing system, a conversion circuit coupled to a system bus generates a flow control unit (FLIT) and provides the FLIT to a link interface circuit for transmission over an external link. The external link may be a peripheral component interface (PCI) express (PCIe) link coupled to an external device comprising a cache or memory. The conversion circuit generates the FLIT, including write information based on the write instruction, metadata associated with at least one cache line, and cache line chunks, including bytes of a cache line. The cache line chunks may be chunks of one of the at least one cache line. Including the metadata in the FLIT avoids separately transmitting the at least one cache line and the metadata over the external link, which improves performance compared to generating separate transmissions. In some examples, the FLIT corresponds to a compute express link (CXL) protocol FLIT.

PARTIAL WRITE MANAGEMENT IN A MULTI-TILED COMPUTE ENGINE

Embodiments described herein provide a general purpose graphics processor comprising a plurality of tiles, each tile of the plurality of tiles comprising at least one execution unit, a local cache, and a cache control unit, and a high bandwidth memory communicatively coupled to the plurality of tiles, wherein the high bandwidth memory is shared between the plurality of tiles. The cache control unit is to implement a partial write management protocol to receive a partial write operation directed to a cache line in the local cache, the partial write operation comprising write data, write the data associated with the partial write operation to the local cache when the cache line is in a modified state, and forward the write data associated with the partial write operation to the high bandwidth memory when the partial write operation triggers a cache miss or when the cache line is in an exclusive state or a shared state. Other embodiments may be described and claimed.

QUALITY OF SERVICE DIRTY LINE TRACKING
20210073137 · 2021-03-11 ·

Systems, apparatuses, and methods for generating a measurement of write memory bandwidth are disclosed. A control unit monitors writes to a cache hierarchy. If a write to a cache line is a first time that the cache line is being modified since entering the cache hierarchy, then the control unit increments a write memory bandwidth counter. Otherwise, if the write is to a cache line that has already been modified since entering the cache hierarchy, then the write memory bandwidth counter is not incremented. The first write to a cache line is a proxy for write memory bandwidth since this will eventually cause a write to memory. The control unit uses the value of the write memory bandwidth counter to generate a measurement of the write memory bandwidth. Also, the control unit can maintain multiple counters for different thread classes to calculate the write memory bandwidth per thread class.

Method and apparatus for using a storage system as main memory
10936492 · 2021-03-02 · ·

A data access system including a processor, multiple cache modules for the main memory, and a storage drive. The cache modules include a FLC controller and a main memory cache. The multiple cache modules function as main memory. The processor sends read/write requests (with physical address) to the cache module. The cache module includes two or more stages with each stage including a FLC controller and DRAM (with associated controller). If the first stage FLC module does not include the physical address, the request is forwarded to a second stage FLC module. If the second stage FLC module does not include the physical address, the request is forwarded to the storage drive, a partition reserved for main memory. The first stage FLC module has high speed, lower power operation while the second stage FLC is a low-cost implementation. Multiple FLC modules may connect to the processor in parallel.

Zeroing a memory block without processor caching

A data processing system includes a plurality of processor cores each having a respective associated cache memory, a memory controller, and a system memory coupled to the memory controller. A zero request of a processor core among the plurality of processor cores is transmitted on an interconnect fabric of the data processing system. The zero request specifies a target address of a target memory block to be zeroed and has no associated data payload. The memory controller receives the zero request on the interconnect fabric and services the zero request by zeroing in the system memory the target memory block identified by the target address, such that the target memory block is zeroed without caching the zeroed target memory block in the cache memory of the processor core.

System and Method for Securing Nonvolatile Memory for Execute-in-Place
20230418603 · 2023-12-28 ·

A system for securing the contents of an external nonvolatile memory associated with a main processing device is disclosed. The system stores additional information associated with each cache line in the nonvolatile memory. In some embodiments, this additional information comprises a NONCE (number used once) and a MAC (Message Authentication Code). When the main processing device reads a cache line from the nonvolatile memory, the NONCE, address and data from the cache line are used to generate a MAC, which is then compared to the MAC stored in the nonvolatile memory. If the MACs match, the cache line is stored in the on-board cache of the main processing device. If the MACs do not match, a countermeasure may be implemented. The use of a NONCE addresses an information leakage issue that is present when stream ciphers, such as AES-CTR or AES-GCM, are used in data storage applications.

TAG PROCESSING FOR EXTERNAL CACHES
20230418758 · 2023-12-28 ·

A device includes a cache memory and a memory controller coupled to the cache memory. The memory controller is configured to receive a first read request from a cache controller over an interconnect, the first read request comprising first tag data identifying a first cache line in the cache memory, and determine that the first read request comprises a tag read request. The memory controller is further configured to read second tag data corresponding to the tag read request from the cache memory, compare the second tag data read from the cache memory to the first tag data received from the cache controller with the first read request, and if the second tag data matches the first tag data, initiate an action with respect to the first cache line in the cache memory.

CHIPLET-INTEGRATED MACHINE LEARNING ACCELERATORS

Techniques for performing machine learning operations are provided. The techniques include configuring a first portion of a first chiplet as a cache; performing caching operations via the first portion; configuring at least a first sub-portion of the first portion of the chiplet as directly-accessible memory; and performing machine learning operations with the first sub-portion by a machine learning accelerator within the first chiplet.

NETWORK INTERFACE DEVICE AND HOST PROCESSING DEVICE
20210026689 · 2021-01-28 · ·

A network interface device has an input configured to receive data from a network. The data is for one of a plurality of different applications. The network interface device also has at least one processor configured to determine which of a plurality of available different caches in a host system the data is to be injected by accessing to a receive queue comprising at least one descriptor indicating a cache location in one of said plurality of caches to which data is to be injected, wherein said at least one descriptor, which indicates the cache location, has an effect on subsequent descriptors of said receive queue until a next descriptor indicates another cache location. The at least one processor is also configured to cause the data to be injected to the cache location in the host system.