G06F2212/507

Cache eviction
11334495 · 2022-05-17 · ·

A data processing apparatus is provided. It includes cache circuitry to store a plurality of items, each having an associated indicator. Processing circuitry executes instructions using at least some of the plurality of items. Fill circuitry inserts a new item into the cache circuitry. Eviction circuitry determines which of the plurality of items is to be a victim item based on the indicator, and evicts the victim item from the cache circuitry. Detection circuitry detects a state of the processing circuitry at a time that the new item is inserted into the cache circuitry, and sets the indicator in dependence on the state.

Cache systems and circuits for syncing caches or cache sets
11734015 · 2023-08-22 · ·

A cache system, having a first cache, a second cache, and a logic circuit coupled to control the first cache and the second cache according to an execution type of a processor. When an execution type of a processor is a first type indicating non-speculative execution of instructions and the first cache is configured to service commands from a command bus for accessing a memory system, the logic circuit is configured to copy a portion of content cached in the first cache to the second cache. The cache system can include a configurable data bit. The logic circuit can be coupled to control the caches according to the bit. Alternatively, the caches can include cache sets. The caches can also include registers associated with the cache sets respectively. The logic circuit can be coupled to control the cache sets according to the registers.

Arithmetic processing apparatus and memory apparatus
11327768 · 2022-05-10 · ·

An arithmetic processing apparatus includes an arithmetic circuit configured to perform an arithmetic operation on data having a first data width and perform an instruction in parallel on each element of data having a second data width, and a cache memory configured to store data, wherein the cache memory includes a tag circuit storing tags for respective ways, a data circuit storing data for the respective ways, a determination circuit that determines a type of an instruction with respect to whether data accessed by the instruction has the first data width or the second data width, and a control circuit that performs either a first pipeline operation where the tag circuit and the data circuit are accessed in parallel or a second pipeline operation where the data circuit is accessed in accordance with a tag result after accessing the tag circuit, based on a result determined by the determination circuit.

Enhanced input of machine-learning accelerator activations
11327690 · 2022-05-10 · ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for scheduling operations on a machine-learning accelerator having multiple tiles. The apparatus includes a processor having a plurality of tiles and scheduling circuitry that is configured to select a respective input activation for each tile of the plurality of tiles from either an activation line for the tile or a delay register for the activation line.

PREFETCH KILL AND REVIVAL IN AN INSTRUCTION CACHE

A system comprises a processor including a CPU core, first and second memory caches, and a memory controller subsystem. The memory controller subsystem speculatively determines a hit or miss condition of a virtual address in the first memory cache and speculatively translates the virtual address to a physical address. Associated with the hit or miss condition and the physical address, the memory controller subsystem configures a status to a valid state. Responsive to receipt of a first indication from the CPU core that no program instructions associated with the virtual address are needed, the memory controller subsystem reconfigures the status to an invalid state and, responsive to receipt of a second indication from the CPU core that a program instruction associated with the virtual address is needed, the memory controller subsystem reconfigures the status back to a valid state.

METHOD FOR MATRIX DATA BROADCAST IN PARALLEL PROCESSING
20220129312 · 2022-04-28 ·

Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.

Data processing

Data processing apparatus comprises a data access requesting node; data access circuitry to receive a data access request from the data access requesting node and to route the data access request for fulfilment by one or more data storage nodes selected from a group of two or more data storage nodes; and indication circuitry to provide a source indication to the data access requesting node, to indicate an attribute of the one or more data storage nodes which fulfilled the data access request; the data access requesting node being configured to vary its operation in response to the source indication.

Prefetch kill and revival in an instruction cache

A system comprises a processor including a CPU core, first and second memory caches, and a memory controller subsystem. The memory controller subsystem speculatively determines a hit or miss condition of a virtual address in the first memory cache and speculatively translates the virtual address to a physical address. Associated with the hit or miss condition and the physical address, the memory controller subsystem configures a status to a valid state. Responsive to receipt of a first indication from the CPU core that no program instructions associated with the virtual address are needed, the memory controller subsystem reconfigures the status to an invalid state and, responsive to receipt of a second indication from the CPU core that a program instruction associated with the virtual address is needed, the memory controller subsystem reconfigures the status back to a valid state.

Tablewalk takeover
11314657 · 2022-04-26 · ·

In one embodiment, a microprocessor, comprising: a translation lookaside buffer (TLB) configured to indicate that a virtual page address corresponding to a physical page address of a page of memory that a memory access instruction is attempting to access is missing in the TLB; a first micro-op corresponding to a first memory access instruction and configured to initiate a first speculative tablewalk based on a miss in the TLB of a first virtual page address; and a second micro-op corresponding to a second memory access instruction, the second micro-op configured to take over an active first speculative tablewalk of the first micro-op at its current stage of processing based on being older than the first micro-op and further based on having a virtual page address and properties that match the first virtual page address and properties for the first memory access instruction.

System and method for efficient cache coherency protocol processing

To reduce latency and bandwidth consumption in systems, systems and methods are provided for grouping multiple cache line request messages in a related and speculative manner. That is, multiple cache lines are likely to have the same state and ownership characteristics, and therefore, requests for multiple cache lines can be grouped. Information received in response can be directed to the requesting processor socket, and those speculatively received (not actually requested, but likely to be requested) can be maintained in queue or other memory until a request is received for that information, or until discarded to free up tracking space for new requests.