G06F9/3802

Method and apparatus for virtualizing the micro-op cache

Systems, apparatuses, and methods for virtualizing a micro-operation cache are disclosed. A processor includes at least a micro-operation cache, a conventional cache subsystem, a decode unit, and control logic. The decode unit decodes instructions into micro-operations which are then stored in the micro-operation cache. The micro-operation cache has limited capacity for storing micro-operations. When new micro-operations are decoded from pending instructions, existing micro-operations are evicted from the micro-operation cache to make room for the new micro-operations. Rather than being discarded, micro-operations evicted from the micro-operation cache are stored in the conventional cache subsystem. This prevents the original instruction from having to be decoded again on subsequent executions. When the control logic determines that micro-operations for one or more fetched instructions are stored in either the micro-operation cache or the conventional cache subsystem, the control logic causes the decode unit to transition to a reduced-power state.

Throttling Schemes in Multicore Microprocessors
20220365879 · 2022-11-17 ·

An electronic device includes a cache, a processing cluster having one or more processors, and prefetch throttling circuitry that determines a congestion level of the processing cluster based on an extent to which the data retrieval requests sent from the processors to the cache are not satisfied by the cache. Congestion criteria require that the congestion level of the cluster is above a cluster congestion threshold. In accordance with a determination that the congestion level of the cluster satisfies the congestion criteria, the prefetch throttling circuit causes one of the processors to limit prefetch requests to the cache to prefetch requests of at least a threshold quality. In accordance with a determination that the congestion level of the cluster does not satisfy the congestion criteria, the prefetch throttling circuit forgoes causing the processors to limit prefetch requests to the cache to prefetch requests of at least the threshold quality.

Servicing CPU demand requests with inflight prefetches

Disclosed embodiments provide a technique in which a memory controller determines whether a fetch address is a miss in an L1 cache and, when a miss occurs, allocates a way of the L1 cache, determines whether the allocated way matches a scoreboard entry of pending service requests, and, when such a match is found, determine whether a request address of the matching scoreboard entry matches the fetch address. When the matching scoreboard entry also has a request address matching the fetch address, the scoreboard entry is modified to a demand request.

DATA PREFETCHING FOR GRAPHICS DATA PROCESSING

Embodiments are generally directed to data prefetching for graphics data processing. An embodiment of an apparatus includes one or more processors including one or more graphics processing units (GPUs); and a plurality of caches to provide storage for the one or more GPUs, the plurality of caches including at least an L1 cache and an L3 cache, wherein the apparatus to provide intelligent prefetching of data by a prefetcher of a first GPU of the one or more GPUs including measuring a hit rate for the L1 cache; upon determining that the hit rate for the L1 cache is equal to or greater than a threshold value, limiting a prefetch of data to storage in the L3 cache, and upon determining that the hit rate for the L1 cache is less than a threshold value, allowing the prefetch of data to the L1 cache.

ADAPTIVE MATRIX MULTIPLICATION ACCELERATOR FOR MACHINE LEARNING AND DEEP LEARNING APPLICATIONS
20230041850 · 2023-02-09 ·

An adaptive matrix multiplier. In some embodiments, the matrix multiplier includes a first multiplying unit a second multiplying unit,a memory load circuit, and an outer buffer circuit. The first multiplying unit includes a first inner buffer circuit and a second inner buffer circuit, and the second multiplying unit includes a first inner buffer circuit and a second inner buffer circuit. The memory load circuit is configured to load data from memory, in a single burst of a burst memory access mode, into the first inner buffer circuit of the first multiplying unit; and into the first inner buffer circuit of the second multiplying unit.

LIVELOCK RECOVERY CIRCUIT FOR DETECTING ILLEGAL REPETITION OF AN INSTRUCTION AND TRANSITIONING TO A KNOWN STATE
20230033403 · 2023-02-02 ·

Livelock recovery circuits configured to detect livelock in a processor, and cause the processor to transition to a known safe state when livelock is detected. The livelock recovery circuits include detection logic configured to detect that the processor is in livelock when the processor has illegally repeated an instruction; and transition logic configured to cause the processor to transition to a safe state when livelock has been detected by the detection logic.

Systems and methods for performing 16-bit floating-point matrix dot product instructions

Disclosed embodiments relate to computing dot products of nibbles in tile operands. In one example, a processor includes decode circuitry to decode a tile dot product instruction having fields for an opcode, a destination identifier to identify a M by N destination matrix, a first source identifier to identify a M by K first source matrix, and a second source identifier to identify a K by N second source matrix, each of the matrices containing doubleword elements, and execution circuitry to execute the decoded instruction to perform a flow K times for each element (m, n) of the specified destination matrix to generate eight products by multiplying each nibble of a doubleword element (M,K) of the specified first source matrix by a corresponding nibble of a doubleword element (K,N) of the specified second source matrix, and to accumulate and saturate the eight products with previous contents of the doubleword element.

Computational memory

An example device includes a plurality of computational memory banks. Each computational memory bank of the plurality of computational memory banks includes an array of memory units and a plurality of processing elements connected to the array of memory units. The device further includes a plurality of single instruction, multiple data (SIMD) controllers. Each SIMD controller of the plurality of SIMD controllers is contained within at least one computational memory bank of the plurality of computational memory banks. Each SIMD controller is to provide instructions to the at least one computational memory bank.

Dynamic re-evaluation of parameters for non-volatile memory using microcontroller

A non-volatile memory apparatus and corresponding method of operation are provided. The apparatus includes non-volatile memory cells in an integrated circuit device along with a microcontroller in communication with the non-volatile memory cells. The microcontroller is configured to receive a memory operation command and in response, determine a condition value of one of a plurality of conditions associated with the memory operation command and whether the one of the plurality of conditions is dynamic. In parallel, the microcontroller determines and outputs an output value using the condition value. The microcontroller then determines whether the one the plurality of conditions has changed. If the one of the plurality of conditions is dynamic and has changed, the microcontroller determines an updated condition value and in parallel, compares the condition value and the updated condition value and determines and outputs an updated output value using the updated condition value and the comparison.

CORE-BASED SPECULATIVE PAGE FAULT LIST

An embodiment of an integrated circuit may comprise an instruction decoder to decode one or more instructions to be executed by a core, and circuitry coupled to the instruction decoder, the circuitry to determine if a decoded instruction involves a page to be fetched, and determine one or more hints for one or more optional pages that may be fetched along with the page for the decoded instruction. Other embodiments are disclosed and claimed.