G06F2212/6028

Controlling issue rates of requests of varying broadcast scopes in a data processing system

A coherent data processing system includes a system fabric communicatively coupling a plurality of coherence participants and fabric control logic. The fabric control logic quantifies congestion on the system fabric based on coherence messages associated with commands issued on the system fabric. Based on the congestion on the system fabric, the fabric control logic determines a rate of request issuance applicable to a set of coherence participants among the plurality of coherence participants. The fabric control logic issues at least one rate command to set a rate of request issuance to the system fabric of the set of coherence participants.

Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format

Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.

ZERO LATENCY PREFETCHING IN CACHES

This invention involves a cache system in a digital data processing apparatus including: a central processing unit core; a level one instruction cache; and a level two cache. The cache lines in the second level cache are twice the size of the cache lines in the first level instruction cache. The central processing unit core requests additional program instructions when needed via a request address. Upon a miss in the level one instruction cache that causes a hit in the upper half of a level two cache line, the level two cache supplies the upper half level cache line to the level one instruction cache. On a following level two cache memory cycle, the level two cache supplies the lower half of the cache line to the level one instruction cache. This cache technique thus prefetches the lower half level two cache line employing fewer resources than an ordinary prefetch.

MEMORY SYSTEM FOR ACCELERATING GRAPH NEURAL NETWORK PROCESSING
20230026824 · 2023-01-26 ·

A memory system for accelerating graph neural network processing can include an on-host chip memory to cache data needed for processing a current root node. The system can also include a volatile memory interface between the host and non-volatile memory. The volatile memory can be configured to save one or more sets of next root nodes, neighbor nodes and corresponding attributes. The non-volatile memory can have sufficient capacity to store the entire graph data. The non-volatile memory can also be configured to pre-arrange the sets of next root nodes, neighbor nodes and corresponding attributes for storage in the volatile memory.

COMPRESSION AWARE PREFETCH

Methods, devices, and systems for prefetching data. First data is loaded from a first memory location. The first data in cached in a cache memory. Other data is prefetched to the cache memory based on a compression of the first data and a compression of the other data. In some implementations, the compression of the first data and the compression of the other data are determined based on metadata associated with the first data and metadata associated with the other data. In some implementations, the other data is prefetched to the cache memory based on a total of a compressed size of the first data and a compressed size of the other data being less than a threshold size. In some implementations, the other data is not prefetched to the cache memory based on the other data being uncompressed.

Register File Prefetch

Techniques relating to register file prefetch are described. In an embodiment, execution circuitry causes issuance of a prefetch request to copy data from a data cache unit to a register file. Other embodiments are also disclosed and claimed.

Co-scheduled loads in a data processing apparatus
11693665 · 2023-07-04 · ·

A data processing apparatus and method of operating such is disclosed. Issue circuitry buffers operations prior to execution until operands are available in a set of registers. A first and a second load operation are identified in the issue circuitry, when both are dependent on a common operand, and when the common operand is available in the set of registers. Load circuitry has a first address generation unit to generate a first address for the first load operation and a second address generation unit to generate a second address for the second load operation. An address comparison unit compares the first address and the second address. The load circuitry is arranged to cause a merged lookup to be performed in local temporary storage, when the address comparison unit determines that the first and the second address differ by less than a predetermined address range characteristic of the local temporary storage.

Proactive caching of transient assistant action suggestions at a feature phone

Proactive caching, at a client device (e.g., a feature phone), of transient assistant action suggestions for selective rendering by an assistant client application of the client device. A transient assistant action suggestion, when rendered via an assistant client application and selected, causes the assistant client application to initiate performance of a corresponding assistant action. In various implementations, a prefetched transient action suggestion can be a time-constrained suggestion that includes at least associated rendering restriction metadata that defines one or more temporal windows to which rendering of the time-constrained suggestion is restricted. Proactive cache refresh rate metadata can also be associated with transient action suggestion(s) and defines a duration during which the assistant client application is to refrain from interfacing with a remote system to prefetch updated transient assistant action suggestions.

Prefetch of random data using application tags

A processor may boot a system. The processor may determine a type of operation of data based on an application tag. The processor may analyze at least one specific table for the application tag. The processor may perform an operation associated with the application tag.

STREAMING ENGINE WITH STREAM METADATA SAVING FOR CONTEXT SWITCHING
20230004391 · 2023-01-05 ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces addresses of data elements. A steam head register stores data elements next to be supplied to functional units for use as operands. Stream metadata is stored in response to a stream store instruction. Stored stream metadata is restored to the stream engine in response to a stream restore instruction. An interrupt changes an open stream to a frozen state discarding stored stream data. A return from interrupt changes a frozen stream to an active state.