G06F2212/302

MACHINE LEARNING SPARSE COMPUTATION MECHANISM

Techniques to improve performance of matrix multiply operations are described in which a compute kernel can specify one or more element-wise operations to perform on output of the compute kernel before the output is transferred to higher levels of a processor memory hierarchy.

Tile-Based Graphics
20230102320 · 2023-03-30 ·

A tile-based graphics system has a rendering space sub-divided into a plurality of tiles which are to be processed. Graphics data items, such as parameters or texels, are fetched into a cache for use in processing one of the tiles. Indicators are determined for the graphics data items, whereby the indicator for a graphics data item indicates the number of tiles with which that graphics data item is associated. The graphics data items are evicted from the cache in accordance with the indicators of the graphics data items. For example, the indicator for a graphics data item may be a count of the number of tiles with which that graphics data item is associated, whereby the graphics data item(s) with the lowest count(s) is (are) evicted from the cache.

EFFICIENT COMPRESSED VERBATIM COPY

Compressed verbatim copy can enable more efficient copying of compressed data. In one example, a compressed verbatim copy method involves receiving a command to copy compressed data from a source address of the memory device to a destination address. In response to the receipt of the command, the method involves copying the compressed data in a compressed format from the source address to the destination address without first decompressing the data. A second source address and a second destination address of metadata for the compressed data is determined, and the metadata is copied from the second source address to the second destination address.

GPU cache management based on locality type detection

Wavefront loading in a processor is managed and includes monitoring a selected wavefront of a set of wavefronts. Reuse of memory access requests for the selected wavefront is counted. A cache hit rate in one or more caches of the processor is determined based on the counted reuse. Based on the cache hit rate, subsequent memory requests of other wavefronts of the set of wavefronts are modified by including a type of reuse of cache lines in requests to the caches. In the caches, storage of data in the caches is based on the type of reuse indicated by the subsequent memory access requests. Reused cache lines are protected by preventing cache line contents from being replaced by another cache line for a duration of processing the set of wavefronts. Caches are bypassed when streaming access requests are made.

IMMEDIATE OFFSET OF LOAD STORE AND ATOMIC INSTRUCTIONS

One embodiment provides a graphics processor including a processing resource including a register file, memory, a cache memory, and load/store/cache circuitry to process load, store, and prefetch messages from the processing resource. The circuitry includes support for an immediate address offset that will be used to adjust the address supplied for a memory access to be requested by the circuitry. Including support for the immediate address offset removes the need to execute additional instructions to adjust the address to be accessed prior to execution of the memory access instruction.

Methods and apparatus for implementing cache policies in a graphics processing unit

A method of processing a workload in a graphics processing unit (GPU) may include detecting a work item of the workload in the GPU, determining a cache policy for the work item, and operating at least a portion of a cache memory hierarchy in the GPU for at least a portion of the work item based on the cache policy. The work item may be detected based on information received from an application and/or monitoring one or more performance counters by a driver and/or hardware detection logic. The method may further include monitoring one or more performance counters, wherein the cache policy for the work item may be determined and/or changed based on the one or more performance counters. The cache policy for the work item may be selected based on a runtime learning model.

Memory controllers including examples of calculating hamming distances for neural network and data center applications

Examples of systems and method described herein provide for the processing of image codes (e.g., a binary embedding) at a memory controller with various memory devices. Such images codes may generated by various endpoint computing devices, such as Internet of Things (IoT) computing devices, Such devices can generate a Hamming processing request, having an image code of the image, to compare that representation of the image to other images (e.g., in an image dataset) to identify a match or a set of neural network results. Advantageously, examples described herein may be used in neural networks to facilitate the processing of datasets, so as to increase the rate and amount of processing of such datasets. For example, comparisons of image codes can be performed “closer” to the memory devices, e.g., at the memory controller coupled to memory devices.

GUARANTEED REAL-TIME CACHE CARVEOUT FOR DISPLAYED IMAGE PROCESSING SYSTEMS AND METHODS
20230081746 · 2023-03-16 ·

An electronic device may include an electronic display to display an image based on processed image data. The electronic device may also include image processing circuitry to generate the processed image data based on input image data and previously determined data stored in memory. The image processing circuitry may also operate according to real-time computing constraints. Cache memory may store the previously determined data in a provisioned section of the cache memory allotted to the image processing circuitry. Additionally, a controller may manage reading and writing of the previously determined data to the provisioned section of the cache memory.

Cache coherent acceleration function virtualization
11474871 · 2022-10-18 · ·

The embodiments herein describe a virtualization framework for cache coherent accelerators where the framework incorporates a layered approach for accelerators in their interactions between a cache coherent protocol layer and the functions performed by the accelerator. In one embodiment, the virtualization framework includes a first layer containing the different instances of accelerator functions (AFs), a second layer containing accelerator function engines (AFE) in each of the AFs, and a third layer containing accelerator function threads (AFTs) in each of the AFEs. Partitioning the hardware circuitry using multiple layers in the virtualization framework allows the accelerator to be quickly re-provisioned in response to requests made by guest operation systems or virtual machines executing in a host. Further, using the layers to partition the hardware permits the host to re-provision sub-portions of the accelerator while the remaining portions of the accelerator continue to operate as normal.

TRUSTED LOCAL MEMORY MANAGEMENT IN A VIRTUALIZED GPU

Embodiments are directed to trusted local memory management in a virtualized GPU. An embodiment of an apparatus includes one or more processors including a trusted execution environment (TEE); a GPU including a trusted agent; and a memory, the memory including GPU local memory, the trusted agent to ensure proper allocation/deallocation of the local memory and verify translations between graphics physical addresses (PAs) and PAs for the apparatus, wherein the local memory is partitioned into protection regions including a protected region and an unprotected region, and wherein the protected region to store a memory permission table maintained by the trusted agent, the memory permission table to include any virtual function assigned to a trusted domain, a per process graphics translation table to translate between graphics virtual address (VA) to graphics guest PA (GPA), and a local memory translation table to translate between graphics GPAs and PAs for the local memory.