Patent classifications
G06F2212/455
Methods and apparatus for implementing cache policies in a graphics processing unit
A method of processing a workload in a graphics processing unit (GPU) may include detecting a work item of the workload in the GPU, determining a cache policy for the work item, and operating at least a portion of a cache memory hierarchy in the GPU for at least a portion of the work item based on the cache policy. The work item may be detected based on information received from an application and/or monitoring one or more performance counters by a driver and/or hardware detection logic. The method may further include monitoring one or more performance counters, wherein the cache policy for the work item may be determined and/or changed based on the one or more performance counters. The cache policy for the work item may be selected based on a runtime learning model.
GUARANTEED REAL-TIME CACHE CARVEOUT FOR DISPLAYED IMAGE PROCESSING SYSTEMS AND METHODS
An electronic device may include an electronic display to display an image based on processed image data. The electronic device may also include image processing circuitry to generate the processed image data based on input image data and previously determined data stored in memory. The image processing circuitry may also operate according to real-time computing constraints. Cache memory may store the previously determined data in a provisioned section of the cache memory allotted to the image processing circuitry. Additionally, a controller may manage reading and writing of the previously determined data to the provisioned section of the cache memory.
Matrix transpose hardware acceleration
In one example, an apparatus comprises: a buffer memory; and a memory access circuit configured to: fetch, from a first memory, a set of first groups of data elements of a first matrix, each first group of data elements being stored at consecutive memory addresses at the first memory; based on a first configuration, store the set of first groups of data elements at consecutive memory addresses or at non-consecutive memory addresses at the buffer memory; based on a second configuration that defines a memory address offset, fetch a set of second groups of the data elements from the buffer memory, each second group of the data elements being stored at consecutive memory addresses of the buffer memory, each second group being separated by the memory address offset in the buffer memory; and store each fetched second group at consecutive addresses of a destination memory to form a second matrix.
SYSTEMS AND METHODS FOR PRE-PROCESSING AND POST-PROCESSING COHERENT HOST-MANAGED DEVICE MEMORY
The disclosed computer-implemented method may include receiving, from a host via a cache-coherent interconnect, a request to access an address of a coherent memory space of the host. When the request is to write data, the computer-implemented method may include (1) performing, after receiving the data, a post-processing operation on the data to generate post-processed data and (2) writing the post-processed data to a physical address of a device-attached physical memory mapped to the address. When the request is to read data, the computer-implemented method may include (1) reading the data from the physical address of a device-attached physical memory mapped to the address, (2) performing, before responding to the request, a pre-processing operation on the data to generate pre-processed data, and (3) returning the pre-processed data to the external host via the cache-coherent interconnect. Various other methods, systems, and computer-readable media are also disclosed.
Tensor-based optimization method for memory management of a deep-learning GPU and system thereof
The present disclosure relates to a tensor-based optimization method for GPU memory management of deep learning, at least comprising steps of: executing at least one computing operation, which gets tensors as input and generates tensors as output; when one said computing operation is executed, tracking access information of the tensors, and setting up a memory management optimization decision based on the access information, during a first iteration of training, performing memory swapping operations passively between a CPU memory and a GPU memory so as to obtain the access information about the tensors regarding a complete iteration; according to the obtained access information about the tensors regarding the complete iteration, setting up a memory management optimization decision; and in a successive iteration, dynamically adjusting the set optimization decision of memory management according to operational feedbacks.
Cache miss handling for read operations in data processing systems
In a data processing system comprising a cache system configured to transfer data stored in a memory system to a processor and vice-versa, a processing unit operable to read data from a cache of the cache system can send a read request for data to the cache. The cache system, in response to the read request, determines whether the requested data is present in the cache. When the requested data is present in the cache, the cache system returns the data from the cache to the processing unit and invalidates the entry for the data in the cache. When the requested data is not present in the cache, the cache system returns an indication of that to the processing unit, without the cache system sending a request for the data towards the memory system.
Methods, systems and apparatus to reduce memory latency when fetching pixel kernels
Methods, systems, apparatus, and articles of manufacture to reduce memory latency when fetching pixel kernels are disclosed. An example apparatus includes first interface circuitry to receive a first request from a hardware accelerator at a first time including first coordinates of a first pixel disposed in a first image block, second interface circuitry to receive a second request including second coordinates from the hardware accelerator at a second time after the first time, and kernel retriever circuitry to, in response to the second request, determine whether the first image block is in cache storage based on a mapping of the second coordinates to a block tag, and, in response to determining that the first image block is in the cache storage, access, in parallel, two or more memory devices associated with the cache storage to transfer a plurality of image blocks including the first image block to the hardware accelerator.
Systems and methods for improving cache efficiency and utilization
- Altug Koker ,
- Joydeep Ray ,
- Ben Ashbaugh ,
- Jonathan Pearce ,
- Abhishek Appu ,
- Vasanth Ranganathan ,
- Lakshminarayanan Striramassarma ,
- Elmoustapha Ould-Ahmed-Vall ,
- Aravindh Anantaraman ,
- Valentin Andrei ,
- Nicolas Galoppo von Borries ,
- Varghese George ,
- Yoav Harel ,
- Arthur Hunter, JR. ,
- Brent Insko ,
- Scott Janus ,
- Pattabhiraman K ,
- Mike Macpherson ,
- Subramaniam Maiyuran ,
- Marian Alin Petre ,
- Murali Ramadoss ,
- Shailesh Shah ,
- Kamal Sinha ,
- Prasoonkumar Surti ,
- Vikranth Vemulapalli
Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache coupled to the processing resources. The cache controller is configured to control cache priority by determining whether default settings or an instruction will control cache operations for the cache.
Cache optimization for graphics systems
A mechanism is described for facilitating optimization of cache associated with graphics processors at computing devices. A method of embodiments, as described herein, includes introducing coloring bits to contents of a cache associated with a processor including a graphics processor, wherein the coloring bits to represent a signal identifying one or more caches available for use, while avoiding explicit invalidations and flushes.
Reducing Memory Access Latencies During Ray Traversal
While prefetching data for a second fiber, a hierarchical data structure is traversed using a first fiber after deferring traversal for the second fiber. Then context is switched to the second fiber, and the hierarchical data structure is traversed using second fiber while prefetching data for another fiber.