Patent classifications
G06F2212/302
GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT
Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.
SYSTEMS AND METHODS FOR CACHE OPTIMIZATION
- Altug Koker ,
- Joydeep Ray ,
- Elmoustapha Ould-Ahmed-Vall ,
- Abhishek Appu ,
- Aravindh Anantaraman ,
- Valentin Andrei ,
- Durgaprasad Bilagi ,
- Varghese George ,
- Brent Insko ,
- Sanjeev Jahagirdar ,
- Scott Janus ,
- Pattabhiraman K ,
- Sungye Kim ,
- Subramaniam Maiyuran ,
- Vasanth Ranganathan ,
- Lakshminarayanan Striramassarma ,
- Xinmin Tian
Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache memory that is coupled to the processing resources. The cache controller is configured to set an initial aging policy using an aging field based on age of cache lines within the cache memory and to determine whether a hint or an instruction to indicate a level of aging has been received. In one embodiment, the cache memory configured to be partitioned into multiple cache regions, wherein the multiple cache regions include a first cache region having a cache eviction policy with a configurable level of data persistence.
Atomic memory update unit and methods
In an aspect, an update unit can evaluate condition(s) in an update request and update one or more memory locations based on the condition evaluation. The update unit can operate atomically to determine whether to effect the update and to make the update. Updates can include one or more of incrementing and swapping values. An update request may specify one of a pre-determined set of update types. Some update types may be conditional and others unconditional. The update unit can be coupled to receive update requests from a plurality of computation units. The computation units may not have privileges to directly generate write requests to be effected on at least some of the locations in memory. The computation units can be fixed function circuitry operating on inputs received from programmable computation elements. The update unit may include a buffer to hold received update requests.
Reduction of BVH-Node Bandwidth with Incremental Traversal
Incremental encoding of Bounding Volume Hierarchies (BVH) enables coarse quantization of bounding volumes, significantly reducing their memory footprint. However, reducing the size of the BVH alone does not yield a comparable reduction in memory bandwidth in some embodiments. While the bounding volumes of the BVH nodes can be aggressively quantized, the size of the child node pointers remains a significant overhead. A two-level clustering method introduces a memory layout and node addressing scheme, which allows the reordering of BVH nodes to reduce their memory footprint in hardware ray tracing systems using reduced precision ray traversal.
Shared virtual memory
A method and system for shared virtual memory between a central processing unit (CPU) and a graphics processing unit (GPU) of a computing device are disclosed herein. The method includes allocating a surface within a system memory. A CPU virtual address space may be created, and the surface may be mapped to the CPU virtual address space within a CPU page table. The method also includes creating a GPU virtual address space equivalent to the CPU virtual address space, mapping the surface to the GPU virtual address space within a GPU page table, and pinning the surface.
Memory management for graphics processing unit workloads
A method, a device, and a non-transitory computer readable medium for performing memory management in a graphics processing unit are presented. Hints about the memory usage of an application are provided to a page manager. At least one runtime memory usage pattern of the application is sent to the page manager. Data is swapped into and out of a memory by analyzing the hints and the at least one runtime memory usage pattern.
Machine learning sparse computation mechanism
Techniques to improve performance of matrix multiply operations are described in which a compute kernel can specify one or more element-wise operations to perform on output of the compute kernel before the output is transferred to higher levels of a processor memory hierarchy.
Modifying Processing of Commands in a Command Queue Based on Accessed Data Related to a Command
Processing of commands at a graphics processor are controlled by receiving input data and generating a command for processing at the graphics processor from the input data, wherein the command will cause the graphics processor to write out at least one buffer of data to an external memory, and submitting the command to a queue for later processing at the graphics processor. Subsequent to submitting the command, but before the write to external memory has been completed, further input data is received and it is determined that the buffer of data does not need to be written to external memory. The graphics processor is then signalled to prevent at least a portion of the write to external memory from being performed for the command.
Power Aware Translation Lookaside Buffer Invalidation Optimization
One disclosed embodiment includes a method for memory management. The method includes receiving a first request to clear one or more entries of a translation lookaside buffer (TLB), receiving a second request to clear one or more entries of the TLB, bundling the first request with the second request, determining that a processor associated with the TLB transitioned to an inactive mode, and dropping the bundled first and second requests based on the determination.
MAPPING APERTURES OF DIFFERENT SIZES
Apertures of a first size in a first physical address space of at least one processor are mapped to respective blocks of the first size in a second address space of a storage medium. Apertures of a second size in the first physical address space are mapped to respective blocks of the second size in the second address space, the second size being different from the first size.