Patent classifications
G06F2212/302
ATOMIC HANDLING FOR DISAGGREGATED 3D STRUCTURED SOCS
In a further embodiment, a system on a chip integrated circuit (SoC) is provided that includes an active base die including a first cache memory, a first die mounted on and coupled with the active base die, and a second die mounted on the active base die and coupled with the active base die and the first die. The first die includes an interconnect fabric, an input/output interface, and an atomic operation handler. The second die includes an array of graphics processing elements and an interface to the first cache memory of the active base die. At least one of the graphics processing elements are configured to perform, via the atomic operation handler, an atomic operation to a memory device.
Systems and methods for improving cache efficiency and utilization
- Altug Koker ,
- Joydeep Ray ,
- Ben Ashbaugh ,
- Jonathan Pearce ,
- Abhishek Appu ,
- Vasanth Ranganathan ,
- Lakshminarayanan Striramassarma ,
- Elmoustapha Ould-Ahmed-Vall ,
- Aravindh Anantaraman ,
- Valentin Andrei ,
- Nicolas Galoppo von Borries ,
- Varghese George ,
- Yoav Harel ,
- Arthur Hunter, JR. ,
- Brent Insko ,
- Scott Janus ,
- Pattabhiraman K ,
- Mike Macpherson ,
- Subramaniam Maiyuran ,
- Marian Alin Petre ,
- Murali Ramadoss ,
- Shailesh Shah ,
- Kamal Sinha ,
- Prasoonkumar Surti ,
- Vikranth Vemulapalli
Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache coupled to the processing resources. The cache controller is configured to control cache priority by determining whether default settings or an instruction will control cache operations for the cache.
GRAPHICS DISCARD ENGINE
Systems, apparatuses, and methods for implementing a discard engine in a graphics pipeline are disclosed. A system includes a graphics pipeline with a geometry engine launching shaders that generate attribute data for vertices of each primitive of a set of primitives. The attribute data is consumed by pixel shaders, with each pixel shader generating a deallocation message when the pixel shader no longer needs the attribute data. A discard engine gathers deallocations from multiple pixel shaders and determines when the attribute data is no longer needed. Once a block of attributes has been consumed by all potential pixel shader consumers, the discard engine deallocates the given block of attributes. The discard engine sends a discard command to the caches so that the attribute data can be invalidated and not written back to memory.
DISTRIBUTED COMPRESSION/DECOMPRESSION SYSTEM
A graphics processor includes multiple levels of memory units, including a memory device and a cache device located near a graphics component. The graphics processor includes distributed compression/decompression, including a module between the cache device and the memory device. The module can perform compression of write data when the write data is moved from the cache device to the memory device, and perform decompression of read data when the read data is moved from the memory device to the cache device. The graphics processor can include a second level of cache with another compression module between the first level of cache and the second level of cache.
APPARATUS AND METHOD TO IMPROVE MEMORY ACCESS PERFORMANCE BETWEEN SHARED LOCAL MEMORY AND SYSTEM GLOBAL MEMORY
Described is a machine-readable storage medium having instructions stored thereon, that when executed, cause a processor to perform a method which comprises: grouping two or more work groups to form a super-workgroup; and partitioning a portion of a memory space into one or more super-shared local memories (Super-SLMs), wherein the memory space shared within the super-workgroup forms at least one Super-SLM of the one or more Super-SLMs. Described is an apparatus which comprises: a plurality of execution units; a cache memory having a portion characterized as a SLM which is shared with the plurality of execution units at least one of which is to operate on a work group of a sub-slice, wherein the SLM is shared within the work group; and at least one Super-SLM for providing shared memory accessible by different work groups in the sub-slice, wherein the at least one of the execution units is to operate on the different work groups.
GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT
Described herein is a graphics processing unit (GPU) configured to receive an instruction having multiple operands, where the instruction is a single instruction multiple data (SIMD) instruction configured to use a bfloat16 (BF16) number format and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent. The GPU can process the instruction using the multiple operands, where to process the instruction includes to perform a multiply operation, perform an addition to a result of the multiply operation, and apply a rectified linear unit function to a result of the addition.
ARCHITECTURE AND ALGORITHMS FOR DATA COMPRESSION
A system architecture conserves memory bandwidth by including compression utility to process data transfers from the cache into external memory. The cache decompresses transfers from external memory and transfers full format data to naive clients that lack decompression capability and directly transfers compressed data to savvy clients that include decompression capability. An improved compression algorithm includes software that computes the difference between the current data word and each of a number of prior data words. Software selects the prior data word with the smallest difference as the nearest match and encodes the bit width of the difference to this data word. Software then encodes the difference between the current stride and the closest previous stride. Software combines the stride, bit width, and difference to yield final encoded data word. Software may encode the stride of one data word as a value relative to the stride of a previous data word.
HIERARCHICAL LOSSLESS COMPRESSION AND NULL DATA SUPPORT
Described herein are computer graphics technologies to facilitate effective and efficient memory handling for blocks of memory including texture maps. More particularly, one or more implementations described herein facilitates hierarchical lossless compression of memory with null data support for memory resources, including texture maps. More particularly still, one or more implementations described herein facilitates the use of meta-data for lossless compression and the support of null encodings for Tiled Resources. This technology also permits use of the fast-clear compression method, where meta-data specifies that the entire access should return some specified clear value.
Cache memory system and operating method for the same
A cache memory system includes a cache memory, which stores cache data corresponding to portions of main data stored in a main memory and priority data respectively corresponding to the cache data; a table storage unit, which stores a priority table including information regarding access frequencies with respect to the main data; and a controller, which, when at least one from among the main data is requested, determines whether cache data corresponding to the request is stored in the cache memory, deletes one from among the cache data based on the priority data, and updates the cache data set with new data, wherein the priority data is determined based on the information regarding access frequencies.
Warping data
A method of warping data includes the steps of providing a set of target coordinates .sup.N, calculating, by a warping engine, source coordinates
.sup.N for the target coordinates
.sup.N, requesting, by the warping engine, data values for a plurality of source coordinates from a cache, and computing, by the warping engine, interpolated data values for each