Texture pipeline with online variable rate dictionary compression
09947071 ยท 2018-04-17
Assignee
Inventors
Cpc classification
G09G2340/02
PHYSICS
G09G2360/18
PHYSICS
G06T11/40
PHYSICS
G09G5/393
PHYSICS
G09G5/395
PHYSICS
International classification
Abstract
A graphics system supports variable rate compression and decompression of texture data and color data. An individual block of data is analyzed to determine a compression data type from a plurality of different compression data types having different compression lengths. The compression data types may include a compression data type for a block having a constant (flat) pixel value over nn pixels, compression data type in which a subset of 3 or 4 values represents a plane or gradient, and wavelet or other compression type to represent higher frequency content. Additionally, metadata indexing provides information to map between an uncompressed address to a compressed address. To reduce the storage requirement, the metadata indexing permits two or more duplicate data blocks to reference the same piece of compressed data.
Claims
1. A method of variable rate compression of pixel or texel color values in a texture pipeline of graphics processing system, comprising: analyzing a set of blocks of color values of one of a texture and an image, wherein the one of the texture and the image is divided into the set of blocks of color values, each of the blocks of color values comprising a region of nn pixels or texels (n being a natural number); for each individual block of color values, independently determining an associated data type of a plurality of data types based on the individual block of color values, each of the plurality of data types having a different compressed length with each data type having an associated compression type, the plurality of data types comprising a flat data type, a planar data type, and one of a wavelet data type and a spline data type; for each block of color values, compressing color data of each block based on the determined associated data type, the determined associated data type being one of the flat data type, the planar data type, the wavelet data type, or a spline data type; for each compressed block of color values, generating metadata defining a mapping between an uncompressed texture address space of each block to a compressed texture address space and which indicates the data type of each block; and decompressing a block of pixels by accessing a dictionary value and using the dictionary value to generate uncompressed color data for a plurality of data blocks.
2. The method of claim 1, wherein duplicative blocks in the set of blocks are indexed to a single instance of representative compressed data.
3. The method of claim 2, further comprising generating a dictionary, wherein the metadata defines an index to representative compressed data in the dictionary.
4. The method of claim 1, wherein the flat data type is one in which all of the color values of the block have the same value and a single representative color value is compressed to represent the data of the block.
5. The method of claim 4, wherein the planar data type is one in which three color values are used to represent the color values of the block.
6. The method of claim 5, wherein one of a wavelet compression and a spline compression is used to compress the color values of the block associated with a corresponding one of the wavelet data type and the spline data type.
7. The method of claim 1, wherein the compression is performed at runtime for a dynamic texture.
8. The method of claim 1, further comprising of variable rate decompression of a block of pixels, including utilizing the metadata to identify compressed data for the block and the compression type and performing decompression using a decoder selected for the compression type to obtain uncompressed color data for the block.
9. The method of claim 8, wherein a decompressor receives an uncompressed address miss from a cache and in response accesses metadata and dictionary or compressed values and performs decoding to generate decompressed data for the cache.
10. The method of claim 9, wherein the variable rate compression and decompression are performed at least in part in hardware.
11. The method of claim 1, wherein the variable rate compression is lossless compression and a data block that cannot be compressed within a maximum size limit is stored as an uncompressed block.
12. The method of claim 1, wherein the variable rate compression includes compressing texture data into compressed data and storing the compressed data as a transformed texture in an on-chip L2 cache memory.
13. The method of claim 12, further comprising in response to a cache memory miss, using the metadata to determine the address of the compressed texture data, retrieving the compressed texture data, decoding the compressed texture data, and providing uncompressed texture data to the cache memory.
14. A graphics processing unit, comprising: a texture pipeline including: a compressor to perform variable rate compression of blocks of data of a texture by: analyzing the blocks of data of the texture, the texture being divided into the blocks of data, each of the blocks of data comprising a region of nn pixels or texels (n being a natural number); for each individual block of data, independently determining an associated data type of a plurality of data types based on the individual block of data, each of the plurality of data types having a different compressed length with each data type having an associated compression type, the plurality of data types comprising a flat data type, a planar data type, and one of a wavelet data type and a spline data type; for each block of data, compressing data of each block based on the determined associated data type, the determined associated data type being one of the flat data type, the planar data type, the wavelet data type, or a spline data type; and a decompressor to perform decompression of the compressed data by accessing a dictionary value and using the dictionary value to generate uncompressed color data for a plurality of data blocks.
15. The graphics processing unit of claim 14, wherein for a set of duplicate blocks having generating identical compressed values the compressor stores a single representative compressed value and each instance in the set of duplicate blocks is indexed to the same representative compressed value.
16. The graphics processing unit of claim 14, wherein the flat data type is one in which all color values of a block have the same value and the compressor stores a single representative color value to represent the block.
17. The graphics processing unit of claim 16, wherein the planar data type is one in which three color values are used to represent color data of the block.
18. The graphics processing unit of claim 14, wherein the graphics processing unit performs compression at runtime for dynamic textures.
19. The graphics processing unit of claim 14, further comprising on-chip L1 and L2 cache memory, wherein the graphics processing unit performs variable rate compression of texture data stored in the L2 cache and the decompressor decompresses the compressed data stored in the L2 cache.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION
(9) Generally speaking, embodiments of the present invention include variable rate compression of color and texture data in a graphics system, as well as decompression of the compressed data. This includes techniques to transform texture data to modified texture data and metadata header data, where the combined memory footprint of the modified texture and metadata header data is smaller than the actual uncompressed texture data. Additionally, embodiments of the present invention include techniques to access the modified texture data and decode it to the actual texture data.
(10)
(11) In one embodiment a variable rate compressor 110 supports different compression lengths for different types of data (e.g., a flat data compression type, linear data compression type, and a wavelet data compression type, as described below in more detail). In this example, the compressed color data generated by compressor 110 is stored in the L2 cache 104. Additionally, a metadata manager (not shown) in compressor 110 generates metadata to assist in address mapping during decompression. A decompressor 120 includes a meta data cache 130, dictionary table 140, and decoder 145. The metadata cache 130 is populated with metadata via metadata requests to the L2 cache 104, as illustrated by the arrows in
(12) The compressed data is accessed from the L2 cache and provided to decoder 145. Decoder 145 supports different types of decoding based on a compression type and thus may also be considered as having several different decoders. A pass-through path is provided for uncompressed blocks. Once the compressed data is received, it is sent to the appropriate decoder depending on the compression data type. For example, a 2-bit code supports three different compression types and an uncompressed data type. More generally, an arbitrary number of different compression types could be supported.
(13) On a miss in the L1 cache 102, the address is sent to the decompressor 120. The address is looked up in the metadata cache 130 (e.g., via a header table or other data structure) to find the transformed memory address in L2 cache. For the purposes of illustration, the decompressor 120 is illustrated with a dictionary table 140 to support common cases of duplicate blocks.
(14)
(15) As illustrated in
(16) Determining a compression data type for a data block permits optimizing the compression of an individual block. For example, for a flat block, in which all of the pixels have the same color value only a single representative value needs to be compressed to represent all of the color values of the block. Additionally, the compression of blocks containing higher frequency content can be performed using a longer compression length to implement lossless compression. Moreover, decisions may be made to not compress blocks that are not compressible without loss within a maximum compression length.
(17) In a KK set of blocks there may also be blocks at different locations that are duplicates of others. For example, suppose that there are two flat blocks that are duplicates of other in the sense that they have the same constant pixel color value A. Because these two blocks are duplicates of each other, an additional improvement in compression can be achieved by storing one compressed value for all duplicate flat blocks. More generally, there may be instances in which blocks of the same compression data type are duplicates of each other, such as two or more of the plane blocks being duplicates of each other, or two or more of the wavelet compression data type blocks being duplicates of each other.
(18) As illustrated in
(19) An exemplary variable rate compression scheme is as follows. First the tile buffer 105 sends out pixel color data (for example, RGBA8888 format) to the compressor 110 for a tile of size nn pixels. The compressor 110 processes an nn pixel square block at a time from the whole tile. For the purpose of illustration, n=4 will be used unless explicitly stated.
(20) For each nn block, the compressor 110 makes a determination of the data compression data type and corresponding entries are made in a metadata cache. Examples include determining whether a block has flat (constant) pixel data compression type; determining whether a block represents a plane or gradient compression type; or determining whether the block can be represented as a wavelet based compression type. However, it will be understood that additional compression types may be supported and that there may also be instances in which a block is not compressible without loss. The compressed data may be stored as a modified texture in the L2 cache 104. The metadata may be used as an address translator that translates from the uncompressed address space to a compressed address space in order to fetch the modified texture and decompress them in the decoder.
(21) Consider the case of a flat compression type. For the case of a flat compression data type, an nn block needs one entry in the meta-data cache (also referred to as header table interchangeably) 130 and the mapping process can point to a dictionary value stored in a dictionary table of dictionary 140. In one embodiment additional bits referred to as CompressionType (00) are used to indicate that this was a flat compressed block. A determination is made whether flat compression is to be employed for an nn block. Flat compression is used when all of the pixels in a block can be represented by the same value. If all of the pixels in the block have the same color value, then the block is marked as a flat block and stored as a single pixel value (e.g., RGBA8888) in the dictionary 140. This corresponds to a compression ratio of 16:1 for an individual 44 block. Additionally, if the block is duplicated, a further improvement in compression efficiency occurs due to the fact that a single dictionary value represents two or more duplicate blocks. Table 1 illustrates a comparison of uncompressed block sizes to compressed block size for flat blocks for common texel formats.
(22) TABLE-US-00001 TABLE 1 Constant Block Sizes for Common Texel Formats Format RGBA8888 RGBA16161616 RGBA4444 RGB565 R8 Z24 Z32 Uncompressed 64 128 32 32 16 48 64 4 4 Block Size (bytes) Compressed 4 8 2 2 1 3 4 Block Size (bytes)
(23) If the block is not a flat, then a determination is made whether the block is one of the other supported compression types. A determination is made if the nn block represents a plane (or gradient) that can be represented by a linear function ax+by+c, where a, b, c are color constants, and x, y are pixel coordinates. Additionally, an optional fine detail component may be included. The detail component augments the color produced by the linear equation using a per-pixel additive value stored in 2's complement, leading to the color value=ax+by+c+d (x, y). In these cases, three color values can represent the whole nn block (with an optional fourth color value if the fine detail component is included). Linear blocks thus provide an approximate compression ratio of 2-to-2.5 for 44 blocks. The compression type bits are referred to as CompressionType is (01) to indicate that this is a plane (or gradient).
(24) For this situation there are several options to store the three pixel values of the nn block. One option is to store the three pixel values in the dictionary 140 with an entry in the meta-data table. Another option is to store the values in memory as a modified texture along with the length of the three values (12 Bytes for RGBA8888 format). In this case, the metadata maps an uncompressed address to a compressed address in memory. Table 2 illustrates a comparison of uncompressed block size to compressed block size for linear block for common texel formats.
(25) TABLE-US-00002 TABLE 2 Linear Block Sizes for Common Texel Formats Format RGBA8888 RGBA16161616 RGBA4444 RGB565 R8 Z24 Z32 Uncompressed 64 128 32 32 16 48 64 4 4 Block Size (bytes) Compressed 13 25 7 6.75 3.25 9.25 12.25 Block Size w/o detail (bytes) Compressed 29 41 7.25 21.25 28.25 Block Size w. detail (bytes)
(26) If the block is not a flat block type or a plane type, then a determination is made whether it is another type, such as a spline or wavelet compression types. For example, a general Wavelet or DGR block is a block stored using a generic two coefficient wavelet based Golomb-Rice (DGR) code. DGR provides an average compression ratio of 1.6 for a 44 blocks. General Wavelet blocks are ones in which no polynomial fit of data can be achieved with degree <4. In such cases, an efficient method of storing such blocks is to use a wavelet basis, which is then stored using recursive indexing followed by an run-length code, which may be implemented using a Golomb-Rice coding scheme.
(27) As examples of compression sizes for DGR blocks, consider the case of 44 blocks and the common texel formats of R8, Z24, and RGBA8888. In this example, this would lead to blocks of maximum size 16, 48, and 64 bytes respectively.
(28) Additionally, in one embodiment a fallback to storing uncompressed data is included if the block cannot be compressed using the other compression schemes in a manner that is acceptable. For example consider the case that lossless compression is a requirement for a dynamic texture. If lossless compression is required, then if none of the compression schemes support lossless compression for the block a fallback is to store uncompressed data for the block. For the case of DGR, the wavelet compressor may include a rule that is the wavelet compressor cannot construct a representation smaller than the maximum compression length, the block is to be stored as-is without any compression.
(29) The decoder 145 supports pass through of uncompressed data. However, for compressed data the decoder performs a decoding operation based on the compression type. An exemplary decompression scheme is as follows. For the first time a new texture cache-line address is requested by the Level 1 cache 102, the meta-data cache 130 sends an address request to fetch the meta-data and the dictionary for the texture. The uncompressed address is stored in a FIFO in the meta-data cache 130. The address request may include a separate texture identifier (T# ID). For 44 tiles, a request for a quad in a specific texture may be identified by the tuple of T#, and the address in the texture. This information permits a metadata query to be performed that includes a tile offset (from the texture base address) and a 44 block offset within the tile.
(30) The Level 2 cache 104 returns the meta-data along with the dictionary to decode the texture. This is stored in the meta-data cache 130 and the dictionary table 340. The uncompressed address is fetched from the FIFO and is used to look up the meta-data cache to decipher one of the following based on the CompressionType bits.
(31) For a flat compressed block, the CompressionType is 00, which indicates that the cache line address is represented by one value. In one embodiment, the meta-data cache contains an index into the dictionary table that fetches the pixel value and sends it to the decoder for expansion to nn color values. The data is returned to the L1 cache. For such cases, all Level 2 cache and thus memory accesses are eliminated. In another embodiment, the flat value may be stored along with the compressed data. In such a case, the meta-data cache contains an compressed address from which the flat value is fetched.
(32) For a Planar/Gradient compression block the CompressionType is 01 and indicates the cacheline address is represented by a plane. Assuming the 3 vertices of a plane are stored in memory, the meta-data cache contains a compressed address in the L2 cache from which to fetch 12 Bytes. The L2 cache returns 12B to the decoder which further computes all nn values and sends them back to the L1 cache.
(33) Additional CompressionType Values are indicated by using different values for each compression algorithm that is used. For example a CompressionType of 10 may be used to indicate wavelet/DGR compression and a Compression type of 11 may be used indicated uncompressed data. Data is either fetched from the L2 cache 304 or from the dictionary table and is sent to the L1 cache 302. For the case of wavelet compression, the decoder computes all nn values of the uncompressed data and sends them back to the L1 cache.
(34) As an illustrative example, consider a given texture (88) that contains RGBA8888 (4 Byte texel) data. The total size of the texture is 256 Bytes uncompressed. In one embodiment, the texture is broken down into four 44 blocks of data. The metadata can be represented as header table that contains 4 entries, one for each block. The header table is indexed by the block number. The header table stores two values, a block offset (8 bits for example for byte addressability) and a block length (1 bit, one to indicate 4 byte or 64 byte blocks). The block offset is added to the base address of the texture to fetch data from memory. If all 16 values of blocks one and two are the same, then only 4 Bytes of storage are needed for the whole block. The compressed length of the block is 4 Bytes compared to 64 Bytes for the uncompressed block. In this case, the first entry of the header table has an offset 0 to indicate starting block address, and a length indicator bit of 0 (4 bytes). The second entry has an offset 4 to indicate it is 4 bytes from the starting texture address. Assume that texels in block 3 and block four are all not the same and are stored as 64B quantities. Entry 3 in the header table has an offset of 8 and a length indicator bit of 1 to indicate that the length is 64B. Entry 4 is similarly filled.
(35) When the texture L1 cache sends out a texture address, the decompressor 120 identifies the block number from the texel address and looks into the header table of the meta data cache 130. For any texel in the first block, it looks up the offset to compute the actual memory address. It then looks up the length indicator bit (in this case, 0) to figure out that only 4B of data need to be read. It sends the memory address along with the length to L2 cache and/or memory to fetch only 4 B. Once data is returned by L2 cache, the decoder 345 uses the length bit to replicate the color value for all 16 texels in the block.
(36) In this simple example, the decompressor 120 fetches a total of 141 bytes from memory as opposed to 256 bytes in the traditional case, providing a memory BW reduction of up to 45%.
(37) It will be understood that one option is to selectively use variable rate compression for applications in which it provides the greatest benefits. For example, the variable rate compression may be selectively turned on when dynamic textures are generated by the GPU during runtime and turned off when static textures are used.
(38) The metadata in the meta data cache 130 may be organized into a hierarchical table or other hierarchical data structure. As an illustrative example, support may be provided for different block sizes, such a block sizes from 44 to 6464. A bit code may be used to indicate the block size of a texture. For example, a bit code of 001 may be used to indicate a 44 level hierarchy and a 101 bit code may be used to indicated a 6464 level hierarchy.
(39) As an example of a hierarchical metadata structure, the data structure may store a starting index into a decoder table and a code book table for each hierarchy. A cache may be provided or a decoder table to perform both metadata decoding and uncompressed to compressed translation. A separate cache may be provided for a code book table for code word storing. The decoder table may include bits to represent each hierarchy include bits to represent a partition offset into memory, an actual offset in memory and length, a code book value for each quad. In one embodiment a code book table may contain four texel values for each quad.
(40) The choice of block size determines the total size of the meta-data and also the compression ratio of the block. There are tradeoffs between block size and metadata overhead. For example, choosing a 44 block supports compressing many blocks with either a flat compression or a linear compression. For example, a 6464 texture requires 256 meta-data entries. If each meta-data entry is 33 bits (the total size of meta data is 1056 B or 17 L1 cache lines, thus leading to a meta-data overhead of 6.6% or 2.06 bits/texel in a block. Increasing the block size to 88 texels reduces the meta-data overhead by 4 to 264B or 5 L1$ cache lines, thus leading to an overhead of 1.9% or 0.51 bits/texel in a block. However, there is a greater computation effort to perform linear compression for an 88 block. Consequently, a 88 block may not allow for linear compression due to the increased arithmetic and area complexity and potentially reduces the compression rate. It will be understood that as an optimization choices can be provided for a driver to choose between different block sizes.
(41)
(42) For RTs that require compression, the COMP unit 510 accumulates these values, performs the compression operation using variable length compressors (VLCs) and sends compressed data every cycle to the L2 cache at the block address specified by the TB. An exemplary output data size to the L2 cache is 256 B. As previously discussed, in one embodiment four different classes of compression algorithms are supported by COMP unit. For efficient addressing, a metadata map is generated to translate every uncompressed pixel location into a compressed location along with the type and width of compressed data. This meta-data is stored in a meta-data address pointed to by the driver. The meta-data manager 515 may be assigned to manage the meta-data storage process. The COMP unit 510 is responsible for accumulating meta-data and compressed data into cache-lines before sending them out to a L2 cache slice.
(43) In cases where the RT is the final RT, it is sent directly to the L2 cache as the final frame buffer. In such cases, Format Conversion (FC) unit may be involved in intermediate operations. The FC unit performs format conversion and expansion to the final RT format.
(44)
(45) The compression process begins for each received uncompressed 44 quad by converting all color values into deltas by subtracting a pivot value from every pixel 605, using a 15-wide subtraction block. This pivot value is fixed to be the color value at the (0, 0) pixel within a 44, referred to as c(0, 0). Note that certain data types which are unsigned may need to be extended with one extra sign bit for this step. The results of this step D(x, y) are used by subsequent decision blocks. A decision is made in block 610 if all of the resulting values are zero. If yes, the 44 block is encoded as a constant. If not, a decision is made whether or not a linear polynomial may be fitted to the value of the block 615. If so, the block is encoded as a linear block. If not, a decision is made whether or not GR wavelet compression may be used to encode the block 620. This may include checks whether or not the total size of the encode value is less than an uncompressed size. An additional check may be performed to determine whether or not the compressed size is less than a maximum allowable compressed size. If the block fails the GR encoding it is written as an uncompressed block 630.
(46)
(47) Hardware support is preferably provided for variable rate compression and decompression. The hardware support may include implementing at least some of the compression and decompression processing using mathematical operations implemented in hardware. That is, it is preferable from a speed and power consumption standpoint to implement as much of the compression and decompression process in hardware as possible. One aspect of the variable rate compression process is that it may be implemented using hardware assists to improve computational speed and power efficiency. For example, the flat compression type may be implemented using comparatively simple hardware to perform subtraction and accumulation operations. The linear data compression type and DGR compression type may also be performed using hardware assists. Conversely, the decoding/decompression operations may be implemented using hardware assists. The ability to use hardware assists for the compression and decompression operations improves speed and power consumption, thus facilitating the use of variable rate compression and decompression during runtime for dynamic textures.
(48)
(49) Once the compressed data is received, a mux 720 may be used to send the compressed data is sent to the appropriate decoder depending on the CompressionType. In this example, the compressed data for a flat data type is sent to a corresponding flat data decoder 725, compressed data for a linear (planar) data type is sent to a linear decoder 730, and a differential (DGR wavelet) data type is stent to a differential decoder 735. A pass-through path exists for uncompressed blocks.
(50) An intermediate cache is included within the data fetch block 715 to cache fetched compressed data, with the address of the cache line acting as a tag, to reduce redundant compressed data fetch.
(51) Decoded quads are written to an intermediate storage buffer 740 where the 44 block is constructed, before being returned to a texture unit.
(52) The individual decoder blocks 725, 730, and 735 may be implemented using hardware assists that are effectively inverse operations to the compression process. Thus the decompression process may also be implemented to be relatively quick and power efficient.
(53)
(54) 1. The processor sends input co-ordinates to a texture addresser.
(55) 2. A texture addresser converts co-ordinates into memory addresses for fetching data and sends them to the L1 texture cache.
(56) 3. The L1 texture cache fetches the data and sends it to a conventional decompressor. On a miss in the L1 cache, data is fetched from L2 cache or from memory.
(57) 4. A lossy decompressor then decompresses the color data and sends it to texture filter.
(58) 5. A filter unit interpolates the input data and sends it back to processor engine.
(59)
(60) 1. On a miss in L1 cache, the address is sent to the variable rate decompressor block claimed 120. The address is looked up in the header table of the metadata cache 130 to find the transformed memory address in L2 cache.
(61) 2. The L2 cache then sends the transformed texture data back to the variable rate decompressor block 120. The block decodes the transformed data to the actual texture data that is sent back to the L1 cache.
(62) One aspect of the present invention illustrated in
(63) As previously discussed, an application of variable rate compression and decompression is for use with dynamic textures generated at runtime. Hardware assists for the variable rate compression and decompression provide performance advantages for speed and energy efficiency. The time constraints are very tight with dynamic textures. Consequently, the use of hardware assists to aid in performing variable rate compression and decompression provides important benefits for speed and energy efficiency.
(64) As previously discussed, an application of the present invention is for use with dynamic textures in which lossless compression is required. However, it will be understood that the variable rate compression and decompression may also be applied to applications in which lossy compression is acceptable.
(65) While the invention has been described in conjunction with specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. The present invention may be practiced without some or all of these specific details. In addition, well known features may not have been described in detail to avoid unnecessarily obscuring the invention. In accordance with the present invention, the components, process steps, and/or data structures may be implemented using various types of operating systems, programming languages, computing platforms, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herein. The present invention may also be tangibly embodied as a set of computer instructions stored on a computer readable medium, such as a memory device.