Decoder having multiple stages of adders to decode delta encoded data
12554463 ยท 2026-02-17
Assignee
Inventors
Cpc classification
International classification
G06F7/501
PHYSICS
Abstract
Provided are a decoder unit, a decompressor unit, and method for a decoder of multiple stages of adders to decode delta encoded data. A decoder unit implemented in a cache memory having a cache memory cell array comprises a first stage of adder circuits to add, in parallel, pairs of encoded items transformed using a delta encoding, wherein the encoded items include a plurality of deltas of neighbors of sequential source items. The decoder unit further comprises at least one successive stage of adder circuits to add, in parallel in each stage, pairs of outputs from a previous stage of adder circuits, wherein each successive stage has fewer adder circuits than the previous stage, and wherein output from a last of the at least one successive stage comprises the sequential source items.
Claims
1. A decoder unit implemented in a cache memory having a cache memory cell array, comprising: a first stage of adder circuits to add, in parallel, pairs of encoded items transformed using a delta encoding, wherein the encoded items include a plurality of deltas of neighbors of sequential source items; and at least one successive stage of adder circuits to add, in parallel in each stage of adder circuts, pairs of outputs multiple hops away from a previous stage of adder circuits, wherein each successive stage has fewer adder circuits than the previous stage of adder circuits to add fewer outputs than at the previous stage of adder circuits, and wherein output from a last of the at least one successive stage comprises the sequential source items.
2. The decoder unit of claim 1, wherein the adding, in parallel, pairs of encoded items in the first stage of adder circuits comprises adding, in parallel, each pair of consecutive decoded items to produce first output items.
3. The decoder unit of claim 1, wherein each of the sequential source items comprises a number of bytes of data to encode.
4. The decoder unit of claim 1, wherein there are n encoded items, wherein the first stage of adder circuits comprises n1 adder circuits to add each consecutive pair of the encoded items to produce n outputs including n1 added outputs, wherein the at least one successive stage of adder circuits comprises: at least one successive ith stage of adders following the first stage, where i=1 . . . j, and wherein each successive stage adds a pair of outputs from a previous stage 2.sup.i hops away.
5. The decoder unit of claim 4, wherein each successive ith stage following the first stage has (n2.sup.i) adders.
6. The decoder unit of claim 4, wherein j=log.sub.2(n).
7. The decoder unit of claim 4, wherein each stage outputs n items, wherein the first stage includes one passthrough circuit followed by the n1 adder circuits, wherein the passthrough circuit passes a first input item to first stage output, also including output from the n1 adder circuits, to a second stage of adder circuits, wherein each successive ith stage following the first stage has (n2.sup.i) adders following 2.sup.i passthrough circuits to process output from a previous stage.
8. The decoder unit of claim 4, wherein the encoded items are provided by a decompressor unit in a cache memory controller of the cache memory from a compressed word in a compressed cache line in the cache memory cell array.
9. A decompressor unit implemented in a cache memory having a cache memory cell array, comprising: a decompressor to decompress a compressed chunk of encoded items stored in the cache memory cell array to produce an uncompressed encoded chunk of the encoded items, wherein the encoded items are formed using delta encoding and include a plurality of deltas of neighbors of sequential source items; and a decoder including: a first stage of adder circuits to add, in parallel, pairs of the encoded items; and at least one successive stage of adder circuits to add, in parallel in each stage of adder circuits, pairs of outputs multiple hops away from a previous stage of adder circuits, wherein each successive stage has fewer adder circuits than the previous stage of adder circuits to add fewer outputs than at the previous stage of adder circuits, and wherein output from a last of the at least one successive stage comprises the sequential source items.
10. The decompressor unit of claim 9, wherein the adding, in parallel, pairs of encoded items in the first stage of adder circuits comprises adding, in parallel, each pair of consecutive decoded items to produce first output items, wherein the adding, in parallel, pairs in each successive stage of adder circuits comprises adding items multiple hops away to produce fewer added outputs than at the previous stage of adder circuits.
11. The decompressor unit of claim 9, wherein there are n encoded items, wherein the first stage of adder circuits comprises n1 adder circuits to add each consecutive pair of the encoded items to produce n outputs including n1 added outputs, wherein the at least one successive stage of adder circuits comprises: at least one successive ith stage of adders following the first stage, where i=1 . . . j, and wherein each successive stage adds a pair of outputs from a previous stage 2.sup.i hops away.
12. The decompressor unit of claim 11, wherein each successive ith stage following the first stage has (n2.sup.i) adders.
13. The decompressor unit of claim 11, wherein j=log.sub.2(n).
14. The decompressor unit of claim 11, wherein each stage outputs n items, wherein the first stage includes one passthrough circuit followed by the n1 adder circuits, wherein the passthrough circuit passes a first input item to first stage output, also including output from the n1 adder circuits, to a second stage of adder circuits, wherein each successive ith stage following the first stage has (n2.sup.i) adders following 2.sup.i passthrough circuits to process output from a previous stage.
15. A method for decoding data in a cache memory having a cache memory cell array, comprising: adding, by a first stage of adder circuits, in parallel, pairs of encoded items transformed using a delta encoding, wherein the encoded items include a plurality of deltas of neighbors of sequential source items, to produce output items; and adding, in each of at least one successive stage of adder circuits, in parallel in each stage of adder circuits, pairs of outputs multiple hops away from a previous stage of adder circuits, wherein each successive stage has fewer adder circuits than the previous stage of adder circuits to add fewer outputs than at the previous stage of adder circuits, and wherein output from a last of the at least one successive stage comprises the sequential source items.
16. The method of claim 15, wherein there are n encoded items, wherein the first stage of adder circuits comprises n1 adder circuits to add each consecutive pair of the encoded items to produce n outputs including n1 added outputs, wherein the at least one successive stage of adder circuits comprises: at least one successive ith stage of adders following the first stage, where i=1 . . . j, and wherein each successive stage adds a pair of outputs from a previous stage 2.sup.i hops away.
17. The method of claim 16, wherein each successive ith stage following the first stage has (n2.sup.i) adders.
18. The method of claim 16, wherein j=log.sub.2(n).
19. The method of claim 15, further comprising: decompressing, by a decompressor, a compressed chunk of encoded items stored in the cache memory cell array to produce an uncompressed encoded chunk of the encoded items, wherein the encoded items are formed using delta encoding and include a plurality of deltas of neighbors of sequential source items, and wherein the compressed chunk of encoded items is provided to the first stage of adder circuits to process.
20. The method of claim 15, wherein the adding, in parallel, pairs of encoded items in the first stage of adder circuits comprises adding, in parallel, each pair of consecutive decoded items to produce first output items.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
DETAILED DESCRIPTION
(14) Described embodiments improve the latency of delta decoding technology to decode delta encoded items. Prior art delta decoding techniques require n sequential decoding operations of an n item chunk to decode. Described embodiments provide a delta decoder unit having multiple stages of adders working in parallel. While total number of adders are increased, there are fewer stages of adders than the number of operations to sequentially decode the delta encoded items. For instance, the prior art requires n sequential decoding operations on each of the n items. Described embodiments provide a multi-stage decoder of adders to decode the n encoded items in log.sub.2(n) stages of parallel operations, where output of one stage is input to a next stage. Since log.sub.2(n) number of operations is substantially less than n sequential operations, described embodiments substantially reduce the latency of delta decoding.
(15)
(16) With respect to
(17) The cache memory 200 may comprise a high-speed data storage layer which stores a subset of data, typically transient in nature, so that future requests for that data are served up faster than is possible by accessing the primary storage location of the data. The cache memory 200 may comprise a volatile or non-volatile memory device, such as a Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), eDRAM (embedded DRAM). Other embodiments may utilize phase change memory (PCM), Magnetoresistive random-access memory (MRAM), Spin Transfer Torque (STT)-MRAM, a ferroelectric random-access memory (Efram), nanowire-based non-volatile memory, and Direct In-Line Memory Modules (DIMMs), NAND storage, e.g., flash memory, Solid State Drive (SSD) storage, non-volatile RAM, etc.
(18) The cache memory controller 204 may be implemented in circuitry in a semiconductor device, such as a Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) in the cache memory controller 204.
(19)
(20) In one embodiment, the data transformers 304.sub.1, 304.sub.2, 304.sub.3, 304.sub.4 each implement a combination of a delta, XOR, and bit plane transform, or DXB. The delta transform may subtract item size bytes in the input word 302 by subtracting all from a single base value or subtracting neighbor values. The XOR transform may XOR the delta transformed bytes, and the bit plane operation may then perform a bit plane transform on the data. In certain embodiments, the difference in the DXB transformers 304.sub.i is that they each operate on different data unit sizes, e.g., 8, 16, 32, 64 bytes, of the input chunk 302. In alternative embodiments, different data transforms may be used, and a data transformer 304.sub.i may perform only one data transform or a different number and/or type of data transform than the DXB transform combination. The data transformation may involve delta and two-dimensional encoding. In the delta encoding, neighboring data items are subtracted from one another. In two-dimensional encoding, the encoding is performed both vertically (bit-plane) and horizontally (word-plane) and the best of the two dimensions producing the greatest number of zeroes is selected.
(21) The data transformers 304.sub.1, 304.sub.2, 304.sub.3, 304.sub.4 implement arithmetic/logic operations to transform the data units of the input chunk 302 to a transformed chunk encoded with more zeroes than the input chunk 302. The different transform units 304.sub.1, 304.sub.2, 304.sub.3, 304.sub.4 may compete in a tournament so the transformed data is selected that results in the minimum number of non-zero-bits.
(22) Each population count unit 308.sub.1, 308.sub.2, 308.sub.3, 308.sub.4 comprises circuitry coupled to each data transformer 304.sub.1, 304.sub.2, 304.sub.3, 304.sub.4 and a population count (POPCOUNT) unit 310 is coupled to the passthrough unit 306. The population count units 308.sub.1, 308.sub.2, 308.sub.3, 308.sub.4 and 310 count the number of non-zero (NZ) bits in the output from the data transformers 304.sub.1, 304.sub.2, 304.sub.3, 304.sub.4 and the passthrough unit 306, respectively. The non-zero bits from the population count units 308.sub.1, 308.sub.2, 308.sub.3, 308.sub.4, and 310 are inputted to a population count (POPCOUNT) decision unit 312, comprising circuitry, that indicates the data transformer 304.sub.1, 304.sub.2, 304.sub.3, 304.sub.4 or passthrough unit 306 outputting the transformed chunk having a fewest number of non-zero bits. This indication of the data transformer 304.sub.i or passthrough unit 306 producing the fewest number of non-zero bits is provided as input to the select minimum output MUX 314, which outputs the transformed chunk 316, having the same byte size as the input word 302, but encoded with a greater number of zeros.
(23) The zero-value compressor 318 compresses the transformed chunk 316, or input chunk 302, to produce the compressed chunk 322 and a non-zero mask (NZMASK) 320 having bits indicating the non-zero values in the transformed chunk 316. Packing logic 324 creates a compressed chunk package 400, as shown in
(24)
(25)
(26) The POPCOUNT decision unit 312 determines (at block 610) the data transformer 304.sub.i or the passthrough unit 306 that produces a transformed chunk or input data having the minimum number of non-zero bits and inputs to the MUX 314 indication of the transformer 304.sub.i or passthrough unit 306 producing the minimum number of non-zero bits. The MUX 314 selects (at block 612) the transformed chunk 316 or input chunk 302 having the minimum number of non-zero bits to output as the transformed chunk 316 to the zero-value compressor 318. The zero-value compressor 318 removes (at block 614) non-zero data units of a ZVC item size (e.g., 1 bit, 1 byte, 2 bytes, 4 bytes, 8 bytes, 16 bytes) to output the compressed chunk 322 and a non-zero bitmask (NZMASK) 320 indicating data units in the transformed chunk 316 having non-zero values.
(27) Packing logic 324 generates (at block 616) a header 500 for the compressed chunk 322 indicating whether bit plane encoding was used 502 in the data transform, data transform item size 504 of data units in the input chunk 302 transformed, and the zero-value compression item size 506. The ZVC item size 506 may be provided for embodiments using an ensemble of zero-value compressors as described with respect to
(28) Described embodiments optimize zero-value compression by having an ensemble of data transformers encode input chunks or words of a cache line with mostly zeroes to produce transformed chunks, and then select a transformed chunk having a minimum number of non-zero values to optimize the zero-value compressor 318, which compresses by removing zero-values. Further, the data transformers 304.sub.i transform the input chunk 302 in parallel on same clock cycles, so that the output having the minimum number of non-zero-values, i.e., most zero-values, is selected from transformed data from all the different transforms. In this way, the best transformed data is selected to compress, which may change for different input chunks words, on the same clock cycle.
(29)
(30) A ZVC selector 706 selects a compressed chunk 708 outputted by one of the ZVCs 702.sub.i that has a minimum number of non-zero bits and outputs a non-zero mask (NZMASK) 710 indicating data units of the ZVC item size having non-zero values, and a ZVC item size 712 of the data units subject to zero-value compression. For instance, if the input chunk 302 is 32 bytes and the ZVC item size is 16 bytes, then the NZMASK 710 has two bits, one for each of the 16 byte data units in the compressed chunk 708. The non-zero bitmask 710, compressed chunk 708, and ZVC item size 712 may be sent to the packing logic 324 in
(31)
(32) With the embodiment of
(33)
(34) In certain instances, the item size 504 selected for the data transformer method 304.sub.i may be different than the item size 506 selected for the zero-value compressor.
(35)
(36) The decoders 1008.sub.1 . . . 1008.sub.4 include circuitry to perform different decoding operations to decode for different encoding techniques used to encode the data with more zeroes. The decoders 1008.sub.i decode the encoded chunk 1004 to transform the chunk 1004 back to the original source data. The data units of the encoded chunk 1004 may comprise a bit or one or more bytes. The different data decoders 1008.sub.i may implement different decoding methods or comprise the same decoding method but with different parameters and tuning. For instance, the different decoders 1008.sub.i may use the same decoding method but process different item size data units of the encoded chunk 1004. For instance, as shown in
(37) In one embodiment, the decoders 1008.sub.i each implement a combination of a delta decoder 1014.sub.i, XOR decoder, and bit plane transform decoder, or DXB decoder, which would decode in the order of bit plane transform, XOR, and delta because, in the embodiment of
(38) The decoders 1008.sub.1, 1008.sub.2, 1008.sub.3, 1008.sub.4 implement arithmetic/logic operations to decode the encoded chunk 1004 to the original chunk 1012.
(39)
(40)
(41) The circuit structure can be applied to words having any number of units, that are powers of two, such as more than 8 bytes, or 4, 16, 32, 64, et seq. bytes. If the number of bytes or data units comprises n, then there would be log.sub.2(n) stages. Further, at each stage i, for i=0 . . . (n1), there are 2.sup.i passthrough units followed by (n2.sub.i) adders. The first 2.sup.i inputs to the ith stage are forwarded to passthrough units and the first 2.sup.i adders to be combined with inputs (2.sup.i+1) through (n1), respectively.
(42)
(43) With the above embodiments, the decoding of delta encoded data is optimized by performing in parallel log.sub.2(n) stages of operations in parallel to reduce the latency over delta decoding techniques that decode data sequentially for each item, requiring n sequential operations. In this way described embodiments, optimize and reduce the latency of delta decoding operations using multiple stages of adders to perform stages of decoding operations in parallel on substantially fewer clock cycles than needed to sequentially decode each delta encoded item.
(44) The letter designators, such as i, n, among others, are used to designate an instance of an element, i.e., a given element, or a variable number of instances of that element when used with the same or different elements.
(45) The terms an embodiment, embodiment, embodiments, the embodiment, the embodiments, one or more embodiments, some embodiments, and one embodiment mean one or more (but not all) embodiments of the present invention(s) unless expressly specified otherwise.
(46) The terms including, comprising, having and variations thereof mean including but not limited to, unless expressly specified otherwise.
(47) The enumerated listing of items does not imply that any or all the items are mutually exclusive, unless expressly specified otherwise.
(48) The terms a, an and the mean one or more, unless expressly specified otherwise.
(49) Devices that are in communication with each other need not be in continuous communication with each other, unless expressly specified otherwise. In addition, devices that are in communication with each other may communicate directly or indirectly through one or more intermediaries.
(50) A description of an embodiment with several components in communication with each other does not imply that all such components are required. On the contrary a variety of optional components are described to illustrate the wide variety of possible embodiments of the present invention.
(51) When a single device or article is described herein, it will be readily apparent that more than one device/article (whether or not they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether or not they cooperate), it will be readily apparent that a single device/article may be used in place of the more than one device or article or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the present invention need not include the device itself.
(52) The foregoing description of various embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto. The above specification, examples and data provide a complete description of the manufacture and use of the composition of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims herein after appended.