Efficient implementation of a threshold modified min-sum algorithm for low-density parity-check decoders

11309915 · 2022-04-19

Assignee

Inventors

Cpc classification

International classification

Abstract

A hardware efficient implementation of a threshold modified attenuated min-sum algorithm (TAMSA”) and a threshold modified offset min-sum algorithm (“TOMSA”) that improve the performance of a low density parity-check (“LDPC”) decoder by reducing the bit error rate (“BER”) compared to the conventional attenuated min-sum algorithm (“AMSA”), offset min-sum algorithm (“OMSA”), and the min-sum algorithm (“MSA”). Embodiments of the present invention preferably use circuit optimization techniques, including a parallel computing structure and lookup tables, and a field-programmable gate array (“FPGA”) or application specific integrated circuit (“ASIC”) implementation.

Claims

1. A method for implementing a threshold modified min-sum algorithm for a low-density parity check (“LDPC”) decoder comprising: quantization of received channel values; obtaining parallel input data; storing the parallel input data; based on data contained in a parity check matrix, a decoding controller causing an address generator to generate addresses to access stored data; passing data to a check node unit (“CNU”); the CNU calculating a minimum value of the data passed to the CNU; applying an offset value and/or attenuation value to the calculated minimum value if a magnitude of the calculated minimum value comprises a magnitude less than a threshold value and not applying the offset value and/or attenuation value to the magnitude of the calculated minimum value if the magnitude of the calculated minimum value is greater than the threshold value; storing the calculated minimum value; calculating iterations of variable node log-likelihood ratios (“LLRs”) and storing the calculated iterations of LLRs; after each iteration, making a hard decision based on a sign of the calculated iteration of the variable node LLR to determine whether a codeword is valid; and when the hard decision determines that the codeword is valid, passing final data to an output.

2. The method of claim 1 wherein obtaining parallel input data comprises the CNU converting serial input into parallel data and the CNU processing the parallel data in a parallel architecture.

3. The method of claim 1 wherein the CNU calculating a minimum value of the data passed to the CNU comprises the CNU calculating minimum and sub-minimum values of the data passed to the CNU and wherein storing the calculated minimum value comprises storing the calculated minimum and sub-minimum values.

4. The method of claim 3 wherein a circuit used to calculate a sub-minimum value comprises one less data input than a circuit used to calculate a minimum value.

5. The method of claim 3 wherein when a variable node message is equal to a minimum value, the calculated sub-minimum value is assigned as the minimum value for calculations in the CNU.

6. The method of claim 1 wherein when a variable node message is not equal to a minimum value, the calculated minimum value is assigned as the minimum value for calculations in the CNU.

7. The method of claim 1 wherein quantized LLR values are assigned according to a lookup table and all decoder operations are performed on corresponding binary value strings.

8. The method of claim 1 wherein the CNU converts LLRs from previous iterations into parallel data and wherein the CNU converts minimum values from previous iterations into parallel data and passes LLRs from previous iterations and minimum values from previous iterations to a plurality of full subtractor modules and wherein parallel outputs of the LLRs are also passed to a plurality of full adder modules.

9. The method of claim 8 wherein the plurality of full adder modules adds the LLRs with data from a lookup table.

10. The method of claim 8 wherein sign and magnitude values to be sent to nodes are calculated separately from one another.

11. The method of claim 8 wherein the plurality of full subtractor modules is based on data contained in a parity-check matrix.

12. The method of claim 8 wherein the plurality of full adder modules is based on data contained in a parity check matrix.

13. The method of claim 1 wherein signs of all variable nodes connected to a check node are multiplied together.

14. The method of claim 1 wherein a sign of an outgoing message to each variable node is computed by multiplying with a sign of a corresponding variable node.

15. The method of claim 1 wherein when the hard decision determines that the codeword is not valid, a current number of iterations is compared to a predetermined maximum iteration number and if the current number of iterations is smaller than the predetermined maximum iteration number, the method continues with another decoding iteration.

16. The method of claim 1 wherein when the hard decision determines that the codeword is not valid, a current number of iterations is compared to a predetermined maximum iteration number and if the current number of iterations is equal to the predetermined maximum iteration number, a final value is output.

17. The method of claim 1 wherein storing the calculated iterations of LLRs during iterations comprises storing the calculated iterations of LLRs to a random access memory.

18. The method of claim 3 wherein storing the calculated minimum value comprises storing the minimum and sub-minimum values in a check node random access memory.

19. The method of claim 1 further comprising providing a single lookup table which includes both threshold check values and attenuation and/or offset values.

20. A method for implementing a threshold modified min-sum algorithm for a low-density parity check (“LDPC”) decoder comprising: a check node unit calculating minimum and sub-minimum values of data, including converting log-likelihood ratios (“LLRs”) into serial data and converting minimum data into serial data; the check node unit subtracting the minimum data from the LLRs via a plurality of full subtractor modules; storing calculated minimum and sub-minimum values; applying an offset value and/or attenuation value to the calculated minimum value if a magnitude of the calculated minimum value comprises a magnitude less than a threshold value and not applying the offset value and/or attenuation value to the magnitude of the calculated minimum value if the magnitude of the calculated minimum value is greater than the threshold value and after at each iteration, making a hard decision based on a sign of the calculated iteration of the variable node log-likelihood ratio to determine whether a codeword is valid.

21. The method of claim 20 wherein applying an offset value to the calculated minimum value comprises applying an offset value that is stored in a lookup table to reduce log-likelihood ratio (“LLR”) values by multiplication and/or subtraction in a quantized LDPC decoder.

22. The method of claim 21 wherein log-likelihood ratio values are not consistently reduced by the same magnitude for all message values.

23. The method of claim 21 wherein the lookup table is determined by a value of the multiplication and/or subtraction and the threshold.

24. The method of claim 21 wherein the lookup table comprises data including minimum and sub-minimum data.

25. The method of claim 21 wherein the lookup table comprises both threshold values and attenuation and/or offset values.

26. The method of claim 20 wherein the method attenuates and/or offsets values that are believed to be unreliable but does not attenuate and/or offset values that are believed to be reliable.

Description

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

(1) The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more embodiments of the present invention and, together with the description, serve to explain the principles of the invention. The drawings are only for the purpose of illustrating one or more embodiments of the invention and are not to be construed as limiting the invention. In the drawings:

(2) FIG. 1 is a drawing which illustrates a Tanner graph of an LDPC code;

(3) FIG. 2 is a drawing which illustrates simulation results comparing the decoding performance of quantized MSA, AMSA, TAMSA, and layered TAMSA for a (155, 64) QC Tanner LDPC code;

(4) FIG. 3 is a drawing which illustrates a system diagram of an embodiment of the present invention;

(5) FIG. 4 is a drawing which illustrates CNU Architecture of an embodiment of the present invention; and

(6) FIG. 5 is a drawing which illustrates circuit architecture for minimum computation.

DETAILED DESCRIPTION OF THE INVENTION

(7) Referring now to the figures, MP decoding of LDPC codes operates by iteratively exchanging messages in the Tanner graph of an LDPC code between variable nodes (white circles) and check nodes (plus boxes), (see FIG. 1). At the k.sup.th iteration, let V.sub.ij.sup.k denote the LLR value passed from variable node v.sub.i to check node c.sub.j and let C.sub.ji.sup.k denote the LLR value passed from check node c.sub.j to variable node v.sub.i. The set of check nodes in the graph connected to v.sub.i are represented by N(v.sub.i) and the set of variable nodes connected to c.sub.j are represented by N(c.sub.j). Assume that codeword u=(u.sub.1, u.sub.2, . . . , u.sub.n) is transmitted on an AWGN channel under binary phase shift keyed (“BPSK”) modulation, where each zero is mapped to +1 and each one is mapped to −1. Let p.sub.0 represent the probability that a 0 is received from the channel and let p.sub.1 represent the probability that a 1 is received from the channel. Let

(8) r i = ln ( p 0 p 1 )
denote the LLR values received from channel for bit i. The MSA algorithm is initialized in iteration 0 by passing the received value r.sub.i from each variable node v.sub.i to the check nodes in N(v.sub.i) as
V.sub.ij.sup.0=r.sub.i.  (Equation 1)

(9) Following initialization, the outgoing message from check node C.sub.ji.sup.k to variable node v.sub.i at iteration k is given by

(10) C ji k = ( .Math. i N ( c j ) \ i sign ( V i j k ) ) .Math. min i N ( c j ) \ i .Math. V i j k .Math. , ( Equation 2 )
where N (c.sub.j)\i denotes the set of all variable nodes connected to check node j except v.sub.i. For iteration k>0, the outgoing message V.sub.ij.sup.k from variable node v.sub.i to check node c.sub.j is given by

(11) V ij k = r i + .Math. j N ( v i ) \ j C j i k , ( Equation 3 )
where N(v.sub.i)\j denotes the set of all check nodes connected to variable node i except c.sub.j. After all check nodes and all variable nodes are updated, the hard decision estimate is computed

(12) u ^ i k = { 0 , r i + .Math. j N ( v i ) C ij k > 0 , 1 , r i + .Math. j N ( v i ) C ji k > 0. ( Equation 4 )

(13) If the hard decision û is a codeword, decoding preferably stops, otherwise the decoder starts the next iteration until some pre-specified amount of decoder iterations I.sub.max are reached. To reduce the BER performance loss of MSA when compared to SPA, the attenuated (or normalized) MSA (“AMSA”) has been proposed. AMSA operates as MSA, but where equation 2 is replaced by

(14) C ji k = α ( .Math. i N ( c j ) \ i sign ( V i j k ) ) .Math. min i N ( c j ) \ i .Math. V i j k .Math. , ( Equation 5 )
and OMSA operates as MSA, but where equation 2 is replaced by

(15) C ji k = ( .Math. i N ( c j ) \ i sign ( V i j k ) ) .Math. max { min i N ( c j ) \ i .Math. V i j k .Math. - β , 0 } , ( Equation 6 )
respectively, where α>0 and β>0 are constants. AMSA and OMSA reduce the negative effect of overestimating the LLR magnitudes in MSA and improves performance in the low SNR region; however, neither of them necessarily achieves good performance in the high SNR region. Threshold AMSA (“TAMSA”) and threshold OMSA (“TAMSA”) are known to improve performance in high SNR region compared to AMSA and MSA. The new algorithm is based on the assumption that small problematic graphical objects, called trapping sets, are the major cause of the performance loss in high SNR. TAMSA operates as MSA, but where equation 2 is replaced by

(16) C ji k = { ( .Math. i N ( c j ) \ i sign ( V i j k ) ) .Math. min i N ( c j ) \ i .Math. V i j k .Math. , if min i N ( c j ) \ i .Math. V i j k .Math. τ , α ( .Math. i N ( c j ) \ i sign ( V i j k ) ) .Math. min i N ( c j ) \ i .Math. V i j k .Math. , otherwise , ( Equation 7 )
and TOMSA operates as MSA, but where equation 2 is replaced by

(17) C ji k = { ( .Math. i N ( c j ) \ i sign ( V i j k ) ) .Math. min i N ( c j ) \ i .Math. V i j k .Math. , if min i N ( c j ) \ i .Math. V i j k .Math. τ , ( .Math. i N ( c j ) \ i sign ( V i j k ) ) .Math. max { min i N ( c j ) \ i .Math. V i j k .Math. - β , 0 } , otherwise . ( Equation 8 )
TAMSA and TOMSA locally reduce the magnitudes of the check node LLRs by adding a simple threshold test compared to AMSA (equation 5) and OMSA (equation 6), which improves the performance with a negligible complexity increase.

(18) A layered version of TAMSA and TOMSA, with modified update rules can also be provided. The algorithm is initialized by (equation 1), then the outgoing message V.sub.ij.sup.k at iteration k>0 is replaced by
V.sub.ij.sup.k=V.sub.i.sup.k-1−C.sub.ji.sup.k-1,  (Equation 9)
Where C.sub.ji.sup.0=0, and the outgoing message for some subset of check nodes is computed following equation 7 (TAMSA) or equation 8 (TOMSA). The choice of subsets will vary depending on the code and desired parallelization. Message V.sub.ij.sup.k is updated again for the variable nodes connected to the selected subset of check nodes as
V.sub.ij.sup.k=V.sub.ij.sup.k+C.sub.ji.sup.k.  (Equation 10)

(19) The decoder preferably repeats equations 7 (or 8) and 10 until all check nodes and variable nodes are updated. Finally, the hard decision estimate (equation 4) is replaced by

(20) u ^ i k = { 0 , V i k > 0 , 1 , V i k > 0. ( Equation 11 )
If the hard decision û is a codeword, decoding stops, otherwise the decoder starts the next iteration from equation 9 until a pre-determined amount of decoder iterations I.sub.max are reached.

III. Finite Precision Representation of LLRS

(21) Practical hardware implementation of LDPC decoders can rely on finite precision representation of LLRs. Clipping and quantization have effects on the MSA. Moreover, computer simulation or a quantized density evolution (“DE”) algorithm can be used find the optimal attenuation or offset parameters α or β from equations 5 and 6 for quantized AMSA and OMSA.

(22) In one embodiment, a 5-bit quantizer can be used for LLR values where a 4-bit LUT can be used to map the magnitude of LLRs (and the LLRs after attenuation), and one extra bit for the sign of the LLRs. Table I illustrates the LUT that can be used to convert received floating-point LLRs to quantized LLRs, where the LLRs are represented as a range. This mapping is preferably done once in order to quantize r.sub.i as a 4-bit string with 1-bit sign. After this, all operations in equations 2-11 are preferably performed with (4+1) bit strings. Attenuation (multiplication by α in equation 5 or 7) and offset (subtraction of β in equation 6 or 8) are preferably not computed in real-time, rather they are preferably computed in advance for each range of LLRs, for a given α and β, with a resulting LUT for the new mapping. The LUT for attenuation of the mapping in Table I with α=0.8 is illustrated in Table II. The LUT for offset of the mapping in Table I with β=0.15 is illustrated in Table III. Threshold attenuation can optionally be achieved by modifying Table II. For example, for τ=1.425 and α=0.8, quantized LLRs smaller than 1010 will be attenuated according to equation 7. In this case, the TAMSA LUT will be the same as Table II for LLRs 0000 to 1001, but LLRs 1010 to 1111 will not be attenuated. Similarly, for the TOMSA LUT in the case of τ=1.425 and β=0.15, it will be the same as Table III for LLRs 0000 to 1001, but LLRs 1010 to 1111 will not be attenuated.

(23) TABLE-US-00001 TABLE I FLOATING-POINT LLRs To 4-BIT STRINGS Received LLR Map    [0, 0.075) 0000 [0.075, 0.225) 0001 [0.225, 0.375) 0010 [0.375, 0.525) 0011 [0.525, 0.675) 0100 [0.675, 0.825) 0101 [0.825, 0.975) 0110 [0.975, 1.125) 0111 [1.125, 1.275) 1000 [1.275, 1.425) 1001 [1.425, 1.575) 1010 [1.575, 1.725) 1011 [1.725, 1.875) 1100 [1.875, 2.025) 1101 [2.025, 2.175) 1110 [2.175, ∞) 1111

(24) TABLE-US-00002 TABLE II ATTENUATED 4-BIT STRINGS FOR α = 0.8 Attenuated LLR LLR 0000 0000 0001 0001 0010 0010 0011 0010 0100 0011 0101 0100 0110 0101 0111 0110 1000 0110 1001 0111 1010 1000 1011 1001 1100 1010 1101 1010 1110 1011 1111 1100

(25) TABLE-US-00003 TABLE III OFFSET 4-BIT STRINGS FOR β = 0.15 LLR Offset LLR 0000 0000 0001 0000 0010 0001 0011 0010 0100 0011 0101 0100 0110 0101 0111 0110 1000 0111 1001 1000 1010 1001 1011 1010 1100 1011 1101 1100 1110 1101 1111 1110

(26) The LUT approach detailed above is the preferable method to combine attenuation and/or offset with the threshold check (equations 7 or 8) in a single LUT; however, the same result can be obtained by conventional computation of the minimum value in equation 2, followed by a comparison with the threshold value, and then either applying attenuation/offset or not depending on the result of the threshold value check. Attenuation and/or offset, when necessary, can be achieved in the conventional way by a LUT or alternative circuitry to perform the quantized computation.

(27) FIG. 2 illustrates simulation results for quantized MSA, AMSA, TAMSA, and layered TAMSA with an attenuation factor α=0.8 for all attenuated algorithms and a threshold τ=1.425 for threshold algorithms (using the LUTs as described above). In running data for this example, all algorithms were allowed a maximum of 100 iterations. Both TAMSA and layered TAMSA result in significantly improved performance over the AMSA and MSA; with the best performance resulting from layered TAMSA, which offers close to 0.4 dB gain at a BER equal to 10.sup.−9 over AMSA and MSA.

(28) Another important metric related to decoder power consumption is the average number of iterations (“ANI”) performed for each algorithm. The results are summarized in Table IV. Both AMSA and TAMSA provided a significant reduction in the average number of iterations when compared to MSA at low SNR, with similar numbers elsewhere.

IV. System Design Considerations

(29) As an example, a layered TAMSA can be implemented with a (155, 64) QC Tanner code. Using this particular LDPC code as an example, the corresponding decoder hardware for implementation of it will now be described.

(30) LDPC codes. The parity-check matrix of the (155, 64) QC Tanner code is given by

(31) 0 H = [ I 1 I 2 I 4 I 8 I 16 I 5 I 10 I 20 I 9 I 18 I 25 I 19 I 7 I 14 I 28 ] , ( Equation 12 )
where I.sub.x is a 31×31 identity matrix with rows shifted cyclically to the left by x positions. According to this specific QC structure, a full-parallel architecture can be used to implement layered MSA, layered AMSA, and layered TAMSA to speed up the decoding process. Specifically, 31 check node unit (“CNU”) modules can optionally be used in the LDPC decoder.

(32) At each iteration, message V.sub.ij.sup.k is preferably computed by equation 9. C.sub.ji.sup.k can then be computed using equations 2, 5, 6, 7, or 8, where appropriate, for the first 31 rows in parallel (j=1, 2, . . . , 31), then all connected variable node LLRs can be updated using V.sub.ij.sup.k by using equation 10. This is preferably repeated for the next 31 rows (j=32, 33, . . . , 62), and then the final 31 rows (j=63, 64, . . . , 93) in the parity-check matrix of equation 12. After these three batches of parallel computation, one iteration is completed and the iteration number increases by 1, and the sign of the LLRs is calculated for the hard decision according to equation 11. The decoder stops either if the hard decisions give a valid codeword or the iteration number achieves a preset maximum iteration number I.sub.max.

(33) System Design. The decoder system preferably includes several building blocks, as illustrated in FIG. 3, where the black arrows represent data flow and the white arrows represent control flow. The input serial data V.sub.ij.sup.0 is preferably first converted into parallel data by the serial-in parallel-out (“SIPO”). The data is then preferably stored in random access memory (“RAM”). Because this data contains variable node values, we refer to it as variable node RAM (“VRAM”). The VRAM also preferably stores the temporary variable node LLRs V during the decoding process. The decoder controller preferably controls the decoding process, and the values of check nodes and variable nodes are updated according to the status of the decoder controller. First, according to the parity check matrix of the (155, 64) QC Tanner code in equation 12, the decoder controller preferably asks the address generator to generate several addresses to access the data V.sub.ij.sup.k stored in VRAM. Then, this data is sent to the CNU, where the minimum values and sub-minimum (second minimum) values are calculated for use in equations 2, 5, 6, 7, or 8, the decoder controller then preferably asks the address generator to generate addresses to store the minimum and sub-minimum values—most preferably for storage into check node random access memory (“CRAM”). Meanwhile, V.sub.ij.sup.k is computed according to equation 10 and stored back into the VRAM. After a decoding iteration is complete, the decoder controller preferably asks the VRAM hard decision to make a hard decision according to the sign of LLRs and decide whether it is a valid codeword (which means the decoding is successful). If it is successful, the final data is preferably sent to the output of the decoder. If not, the decoder controller preferably compares the number of current iterations with a predetermined maximum iteration number I.sub.max. If the number of iterations is smaller than I.sub.max, the decoder preferably starts the next iteration of decoding, computing V.sub.ij.sup.k using equation 9 and updating the VRAM, otherwise, the decoder controller preferably finishes the decoding process and outputs the result from the VRAM.

(34) TABLE-US-00004 TABLE IV AVERAGE NUMBER OF ITERATIONS RECORDED FOR THE (155, 64) QC TANNER CODE E.sub.b/N.sub.0 MSA [17] AMSA [17] TAMSA [17] 1 dB 68.95 59.28 59.24 2 dB 30.4 23.13 22.9 3 dB 7.82 6.28 6.2 4 dB 3.06 2.95 2.87 5 dB 1.97 1.98 1.98 6 dB 1.44 1.46 1.46 7 dB 1.09 1.10 1.10 8 dB 0.85 0.86 0.86

(35) To implement the CNU (FIG. 4), the full-parallel structure previously described is preferably implemented. For the CNU unit, the LLRs and minimal values from previous iteration are preferably sent to the unit serially, where two SIPO units are applied. In one embodiment, there can be five full-subtractor modules used to implement equation 9, and five full-adder modules that are preferably used to implement equation 10. The reason that five full modules are used for each in this case is because there are five columns in the H matrix of equation 12. Embodiments of the present invention will, in general, preferably use |N(c_j)| full-subtractor modules to implement equation 9, and |N(c_j)| full-adder modules to implement equation 10 such that the number of each module is preferably equal to the number of ones in a given row j of the parity-check matrix. The sign and the magnitude values to be sent to each variable node are preferably calculated separately. First, the signs of all variable nodes connected to this check node are multiplied together to form Π.sub.iϵN(c.sub.j.sub.) sign (V.sub.ij). The sign of the outgoing message to each variable node is preferably computed by multiplying Π.sub.iϵN(c.sub.j.sub.) sign (V.sub.ij) with the sign of the corresponding variable node. Second, the minimum value

(36) min i N ( c j ) \ i .Math. V i j k .Math.
is now preferably computed. Embodiments of the present invention preferably use a parallel circuit design to determine both the minimum and sub-minimum values of incoming LLRs in order to efficiently select this minimum value for each of the connected variable nodes, thus providing a very fast CNU module. The preferred architecture used to determine the minimum is illustrated in FIG. 5. In this embodiment, there are five 4-bit inputs corresponding to the five incoming quantized LLRs (Data i bit 1, Data i bit 2, . . . , Data i bit 4, for i=0, 1, . . . , 4) and one 4-bit output (min bit 1, min bit 2, . . . , min bit 4). The circuit to determine the sub-minimum value is preferably similar, except that it has four 4-bit data inputs because the previously found minimum value is preferably not used. M.sub.1,j represents the minimum value and M.sub.2,j represents the sub-minimum value input to Finally, the CNU preferably compares each value of the variable node with M.sub.1,j to determine the minimum value, if the variable node message V.sub.ij.sup.k equals M.sub.1,j, M.sub.2,j is preferably assigned as the minimum value in equations 2, 5, 6, 7, and 8, otherwise, M.sub.1,j is preferably used for the minimum value. Using this method avoids multiple calculations to update each check node. Layered AMSA/OMSA and layered TAMSA/TOMSA require an additional LUT for attenuation/offset compared to layered MSA; however, the hardware costs are preferably the same for each case. Although this discussion has focused on the implementation of the (155, 64) QC Tanner code, the above architecture also suitably generalizes for other QC LDPC codes.

(37) TABLE-US-00005 TABLE V COMPARISON OF HARDWARE RESOURCES Layered MSA Layered AMSA Layered TAMSA LUT    14.9k    14.9k    14.9k Flip-flop (“FF”)    10.4k    10.4k    10.4k Block RAM    13.50    13.50    13.50 (“BRAM”) Leaf cells   2830   2832   2832 Power (CLK1) 38480011.06 38480011.06 38480011.06 Area (CLK1)   72314.61   72314.61   72314.61 Power (CLK2)  9854972.14  9854972.14  9854972.14 Area (CLK2)   71167.19   71167.19   71167.19

(38) The comparison of hardware resources used in layered MSA, layered AMSA, and layered TAMSA are summarized in Table V. The power and area of layered MSA, AMSA, and TAMSA are the same, when the clock is 500 MHz (CLK1) and 100 MHz (CLK2), respectively. The data comparison illustrates that layered TAMSA requires no extra hardware resources compare to layered AMSA, and both attenuated algorithms require only 0.07% extra leaf cells compared to conventional layered MSA.

(39) In one embodiment, for low-power hardware design considerations, the use of a full-parallel architecture is preferred over a pipeline structure because, in one embodiment, the layered decoder uses a complete computation of the first 31 rows in the parity-check matrix to continue the computation of the following 31 rows. However, for different LDPC codes, a pipeline structure can provide desirable results—particularly in order to speed up the decoding process. The LUT-based TAMSA/TOMSA approach described above can be applied similarly to pipeline or any other decoder implementations.

(40) The preceding examples can be repeated with similar success by substituting the generically or specifically described components and/or operating conditions of embodiments of the present invention for those used in the preceding examples.

(41) Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.

(42) Although the terms VRAM and CRAM are used to designate where different values are most preferably stored, in one embodiment VRAM and CRAM can comprise one or more non-specific random access memory “RAM” chips and/or cards—thus in one embodiment, values that are most preferably stored on VRAM can be stored on the same RAM device as values which are referred to as preferably being stored on CRAM. Note that in the specification and claims, “about” or “approximately” means within twenty percent (20%) of the numerical amount cited. All computer software disclosed herein may be embodied on any non-transitory computer-readable medium (including combinations of mediums), including without limitation CD-ROMs, DVD-ROMs, hard drives (local or network storage device), USB keys, other removable drives, ROM, and firmware.

(43) Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other. Although the invention has been described in detail with particular reference to the disclosed embodiments, other embodiments can achieve the same results. Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguring their relationships with one another.