Systems and Methods for Decoding of Graph-Based Channel Codes Via Reinforcement Learning
20230231575 · 2023-07-20
Assignee
Inventors
Cpc classification
H03M13/1125
ELECTRICITY
International classification
Abstract
Embodiments of the present disclosure relate to sequential decoding of moderate length low-density parity-check (LDPC) codes via reinforcement learning (RL). The sequential decoding scheme is modeled as a Markov decision process (MDP), and an optimized cluster scheduling policy is subsequently obtained via RL. A software agent is trained to schedule all check nodes (CNs) in a cluster, and all clusters in every iteration. A new RL state space model is provided that enables the RL-based decoder to be suitable for longer LDPC codes.
Claims
1. A method for decoding low-density parity-check codes encoded in a traffic channel of a communication signal received by a mobile communication device, the method comprises: generating a decoding schedule for a plurality of clusters of check nodes in response to execution of a reinforcement learning software agent of an LDPC decoder; sequentially decoding each of the plurality of clusters of check nodes according to the decoding schedule; updating a posterior log-likelihood ratio of all variable nodes (VNs) based on the sequential decoding schedule; determining whether a specified maximum number of iterations has been reached or a stopping condition has been satisfied based on the sequential decoding schedule; in response to determining that the specified maximum number of iterations is reached or the stopping condition is satisfied, outputting a reconstructed signal corresponding to the communication signal received by the mobile communication device.
2. The method of claim 1, further comprising: training the reinforcement learning software agent to schedule plurality of clusters of check nodes based on a reward associated with an outcome of decoding each of the plurality of clusters of check nodes.
3. The method of claim 2, wherein the reward corresponds to a probability that corrupted bits of the communication signal are correctly reconstructed.
4. The method of claim 2, further comprising establishing a cluster scheduling policy based on the training.
5. The method of claim 4, wherein the decoding schedule is determined based on the cluster scheduling policy.
6. The method of claim 1, further comprising clustering the check nodes into the plurality of clusters to minimize inter-cluster dependency.
7. The method of claim 1, wherein the reinforcement learning software agent implements at least one of a Q-learning or a deep reinforcement learning scheme to generate the cluster scheduling policy.
8. A system for decoding low-density parity-check codes encoded in a traffic channel of a communication signal received by a mobile communication device, the system comprises: a non-transitory computer-readable medium storing instructions for decoding low-density parity-check codes; and a processing device executing the instructions to: generate a decoding schedule for a plurality of clusters of check nodes in response to execution of a reinforcement learning software agent of an LDPC decoder; sequentially decode each of the plurality of clusters of check nodes according to the learned scheduling policy; update a posterior log-likelihood ratio of all variable nodes (VNs) based on the sequential decoding schedule; determine whether a specified maximum number of iterations has been reached or a stopping condition has been satisfied based on the sequential scheduling policy; output a reconstructed signal corresponding to the communication signal received by the mobile communication device in response to determining that the specified maximum number of iterations is reached, or the stopping condition is satisfied.
9. The system of claim 8, wherein the processing device executes the instructions to: train the reinforcement learning software agent to sequentially schedule the plurality of clusters of check nodes based on a reward associated with an outcome of decoding each of the plurality of clusters of check nodes.
10. The system of claim 9, wherein the reward corresponds to a probability that corrupted bits of the communication signal are correctly reconstructed.
11. The system of claim 9, wherein the processing device executes the instructions to establish a cluster scheduling policy based on the training.
12. The system of claim 11, wherein the decoding schedule is determined based on the learned cluster scheduling policy.
13. The system of claim 8, wherein the processing device executes the instructions to cluster the check nodes into the plurality of clusters to minimize inter-cluster dependency.
14. The system of claim 8, wherein the reinforcement learning software agent implements at least one of a Q-learning or a deep reinforcement learning to generate the decoding schedule.
15. A non-transitory computer-readable medium comprising instructions, wherein execution of the instructions by a processing device causes the processing device to: generate a decoding schedule for a plurality of clusters of check nodes in response to execution of a reinforcement learning software agent of an LDPC decoder; sequentially decode each of the plurality of clusters of check nodes according to the learned scheduling policy; update a posterior log-likelihood ratio of all variable nodes (VNs) based on the learned sequential scheduling policy; determine whether a specified maximum number of iterations has been reached or a stopping condition has been satisfied based on the sequential cluster scheduling policy; output a reconstructed signal corresponding to the communication signal received by the mobile communication device in response to determining that the specified maximum number of iterations is reached or the stopping condition is satisfied.
16. The medium of claim 15, wherein execution of the instructions by the processing device causes the processing device to: train the reinforcement learning software agent to sequentially schedule the plurality of clusters of check nodes based on a reward associated with an outcome of decoding each of the plurality of clusters of check nodes.
17. The medium of claim 16, wherein the reward corresponds to a probability that corrupted bits of the communication signal are correctly reconstructed.
18. The medium of claim 16, wherein execution of the instructions by the processing device causes the processing device to establish a cluster scheduling policy based the training.
19. The medium of claim 18, wherein the decoding schedule is determined based on the sequential cluster scheduling policy.
20. The medium of claim 15, wherein the reinforcement learning software agent implements at least one of a Q-learning or a deep reinforcement learning to generate the decoding schedule.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
DETAILED DESCRIPTION
[0023] Embodiments of the present disclosure provide for systems and methods for sequential decoding of moderate length low-density parity-check (LDPC) codes via reinforcement learning (RL). The sequential decoding process can be embodied in an LDPC decoder including a reinforcement learning software agent executed in a mobile communication device and can be modeled as a Markov decision process (MDP). An optimized cluster scheduling policy can be subsequently obtained via RL. In contrast to conventional approaches, where a software agent learns to schedule only a single check node (CN) within a group (cluster) of CNs per iteration, in embodiments of the present disclosure the software agent of the LDPC decoder is trained to schedule all CNs in a cluster, and all clusters in every iteration. That is, in accordance with embodiments of the present disclosure, in each RL step, the software agent of the LDPC decoder learns to schedule CN clusters sequentially depending on the reward associated with the outcome of scheduling a particular cluster.
[0024] Embodiments of the present disclosure provide an LDPC decoder with a new RL state space model, which has a significantly smaller number of states than previously proposed models, enabling embodiments of the RL-based LDPC decoder of the present disclosure to be suitable for much longer LDPC codes. As a result, embodiments of the RL-based LDPC decoder described herein exhibit a signal-to-noise ratio (SNR) gain of approximately 0.8 dB for fixed bit error probability over the conventional flooding approach.
[0025] With respect to LDPC codes, an [n, k] binary linear code is a k-dimensional subspace of F.sub.2.sup.n, and can be defined as the kernel of a binary parity-check matrix H ∈ F.sub.2.sup.m×n, where m≥n−k. The code's block length is n, and the rate is (n−rank(H))/n. The Tanner graph of a linear code with parity-check matrix H is the bipartite graph G.sub.H=(V ∪ C, E), where V={v.sub.0, . . . , v.sub.n−1} is a set of variable nodes (VNs) corresponding to the columns of H, C={c.sub.0, . . . , c.sub.m−1} is a set of check nodes (CNs) corresponding to the rows of the parity-check matrix H, and edges in E correspond to columns (or VNs) and rows (or CNs) in parity-check matrix H that contain a “1”. LDPC codes are a class of highly competitive linear codes defined via sparse parity-check matrices or, equivalentlγ, sparse Tanner graphs, and are amenable to low-complexity graph-based message-passing decoding algorithms, making them ideal for practical applications in telecommunications and other fields. One example of a decoding algorithm for which LDPC codes are suitable is belief propagation (BP) iterative decoding.
[0026] Experimental results for embodiments the LDPC decoder that utilize two particular classes of LDPC codes—(γ, k)-regular and array-based (AB-) LDPC codes—are described herein. A (γ, k)-regular LDPC code is defined by a parity-check matrix with constant column and row weights equal to γ and k, respectively. A (γ, p) AB-LDPC code, where p is prime, is a (γ, p)-regular LDPC code with additional structure in its parity-check matrix, H(γ, p). In particular,
where σ.sup.z denotes the circulant matrix obtained by cyclically left-shifting the entries of the p×p identity matrix I by z (mod p) positions. Notice that σ.sup.0=I. In embodiment of the present disclosure, lifted LDPC codes can be obtained by replacing non-zero (resp., zero) entries of the parity-check matrix with randomly generated permutation (resp., all-zero) matrices.
[0027] In an RL problem, a software agent (learner) interacts with an environment whose state space can be modeled as a finite Markov decision process (MDP). The software agent takes actions that alter the state of the environment and receives a reward in return for each action, with the goal of maximizing the total reward in a series of actions. The optimized sequence of actions can be obtained by employing a cluster scheduling policy which utilizes an action-value function to determine how beneficial an action is for maximizing the long-term expected reward. For embodiments described herein, let [[x]]={0, . . . , x−1}, where x is a positive integer. As an example, an environment can allow m possible actions. A random variable A.sub.l ∈ [[m]], with realization a, represents the index of an action taken by the software agent during learning step l. The current state of the environment before taking action A.sub.l is represented as S.sub.l, with realization s ∈ Z, and S.sub.l+1, with realization s′, represents a new state of the MDP after executing action A.sub.l. A state space S contains all possible state realizations. The reward yielded at step l after taking action A.sub.l in state S.sub.l is represented as R.sub.l(S.sub.l, A.sub.l, S.sub.l+1).
[0028] Optimal policies for MDPs can be estimated via Monte Carlo techniques such as Q-learning. The estimated action-value function Q.sub.l(S.sub.l, A.sub.l) in Q-learning represents the expected long-term reward achieved by the software agent at step l after taking action A.sub.l in state S.sub.l. To improve the estimation in each step, the action-value function can be adjusted according to a recursion
where s′ represents the new state s.sub.0 as a function of s and a, 0<a<1 is the learning rate, β is the reward discount rate, and Q.sub.l+1(s, a) is a future action-value resulting from action a in the current state s. Note that the new state is updated with each action. The optimal cluster scheduling policy for the software agent, π.sup.(l), in state s is given by
π.sup.(l)=argmax .sub.aQ.sub.l(s,a), (3)
where l is the total number of learning steps elapsed after observing the initial state S.sub.0. In the case of a tie, an action can be uniformly chosen at random from all the maximizing actions.
[0029] An embodiment of the RL-based sequential decoding (RL-SD) process can include a belief propagation (BP) decoding algorithm in which the environment is given by the Tanner graph of the LDPC code, and the optimized sequence of actions, i.e., the scheduling of individual clusters, can be obtained using a suitable RL algorithm such as Q-learning. A single cluster scheduling step can be carried out by sending messages from all CNs of a cluster to their neighboring VNs, and subsequently sending messages from these VNs to their CN neighbors. That is, a selected cluster executes one iteration of flooding in each decoding instant. Every cluster is scheduled exactly once within a single decoder iteration. Sequential cluster scheduling can be carried out until a stopping condition is reached, or an iteration threshold is exceeded. The RL-SD method relies on a cluster scheduling policy based on an action-value function, which can be estimated using the RL techniques described herein.
[0030]
[0031] The (first) mobile communication device 110 can encode (e.g., with LDPC codes) and modulate a radiofrequency (RF) signal and transmit the RF signal which can be routed through the network 130 and transmitted to the (second) communication device 120, which can demodulate and decode the received RF signal to extract the voice data. In an exemplary embodiment, the first mobile communication device 110 can use LDPC codes for channel coding on the traffic channel. When the second mobile communication device 120 receives the RF signal, the second mobile communication device can extract the LDPC codes from the RF signal and use the extracted LDPC codes to correct channel errors by maintaining parity bits for data bits transmitted via the traffic channel. When a parity check failure is detected by the second mobile communication device 120 for one or more data bits, information from the multiple parity bits of the LDPC codes associated with the one or more data bits can be used by the second mobile communication device 120 to determine the original/correct value for the one or more data bits.
[0032]
[0033] The memory 206 can include any suitable, non-transitory computer-readable storage medium, e.g., read-only memory (ROM), erasable programmable ROM (EPROM), electrically-erasable programmable ROM (EEPROM), random access memory (RAM), flash memory, and the like. In exemplary embodiments, an operating system 226 and an embodiment of the LDPC decoder 228 can be embodied as computer-readable/executable program code stored on the non-transitory computer-readable memory 206 and implemented using any suitable, high or low-level computing language, scripting language, or any suitable platform, such as, e.g., Java, C, C++, C #, assembly code, machine-readable language, Python, Rails, Ruby, and the like. The memory 206 can also store data to be used by and/or that is generated by the LDPC decoder 228. While memory 206 is depicted as a single component, those skilled in the art will recognize that the memory can be formed using multiple components and that separate non-volatile and volatile memory devices can be used.
[0034] One or more processing and logic devices 204 can be programmed and/or configured to facilitate an operation of the mobile communication device 200 and enable RF communications with other communication devices via a network (e.g., network 130). The processing and/or logic devices 204 can be programmed and/or configured to execute the operating system 226 and the LDPC decoder 228 to implement one or more processes to perform one or more operations (decoding of LDPC codes, error detection and correction). As an example, a microprocessor, micro-controller, central processing unit (CPU), or graphical processing unit (GPU) can be programmed to execute the LDPC decoder 228. As another example, the LDPC decoder 228 can be embodied and executed by an application-specific integrated circuit (ASIC). The processing and/or logic devices 204 can retrieve information/data from and store information/data to the memory 206. For example, the processing device 204 can retrieve and/or store LDPC codes and/or any other suitable information/data that can be utilized by the mobile communication device to perform error detection and correction using LDPC codes.
[0035] The LDPC decoder 228 can include a reinforcement learning (RL) software agent that can sequentially decode the low-density parity-check (LDPC) codes included in the RF signal via reinforcement learning (RL). The sequential decoding process implemented by the software agent can be trained to schedule all check nodes (CNs) in a cluster, and all clusters in every iteration, such that in each RL step, the software agent of the LDPC decoder 228 learns to schedule CN clusters sequentially depending on the reward associated with the outcome of scheduling a particular cluster.
[0036] The RF circuitry 214 can include an RF transceiver, one or more modulation circuits, one or more demodulation circuits, one or more multiplexers, one or more demultiplexers. The RF circuitry 214 can be configured to transmit and/or receive wireless communications via an antenna 215 pursuant to, for example, the 3rd Generation Partnership Project (3GPP) for 5G NR and/or the International Telecommunications Union (ITU) IMT-2020.
[0037] The display unit 208 can render user interfaces, such as graphical user interfaces (GUIs) to a user and in some embodiments can provide a mechanism that allows the user to interact with the GUIs. For example, a user may interact with the mobile communication device 200 through the display unit 208, which may be implemented as a liquid crystal touchscreen (or haptic) display, a light-emitting diode touchscreen display, and/or any other suitable display device, which may display one or more user interfaces that may be provided in accordance with exemplary embodiments.
[0038] The power source 212 can be implemented as a battery or capacitive elements configured to store an electric charge and power the mobile communication device 200. In exemplary embodiments, the power source 212 can be a rechargeable power source, such as a battery or one or more capacitive elements configured to be recharged via a connection to an external power supply.
[0039]
[0040] The transmitted and the received words can be represented as x=[x.sub.0, . . . , x.sub.n−1] and y=[y.sub.0, . . . ,y.sub.n−1], respectively, where for v ∈ [[n]], the values of each transmitted word include 0's and/or 1's (x.sub.v ∈ {0,1}) and the value of each received word can be represented as y.sub.v=(−1).sup.x.sup.(0, σ.sup.2). The posterior log-likelihood ratio (LLR) of a transmitted bit x.sub.v can be expressed as
The posterior LLR computed by VN v during iteration I can be represented as L.sub.I=Σ.sub.c∈
.sub.(v)m.sub.c.fwdarw.v.sup.(I)+L.sub.v, where L.sub.0
=L.sub.v and m.sub.c.fwdarw.v.sup.(I) is the message received by VN v from neighboring CN c in iteration I. Similarly, the posterior LLR computed during iteration I by VN j in the subgraph induced by the cluster with index a ∈ [[┌m/z┐]] can be represented as L.sub.I
. Hence, L.sub.I
=L.sub.I
if VN v in the Tanner graph is also the jth VN in the subgraph induced by the cluster with index a.
[0041] After scheduling cluster a during iteration I, the output
of cluster a, where l.sub.a≤z*k.sub.max is the number of VNs adjacent to cluster a, is obtained by taking hard decisions on the vector of posterior LLRs
computed according to
[0042] The output, {circumflex over (x)}.sub.a.sup.(I) of cluster a includes the bits reconstructed by the sequential decoder after scheduling cluster a during iteration I. An index of a realization of {circumflex over (x)}.sub.a.sup.(I) in iteration I can be denoted by s.sub.a.sup.(I) ∈ [[2.sup.l.sup.
can be obtained.
[0043] During the learning/training phase, embodiments of the RL process inform the software agent of the current state of the LPDC decoder and the reward obtained after performing an action (decoding a cluster). Based on these observations, the software agent of the LDPC decoder 228 can take future actions, to enhance the total reward earned, which alters the state of the environment as well as the future reward. Given that the transmitted communication signal x is known during the training phase, a vector containing the l.sub.a bits of x that are reconstructed in the output {circumflex over (x)}.sub.a.sup.(I) of a cluster can be represented as x.sub.a=[x.sub.0,a, . . . , x.sub.l.sub.
where 1(.Math.) denotes the indicator function. Thus, the reward earned by the software agent after scheduling cluster a is identical to the probability that the corrupted bits corresponding to the transmitted bits x.sub.0,a, . . . , x.sub.l.sub.
[0044]
[0045]
[0046] The RL-SD process illustrated by
[0047] With respect to the software agent learning a cluster scheduling policy, the state of the MDP after scheduling a cluster index a during learning step l can be denoted as {circumflex over (x)}.sub.a.sup.(l), and the index of a realization of {circumflex over (x)}.sub.a.sup.(l) be referred to as
Thus, s.sub.a also refers to the state of the MDP. The state space of the MDP contains all possible Σ.sub.a∈[┌m/z┐]2.sup.l.sup.
=[┌m/z┐]. Different Q-learning-based RL approaches can be used for solving the sequential decoding problem.
[0048] As an example using deep reinforcement learning (DRL), for MDPs with very large state spaces, the action-value function Q.sub.l(s, a) can be approximated as Q.sub.l(s, a; W) using a deep learning model with tensor W representing the weights connecting all layers in the neural network (NN). In each learning step 1, a separate NN can be used with weight W.sub.l.sup.(a), for each cluster, since a single NN cannot distinguish between the signals {circumflex over (x)}.sub.a.sup.(l), . . . , {circumflex over (x)}.sub.┌m/z┐−1.sup.(l), and hence cannot distinguish between the rewards R.sub.0, . . . , R.sub.┌m/z┐−1 generated by the ┌m/z┐ different clusters. The target of the NN corresponding to cluster a is given by
[0049] where the reward R.sub.l(s.sub.a, a, s′)=R.sub.a. Also, let Q.sub.l(s.sub.a, a; W.sub.l.sup.(a)) be the NN's prediction. In each DRL step, the mean squared error loss between T.sub.l.sup.(a) and Q.sub.l(s.sub.a, a; W.sub.l.sup.(a)) can be minimized using a gradient descent method. The NN corresponding to each cluster learns to map the cluster output {circumflex over (x)}.sub.a.sup.(l) to a vector of ┌m/z┐ predicted action-values
[0050] During inference, the optimized cluster scheduling policy, π.sub.i.sup.*(I), for scheduling the ith cluster during decoder iteration I is expressed as
where s.sub.a.sub.
[0051] As another example using standard Q-learning, for MDPs with moderately large state spaces, a standard Q-learning approach can be used for determining the optimal cluster scheduling order, where the action-value for choosing cluster a in state s.sub.a is given by
[0052] In each learning step l, cluster a can be selected via a ε-greedy approach according to
[0053] WHERE π.sub.Q.sup.(L)=max.sub.a∈[┌m/z┐]Q.sub.L(S.sub.A, A). For ties (as in the first iteration of the standard Q-learning algorithm shown in
for scheduling the ith cluster during decoder iteration I can be expressed as
[0054] here Q*(S.sub.a.sub.
[0055] ={L.sub.0, . . . , L.sub.|
.sub.|−1} containing |L| realizations of L over which Q-learning is performed, and a parity-check matrix H. The output is Q*(s.sub.a.sub.
, the action-value function in equation 8 can be recursively updated l.sub.max times.
Experimental Results
[0056] Experiments were performed to test the performance of the RL-SD process shown in
[0057]
[0058] The LLR vectors used for training are sampled uniformly at random over a range of A equally spaced SNR values for a given code. Hence, there are |L|/A LLR vectors in for each SNR value considered. For both considered codes (e.g., [384, 256]-WRAN and (3, 5)-AB LDPC codes), the learning parameters can be as follows: α=0.1, β=0.9, ε=0.6, l.sub.max=50, and |
|=5×10.sup.5, where |L| is chosen to ensure that the training is as accurate as possible without incurring excessive run-time for the standard Q-learning algorithm (e.g., an embodiment of which is shown in
[0059] For both training and inference, the AWGN channel is considered and all-zero codewords are transmitted using BPSK modulation. Training with the all-zero codeword is sufficient as, due to the symmetry of the BP decoder and the channel, the decoding error is independent of the transmitted signal.
and the frame error rate (FER), given by Pr[{circumflex over (x)} ≠x]. In the case of the WRAN LDPC code, z=1 is only considered as this code has several degree-11 CNs which render both learning schemes too computationally intensive for z>1. On the other hand, for the AB code, multiple cluster sizes are chosen from z ∈ {1, 2,3} for both the random and RL-SD schemes. For z ∈ {1, 2}, standard Q-learning can be employed to learn the cluster scheduling policy. For z=3, deep reinforcement learning (DRL) can be utilized, as standard Q-learning is not feasible due to the significantly increased state space. The same number of training examples are used for both standard Q-learning and DRL.
[0060] The BER vs. channel signal-to-noise ratio (SNR), in terms of Eb/NO in dB, for the [384, 256]-WRAN and (3, 5) AB-LDPC codes using these decoding techniques are shown in
[0061] In Table 1, the average number of CN to VN messages propagated in the considered decoding schemes are compared to attain the results in
TABLE-US-00001 SNR (dB) 1 2 3 flooding 6480 6422 5171 random (z = 1) 6480 5827 3520 RL-SD (z = 1) 6467 5450 3179
TABLE-US-00002 SNR (dB) 1 2 3 flooding 63750 16409 8123 random (z = 3) 44338 11102 5005 RL-SD (z = 3) 40448 10694 4998 random (z = 2) 36328 10254 4994 RL-SD (z = 2) 31383 7349 4225 random (z = 1) 59750 10692 4812 RL-SD (z = 1) 51250 6240 3946
Table 1: Average number of CN to VN messages propagated in various decoding schemes for a [384, 256]-WRAN (left) and (3,5) AB-(right) LDPC code to attain the results shown in
[0062] Exemplary flowcharts are provided herein for illustrative purposes and are non-limiting examples of methods. One of ordinary skill in the art will recognize that exemplary methods may include more or fewer steps than those illustrated in the exemplary flowcharts, and that the steps in the exemplary flowcharts may be performed in a different order than the order shown in the illustrative flowcharts.
[0063] The foregoing description of the specific embodiments of the subject matter disclosed herein has been presented for purposes of illustration and description and is not intended to limit the scope of the subject matter set forth herein. It is fully contemplated that other various embodiments, modifications and applications will become apparent to those of ordinary skill in the art from the foregoing description and accompanying drawings. Thus, such other embodiments, modifications, and applications are intended to fall within the scope of the following appended claims. Further, those of ordinary skill in the art will appreciate that the embodiments, modifications, and applications that have been described herein are in the context of particular environment, and the subject matter set forth herein is not limited thereto but can be beneficially applied in any number of other manners, environments, and purposes. Accordingly, the claims set forth below should be construed in view of the full breadth and spirit of the novel features and techniques as disclosed herein.