Systems and Methods for Decoding of Graph-Based Channel Codes Via Reinforcement Learning

Abstract

Embodiments of the present disclosure relate to sequential decoding of moderate length low-density parity-check (LDPC) codes via reinforcement learning (RL). The sequential decoding scheme is modeled as a Markov decision process (MDP), and an optimized cluster scheduling policy is subsequently obtained via RL. A software agent is trained to schedule all check nodes (CNs) in a cluster, and all clusters in every iteration. A new RL state space model is provided that enables the RL-based decoder to be suitable for longer LDPC codes.

Claims

1. A method for decoding low-density parity-check codes encoded in a traffic channel of a communication signal received by a mobile communication device, the method comprises: generating a decoding schedule for a plurality of clusters of check nodes in response to execution of a reinforcement learning software agent of an LDPC decoder; sequentially decoding each of the plurality of clusters of check nodes according to the decoding schedule; updating a posterior log-likelihood ratio of all variable nodes (VNs) based on the sequential decoding schedule; determining whether a specified maximum number of iterations has been reached or a stopping condition has been satisfied based on the sequential decoding schedule; in response to determining that the specified maximum number of iterations is reached or the stopping condition is satisfied, outputting a reconstructed signal corresponding to the communication signal received by the mobile communication device.

2. The method of claim 1, further comprising: training the reinforcement learning software agent to schedule plurality of clusters of check nodes based on a reward associated with an outcome of decoding each of the plurality of clusters of check nodes.

3. The method of claim 2, wherein the reward corresponds to a probability that corrupted bits of the communication signal are correctly reconstructed.

4. The method of claim 2, further comprising establishing a cluster scheduling policy based on the training.

5. The method of claim 4, wherein the decoding schedule is determined based on the cluster scheduling policy.

6. The method of claim 1, further comprising clustering the check nodes into the plurality of clusters to minimize inter-cluster dependency.

7. The method of claim 1, wherein the reinforcement learning software agent implements at least one of a Q-learning or a deep reinforcement learning scheme to generate the cluster scheduling policy.

8. A system for decoding low-density parity-check codes encoded in a traffic channel of a communication signal received by a mobile communication device, the system comprises: a non-transitory computer-readable medium storing instructions for decoding low-density parity-check codes; and a processing device executing the instructions to: generate a decoding schedule for a plurality of clusters of check nodes in response to execution of a reinforcement learning software agent of an LDPC decoder; sequentially decode each of the plurality of clusters of check nodes according to the learned scheduling policy; update a posterior log-likelihood ratio of all variable nodes (VNs) based on the sequential decoding schedule; determine whether a specified maximum number of iterations has been reached or a stopping condition has been satisfied based on the sequential scheduling policy; output a reconstructed signal corresponding to the communication signal received by the mobile communication device in response to determining that the specified maximum number of iterations is reached, or the stopping condition is satisfied.

9. The system of claim 8, wherein the processing device executes the instructions to: train the reinforcement learning software agent to sequentially schedule the plurality of clusters of check nodes based on a reward associated with an outcome of decoding each of the plurality of clusters of check nodes.

10. The system of claim 9, wherein the reward corresponds to a probability that corrupted bits of the communication signal are correctly reconstructed.

11. The system of claim 9, wherein the processing device executes the instructions to establish a cluster scheduling policy based on the training.

12. The system of claim 11, wherein the decoding schedule is determined based on the learned cluster scheduling policy.

13. The system of claim 8, wherein the processing device executes the instructions to cluster the check nodes into the plurality of clusters to minimize inter-cluster dependency.

14. The system of claim 8, wherein the reinforcement learning software agent implements at least one of a Q-learning or a deep reinforcement learning to generate the decoding schedule.

15. A non-transitory computer-readable medium comprising instructions, wherein execution of the instructions by a processing device causes the processing device to: generate a decoding schedule for a plurality of clusters of check nodes in response to execution of a reinforcement learning software agent of an LDPC decoder; sequentially decode each of the plurality of clusters of check nodes according to the learned scheduling policy; update a posterior log-likelihood ratio of all variable nodes (VNs) based on the learned sequential scheduling policy; determine whether a specified maximum number of iterations has been reached or a stopping condition has been satisfied based on the sequential cluster scheduling policy; output a reconstructed signal corresponding to the communication signal received by the mobile communication device in response to determining that the specified maximum number of iterations is reached or the stopping condition is satisfied.

16. The medium of claim 15, wherein execution of the instructions by the processing device causes the processing device to: train the reinforcement learning software agent to sequentially schedule the plurality of clusters of check nodes based on a reward associated with an outcome of decoding each of the plurality of clusters of check nodes.

17. The medium of claim 16, wherein the reward corresponds to a probability that corrupted bits of the communication signal are correctly reconstructed.

18. The medium of claim 16, wherein execution of the instructions by the processing device causes the processing device to establish a cluster scheduling policy based the training.

19. The medium of claim 18, wherein the decoding schedule is determined based on the sequential cluster scheduling policy.

20. The medium of claim 15, wherein the reinforcement learning software agent implements at least one of a Q-learning or a deep reinforcement learning to generate the decoding schedule.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] FIG. 1 illustrates an example environment within which a mobile communication device communicates over a network in accordance with embodiments of the present disclosure.

[0014] FIG. 2 illustrates an example mobile communication device in accordance with embodiments of the present disclosure.

[0015] FIG. 3 illustrates an example of a cluster-induced subgraph, shown with check nodes (CNs), edges, and variable nodes (VNs) in accordance with embodiments of the present disclosure.

[0016] FIG. 4 is a flowchart that illustrates an example process for decoding LDPC codes in accordance with embodiments of the present disclosure.

[0017] FIG. 5 illustrates an embodiment of an RL-SD algorithm of the LDPC decoder for sequential LDPC code decoding in accordance with embodiments of the present disclosure.

[0018] FIG. 6 illustrates an embodiment of a standard Q-learning algorithm for learning the check node scheduling order in accordance with embodiments of the present disclosure.

[0019] FIG. 7 is a graph illustrating bit error rate (BER) results using different belief propagation (BP) decoding schemes for a [384, 256]-WRAN LDPC code with block length n=384 in accordance with embodiments of the present disclosure.

[0020] FIG. 8 is a graph illustrating frame error rate (FER) results using different belief propagation (BP) decoding schemes for a [384, 256]-WRAN LDPC code with block length n=384 in accordance with embodiments of the present disclosure.

[0021] FIG. 9 is a graph illustrating bit error rate (BER) results using different belief propagation (BP) decoding schemes for a (3, 5) AB-LDPC code with block length n=500 in accordance with embodiments of the present disclosure.

[0022] FIG. 10 is a graph illustrating frame error rate (FER) results using different belief propagation (BP) decoding schemes for a (3, 5) AB-LDPC code with block length n=500 in accordance with embodiments of the present disclosure.

DETAILED DESCRIPTION

[0023] Embodiments of the present disclosure provide for systems and methods for sequential decoding of moderate length low-density parity-check (LDPC) codes via reinforcement learning (RL). The sequential decoding process can be embodied in an LDPC decoder including a reinforcement learning software agent executed in a mobile communication device and can be modeled as a Markov decision process (MDP). An optimized cluster scheduling policy can be subsequently obtained via RL. In contrast to conventional approaches, where a software agent learns to schedule only a single check node (CN) within a group (cluster) of CNs per iteration, in embodiments of the present disclosure the software agent of the LDPC decoder is trained to schedule all CNs in a cluster, and all clusters in every iteration. That is, in accordance with embodiments of the present disclosure, in each RL step, the software agent of the LDPC decoder learns to schedule CN clusters sequentially depending on the reward associated with the outcome of scheduling a particular cluster.

[0024] Embodiments of the present disclosure provide an LDPC decoder with a new RL state space model, which has a significantly smaller number of states than previously proposed models, enabling embodiments of the RL-based LDPC decoder of the present disclosure to be suitable for much longer LDPC codes. As a result, embodiments of the RL-based LDPC decoder described herein exhibit a signal-to-noise ratio (SNR) gain of approximately 0.8 dB for fixed bit error probability over the conventional flooding approach.

[0025] With respect to LDPC codes, an [n, k] binary linear code is a k-dimensional subspace of F.sub.2.sup.n, and can be defined as the kernel of a binary parity-check matrix H ∈ F.sub.2.sup.m×n, where m≥n−k. The code's block length is n, and the rate is (n−rank(H))/n. The Tanner graph of a linear code with parity-check matrix H is the bipartite graph G.sub.H=(V ∪ C, E), where V={v.sub.0, . . . , v.sub.n−1} is a set of variable nodes (VNs) corresponding to the columns of H, C={c.sub.0, . . . , c.sub.m−1} is a set of check nodes (CNs) corresponding to the rows of the parity-check matrix H, and edges in E correspond to columns (or VNs) and rows (or CNs) in parity-check matrix H that contain a “1”. LDPC codes are a class of highly competitive linear codes defined via sparse parity-check matrices or, equivalentlγ, sparse Tanner graphs, and are amenable to low-complexity graph-based message-passing decoding algorithms, making them ideal for practical applications in telecommunications and other fields. One example of a decoding algorithm for which LDPC codes are suitable is belief propagation (BP) iterative decoding.

[0026] Experimental results for embodiments the LDPC decoder that utilize two particular classes of LDPC codes—(γ, k)-regular and array-based (AB-) LDPC codes—are described herein. A (γ, k)-regular LDPC code is defined by a parity-check matrix with constant column and row weights equal to γ and k, respectively. A (γ, p) AB-LDPC code, where p is prime, is a (γ, p)-regular LDPC code with additional structure in its parity-check matrix, H(γ, p). In particular,

[00001] $\begin{matrix} H (γ, p) = [\begin{matrix} I & I & I & .Math. & I \\ I & σ & σ^{2} & .Math. & σ^{p - 1} \\ .Math. & .Math. & .Math. & .Math. & .Math. \\ I & σ^{γ - 1} & σ^{2 (γ - 1)} & .Math. & σ^{(γ - 1) (p - 1)} \end{matrix}], & (1) \end{matrix}$

where σ.sup.z denotes the circulant matrix obtained by cyclically left-shifting the entries of the p×p identity matrix I by z (mod p) positions. Notice that σ.sup.0=I. In embodiment of the present disclosure, lifted LDPC codes can be obtained by replacing non-zero (resp., zero) entries of the parity-check matrix with randomly generated permutation (resp., all-zero) matrices.

[0027] In an RL problem, a software agent (learner) interacts with an environment whose state space can be modeled as a finite Markov decision process (MDP). The software agent takes actions that alter the state of the environment and receives a reward in return for each action, with the goal of maximizing the total reward in a series of actions. The optimized sequence of actions can be obtained by employing a cluster scheduling policy which utilizes an action-value function to determine how beneficial an action is for maximizing the long-term expected reward. For embodiments described herein, let [[x]]={0, . . . , x−1}, where x is a positive integer. As an example, an environment can allow m possible actions. A random variable A.sub.l ∈ [[m]], with realization a, represents the index of an action taken by the software agent during learning step l. The current state of the environment before taking action A.sub.l is represented as S.sub.l, with realization s ∈ Z, and S.sub.l+1, with realization s′, represents a new state of the MDP after executing action A.sub.l. A state space S contains all possible state realizations. The reward yielded at step l after taking action A.sub.l in state S.sub.l is represented as R.sub.l(S.sub.l, A.sub.l, S.sub.l+1).

[0028] Optimal policies for MDPs can be estimated via Monte Carlo techniques such as Q-learning. The estimated action-value function Q.sub.l(S.sub.l, A.sub.l) in Q-learning represents the expected long-term reward achieved by the software agent at step l after taking action A.sub.l in state S.sub.l. To improve the estimation in each step, the action-value function can be adjusted according to a recursion

[00002] $\begin{matrix} Q_{l + 1} (s, a) = (1 - α) Q_{l} (s, a) + α (R_{l} (s, a, s^{'}) + β \max_{a^{'} \in [[m]]} Q_{l} (s^{'}, a^{'})) & (2) \end{matrix}$

where s′ represents the new state s.sub.0 as a function of s and a, 0<a<1 is the learning rate, β is the reward discount rate, and Q.sub.l+1(s, a) is a future action-value resulting from action a in the current state s. Note that the new state is updated with each action. The optimal cluster scheduling policy for the software agent, π.sup.(l), in state s is given by

π.sup.(l)=argmax .sub.aQ.sub.l(s,a), (3)

where l is the total number of learning steps elapsed after observing the initial state S.sub.0. In the case of a tie, an action can be uniformly chosen at random from all the maximizing actions.

[0029] An embodiment of the RL-based sequential decoding (RL-SD) process can include a belief propagation (BP) decoding algorithm in which the environment is given by the Tanner graph of the LDPC code, and the optimized sequence of actions, i.e., the scheduling of individual clusters, can be obtained using a suitable RL algorithm such as Q-learning. A single cluster scheduling step can be carried out by sending messages from all CNs of a cluster to their neighboring VNs, and subsequently sending messages from these VNs to their CN neighbors. That is, a selected cluster executes one iteration of flooding in each decoding instant. Every cluster is scheduled exactly once within a single decoder iteration. Sequential cluster scheduling can be carried out until a stopping condition is reached, or an iteration threshold is exceeded. The RL-SD method relies on a cluster scheduling policy based on an action-value function, which can be estimated using the RL techniques described herein.

[0030] FIG. 1 illustrates an example environment 100 to facilitate communications and/or the transfer of data between communication devices. As a non-limiting example, a first user of a first mobile communication device or handset 110 can communicate with a second user of a second mobile communication device or handset 120 via a communication channel established by a network 130 between the first and second mobile communication devices 110 and 120. The network 130 can include, for example, one or more base stations 132, routers 134, switches 136, and/or servers 138.

[0031] The (first) mobile communication device 110 can encode (e.g., with LDPC codes) and modulate a radiofrequency (RF) signal and transmit the RF signal which can be routed through the network 130 and transmitted to the (second) communication device 120, which can demodulate and decode the received RF signal to extract the voice data. In an exemplary embodiment, the first mobile communication device 110 can use LDPC codes for channel coding on the traffic channel. When the second mobile communication device 120 receives the RF signal, the second mobile communication device can extract the LDPC codes from the RF signal and use the extracted LDPC codes to correct channel errors by maintaining parity bits for data bits transmitted via the traffic channel. When a parity check failure is detected by the second mobile communication device 120 for one or more data bits, information from the multiple parity bits of the LDPC codes associated with the one or more data bits can be used by the second mobile communication device 120 to determine the original/correct value for the one or more data bits.

[0032] FIG. 2 is a block diagram of an example of an embodiment of a mobile communication device 200 in accordance with embodiments of the present disclosure. The mobile communication device 200 can be a smartphone, tablet, subnotebook, laptop, personal digital assistant (PDA), and/or any other suitable mobile communication device that includes or can be programmed and/or configured to communicate with other communication devices via a communication network (e.g., network 130). The mobile communication device 200 can include one or more processing and/or logic devices 204, such as digital signal processors (DSP), microprocessors, microcontrollers, and/or graphical processing units (GPUs), field programmable gate arrays (FPGAs), application specific circuits (ASICs), and the like. The mobile communication device 200 can also include memory/storage 206 in the form a non-transitory computer-readable medium, a display unit 208, a battery 212, and a radio frequency circuitry 214. The camera 210 can be programmed and/or configured to capture images of scenes. Some embodiments of the mobile communication device 200 can also include other components, such as sensors 216 (e.g., accelerometers, gyroscopes, piezoelectric sensors, light sensors, LIDAR sensors), subscriber identity module (SIM) card 218, audio components 220 and 222 (e.g., microphones and/or speakers), and power management circuitry 224.

[0033] The memory 206 can include any suitable, non-transitory computer-readable storage medium, e.g., read-only memory (ROM), erasable programmable ROM (EPROM), electrically-erasable programmable ROM (EEPROM), random access memory (RAM), flash memory, and the like. In exemplary embodiments, an operating system 226 and an embodiment of the LDPC decoder 228 can be embodied as computer-readable/executable program code stored on the non-transitory computer-readable memory 206 and implemented using any suitable, high or low-level computing language, scripting language, or any suitable platform, such as, e.g., Java, C, C++, C #, assembly code, machine-readable language, Python, Rails, Ruby, and the like. The memory 206 can also store data to be used by and/or that is generated by the LDPC decoder 228. While memory 206 is depicted as a single component, those skilled in the art will recognize that the memory can be formed using multiple components and that separate non-volatile and volatile memory devices can be used.

[0034] One or more processing and logic devices 204 can be programmed and/or configured to facilitate an operation of the mobile communication device 200 and enable RF communications with other communication devices via a network (e.g., network 130). The processing and/or logic devices 204 can be programmed and/or configured to execute the operating system 226 and the LDPC decoder 228 to implement one or more processes to perform one or more operations (decoding of LDPC codes, error detection and correction). As an example, a microprocessor, micro-controller, central processing unit (CPU), or graphical processing unit (GPU) can be programmed to execute the LDPC decoder 228. As another example, the LDPC decoder 228 can be embodied and executed by an application-specific integrated circuit (ASIC). The processing and/or logic devices 204 can retrieve information/data from and store information/data to the memory 206. For example, the processing device 204 can retrieve and/or store LDPC codes and/or any other suitable information/data that can be utilized by the mobile communication device to perform error detection and correction using LDPC codes.

[0035] The LDPC decoder 228 can include a reinforcement learning (RL) software agent that can sequentially decode the low-density parity-check (LDPC) codes included in the RF signal via reinforcement learning (RL). The sequential decoding process implemented by the software agent can be trained to schedule all check nodes (CNs) in a cluster, and all clusters in every iteration, such that in each RL step, the software agent of the LDPC decoder 228 learns to schedule CN clusters sequentially depending on the reward associated with the outcome of scheduling a particular cluster.

[0036] The RF circuitry 214 can include an RF transceiver, one or more modulation circuits, one or more demodulation circuits, one or more multiplexers, one or more demultiplexers. The RF circuitry 214 can be configured to transmit and/or receive wireless communications via an antenna 215 pursuant to, for example, the 3rd Generation Partnership Project (3GPP) for 5G NR and/or the International Telecommunications Union (ITU) IMT-2020.

[0037] The display unit 208 can render user interfaces, such as graphical user interfaces (GUIs) to a user and in some embodiments can provide a mechanism that allows the user to interact with the GUIs. For example, a user may interact with the mobile communication device 200 through the display unit 208, which may be implemented as a liquid crystal touchscreen (or haptic) display, a light-emitting diode touchscreen display, and/or any other suitable display device, which may display one or more user interfaces that may be provided in accordance with exemplary embodiments.

[0038] The power source 212 can be implemented as a battery or capacitive elements configured to store an electric charge and power the mobile communication device 200. In exemplary embodiments, the power source 212 can be a rechargeable power source, such as a battery or one or more capacitive elements configured to be recharged via a connection to an external power supply.

[0039] FIG. 3 illustrates an example of a graph 305 that includes a cluster-induced subgraph 300 for a cluster size z=2 of an example parity-check matrix. The graph 305 includes check nodes (CNs) 310a-310e, variable nodes (VNs) 320, and edges 330 extending between check nodes 310a-310e and the variable nodes 320a-f. The cluster-induced subgraph 300 includes check nodes (CNs) 310a-310b, variable nodes (VNs) 320a-d, and edges 330 extending between check nodes 310a-310b and the variable nodes 320a-d. An edge extends from a check node to a variable node if there is a “1” in the row and column of the parity-check matrix H corresponding to the nodes. A cluster 302 having a cluster size z=2 includes check nodes 310a and 310b. Since the full LDPC Tanner graph is connected and contains cycles, there exist dependencies between the messages propagated by the different clusters of the LDPC code. Consequently, the output of a cluster can depend on previously scheduled clusters. To improve RL performance, an embodiment of the LDPC decoder 228 can ensure that the clusters are as independent as possible. The choice of clustering can be determined prior to learning using a cycle-maximization method, where the clusters are selected to maximize the number of cycles in the cluster-induced subgraph 300 in order to minimize inter-cluster dependencies.

[0040] The transmitted and the received words can be represented as x=[x.sub.0, . . . , x.sub.n−1] and y=[y.sub.0, . . . ,y.sub.n−1], respectively, where for v ∈ [[n]], the values of each transmitted word include 0's and/or 1's (x.sub.v ∈ {0,1}) and the value of each received word can be represented as y.sub.v=(−1).sup.x.sup.v+z with z˜ custom-character (0, σ.sup.2). The posterior log-likelihood ratio (LLR) of a transmitted bit x.sub.v can be expressed as

[00003] $L_{v} = \log \frac{\Pr (x_{v} = 1 | y_{v})}{\Pr (x_{v} = 0 | y_{v})} .$

The posterior LLR computed by VN v during iteration I can be represented as L.sub.I custom-character =Σ.sub.c∈.sub.(v)m.sub.c.fwdarw.v.sup.(I)+L.sub.v, where L.sub.0 =L.sub.v and m.sub.c.fwdarw.v.sup.(I) is the message received by VN v from neighboring CN c in iteration I. Similarly, the posterior LLR computed during iteration I by VN j in the subgraph induced by the cluster with index a ∈ [[┌m/z┐]] can be represented as L.sub.I custom-character . Hence, L.sub.I=L.sub.I if VN v in the Tanner graph is also the jth VN in the subgraph induced by the cluster with index a.

[0041] After scheduling cluster a during iteration I, the output

[00004] ${\hat{x}}_{a}^{(j)} = [x_{0, a}, .Math., x_{l_{a} - 1, a}]$

of cluster a, where l.sub.a≤z*k.sub.max is the number of VNs adjacent to cluster a, is obtained by taking hard decisions on the vector of posterior LLRs

[00005] ${\hat{L}}_{I, a} = [L_{I} .Math. L_{I}],$

computed according to

[00006] $\begin{matrix} x_{J, a} = {\begin{matrix} 0 & if {\hat{L}}_{I}^{(j, a)} \geq 0, \\ 1, & otherwise \end{matrix} & (4) \end{matrix}$

[0042] The output, {circumflex over (x)}.sub.a.sup.(I) of cluster a includes the bits reconstructed by the sequential decoder after scheduling cluster a during iteration I. An index of a realization of {circumflex over (x)}.sub.a.sup.(I) in iteration I can be denoted by s.sub.a.sup.(I) ∈ [[2.sup.l.sup.a]]. The collection of all possible signals {circumflex over (x)}.sub.0.sup.(I), . . . , {circumflex over (x)}.sub.┌m/z┐-1 .sup.(I) at the end of decoder iteration I forms the state of the MDP associated with the RL process implemented by embodiments of the LDPC decoder 228. At the end of iteration I, the fully reconstructed signal estimate

[00007] $\hat{x} = [\hat{x_{0}}, .Math., \hat{x_{n - 1}}]$

can be obtained.

[0043] During the learning/training phase, embodiments of the RL process inform the software agent of the current state of the LPDC decoder and the reward obtained after performing an action (decoding a cluster). Based on these observations, the software agent of the LDPC decoder 228 can take future actions, to enhance the total reward earned, which alters the state of the environment as well as the future reward. Given that the transmitted communication signal x is known during the training phase, a vector containing the l.sub.a bits of x that are reconstructed in the output {circumflex over (x)}.sub.a.sup.(I) of a cluster can be represented as x.sub.a=[x.sub.0,a, . . . , x.sub.l.sub.a.sub.−1,a]. In each learning step l, the reward R.sub.a obtained by the software agent after scheduling cluster a is defined as

[00008] $\begin{matrix} R_{a} = \frac{1}{l_{a}} {.Math.}_{j = 0}^{l_{a} - 1} 1 (x_{j, a} = \hat{x_{j, a}}), & (5) \end{matrix}$

where 1(.Math.) denotes the indicator function. Thus, the reward earned by the software agent after scheduling cluster a is identical to the probability that the corrupted bits corresponding to the transmitted bits x.sub.0,a, . . . , x.sub.l.sub.a.sub.−1,a are correctly reconstructed.

[0044] FIG. 4 is a flowchart illustrating an example process 400 for reconstructing a received signal. At operation 402, an output L of a communication channel is received by a mobile communication device (e.g., mobile communication device 200), and at operation 404, the state of all check nodes (CNs) is determined by the LDPC decoder (e.g., LDPC decoder 228). At operation 406, a cluster scheduling order is learned by the software agent of the LDPC decoder. At operation 408, each cluster is decoded by the LDPC decoder 228, and a posterior log-likelihood ratio (LLR) of all variable nodes (VNs) is updated by the LDPC decoder. At operation 412, the LDPC decoder determines whether either a specified maximum number of iterations has been reached or a stopping condition has been satisfied. If not, the process 400 proceeds to operation 414 to start a new iteration beginning from operation 404. If the maximum number of iterations or the stopping condition is reached, the LDPC decoder outputs the reconstructed signal x.

[0045] FIG. 5 illustrates an embodiment of the RL-SD algorithm 500 of the LDPC decoder 228 that can be executed by one or more processing devices (e.g., processing device(s) 204) for sequential LDPC code decoding in accordance with embodiments of the present disclosure. The inputs are the soft channel information vector L=[L.sub.0, . . . , L.sub.n−1] comprised of LLRs and a parity-check matrix H of the LDPC code. The output is the reconstructed signal {circumflex over (x)} obtained after executing at most I.sub.max decoding iterations, or until the stopping condition is reached. The optimized scheduling order, learned using the methods described herein, is dynamic and depends on both the graph structure and on the received channel values.

[0046] The RL-SD process illustrated by FIG. 5 can be viewed as a sequential generalized LDPC (GLDPC) decoder when z>1, where BP decoding of a cluster-induced subgraph is analogous to decoding a sub-code of a GLDPC code. When z=1, each cluster represents a single parity-check code, as is the case in a regular LDPC code.

[0047] With respect to the software agent learning a cluster scheduling policy, the state of the MDP after scheduling a cluster index a during learning step l can be denoted as {circumflex over (x)}.sub.a.sup.(l), and the index of a realization of {circumflex over (x)}.sub.a.sup.(l) be referred to as

[00009] $s_{a} \in [[2^{l_{a}}]] .$

Thus, s.sub.a also refers to the state of the MDP. The state space custom-character of the MDP contains all possible Σ.sub.a∈[┌m/z┐]2.sup.l.sup.a realizations of all the cluster outputs {circumflex over (x)}.sub.0.sup.(l), . . . , {circumflex over (x)}.sub.┌m/z┐−1.sup.(l), where a realization can be considered as a (cluster, cluster state) pair. The action space can be defined as custom-character =[┌m/z┐]. Different Q-learning-based RL approaches can be used for solving the sequential decoding problem.

[0048] As an example using deep reinforcement learning (DRL), for MDPs with very large state spaces, the action-value function Q.sub.l(s, a) can be approximated as Q.sub.l(s, a; W) using a deep learning model with tensor W representing the weights connecting all layers in the neural network (NN). In each learning step 1, a separate NN can be used with weight W.sub.l.sup.(a), for each cluster, since a single NN cannot distinguish between the signals {circumflex over (x)}.sub.a.sup.(l), . . . , {circumflex over (x)}.sub.┌m/z┐−1.sup.(l), and hence cannot distinguish between the rewards R.sub.0, . . . , R.sub.┌m/z┐−1 generated by the ┌m/z┐ different clusters. The target of the NN corresponding to cluster a is given by

[00010] $\begin{matrix} T_{1}^{(a)} = R_{1} (s_{a}, a, s^{'}) + β \max_{a^{'} \in [.Math. m / z .Math.]} Q_{1} (s^{'}, a^{'}; W_{1}^{(a)}), & (6) \end{matrix}$

[0049] where the reward R.sub.l(s.sub.a, a, s′)=R.sub.a. Also, let Q.sub.l(s.sub.a, a; W.sub.l.sup.(a)) be the NN's prediction. In each DRL step, the mean squared error loss between T.sub.l.sup.(a) and Q.sub.l(s.sub.a, a; W.sub.l.sup.(a)) can be minimized using a gradient descent method. The NN corresponding to each cluster learns to map the cluster output {circumflex over (x)}.sub.a.sup.(l) to a vector of ┌m/z┐ predicted action-values

[00011] $[Q_{l} (s^{'}, 0; W_{l}^{(a)}), .Math., Q_{l} (f (s_{a}, a), .Math. m / z .Math. -$

[00012] $1; W_{l}^{(a)})] .$

[0050] During inference, the optimized cluster scheduling policy, π.sub.i.sup.*(I), for scheduling the ith cluster during decoder iteration I is expressed as

[00013] $\begin{matrix} π_{i}^{* (I)} = {argmax}_{a_{i} \in 𝒜 \ {a_{0}, .Math., a_{i - 1}}} Q^{*} (s_{a_{i}}^{(I)}, a_{i}; W_{a_{i}}), & (7) \end{matrix}$

where s.sub.a.sub.i.sup.(I) is the state of cluster a.sub.i during decoder iteration I, and W.sub.a.sub.i represents the optimized weight tensor of the NN that generates the optimized action-value Q*(s.sub.a.sub.i.sup.(I), a.sub.i; W.sub.a.sub.i). The cluster scheduling policy π.sub.i.sup.*(I) can be incorporated in step/line 9 of the process illustrated in FIG. 5 to determine the optimized cluster scheduling order.

[0051] As another example using standard Q-learning, for MDPs with moderately large state spaces, a standard Q-learning approach can be used for determining the optimal cluster scheduling order, where the action-value for choosing cluster a in state s.sub.a is given by

[00014] $\begin{matrix} Q_{l + 1} (s_{a}, a) = (1 - α) Q_{1} (s_{a}, a) + α (R_{a} + β \max_{a^{'} \in [.Math. m / z .Math.]} Q_{1} (s_{a}^{'}, a^{'})) & (8) \end{matrix}$

[0052] In each learning step l, cluster a can be selected via a ε-greedy approach according to

[00015] $\begin{matrix} a = {\begin{matrix} selected randomly w . p . ε from A, \\ π_{Q}^{(l)}, selected w . p . 1 - ε \end{matrix} & (9) \end{matrix}$

[0053] WHERE π.sub.Q.sup.(L)=max.sub.a∈[┌m/z┐]Q.sub.L(S.sub.A, A). For ties (as in the first iteration of the standard Q-learning algorithm shown in FIG. 6 for l=0 and the first L), an action can be selected uniformly at random from all the maximizing actions. During inference, the optimized cluster scheduling policy of standard Q-learning,

[00016] $π_{i};$

for scheduling the ith cluster during decoder iteration I can be expressed as

[00017] $\begin{matrix} π_{I} = {argmax}_{a_{i} \in 𝒜 \ {a_{0}, .Math., a_{i - 1}}} Q^{*} (s_{a_{i}}^{(I)}, a_{i}), & (10) \end{matrix}$

[0054] here Q*(S.sub.a.sub.i.sup.(l), a.sub.i) represents the optimized action value once training has been accomplished. The cluster scheduling policy π.sub.i.sup.*(I) can be incorporated in step/line 9 of the process illustrated in FIG. 5 to determine the optimized cluster scheduling order.

[0055] FIG. 6 illustrates an embodiment of the standard Q-learning algorithm 600 of the LDPC decoder 228 that can be executed by one or more processing devices (e.g., processing device(s) 204) for sequential LDPC code decoding in accordance with embodiments of the present disclosure. The input to the algorithm can be a set custom-character ={L.sub.0, . . . , L.sub.|.sub.|−1} containing |L| realizations of L over which Q-learning is performed, and a parity-check matrix H. The output is Q*(s.sub.a.sub.i.sup.(I), a.sub.i). For each L ∈ , the action-value function in equation 8 can be recursively updated l.sub.max times.

Experimental Results

[0056] Experiments were performed to test the performance of the RL-SD process shown in FIG. 5, where the cluster scheduling policy of step/line 9 is learned using both Deep RL and standard Q-learning as described herein. As a benchmark, the RL-SD process is compared with flooding (i.e. all clusters are updated simultaneously per iteration) and a scheme where the cluster scheduling order is randomly generated. Each scheme for decoding is used with both [384, 256]-Wireless Regional Area Network (WRAN) irregular and (3, 5) AB-LDPC codes. For both codes, the choice of block length (at most 500 bits) is influenced by the run-time of the standard Q-learning algorithm (e.g., an embodiment of which is shown in FIG. 6) of the LDPC decoder.

[0057] FIGS. 7-10 illustrate the bit error rate and frame error rate results of the experiments as compared to a conventional flooding scheme. FIG. 7 is a graph illustrating bit error rate (BER) results using different belief propagation (BP) decoding schemes for a [384, 256]-WRAN code with block length n=384 in accordance with embodiments of the present disclosure. FIG. 8 is a graph illustrating frame error rate (FER) results using different belief propagation (BP) decoding schemes for a [384, 256]-WRAN code with block length n=384 in accordance with embodiments of the present disclosure. FIG. 9 is a graph illustrating bit error rate (BER) results using different belief propagation (BP) decoding schemes for a (3, 5) AB-LDPC code with block length n=500 in accordance with embodiments of the present disclosure. FIG. 10 is a graph illustrating frame error rate (FER) results using different belief propagation (BP) decoding schemes for a (3, 5) AB-LDPC code with block length n=500 in accordance with embodiments of the present disclosure. The y-axis in FIGS. 7 and 9 corresponds to a bit error rate (BER) and the y-axis in FIGS. 8 and 10 corresponds to a frame error rate (FER). The x-axes in FIGS. 7-10 correspond to a signal-to-noise ratio (SNR), in terms of Eb/N0 in dB.

[0058] The LLR vectors used for training are sampled uniformly at random over a range of A equally spaced SNR values for a given code. Hence, there are |L|/A LLR vectors in custom-character for each SNR value considered. For both considered codes (e.g., [384, 256]-WRAN and (3, 5)-AB LDPC codes), the learning parameters can be as follows: α=0.1, β=0.9, ε=0.6, l.sub.max=50, and ||=5×10.sup.5, where |L| is chosen to ensure that the training is as accurate as possible without incurring excessive run-time for the standard Q-learning algorithm (e.g., an embodiment of which is shown in FIG. 6). Once RL is accomplished using either DRL or standard Q-learning, the corresponding cluster scheduling policy for each code is incorporated in step/line 9 of the algorithm illustrated in FIG. 5, resulting in RL-SD for that code. For decoding, the maximum number of iterations is set to 50 (I.sub.max=50). Note that in case of DRL, each cluster NN is based on a feed-forward architecture with an input layer of size l.sub.a, two hidden layers of sizes 250 and 125, respectively, and an output layer of size ┌m/z┐. The activation function used for the hidden and output layers are rectified linear unit and sigmoid, respectively.

[0059] For both training and inference, the AWGN channel is considered and all-zero codewords are transmitted using BPSK modulation. Training with the all-zero codeword is sufficient as, due to the symmetry of the BP decoder and the channel, the decoding error is independent of the transmitted signal.

[00018] $\Pr [\hat{x_{v}} \neq x_{v}],$ $v \in [[n]],$

and the frame error rate (FER), given by Pr[{circumflex over (x)} ≠x]. In the case of the WRAN LDPC code, z=1 is only considered as this code has several degree-11 CNs which render both learning schemes too computationally intensive for z>1. On the other hand, for the AB code, multiple cluster sizes are chosen from z ∈ {1, 2,3} for both the random and RL-SD schemes. For z ∈ {1, 2}, standard Q-learning can be employed to learn the cluster scheduling policy. For z=3, deep reinforcement learning (DRL) can be utilized, as standard Q-learning is not feasible due to the significantly increased state space. The same number of training examples are used for both standard Q-learning and DRL.

[0060] The BER vs. channel signal-to-noise ratio (SNR), in terms of Eb/NO in dB, for the [384, 256]-WRAN and (3, 5) AB-LDPC codes using these decoding techniques are shown in FIGS. 7 and 9, respectively. The experimental results reveal that sequential decoding of clusters outperforms the flooding scheme. Furthermore, regardless of the cluster size, the RL-SD scheme outperforms the random sequential scheduling schemes, revealing the benefit of RL. For both codes, the RL-SD scheme outperforms the other decoding schemes, including the state-of-the art hyper-network decoder (in case of the WRAN LDPC code) with a gain of around 0.5 dB for fixed BER. Note that for both codes, sequential decoding performance improves as the cluster size is reduced, mainly because the subgraphs induced by the smaller clusters are less likely to contain detrimental objects, such as cycles and absorbing sets. The FER vs. SNR performance shown in FIGS. 8 and 10 show similar behavior.

[0061] In Table 1, the average number of CN to VN messages propagated in the considered decoding schemes are compared to attain the results in FIGS. 7-10. The numbers without (resp. with) parentheses correspond to the (3, 6)-regular (resp. (3, 5) AB-) LDPC code. The RL-SD algorithm, on average, generates a lower number of CN to VN messages when compared to the other decoding schemes, irrespective of the cluster size. Thus, the RL-SD scheme also provides a significant reduction in message-passing complexity for moderate length LDPC codes.

TABLE-US-00001 SNR (dB) 1 2 3 flooding 6480 6422 5171 random (z = 1) 6480 5827 3520 RL-SD (z = 1) 6467 5450 3179

TABLE-US-00002 SNR (dB) 1 2 3 flooding 63750 16409 8123 random (z = 3) 44338 11102 5005 RL-SD (z = 3) 40448 10694 4998 random (z = 2) 36328 10254 4994 RL-SD (z = 2) 31383 7349 4225 random (z = 1) 59750 10692 4812 RL-SD (z = 1) 51250 6240 3946
Table 1: Average number of CN to VN messages propagated in various decoding schemes for a [384, 256]-WRAN (left) and (3,5) AB-(right) LDPC code to attain the results shown in FIGS. 7-10

[0062] Exemplary flowcharts are provided herein for illustrative purposes and are non-limiting examples of methods. One of ordinary skill in the art will recognize that exemplary methods may include more or fewer steps than those illustrated in the exemplary flowcharts, and that the steps in the exemplary flowcharts may be performed in a different order than the order shown in the illustrative flowcharts.

[0063] The foregoing description of the specific embodiments of the subject matter disclosed herein has been presented for purposes of illustration and description and is not intended to limit the scope of the subject matter set forth herein. It is fully contemplated that other various embodiments, modifications and applications will become apparent to those of ordinary skill in the art from the foregoing description and accompanying drawings. Thus, such other embodiments, modifications, and applications are intended to fall within the scope of the following appended claims. Further, those of ordinary skill in the art will appreciate that the embodiments, modifications, and applications that have been described herein are in the context of particular environment, and the subject matter set forth herein is not limited thereto but can be beneficially applied in any number of other manners, environments, and purposes. Accordingly, the claims set forth below should be construed in view of the full breadth and spirit of the novel features and techniques as disclosed herein.

Systems and Methods for Decoding of Graph-Based Channel Codes Via Reinforcement Learning

Assignee

Inventors

Cpc classification

Classification Explorer

H03M13/1125

ELECTRICITY

Classification Explorer

H03M13/1131

ELECTRICITY

Classification Explorer

H03M13/1128

ELECTRICITY

International classification

Classification Explorer

H03M13/11

ELECTRICITY

Abstract

Claims

Description