LEARNING AN OPTIMAL PRECODING POLICY FOR MULTI-ANTENNA COMMUNICATIONS

Abstract

Systems and methods for learning and applying an optimal precoding policy for multi-antenna communications in a Multiple Input Multiple Output (MIMO) system are disclosed.

Claims

1. A computer implemented method performed by an agent for training a first neural network that maps a Multiple Input Multiple Output, MIMO, channel state to a precoder in a continuous precoder space, the method comprising: initializing first neural network parameters, φ, of a first neural network, F.sub.φ(H), that estimates a first precoding policy that maps a channel state, H, for a MIMO system to a precoder, w, in a continuous precoder space; initializing second neural network parameters, θ, of a second neural network, S.sub.θ(H, w), that estimates a value function that maps the channel state, H, for the MIMO system and the precoder, w, in the continuous precoder space to a value, q, of the precoder, w, in the channel state H; initializing an initial channel state, H.sub.0, for the MIMO system based on a channel model for the MIMO system or a channel measurement in the MIMO system; and for each time t in a set of times t=0 to t=T−1, where T is a predefined integer value that is greater than 1: choosing or obtaining a precoder, w.sub.t, for a channel state, H.sub.t, that is to be executed or has been executed by a MIMO transmitter in the MIMO system; observing a parameter in the MIMO system as a result of execution of the precoder, w.sub.t; computing a reward, r.sub.t, based on the parameter; observing a channel state, H.sub.t+1, for time t+1; updating the second neural network parameters, θ, of the second neural network, S.sub.θ(H, w), based on an experience [H.sub.t, w.sub.t, r.sub.t, H.sub.t+i]; computing a gradient, ∇.sub.φF.sub.φ, which is a gradient of the first neural network, F.sub.φ(H), with respect to the first neural network parameters, φ; computing a gradient, ∇.sub.wS.sub.θ, which is a gradient of the second neural network, S.sub.θ(H, w), with respect to the precoder, w; and updating the first neural network parameters, φ, of the first neural network, F.sub.φ(H), based on the gradient, ∇.sub.φF.sub.φ, and the gradient, ∇.sub.wS.sub.θ.

2. The method of claim 1 further comprising either: providing the first neural network parameters, φ, of the first neural network, F.sub.φ(H), to the MIMO system to be used by the MIMO system for precoder selection; or utilizing the first neural network, F.sub.φ, (H), for precoder selection for the MIMO system during an execution phase.

3. The method of claim 1 wherein updating the first neural network parameters, φ, of the first neural network, F.sub.φ(H), based on the gradient, ∇.sub.φF.sub.φ, and the gradient, ∇.sub.wS.sub.θ, comprises updating the first neural network parameters, φ, of the first neural network, F.sub.φ, (H), in accordance with a rule:
φ←φ+η∇.sub.φ,F.sub.φ,(H)∇.sub.wS.sub.θ(H,W)|.sub.H=H.sub.t.sub.w=F.sub.φ.sub.(w.sub.t.sub.) where η is a predefined learning rate.

4. The method of claim 1 wherein updating the second neural network parameters, θ, of the second neural network, S.sub.θ(h, w), based on the experience [H.sub.t, w.sub.t, r.sub.t, H.sub.t+1] comprises updating the second neural network parameters, θ, of the second neural network, S.sub.θ(H, w), based on the experience [H.sub.t, w.sub.t, r.sub.t, H.sub.t+1] in accordance with a Q-learning scheme.

5. The method of claim 1 wherein the parameter observed in the MIMO system as a result of execution of the precoder, w.sub.t, is block error rate.

6. The method of claim 1 wherein the parameter observed in the MIMO system as a result of execution of the precoder, w.sub.t, is throughput.

7. The method of claim 1 wherein the parameter observed in the MIMO system as a result of execution of the precoder, w.sub.t, is channel capacity.

8. The method of claim 1 wherein choosing or obtaining the precoder, w.sub.t, for the channel state, H.sub.t, comprises choosing the precoder, w.sub.t, for the channel state, H.sub.t, as:
w.sub.t=F.sub.φ,(H.sub.t)+ custom-character , where is an exploration noise.

9. The method of claim 8 further comprising providing the precoder, w.sub.t, to the MIMO system for execution by the MIMO transmitter.

10. The method of claim 8 or 9 wherein the exploration noise is a random noise in the continuous precoder space.

11. The method of claim 10 wherein the step of initializing the initial channel state, H.sub.0, and the steps of choosing or obtaining the precoder, w.sub.t, observing the parameter in the MIMO system, computing the reward, r.sub.t, observing the channel state, H.sub.t+1, updating the second neural network parameters, θ, computing the gradient, ∇.sub.φF.sub.φ, computing the gradient, ∇.sub.wS.sub.θ, and updating the first neural network parameters, φ, for each time t in the set of times t=0 to t=T−1 are repeated for two or more episodes, and a variance of the exploration noise varies over the two or more episodes.

12. The method of claim 11 wherein the variance of the exploration noise gets smaller over the two or more episodes.

13. The method of claim 1 wherein choosing or obtaining the precoder, w.sub.t, for the channel state, H.sub.t, comprises choosing the precoder, w.sub.t, for the channel state, H.sub.t, as:
w.sub.t= custom-character (H.sub.t), where corresponds to the first neural network, F.sub.φ(H), but where an exploration noise is added to the first neural network parameters, φ.

14. The method of claim 13 further comprising providing the precoder, w.sub.t, to the MIMO system for execution by the MIMO transmitter.

15. The method of claim 13 wherein the exploration noise is a random noise in a parameter space of the first neural network, F.sub.φ(H).

16. The method of claim 15 wherein the step of initializing the initial channel state, H.sub.0, and the steps of choosing or obtaining the precoder, w.sub.t, observing the parameter in the MIMO system, computing the reward, r.sub.t, observing the channel state, H.sub.t+1, updating the second neural network parameters, θ, computing the gradient, ∇.sub.φF.sub.φ, computing the gradient, ∇.sub.wS.sub.θ, and updating the first neural network parameters, φ, for each time t in the set of times t=0 to t=T−1 are repeated for two or more episodes, and a variance of the exploration noise varies over two or more episodes.

17. The method of claim 16 wherein the variance of the exploration noise gets smaller over the two or more episodes.

18-29. (canceled)

30. A processing node that implements an agent for training a first neural network that maps a Multiple Input Multiple Output, MIMO, channel state to a precoder in a continuous precoder space, the processing node comprising processing circuitry configured to cause the processing node to: initialize first neural network parameters, φ, of a first neural network, F.sub.φ(H), that estimates a first precoding policy that maps a channel state, H, for a MIMO system to a precoder, w, in a continuous precoder space; initialize second neural network parameters, θ, of a second neural network, S.sub.θ(H, w), that estimates a value function that maps the channel state, H, for the MIMO system and the precoder, w, to a value, q, of the precoder, w, in the channel state, H; initialize an initial channel state, H.sub.0, for the MIMO system based on a channel model for the MIMO system or a channel measurement in the MIMO system; and for each time t in a set of times t=0 to t=T−1, where T is a predefined integer value that is greater than 1: choose or obtain a precoder, w.sub.t, for a channel state, H.sub.t, that is to be executed or has been executed by a MIMO transmitter in the MIMO system; observe a parameter in the MIMO system as a result of execution of the precoder, w.sub.t; compute a reward, r.sub.t, based on the parameter; observe a channel state, H.sub.t+1, for time t+1; update the second neural network parameters, θ, of the second neural network, S.sub.θ(H, w), based on an experience [H.sub.t, w.sub.t, r.sub.t, H.sub.t+1]; compute a gradient, ∇.sub.φF.sub.φ, which is a gradient of the first neural network, F.sub.φ(H), with respect to the first neural network parameters, φ; compute a gradient, ∇.sub.wS.sub.θ, which is a gradient of the second neural network, S.sub.θ(H, w), with respect to the precoder, w; and update the first neural network parameters, φ, of the first neural network, F.sub.φ(H), based on the gradient, V.sub.φF.sub.φ, and the gradient, ∇.sub.wS.sub.θ.

31. A computer implemented method for precoder selection and application for a Multiple Input Multiple Output, MIMO, system comprising: selecting a precoder, w, for a MIMO transmitter of the MIMO system using a first neural network, F.sub.φ(H), that estimates a first precoding policy that maps a channel state, H, for the MIMO system to the precoder, w, in a continuous precoder space; and applying the selected precoder, w, in the MIMO transmitter.

32. The method of claim 31 wherein the method further comprises training the first neural network, F.sub.φ(H), based on a neural network parameter update rule:
φ←φ+η∇.sub.φ,F.sub.φ,(H)∇.sub.wS.sub.θ(H,W)|.sub.H=H.sub.t.sub.w=F.sub.φ.sub.(w.sub.t.sub.) where: φ is a first set of neural network parameters of the first neural network, F.sub.φ, (H); η is a predefined learning rate; ∇.sub.φF.sub.φ, is a gradient of the first neural network, F.sub.φ(H), with respect to the first set of neural network parameters, φ; ∇.sub.wS.sub.θ is a gradient of a second neural network, S.sub.θ(H, w), with respect to the precoder, w, wherein the second neural network, S.sub.θ(H, w), estimates a value function that maps the channel state, H, for the MIMO system and the precoder, w, to a value, q, of the precoder, w, in the channel state, H; and θ is a second set of neural network parameters of the second neural network, S.sub.θ(H, w).

33-36. (canceled)

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0033] The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure, and together with the description serve to explain the principles of the disclosure.

[0034] FIG. 1 illustrates a learning agent that operates to learn an optimal precoder policy for a Multiple Input Multiple Output (MIMO) Orthogonal Frequency Division Multiplexing (OFDM) system in accordance with an embodiment of the present disclosure;

[0035] FIG. 2 illustrates one example of a MIMO system for which embodiments of the present disclosure may be provided;

[0036] FIG. 3 illustrates the learning agent of FIG. 1 in which the learning agent includes two neural networks, namely, a first neural network, denoted by F.sub.φ, that estimates a corresponding set of neural network parameters (φ) that estimate an optimal precoding policy that maps a MIMO channel state (H) to a precoder w in a continuous precoder space and a second network, denoted by S.sub.θ, that is used for training the first neural network in accordance with an embodiment of the present disclosure;

[0037] FIG. 4 illustrates training of the first neural network, denoted by F.sub.φ, based on a gradient (∇.sub.φF.sub.φ) of F.sub.φ with respect to φ and a gradient (∇.sub.w S.sub.θ) of S.sub.θ with respect to the chosen precoder w, in accordance with an embodiment of the present disclosure;

[0038] FIG. 5 illustrates one example embodiment in which the learning agent takes an action (i.e., chooses a precoder w.sub.t) by perturbating a precoder provided by the first neural network F.sub.φ;

[0039] FIG. 6 is an illustration of one training iteration of the learning agent in accordance with one embodiment of the present disclosure;

[0040] FIG. 7 is a flow chart that illustrates the operation of the learning agent during a training phase in accordance with one embodiment of the present disclosure;

[0041] FIG. 8 is an illustration of one training iteration of the learning agent in accordance with another embodiment of the present disclosure;

[0042] FIG. 9 is a flow chart that illustrates the operation of the learning agent during a training phase in accordance with another embodiment of the present disclosure;

[0043] FIG. 10 is a flow chart that illustrates the operation of the system including the learning agent and MIMO system in accordance with another embodiment of the present disclosure; and

[0044] FIGS. 11 through 13 are schematic block diagrams of a processing node that may implement the learning agent in accordance with embodiments of the present disclosure.

DETAILED DESCRIPTION

[0045] The embodiments set forth below represent information to enable those skilled in the art to practice the embodiments and illustrate the best mode of practicing the embodiments. Upon reading the following description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the disclosure and will recognize applications of these concepts not particularly addressed herein. It should be understood that these concepts and applications fall within the scope of the disclosure.

[0046] To address the gap between the unknown optimal solution for Multiple-Input Multiple-Output (MIMO) precoding on a per-RE basis and the conventional sub-optimal solution for MIMO precoding on a per-subband basis, a deep reinforcement learning-based precoding scheme is disclosed herein that can be used to learn an optimal precoding policy for very complex MIMO systems. As described herein, a Reinforcement Learning (RL) agent learns an optimal precoding policy in continuous precoder (i.e., action) space from experience data in a MIMO system. The RL agent interacts with an environment of the MIMO system and channel in an experience sequence of given channel states, precoders taken, and performance parameters (e.g., Block Error Rate (BER), throughput, or channel capacity). The goal of the RL agent is to learn a precoder policy that optimizes the performance parameter (e.g., minimizes BER, maximizes throughput, or maximizes channel capacity). To this end, in one embodiment, the MIMO precoding problem for a single-user (SU) MIMO system is modeled as a contextual-bandit problem in which the RL agent sequentially selects the precoders to serve the environment of MIMO system from a continuous precoder space based on a precoder selection policy and contextual information about the environment conditions, while simultaneously adapting the precoder selection policy based on a reward feedback (e.g., BER, throughput, or channel capacity) from the environment to maximize a numerical reward signal.

[0047] Now, a more detailed description of embodiments of the present disclosure will be provided. As illustrated in FIG. 1, without loss of generality, a precoding problem for a SU-MIMO system 100 is considered in which a learning agent 102 sequentially chooses precoders (w.sub.t) to serve the environment of the MIMO system 100 based on a precoder policy and conditions in the environment (i.e., the MIMO channel state or MIMO channel matrix H.sub.t and H.sub.t+1), while simultaneously adapting the precoder policy based on a reward feedback (e.g., BER.sub.t) from the environment to maximize a numerical reward signal (r.sub.t).

[0048] Before describing the details of the learning agent 102, a description of the SU-MIMO system 100 is beneficial. In this regard, FIG. 2 illustrates one example of the SU-MIMO system 100. The SU-MIMO system 100 is more specifically a MIMO-OFDM system including a transmitter 200 and a receiver 202. The transmitter 200 is equipped with n.sub.tx transmit antenna 204-1 through 204-n.sub.tx. The receiver 202 is equipped with n.sub.rx receive antennas 206-1 through 206-n.sub.rx. To exploit the spatial diversity available in MIMO systems, a precoding vector w ∈ custom-character .sup.ntx×1 is applied at the transmitter 200 and a combining vector r ∈ .sup.ntx×1 is applied at the receiver 202. At the transmitter 200, an encoder 208 encodes one transport bit stream into a bit block b.sub.tx which is then symbol-mapped to modem symbols x by a mapper 210. Typical modem constellations used are M Quadrature Amplitude Modulation (M-QAM), which consists of a set of M constellation points. Then, a precoder 212 precodes the data symbols x by the precoding vector w to form n.sub.tx data substreams. Finally, the streams are processed via respective Inverse Fast Fourier Transform (IFFTs) 214-1 through 214-n.sub.tx to provide time-domain signals that are transmitted via the respective transmit antennas 204-1 through 204-n.sub.tx. In a similar manner, at the receiver 202, signals received via the receive antennas 206-1 through 206-n are transformed to the frequency domain via respective Fast Fourier Transforms (FFTs) 216-1 through 216-n. A combiner 218 combines the resulting data streams by applying the combining vector r to provide a combined signal z. A demapper 220 performs system-demapping to provide a received bit block {circumflex over (b)}.sub.rx which is then decoded by a decoder 222 to provide the received bit stream.

[0049] The set of data Resource Elements (REs) in a given subband is denoted herein by φ.sub.d and a subband precoding application of a precoder w to the data REs i ∈ φ.sub.d is considered. Further, x.sub.i denotes the complex data symbol at the RE and y.sub.i ∈ custom-character .sup.ntx×1 denotes the complex received signal vector at the RE. Then, the received signal at the RE i can be written as:

y.sub.i=H.sub.iwx.sub.i+n.sub.i, Equation 1

where H.sub.i ∈ custom-character .sup.n rx×n tx represents the MIMO channel matrix between the transmit antenna 204-1 through 204-n.sub.tx and the receive antennas 206-1 through 206-n.sub.rx at the RE i, and n.sub.i, ∈ .sup.n rx×1 is an additive white Gaussian noise (AWGN) vector whose elements are i.i.d. complex-valued Gaussians with zero mean and variance σ.sub.n.sup.2. Without loss of generality, it is assumed that the data symbol x.sub.i and the precoding vector w are normalized so that ∈[|x.sub.i|.sup.2]= and ∥w∥.sup.2=1, where |.Math.|denotes the absolute value of a complex value and ∥.Math.∥denotes the 2-norm of a vector. Under these assumptions, the SNR is given by 1/σ.sub.n.sup.2.

[0050] At the receiver 202, the transmitted data symbol x.sub.i can be recovered by combining the received symbols y.sub.i by the unit-norm vector r.sub.i (i.e., ∥r.sub.i∥.sup.2=1), which yields the estimated complex symbol z.sub.i as:

z.sub.i=r.sub.i.sup.+y.sub.i=r.sub.i.sup.+H.sub.iwx.sub.i+r.sub.i.sup.+n.sub.i, Equation 2

where (.Math.).sup.+ denotes the complex conjugate of a vector or matrix.

[0051] Note that r.sub.i.sup.+H.sub.iw in Equation (2) corresponds to the effective channel gain. It is assumed that a Maximal Ratio Combiner (MRC) is used at the receiver 202 (i.e., the combiner 218 is a MRC), which is optimal in the sense of output Signal to Noise Ratio (SNR) maximization when the noise is white.

[0052] As mentioned above, the optimal precoding solution is given by channel-dependent precoder on a per-RE basis. In other words, an optimal precoder w.sub.i is chosen that maximizes the effective channel gain r.sub.i.sup.+H.sub.iw.sub.i on a per-RB basis. However, in practical MIMO-OFDM systems, a precoder is chosen on per-subband basis, achieving a tradeoff between performance and complexity. A practical subband-precoding solution is obtained based on a spatial channel covariance matrix averaged over the pilot signals in a given subband. The set of pilot REs in a given subband is denoted by φ.sub.p. The channel covariance matrix is given by:

[00001] $\begin{matrix} R_{hh} = \frac{1}{.Math. Φ_{p} .Math.} {.Math.}_{j \in Φ_{p}} H_{j}^{†} H_{j} . & Equation 3 \end{matrix}$

Unfortunately, the conventional solution based on this covariance matrix is sub-optimal, and furthermore no truly optimal solution has been found for this setting to date.

[0053] In what follows, instead of approximating an optimal precoder based on the spatial channel covariance matrix, a learning scheme is described in which the learning agent 102 learns an optimal precoding policy directly from interactions with the complex real-world MIMO environment.

[0054] The learning agent 102 learns a precoding policy that optimizes a performance parameter through an experience sequence of given channel matrices, the precoders taken, and the values of the performance parameter achieved. In the remaining description, the performance parameter is BER. However, the performance parameter is not limited thereto. Other examples of the performance parameter are throughput and channel capacity.

[0055] Returning to FIG. 1, FIG. 1 illustrates a learning procedure where the learning agent 102 observes the MIMO channel state H.sub.t of the MIMO system 100 and chooses a precoder w.sub.e to serve the environment. After each time step t, the learning agent 102 receives a feedback of BER performance BER.sub.t in return for the action taken (i.e., the execution of the chosen precoder w.sub.e). Over the times t=0,1, . . . , T−1, the learning agent 102 learns about how the channel states H.sub.t and precoders w.sub.e relate to each other so that the learning agent 102 can predict the best precoder by observing the new MIMO channel in the next steps. Note that while the environmental state can be any environmental information that can help the learning agent 102 learn the optimal precoder policy, the example embodiments described herein the environmental state is represented by channel matrices on the pilot REs in a given subband. Thus, the MIMO channel state H.sub.t can be defined by a set of vectorized channel matrices as follows:

H.sub.t ={[vec(Re[H.sub.j]).sup.T,vec(Im[H.sub.j]).sup.T].sup.T}.sub.j∈φ.sub.p Equation 4

where Re [.Math.] and Im[.Math.] represent the real and imaginary parts of the complex valued MIMO channel matrix. Note that, regarding notation, H.sub.j is used herein to denote the channel matrix at RE j or i, whereas H.sub.t is used herein to denote the environmental state at time t given by a single channel matrix H.sub.j or a set of channel matrices H.sub.j in pilot REs j at the time t.

[0056] Note that, in one embodiment, the ambiguity in phase information of the channel matrix H is removed. For instance, the channel matrix H with size n.sub.r×n.sub.t can be scaled by the phase of element corresponding to the first transmit and first receive antenna, denoted by H(1,1), i.e.,

[00002] $H \leftarrow \frac{H}{H (1, 1)} .$

In addition, in one embodiment, the ambiguity in amplitude information of the channel matrix H is removed. For instance, the channel matrix H with size n.sub.r×n.sub.t can be scaled by its Frobenius norm, denoted by ∥H∥.sub.F, i.e.,

[00003] $H \leftarrow \frac{H}{{.Math. H .Math.}_{F}} .$

[0057] The learning agent 102 chooses a precoder w.sub.t in the MIMO channel state H.sub.e according to the precoder policy and the chosen precoder w.sub.t is applied to the MIMO system 100 to get an experimental BER performance as a feedback. In particular, in one example, the BER performance is calculated by comparing the transmit code block b.sub.tx and the receive code block {circumflex over (b)}.sub.tx as they represent the action value of precoder w.sub.t over the MIMO channel state H.sub.e without help of channel coding. The experimental BER is represented by:

BER.sub.exp.sup.t=BER(b.sub.tx,{circumflex over (b)}.sub.tx|H.sub.t,w.sub.t), Equation 5

One example of the reward function computed based on the feedback is reward function r.sub.t ∈ [−0.5, +0.5]:

r.sub.t=log.sub.2(1−BER.sub.exp.sup.t)+0.5, Equation 6

[0058] As illustrated in FIG. 3, the learning agent 102 is implemented by using two neural networks. A first neural network 300, denoted by F.sub.φ, estimates the optimal precoding policy. In other words, the first neural network 300, denoted by F.sub.φ, estimates a corresponding set of neural network parameters (φ) that estimate an optimal precoding policy that maps the MIMO channel state (H.sub.t) at time t to a precoder w.sub.t in a continuous precoder space. Thus, the first neural network F.sub.φ(H) takes the MIMO channel state H (i.e., the state) as input and provides a precoder w=F.sub.φ(H) (i.e., the action). A second network, denoted by S.sub.θ, estimates a precoder-value function. In other words, the second neural network 300, denoted by S.sub.θ, estimates a corresponding set of neural network parameters (θ) that estimate a precoder-value policy that maps the MIMO channel state (H.sub.t) and precoder w.sub.t at time t to a precoder-value q.sub.t in a non-continuous precoder space. Thus, the second neural network S.sub.θ(s, a) takes not only the MIMO channel state H (i.e., the state) but also the precoder weight w (i.e., the action) as input and provides an output action value q=S.sub.θ(H, w).

[0059] During the training phase, the first neural network F.sub.φ(H) is used to select a precoder in such a way that different actions are explored for a same MIMO channel state H. Note that, in some embodiments, the output of the first neural network F.sub.φ(H) is transformed in the form of a precoder vector or matrix for the MIMO transmission. For example, for digital precoding with unit-power constraint, the transformation includes a procedure for the precoder vector or matrix to have unit Frobenius norm. As another example, for analog precoding with constant modulus constraint, the transformation includes a procedure for each element of the precoder vector or matrix to have unit amplitude. In another example, the precoder w is processed to provide a precoder matrix whose row vectors have a unit norm.

[0060] At each time t, the precoder is executed by the MIMO system 100 in MIMO channel state H.sub.t to provide a reward r.sub.t, generating the experience of [H.sub.tw.sub.t, r.sub.t]. Through the experiences [s.sub.t, a.sub.t, r.sub.t]=[H.sub.t, w.sub.t, r.sub.t], the second neural network S.sub.θ is trained by a Q-learning scheme to estimate the value of given MIMO channel state and chosen precoder. At the same time, the first neural network F.sub.φ is trained by utilizing the gradient of the second neural network S.sub.θ to update the neural network parameters φ of F.sub.φ in the direction of performance gradient. More specifically, the first neural network F.sub.φ is trained by the following parameter update rule:

φ←φ+η∇.sub.φF.sub.φ(H)∇.sub.wS.sub.θ(H,s)|.sub.H=H.sub.t.sub.,w=F.sub.φ.sub.(H.sub.t.sub.), Equation 7

where η is a learning rate, ∇.sub.φF.sub.φ is the gradient of F.sub.φ with respect to φ, and ∇.sub.wS.sub.θ is the gradient of S.sub.θ with respect to the chosen precoder w (i.e., the action). The operation of the learning agent 100 to train the first neural network F.sub.cp using the above parameter update rule is illustrated in FIG. 4. Note that ∇.sub.φF.sub.φ,(H)∇.sub.wS.sub.θ(H, s) is denoted as ∇.sub.φJ in FIG. 4.

[0061] In one embodiment, during the training phase, the first neural network F.sub.φ(H) is used to select a precoder in such a way that different precoders (i.e., different actions) are explored for the same MIMO channel state H. In this regard, FIG. 5 illustrates one example where the learning agent 100 takes an action (i.e., chooses a precoder w.sub.t) by perturbating a precoder provided by the first neural network F.sub.φ. In this example, the deterministic precoder by F.sub.φ is perturbated by adding noise vector custom-character sampled from a Gaussian random process as follows:

w.sub.t=F.sub.φ(H.sub.t)+ custom-character Equation 8

In other example, a random parameter noise is added to the parameters φ of the first neural network, i.e.,

w.sub.t= custom-character (H.sub.t) Equation 9

[0062] FIG. 6 illustrates the operation of the learning agent 102 for one iteration at time t. FIG. 7 is a flow chart that illustrates the learning agent 102 in more detail in accordance with one embodiment of the present disclosure. In the illustrated example, the training is performed in one or more episodes, which are indexed as ep=1, . . . , E. During each episode, a number of iterations of the training are performed at times t=0, 1, . . . , T−1. More specifically, as illustrated, the learning agent 102 initializes the first neural network F.sub.φ and the second neural network S.sub.θ(step 700). More specifically, the learning agent 102 initializes the neural network parameters φ of the first neural network F.sub.φ and the neural network parameters θ of the second neural network S.sub.θ by, e.g., setting these parameters to random values.

[0063] The learning agent 102 sets the episode index ep to 1 (step 702), and initializes MIMO channel state for time t=0 (i.e., H.sub.0) (step 704). The MIMO channel state H.sub.0 may be initialized based on a known MIMO channel model for the MIMO system 100 or based on a channel measurement from the MIMO system 100. The learning agent 102 sets a time index t equal to 0 (step 206).

[0064] The learning agent 102 chooses a precoder w.sub.t=F.sub.φ(H.sub.t)+ custom-character to be executed by a MIMO transmitter in the MIMO system 100, where, as discussed above, is an exploration noise (step 708). As discussed above, in one embodiment, the exploration noise is a noise vector sampled from a Gaussian random process. In one embodiment, the exploration noise custom-character is a random noise in the continuous precoder space. In one embodiment, a variance of the exploration noise varies over training episodes. In one embodiment, the variance of the exploration noise gets smaller over training episodes. In an alternative embodiment, the learning agent 102 chooses a precoder w.sub.t= custom-character (H.sub.t), where denotes a modified version of F.sub.φ in which a random noise is added to the neural network parametersφ of the first neural network F.sub.φ. In one embodiment, a variance of the exploration noise varies over training episodes. In one embodiment, the variance of the exploration noise custom-character gets smaller over training episodes.

[0065] The learning agent 102 executes the chosen precoder w.sub.t (i.e., the action) in the MIMO system 100 (step 710). In other words, the learning agent 102 provides the chosen precoder w.sub.t to the MIMO system 100 for execution (i.e., use) in the MIMO system 100. The learning agent 102 observes the experimental BER.sub.exp.sup.t in the MIMO system 100 for time t and computes the reward r.sub.t (step 712). In one example, the reward r.sub.t is computed in accordance with Equation (6). The learning agent 102 observes the next MIMO channel state H.sub.t+1 in the MIMO system 100 (step 714).

[0066] The learning agent 102 updates the neural network parameters θ of the second (critic) neural network S.sub.θ via Q-learning on the experience [s.sub.t, a.sub.t, r.sub.t, s.sub.t+1] (step 716). The learning agent 102 also computes the gradient vectors ∇.sub.φF.sub.φ and ∇.sub.wS.sub.θ(step 718) and updates the neural network parameters φ of the first (actor) neural network F.sub.φ based on the gradient vectors ∇.sub.φF.sub.φ and ∇.sub.wS.sub.θ in accordance with the parameter update rule of Equation (7) (step 720).

[0067] The learning agent 102 determines whether the last iteration for the current training episode has been reached (i.e., whether t<T−1) (step 722). If the last iteration has not been reached (i.e., if t<T−1), the learning agent increments t (step 724) and the process returns to step 708 and is repeated for the next iteration. Once the last iteration for the current training episode has been reached, the learning agent 102 determines whether the last episode has been reached (i.e., determines whether ep<E) (step 226). If not, the learning agent 102 increments the episode index ep (step 228) and the process returns to step 704 and repeated for the next episode. Once the last episode has been reached, the training process ends and an execution phase begins. For the execution phase, the learning agent 102 provides the trained model (e.g., provides the neural network parameters φ of the first neural network F.sub.φ) to the MIMO system 100) or utilizes the trained model (e.g., utilizes the first neural network F.sub.φ for precoder selection for the MIMO system 100). Thus, in the execution phase, a MIMO transmitter within the MIMO system 100 transmits a signal using the precoder selected by the trained first neural network F.

[0068] In the embodiments described above, the learning agent 102 chooses the precoder w.sub.t for each training iteration. However, the present disclosure is not limited thereto. FIG. 8 illustrates another embodiment in which the precoder w.sub.t for each training iteration is instead chosen in the MIMO system 100 using a conventional precoder selection scheme, where this chosen precoder w.sub.t is observed by the learning agent 102 and used for training. The conventional precoder selection scheme may, for example, be a Singular Value Decomposition (SVD) based precoder selection scheme, a Zero-forcing (ZF) precoder selection scheme, a regularized ZF (RZF) precoder selection scheme, or a Minimum Mean Square Error (MMSE) based precoder selection scheme.

[0069] FIG. 9 is a flow chart that illustrates the learning agent 102 in more detail in accordance with the embodiment of FIG. 8. In the illustrated example, the training is performed in one or more episodes, which are indexed as ep=1, . . . , E. During each episode, a number of iterations of the training are performed at times t=0, 1, . . . ,T−1. More specifically, as illustrated, the learning agent 102 initializes the first neural network F.sub.φ and the second neural network S.sub.θ(step 900). More specifically, the learning agent 102 initializes the neural network parameters φ of the first neural network F.sub.φ and the neural network parameters θ of the second neural network S.sub.θ by, e.g., setting these parameters to random values.

[0070] The learning agent 102 sets the episode index ep to 1 (step 902), and initializes MIMO channel state for time t=0 (i.e., H.sub.0) (step 904). The MIMO channel state H.sub.0 may be initialized based on a known MIMO channel model for the MIMO system 100 or based on a channel measurement from the MIMO system 100. The learning agent 102 sets a time index t equal to 0 (step 906).

[0071] The learning agent 102 observes a precoder w.sub.t executed in the MIMO system 100 (step 908). As discussed above, the precoder w.sub.t is selected in the MIMO system 100 in accordance with a conventional precoder selection scheme. The learning agent 102 observes the experimental BER.sub.exp.sup.t in the MIMO system 100 for time t and computes the reward r.sub.t (step 910). In one example, the reward r.sub.t is computed in accordance with Equation (6). The learning agent 102 observes the next MIMO channel state H.sub.t+1 in the MIMO system 100 (step 912).

[0072] The learning agent 102 updates the neural network parameters θ of the second (critic) neural network S.sub.θ via Q-learning on the experience [s.sub.t, a.sub.t, r.sub.t, s.sub.t+1] (step 914). The learning agent 102 also computes the gradient vectors ∇.sub.φF.sub.φ and ∇.sub.wS.sub.θ(step 916) and updates the neural network parameters φ of the first (actor) neural network F.sub.φ based on the gradient vectors ∇.sub.φF.sub.φ and ∇.sub.wS.sub.θ in accordance with the parameter update rule of Equation (7) (step 918).

[0073] The learning agent 102 determines whether the last iteration for the current training episode has been reached (i.e., whether t<T−1) (step 920). If the last iteration has not been reached (i.e., if t<T−1), the learning agent increments t (step 922) and the process returns to step 908 and is repeated for the next iteration. Once the last iteration for the current training episode has been reached, the learning agent 102 determines whether the last episode has been reached (i.e., determines whether ep<E) (step 924). If not, the learning agent 102 increments the episode index ep (step 926) and the process returns to step 904 and repeated for the next episode. Once the last episode has been reached, the training process ends.

[0074] It should be noted that, once the first neural network F.sub.φ is trained, the first neural network F.sub.φ can be used for selecting the precoder w for the MIMO system 100 during an execution phase. During the execution phase, training of the first and second neural networks may cease or may only be performed occasionally (e.g., periodically).

[0075] FIG. 10 is a flow chart that illustrates the operation of the learning agent 102 and the MIMO system 100 in accordance with another embodiment of the present disclosure in which the first neural network F.sub.φ is used for selecting the precoder w for the MIMO system 100 during an execution phase. Note that optional steps are represented by dashed lines. As illustrated, the learning agent 102 trains the first neural network F.sub.φ and the second neural network S.sub.θ(H, w), as described above (step 1000). Note that step 1000 is optional in the sense that the first neural network F.sub.φ may be trained using some alternative training scheme or may be outside of the scope of the processing node that is performing the method of FIG. 10. While the first neural network F.sub.φ is being trained (e.g., until the first neural network F.sub.φ satisfies a predefined or preconfigured performance criterion), the MIMO system 100 uses a fallback precoder selection scheme to select the precoder w for the MIMO system 100 (step 1002). The fallback precoder selection scheme may be a conventional precoder selection scheme. The conventional precoder selection scheme may, for example, be a SVD based precoder selection scheme, a ZF precoder selection scheme, a RZF precoder selection scheme, or a MMSE based precoder selection scheme.

[0076] Once the first neural network F.sub.φ is trained, the learning agent 102 or the MIMO system 100 uses the first neural network F.sub.φ to select a precoder w for a MIMO transmitter of the MIMO system 100 (step 1004). The MIMO system 100 then applies the selected precoder w in the MIMO transmitter (step 506).

[0077] Optionally, the MIMO system 100 or the learning agent 102 determines whether to fall back to the fallback precoder (e.g., if the performance of the first neural network F.sub.φ, falls below a predefined or preconfigured threshold) (step 1008). If so, the process returns to step 1000. Otherwise, the process returns to step 1004.

[0078] FIG. 11 is a schematic block diagram of a processing node 1100 on which the learning agent 102 may be implemented in accordance with some embodiments of the present disclosure. As illustrated, the processing node 1100 includes one or more processors 1104 (e.g., Central Processing Units (CPUs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), and/or the like), memory 1106, and a network interface 1108. The one or more processors 1104 are also referred to herein as processing circuitry. The one or more processors 1104 operate to provide one or more functions of the learning agent 102 as described herein. In some embodiments, the function(s) are implemented in software that is stored, e.g., in the memory 1106 and executed by the one or more processors 1104.

[0079] FIG. 12 is a schematic block diagram that illustrates a virtualized embodiment of the processing node 1100 according to some embodiments of the present disclosure. As used herein, a “virtualized” processing node is an implementation of the processing node 1100 in which at least a portion of the functionality of the processing node 1100 is implemented as a virtual component(s) (e.g., via a virtual machine(s) executing on a physical processing node(s) in a network(s)). As illustrated, in this example, the processing node 1100 includes one or more processing nodes 1200 coupled to or included as part of a network(s) 1202. Each processing node 1200 includes one or more processors 1204 (e.g., CPUs, ASICs, FPGAs, and/or the like), memory 1206, and a network interface 1208.

[0080] In this example, functions 1210 of the learning agent 102 described herein are implemented at the one or more processing nodes 1200 or distributed across two or more of the processing nodes 1200 in any desired manner. In some particular embodiments, some or all of the functions 1210 of the learning agent 102 described herein are implemented as virtual components executed by one or more virtual machines implemented in a virtual environment(s) hosted by the processing node(s) 1200.

[0081] In some embodiments, a computer program including instructions which, when executed by at least one processor, causes the at least one processor to carry out the functionality of the learning agent 102 or a processing node(s) 1100 or 1200 implementing one or more of the functions of the learning agent 102 in a virtual environment according to any of the embodiments described herein is provided. In some embodiments, a carrier comprising the aforementioned computer program product is provided. The carrier is one of an electronic signal, an optical signal, a radio signal, or a computer readable storage medium (e.g., a non-transitory computer readable medium such as memory).

[0082] FIG. 13 is a schematic block diagram of the processing node 1100 according to some other embodiments of the present disclosure. The processing node 1100 includes one or more modules 1300, each of which is implemented in software. The module(s) 1300 provide the functionality of the learning agent 102 described herein. This discussion is equally applicable to the processing node(s) 1200 of FIG. 12 where the modules 1300 may be implemented at one of the processing nodes 1200 or distributed across two or more of the processing nodes 1200.

[0083] Any appropriate steps, methods, features, functions, or benefits disclosed herein may be performed through one or more functional units or modules of one or more virtual apparatuses. Each virtual apparatus may comprise a number of these functional units. These functional units may be implemented via processing circuitry, which may include one or more microprocessor or microcontrollers, as well as other digital hardware, which may include Digital Signal Processors (DSPs), special-purpose digital logic, and the like. The processing circuitry may be configured to execute program code stored in memory, which may include one or several types of memory such as Read Only Memory (ROM), Random Access Memory (RAM), cache memory, flash memory devices, optical storage devices, etc. Program code stored in memory includes program instructions for executing one or more telecommunications and/or data communications protocols as well as instructions for carrying out one or more of the techniques described herein. In some implementations, the processing circuitry may be used to cause the respective functional unit to perform corresponding functions according one or more embodiments of the present disclosure.

[0084] While processes in the figures may show a particular order of operations performed by certain embodiments of the present disclosure, it should be understood that such order is exemplary (e.g., alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, etc.).

[0085] At least some of the following abbreviations may be used in this disclosure. If there is an inconsistency between abbreviations, preference should be given to how it is used above. If listed multiple times below, the first listing should be preferred over any subsequent listing(s). [0086] 3GPP Third Generation Partnership Project [0087] 5G Fifth Generation [0088] 5GS Fifth Generation System [0089] ASIC Application Specific Integrated Circuit [0090] CPU Central Processing Unit [0091] DSP Digital Signal Processor [0092] FPGA Field Programmable Gate Array [0093] LTE Long Term Evolution

[0094] Those skilled in the art will recognize improvements and modifications to the embodiments of the present disclosure. All such improvements and modifications are considered within the scope of the concepts disclosed herein.

REFERENCES

[0095] [1] R. S. Sutton et al., “Reinforcement Learning: An Introduction,” second edition, MIT Press, Cambridge, Mass., London, 2017. [0096] [2] I. Goodfellow et al., “Deep Learning,” MIT Press, 2016. [0097] [3] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,” in NeurIPS Deep Learning Workshop, 2013. [0098] [4] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529-532, February 2015. [0099] [5] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations (ICLR), 2016.

LEARNING AN OPTIMAL PRECODING POLICY FOR MULTI-ANTENNA COMMUNICATIONS

Inventors

Cpc classification

Classification Explorer

H04B7/0482

ELECTRICITY

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

International classification

Classification Explorer

G06N3/08

PHYSICS

Abstract

Claims

Description