LEARNING AN OPTIMAL PRECODING POLICY FOR MULTI-ANTENNA COMMUNICATIONS
20230186079 · 2023-06-15
Inventors
Cpc classification
International classification
Abstract
Systems and methods for learning and applying an optimal precoding policy for multi-antenna communications in a Multiple Input Multiple Output (MIMO) system are disclosed.
Claims
1. A computer implemented method performed by an agent for training a first neural network that maps a Multiple Input Multiple Output, MIMO, channel state to a precoder in a continuous precoder space, the method comprising: initializing first neural network parameters, φ, of a first neural network, F.sub.φ(H), that estimates a first precoding policy that maps a channel state, H, for a MIMO system to a precoder, w, in a continuous precoder space; initializing second neural network parameters, θ, of a second neural network, S.sub.θ(H, w), that estimates a value function that maps the channel state, H, for the MIMO system and the precoder, w, in the continuous precoder space to a value, q, of the precoder, w, in the channel state H; initializing an initial channel state, H.sub.0, for the MIMO system based on a channel model for the MIMO system or a channel measurement in the MIMO system; and for each time t in a set of times t=0 to t=T−1, where T is a predefined integer value that is greater than 1: choosing or obtaining a precoder, w.sub.t, for a channel state, H.sub.t, that is to be executed or has been executed by a MIMO transmitter in the MIMO system; observing a parameter in the MIMO system as a result of execution of the precoder, w.sub.t; computing a reward, r.sub.t, based on the parameter; observing a channel state, H.sub.t+1, for time t+1; updating the second neural network parameters, θ, of the second neural network, S.sub.θ(H, w), based on an experience [H.sub.t, w.sub.t, r.sub.t, H.sub.t+i]; computing a gradient, ∇.sub.φF.sub.φ, which is a gradient of the first neural network, F.sub.φ(H), with respect to the first neural network parameters, φ; computing a gradient, ∇.sub.wS.sub.θ, which is a gradient of the second neural network, S.sub.θ(H, w), with respect to the precoder, w; and updating the first neural network parameters, φ, of the first neural network, F.sub.φ(H), based on the gradient, ∇.sub.φF.sub.φ, and the gradient, ∇.sub.wS.sub.θ.
2. The method of claim 1 further comprising either: providing the first neural network parameters, φ, of the first neural network, F.sub.φ(H), to the MIMO system to be used by the MIMO system for precoder selection; or utilizing the first neural network, F.sub.φ, (H), for precoder selection for the MIMO system during an execution phase.
3. The method of claim 1 wherein updating the first neural network parameters, φ, of the first neural network, F.sub.φ(H), based on the gradient, ∇.sub.φF.sub.φ, and the gradient, ∇.sub.wS.sub.θ, comprises updating the first neural network parameters, φ, of the first neural network, F.sub.φ, (H), in accordance with a rule:
φ←φ+η∇.sub.φ,F.sub.φ,(H)∇.sub.wS.sub.θ(H,W)|.sub.H=H.sub.
4. The method of claim 1 wherein updating the second neural network parameters, θ, of the second neural network, S.sub.θ(h, w), based on the experience [H.sub.t, w.sub.t, r.sub.t, H.sub.t+1] comprises updating the second neural network parameters, θ, of the second neural network, S.sub.θ(H, w), based on the experience [H.sub.t, w.sub.t, r.sub.t, H.sub.t+1] in accordance with a Q-learning scheme.
5. The method of claim 1 wherein the parameter observed in the MIMO system as a result of execution of the precoder, w.sub.t, is block error rate.
6. The method of claim 1 wherein the parameter observed in the MIMO system as a result of execution of the precoder, w.sub.t, is throughput.
7. The method of claim 1 wherein the parameter observed in the MIMO system as a result of execution of the precoder, w.sub.t, is channel capacity.
8. The method of claim 1 wherein choosing or obtaining the precoder, w.sub.t, for the channel state, H.sub.t, comprises choosing the precoder, w.sub.t, for the channel state, H.sub.t, as:
w.sub.t=F.sub.φ,(H.sub.t)+, where
is an exploration noise.
9. The method of claim 8 further comprising providing the precoder, w.sub.t, to the MIMO system for execution by the MIMO transmitter.
10. The method of claim 8 or 9 wherein the exploration noise is a random noise in the continuous precoder space.
11. The method of claim 10 wherein the step of initializing the initial channel state, H.sub.0, and the steps of choosing or obtaining the precoder, w.sub.t, observing the parameter in the MIMO system, computing the reward, r.sub.t, observing the channel state, H.sub.t+1, updating the second neural network parameters, θ, computing the gradient, ∇.sub.φF.sub.φ, computing the gradient, ∇.sub.wS.sub.θ, and updating the first neural network parameters, φ, for each time t in the set of times t=0 to t=T−1 are repeated for two or more episodes, and a variance of the exploration noise varies over the two or more episodes.
12. The method of claim 11 wherein the variance of the exploration noise gets smaller over the two or more episodes.
13. The method of claim 1 wherein choosing or obtaining the precoder, w.sub.t, for the channel state, H.sub.t, comprises choosing the precoder, w.sub.t, for the channel state, H.sub.t, as:
w.sub.t=(H.sub.t), where
corresponds to the first neural network, F.sub.φ(H), but where an exploration noise is added to the first neural network parameters, φ.
14. The method of claim 13 further comprising providing the precoder, w.sub.t, to the MIMO system for execution by the MIMO transmitter.
15. The method of claim 13 wherein the exploration noise is a random noise in a parameter space of the first neural network, F.sub.φ(H).
16. The method of claim 15 wherein the step of initializing the initial channel state, H.sub.0, and the steps of choosing or obtaining the precoder, w.sub.t, observing the parameter in the MIMO system, computing the reward, r.sub.t, observing the channel state, H.sub.t+1, updating the second neural network parameters, θ, computing the gradient, ∇.sub.φF.sub.φ, computing the gradient, ∇.sub.wS.sub.θ, and updating the first neural network parameters, φ, for each time t in the set of times t=0 to t=T−1 are repeated for two or more episodes, and a variance of the exploration noise varies over two or more episodes.
17. The method of claim 16 wherein the variance of the exploration noise gets smaller over the two or more episodes.
18-29. (canceled)
30. A processing node that implements an agent for training a first neural network that maps a Multiple Input Multiple Output, MIMO, channel state to a precoder in a continuous precoder space, the processing node comprising processing circuitry configured to cause the processing node to: initialize first neural network parameters, φ, of a first neural network, F.sub.φ(H), that estimates a first precoding policy that maps a channel state, H, for a MIMO system to a precoder, w, in a continuous precoder space; initialize second neural network parameters, θ, of a second neural network, S.sub.θ(H, w), that estimates a value function that maps the channel state, H, for the MIMO system and the precoder, w, to a value, q, of the precoder, w, in the channel state, H; initialize an initial channel state, H.sub.0, for the MIMO system based on a channel model for the MIMO system or a channel measurement in the MIMO system; and for each time t in a set of times t=0 to t=T−1, where T is a predefined integer value that is greater than 1: choose or obtain a precoder, w.sub.t, for a channel state, H.sub.t, that is to be executed or has been executed by a MIMO transmitter in the MIMO system; observe a parameter in the MIMO system as a result of execution of the precoder, w.sub.t; compute a reward, r.sub.t, based on the parameter; observe a channel state, H.sub.t+1, for time t+1; update the second neural network parameters, θ, of the second neural network, S.sub.θ(H, w), based on an experience [H.sub.t, w.sub.t, r.sub.t, H.sub.t+1]; compute a gradient, ∇.sub.φF.sub.φ, which is a gradient of the first neural network, F.sub.φ(H), with respect to the first neural network parameters, φ; compute a gradient, ∇.sub.wS.sub.θ, which is a gradient of the second neural network, S.sub.θ(H, w), with respect to the precoder, w; and update the first neural network parameters, φ, of the first neural network, F.sub.φ(H), based on the gradient, V.sub.φF.sub.φ, and the gradient, ∇.sub.wS.sub.θ.
31. A computer implemented method for precoder selection and application for a Multiple Input Multiple Output, MIMO, system comprising: selecting a precoder, w, for a MIMO transmitter of the MIMO system using a first neural network, F.sub.φ(H), that estimates a first precoding policy that maps a channel state, H, for the MIMO system to the precoder, w, in a continuous precoder space; and applying the selected precoder, w, in the MIMO transmitter.
32. The method of claim 31 wherein the method further comprises training the first neural network, F.sub.φ(H), based on a neural network parameter update rule:
φ←φ+η∇.sub.φ,F.sub.φ,(H)∇.sub.wS.sub.θ(H,W)|.sub.H=H.sub.
33-36. (canceled)
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0033] The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure, and together with the description serve to explain the principles of the disclosure.
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
DETAILED DESCRIPTION
[0045] The embodiments set forth below represent information to enable those skilled in the art to practice the embodiments and illustrate the best mode of practicing the embodiments. Upon reading the following description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the disclosure and will recognize applications of these concepts not particularly addressed herein. It should be understood that these concepts and applications fall within the scope of the disclosure.
[0046] To address the gap between the unknown optimal solution for Multiple-Input Multiple-Output (MIMO) precoding on a per-RE basis and the conventional sub-optimal solution for MIMO precoding on a per-subband basis, a deep reinforcement learning-based precoding scheme is disclosed herein that can be used to learn an optimal precoding policy for very complex MIMO systems. As described herein, a Reinforcement Learning (RL) agent learns an optimal precoding policy in continuous precoder (i.e., action) space from experience data in a MIMO system. The RL agent interacts with an environment of the MIMO system and channel in an experience sequence of given channel states, precoders taken, and performance parameters (e.g., Block Error Rate (BER), throughput, or channel capacity). The goal of the RL agent is to learn a precoder policy that optimizes the performance parameter (e.g., minimizes BER, maximizes throughput, or maximizes channel capacity). To this end, in one embodiment, the MIMO precoding problem for a single-user (SU) MIMO system is modeled as a contextual-bandit problem in which the RL agent sequentially selects the precoders to serve the environment of MIMO system from a continuous precoder space based on a precoder selection policy and contextual information about the environment conditions, while simultaneously adapting the precoder selection policy based on a reward feedback (e.g., BER, throughput, or channel capacity) from the environment to maximize a numerical reward signal.
[0047] Now, a more detailed description of embodiments of the present disclosure will be provided. As illustrated in
[0048] Before describing the details of the learning agent 102, a description of the SU-MIMO system 100 is beneficial. In this regard, .sup.ntx×1 is applied at the transmitter 200 and a combining vector r ∈
.sup.ntx×1 is applied at the receiver 202. At the transmitter 200, an encoder 208 encodes one transport bit stream into a bit block b.sub.tx which is then symbol-mapped to modem symbols x by a mapper 210. Typical modem constellations used are M Quadrature Amplitude Modulation (M-QAM), which consists of a set of M constellation points. Then, a precoder 212 precodes the data symbols x by the precoding vector w to form n.sub.tx data substreams. Finally, the streams are processed via respective Inverse Fast Fourier Transform (IFFTs) 214-1 through 214-n.sub.tx to provide time-domain signals that are transmitted via the respective transmit antennas 204-1 through 204-n.sub.tx. In a similar manner, at the receiver 202, signals received via the receive antennas 206-1 through 206-n are transformed to the frequency domain via respective Fast Fourier Transforms (FFTs) 216-1 through 216-n. A combiner 218 combines the resulting data streams by applying the combining vector r to provide a combined signal z. A demapper 220 performs system-demapping to provide a received bit block {circumflex over (b)}.sub.rx which is then decoded by a decoder 222 to provide the received bit stream.
[0049] The set of data Resource Elements (REs) in a given subband is denoted herein by φ.sub.d and a subband precoding application of a precoder w to the data REs i ∈ φ.sub.d is considered. Further, x.sub.i denotes the complex data symbol at the RE and y.sub.i ∈ .sup.ntx×1 denotes the complex received signal vector at the RE. Then, the received signal at the RE i can be written as:
y.sub.i=H.sub.iwx.sub.i+n.sub.i, Equation 1
where H.sub.i ∈.sup.n rx×n tx represents the MIMO channel matrix between the transmit antenna 204-1 through 204-n.sub.tx and the receive antennas 206-1 through 206-n.sub.rx at the RE i, and n.sub.i, ∈
.sup.n rx×1 is an additive white Gaussian noise (AWGN) vector whose elements are i.i.d. complex-valued Gaussians with zero mean and variance σ.sub.n.sup.2. Without loss of generality, it is assumed that the data symbol x.sub.i and the precoding vector w are normalized so that ∈[|x.sub.i|.sup.2]= and ∥w∥.sup.2=1, where |.Math.|denotes the absolute value of a complex value and ∥.Math.∥denotes the 2-norm of a vector. Under these assumptions, the SNR is given by 1/σ.sub.n.sup.2.
[0050] At the receiver 202, the transmitted data symbol x.sub.i can be recovered by combining the received symbols y.sub.i by the unit-norm vector r.sub.i (i.e., ∥r.sub.i∥.sup.2=1), which yields the estimated complex symbol z.sub.i as:
z.sub.i=r.sub.i.sup.+y.sub.i=r.sub.i.sup.+H.sub.iwx.sub.i+r.sub.i.sup.+n.sub.i, Equation 2
where (.Math.).sup.+ denotes the complex conjugate of a vector or matrix.
[0051] Note that r.sub.i.sup.+H.sub.iw in Equation (2) corresponds to the effective channel gain. It is assumed that a Maximal Ratio Combiner (MRC) is used at the receiver 202 (i.e., the combiner 218 is a MRC), which is optimal in the sense of output Signal to Noise Ratio (SNR) maximization when the noise is white.
[0052] As mentioned above, the optimal precoding solution is given by channel-dependent precoder on a per-RE basis. In other words, an optimal precoder w.sub.i is chosen that maximizes the effective channel gain r.sub.i.sup.+H.sub.iw.sub.i on a per-RB basis. However, in practical MIMO-OFDM systems, a precoder is chosen on per-subband basis, achieving a tradeoff between performance and complexity. A practical subband-precoding solution is obtained based on a spatial channel covariance matrix averaged over the pilot signals in a given subband. The set of pilot REs in a given subband is denoted by φ.sub.p. The channel covariance matrix is given by:
Unfortunately, the conventional solution based on this covariance matrix is sub-optimal, and furthermore no truly optimal solution has been found for this setting to date.
[0053] In what follows, instead of approximating an optimal precoder based on the spatial channel covariance matrix, a learning scheme is described in which the learning agent 102 learns an optimal precoding policy directly from interactions with the complex real-world MIMO environment.
[0054] The learning agent 102 learns a precoding policy that optimizes a performance parameter through an experience sequence of given channel matrices, the precoders taken, and the values of the performance parameter achieved. In the remaining description, the performance parameter is BER. However, the performance parameter is not limited thereto. Other examples of the performance parameter are throughput and channel capacity.
[0055] Returning to
H.sub.t ={[vec(Re[H.sub.j]).sup.T,vec(Im[H.sub.j]).sup.T].sup.T}.sub.j∈φ.sub.
where Re [.Math.] and Im[.Math.] represent the real and imaginary parts of the complex valued MIMO channel matrix. Note that, regarding notation, H.sub.j is used herein to denote the channel matrix at RE j or i, whereas H.sub.t is used herein to denote the environmental state at time t given by a single channel matrix H.sub.j or a set of channel matrices H.sub.j in pilot REs j at the time t.
[0056] Note that, in one embodiment, the ambiguity in phase information of the channel matrix H is removed. For instance, the channel matrix H with size n.sub.r×n.sub.t can be scaled by the phase of element corresponding to the first transmit and first receive antenna, denoted by H(1,1), i.e.,
In addition, in one embodiment, the ambiguity in amplitude information of the channel matrix H is removed. For instance, the channel matrix H with size n.sub.r×n.sub.t can be scaled by its Frobenius norm, denoted by ∥H∥.sub.F, i.e.,
[0057] The learning agent 102 chooses a precoder w.sub.t in the MIMO channel state H.sub.e according to the precoder policy and the chosen precoder w.sub.t is applied to the MIMO system 100 to get an experimental BER performance as a feedback. In particular, in one example, the BER performance is calculated by comparing the transmit code block b.sub.tx and the receive code block {circumflex over (b)}.sub.tx as they represent the action value of precoder w.sub.t over the MIMO channel state H.sub.e without help of channel coding. The experimental BER is represented by:
BER.sub.exp.sup.t=BER(b.sub.tx,{circumflex over (b)}.sub.tx|H.sub.t,w.sub.t), Equation 5
One example of the reward function computed based on the feedback is reward function r.sub.t ∈ [−0.5, +0.5]:
r.sub.t=log.sub.2(1−BER.sub.exp.sup.t)+0.5, Equation 6
[0058] As illustrated in
[0059] During the training phase, the first neural network F.sub.φ(H) is used to select a precoder in such a way that different actions are explored for a same MIMO channel state H. Note that, in some embodiments, the output of the first neural network F.sub.φ(H) is transformed in the form of a precoder vector or matrix for the MIMO transmission. For example, for digital precoding with unit-power constraint, the transformation includes a procedure for the precoder vector or matrix to have unit Frobenius norm. As another example, for analog precoding with constant modulus constraint, the transformation includes a procedure for each element of the precoder vector or matrix to have unit amplitude. In another example, the precoder w is processed to provide a precoder matrix whose row vectors have a unit norm.
[0060] At each time t, the precoder is executed by the MIMO system 100 in MIMO channel state H.sub.t to provide a reward r.sub.t, generating the experience of [H.sub.tw.sub.t, r.sub.t]. Through the experiences [s.sub.t, a.sub.t, r.sub.t]=[H.sub.t, w.sub.t, r.sub.t], the second neural network S.sub.θ is trained by a Q-learning scheme to estimate the value of given MIMO channel state and chosen precoder. At the same time, the first neural network F.sub.φ is trained by utilizing the gradient of the second neural network S.sub.θ to update the neural network parameters φ of F.sub.φ in the direction of performance gradient. More specifically, the first neural network F.sub.φ is trained by the following parameter update rule:
φ←φ+η∇.sub.φF.sub.φ(H)∇.sub.wS.sub.θ(H,s)|.sub.H=H.sub.
where η is a learning rate, ∇.sub.φF.sub.φ is the gradient of F.sub.φ with respect to φ, and ∇.sub.wS.sub.θ is the gradient of S.sub.θ with respect to the chosen precoder w (i.e., the action). The operation of the learning agent 100 to train the first neural network F.sub.cp using the above parameter update rule is illustrated in
[0061] In one embodiment, during the training phase, the first neural network F.sub.φ(H) is used to select a precoder in such a way that different precoders (i.e., different actions) are explored for the same MIMO channel state H. In this regard, sampled from a Gaussian random process as follows:
w.sub.t=F.sub.φ(H.sub.t)+ Equation 8
In other example, a random parameter noise is added to the parameters φ of the first neural network, i.e.,
w.sub.t=(H.sub.t) Equation 9
[0062]
[0063] The learning agent 102 sets the episode index ep to 1 (step 702), and initializes MIMO channel state for time t=0 (i.e., H.sub.0) (step 704). The MIMO channel state H.sub.0 may be initialized based on a known MIMO channel model for the MIMO system 100 or based on a channel measurement from the MIMO system 100. The learning agent 102 sets a time index t equal to 0 (step 206).
[0064] The learning agent 102 chooses a precoder w.sub.t=F.sub.φ(H.sub.t)+ to be executed by a MIMO transmitter in the MIMO system 100, where, as discussed above,
is an exploration noise (step 708). As discussed above, in one embodiment, the exploration noise
is a noise vector sampled from a Gaussian random process. In one embodiment, the exploration noise
is a random noise in the continuous precoder space. In one embodiment, a variance of the exploration noise
varies over training episodes. In one embodiment, the variance of the exploration noise
gets smaller over training episodes. In an alternative embodiment, the learning agent 102 chooses a precoder w.sub.t=
(H.sub.t), where
denotes a modified version of F.sub.φ in which a random noise is added to the neural network parametersφ of the first neural network F.sub.φ. In one embodiment, a variance of the exploration noise
varies over training episodes. In one embodiment, the variance of the exploration noise
gets smaller over training episodes.
[0065] The learning agent 102 executes the chosen precoder w.sub.t (i.e., the action) in the MIMO system 100 (step 710). In other words, the learning agent 102 provides the chosen precoder w.sub.t to the MIMO system 100 for execution (i.e., use) in the MIMO system 100. The learning agent 102 observes the experimental BER.sub.exp.sup.t in the MIMO system 100 for time t and computes the reward r.sub.t (step 712). In one example, the reward r.sub.t is computed in accordance with Equation (6). The learning agent 102 observes the next MIMO channel state H.sub.t+1 in the MIMO system 100 (step 714).
[0066] The learning agent 102 updates the neural network parameters θ of the second (critic) neural network S.sub.θ via Q-learning on the experience [s.sub.t, a.sub.t, r.sub.t, s.sub.t+1] (step 716). The learning agent 102 also computes the gradient vectors ∇.sub.φF.sub.φ and ∇.sub.wS.sub.θ(step 718) and updates the neural network parameters φ of the first (actor) neural network F.sub.φ based on the gradient vectors ∇.sub.φF.sub.φ and ∇.sub.wS.sub.θ in accordance with the parameter update rule of Equation (7) (step 720).
[0067] The learning agent 102 determines whether the last iteration for the current training episode has been reached (i.e., whether t<T−1) (step 722). If the last iteration has not been reached (i.e., if t<T−1), the learning agent increments t (step 724) and the process returns to step 708 and is repeated for the next iteration. Once the last iteration for the current training episode has been reached, the learning agent 102 determines whether the last episode has been reached (i.e., determines whether ep<E) (step 226). If not, the learning agent 102 increments the episode index ep (step 228) and the process returns to step 704 and repeated for the next episode. Once the last episode has been reached, the training process ends and an execution phase begins. For the execution phase, the learning agent 102 provides the trained model (e.g., provides the neural network parameters φ of the first neural network F.sub.φ) to the MIMO system 100) or utilizes the trained model (e.g., utilizes the first neural network F.sub.φ for precoder selection for the MIMO system 100). Thus, in the execution phase, a MIMO transmitter within the MIMO system 100 transmits a signal using the precoder selected by the trained first neural network F.
[0068] In the embodiments described above, the learning agent 102 chooses the precoder w.sub.t for each training iteration. However, the present disclosure is not limited thereto.
[0069]
[0070] The learning agent 102 sets the episode index ep to 1 (step 902), and initializes MIMO channel state for time t=0 (i.e., H.sub.0) (step 904). The MIMO channel state H.sub.0 may be initialized based on a known MIMO channel model for the MIMO system 100 or based on a channel measurement from the MIMO system 100. The learning agent 102 sets a time index t equal to 0 (step 906).
[0071] The learning agent 102 observes a precoder w.sub.t executed in the MIMO system 100 (step 908). As discussed above, the precoder w.sub.t is selected in the MIMO system 100 in accordance with a conventional precoder selection scheme. The learning agent 102 observes the experimental BER.sub.exp.sup.t in the MIMO system 100 for time t and computes the reward r.sub.t (step 910). In one example, the reward r.sub.t is computed in accordance with Equation (6). The learning agent 102 observes the next MIMO channel state H.sub.t+1 in the MIMO system 100 (step 912).
[0072] The learning agent 102 updates the neural network parameters θ of the second (critic) neural network S.sub.θ via Q-learning on the experience [s.sub.t, a.sub.t, r.sub.t, s.sub.t+1] (step 914). The learning agent 102 also computes the gradient vectors ∇.sub.φF.sub.φ and ∇.sub.wS.sub.θ(step 916) and updates the neural network parameters φ of the first (actor) neural network F.sub.φ based on the gradient vectors ∇.sub.φF.sub.φ and ∇.sub.wS.sub.θ in accordance with the parameter update rule of Equation (7) (step 918).
[0073] The learning agent 102 determines whether the last iteration for the current training episode has been reached (i.e., whether t<T−1) (step 920). If the last iteration has not been reached (i.e., if t<T−1), the learning agent increments t (step 922) and the process returns to step 908 and is repeated for the next iteration. Once the last iteration for the current training episode has been reached, the learning agent 102 determines whether the last episode has been reached (i.e., determines whether ep<E) (step 924). If not, the learning agent 102 increments the episode index ep (step 926) and the process returns to step 904 and repeated for the next episode. Once the last episode has been reached, the training process ends.
[0074] It should be noted that, once the first neural network F.sub.φ is trained, the first neural network F.sub.φ can be used for selecting the precoder w for the MIMO system 100 during an execution phase. During the execution phase, training of the first and second neural networks may cease or may only be performed occasionally (e.g., periodically).
[0075]
[0076] Once the first neural network F.sub.φ is trained, the learning agent 102 or the MIMO system 100 uses the first neural network F.sub.φ to select a precoder w for a MIMO transmitter of the MIMO system 100 (step 1004). The MIMO system 100 then applies the selected precoder w in the MIMO transmitter (step 506).
[0077] Optionally, the MIMO system 100 or the learning agent 102 determines whether to fall back to the fallback precoder (e.g., if the performance of the first neural network F.sub.φ, falls below a predefined or preconfigured threshold) (step 1008). If so, the process returns to step 1000. Otherwise, the process returns to step 1004.
[0078]
[0079]
[0080] In this example, functions 1210 of the learning agent 102 described herein are implemented at the one or more processing nodes 1200 or distributed across two or more of the processing nodes 1200 in any desired manner. In some particular embodiments, some or all of the functions 1210 of the learning agent 102 described herein are implemented as virtual components executed by one or more virtual machines implemented in a virtual environment(s) hosted by the processing node(s) 1200.
[0081] In some embodiments, a computer program including instructions which, when executed by at least one processor, causes the at least one processor to carry out the functionality of the learning agent 102 or a processing node(s) 1100 or 1200 implementing one or more of the functions of the learning agent 102 in a virtual environment according to any of the embodiments described herein is provided. In some embodiments, a carrier comprising the aforementioned computer program product is provided. The carrier is one of an electronic signal, an optical signal, a radio signal, or a computer readable storage medium (e.g., a non-transitory computer readable medium such as memory).
[0082]
[0083] Any appropriate steps, methods, features, functions, or benefits disclosed herein may be performed through one or more functional units or modules of one or more virtual apparatuses. Each virtual apparatus may comprise a number of these functional units. These functional units may be implemented via processing circuitry, which may include one or more microprocessor or microcontrollers, as well as other digital hardware, which may include Digital Signal Processors (DSPs), special-purpose digital logic, and the like. The processing circuitry may be configured to execute program code stored in memory, which may include one or several types of memory such as Read Only Memory (ROM), Random Access Memory (RAM), cache memory, flash memory devices, optical storage devices, etc. Program code stored in memory includes program instructions for executing one or more telecommunications and/or data communications protocols as well as instructions for carrying out one or more of the techniques described herein. In some implementations, the processing circuitry may be used to cause the respective functional unit to perform corresponding functions according one or more embodiments of the present disclosure.
[0084] While processes in the figures may show a particular order of operations performed by certain embodiments of the present disclosure, it should be understood that such order is exemplary (e.g., alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, etc.).
[0085] At least some of the following abbreviations may be used in this disclosure. If there is an inconsistency between abbreviations, preference should be given to how it is used above. If listed multiple times below, the first listing should be preferred over any subsequent listing(s). [0086] 3GPP Third Generation Partnership Project [0087] 5G Fifth Generation [0088] 5GS Fifth Generation System [0089] ASIC Application Specific Integrated Circuit [0090] CPU Central Processing Unit [0091] DSP Digital Signal Processor [0092] FPGA Field Programmable Gate Array [0093] LTE Long Term Evolution
[0094] Those skilled in the art will recognize improvements and modifications to the embodiments of the present disclosure. All such improvements and modifications are considered within the scope of the concepts disclosed herein.
REFERENCES
[0095] [1] R. S. Sutton et al., “Reinforcement Learning: An Introduction,” second edition, MIT Press, Cambridge, Mass., London, 2017. [0096] [2] I. Goodfellow et al., “Deep Learning,” MIT Press, 2016. [0097] [3] V. Mnih et al., “Playing Atari with Deep Reinforcement Learning,” in NeurIPS Deep Learning Workshop, 2013. [0098] [4] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529-532, February 2015. [0099] [5] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations (ICLR), 2016.