METHODS AND APPARATUSES FOR TRAINING A MODEL BASED REINFORCEMENT LEARNING MODEL

Abstract

Embodiments described herein relate to a method and apparatus for training a model based reinforcement learning, MBRL, model for use in an environment. The method comprises obtaining a sequence of observations, o.sub.t, representative of the environment at a time t; estimating latent states s.sub.t at time t using a representation model, wherein the representation model estimates the latent states s.sub.t based on the previous latent states s.sub.t1, previous actions a.sub.t1 and the observations o.sub.t; generating modelled observations, o.sub.m,t, using an observation model, wherein the observation model generates the modelled observations based on the respective latent states s.sub.t, wherein the step of generating comprises determining means and standard deviations based on the latent states s.sub.t; and minimizing a first loss function to update network parameters of the representation model and the observation model, wherein the first loss function comprises a component comparing the modelled observations, o.sub.m,t to the respective observations o.sub.t.

Claims

1. A method for training a model based reinforcement learning, MBRL, model for use in an environment, the method comprising: obtaining a sequence of observations, o.sub.t, representative of the environment at a time t; estimating latent states s.sub.t at time t using a representation model, wherein the representation model estimates the latent states s.sub.t based on the previous latent states s.sub.t1, previous actions a.sub.t1 and the observations o.sub.t; generating modelled observations, o.sub.m,t, using an observation model, wherein the observation model generates the modelled observations based on the respective latent states s.sub.t, wherein the step of generating comprises determining means and standard deviations based on the latent states s.sub.t; and minimizing a first loss function to update network parameters of the representation model and the observation model, wherein the first loss function comprises a component comparing the modelled observations, o.sub.m,t to the respective observations o.sub.t.

2. The method as claimed in claim 1 wherein the step of generating further comprises sampling distributions generated from the means and standard deviations to generate respective modelled observations, o.sub.m,t.

3. The method as claimed in claim 1 further comprising: determining a reward r.sub.t based on a reward model, wherein the reward model determines the reward r.sub.t based on the latent state s.sub.t, wherein the step minimizing the first loss function is further used to update network parameters of the reward model, and wherein the first loss function further comprises a component relating to the how well the reward r.sub.t represents a real reward for the observation o.sub.t.

4. The method as claimed in claim 1 further comprising: estimating a transitional latent state s.sub.trans,t, using a transition model, wherein the transition model estimates the transitional latent state s.sub.trans,t based on the previous transitional latent state s.sub.trans,t1 and a previous action a.sub.t1; wherein the step of minimizing the first loss function is further used to update network parameters of the transition model, and wherein the first loss function further comprises a component relating to how similar the transitional latent state s.sub.trans,t is to the latent state s.sub.t.

5. The method as claimed in claim 3 further comprising: after minimizing the first loss function, minimizing a second loss function to update network parameters of a critic model and an actor model, wherein the critic model determines state values based on the transitional latent states s.sub.trans,t and the actor model determines actions a based on the transitional latent states s.sub.trans,t.

6. The method as claimed in claim 5 wherein the second loss function comprises a component relating to ensuring the state values are accurate, and a component relating to ensuring the actor model leads to transitional latent states, s.sub.trans,t associated with high state values.

7. The method as claimed in claim 1 wherein the environment comprises a cavity filter being controlled by a control unit.

8. The method as claimed in claim 7 wherein the observations, o.sub.t, each comprise S-parameters of the cavity filter.

9. The method as claimed in claim 7 wherein the previous actions a.sub.t1 relate to tuning characteristics of the cavity filter.

10. The method as claimed in claim 1 wherein the environment comprises a wireless device performing transmissions in a cell.

11. The method as claimed in claim 10 wherein the observations, o.sub.t, each comprise a performance parameter experienced by a wireless device.

12. The method as claimed in claim 11 wherein the performance parameter comprises one or more of: a signal to interference and noise ratio; traffic in the cell and a transmission budget.

13. The method as claimed in claim 10 wherein the previous actions a.sub.t1 relate to controlling one or more of: a transmission power of the wireless device; a modulation and coding scheme used by the wireless device; and a radio transmission beam pattern.

14. The method as claimed in claim 1 further comprising using the trained model in the environment.

15. The method as claimed in claim 14 wherein the observations, o.sub.t, each comprise S-parameters of the cavity filter and wherein using the trained model in the environment comprises tuning the characteristics of the cavity filter to produce desired S-parameters.

16. The method as claimed in claim 14 wherein the environment comprises a wireless device performing transmissions in a cell and wherein the using the trained model in the environment comprises adjusting one of: the transmission power of the wireless device; the modulation and coding scheme used by the wireless device; and a radio transmission beam pattern, to obtain a desired value of the performance parameter.

17. An apparatus for training a model based reinforcement learning, MBRL, model for use in an environment, the apparatus comprising processing circuitry configured to cause the apparatus to perform the method as claimed in claim 1.

18. The apparatus of claim 17 wherein the apparatus comprises a control unit for a cavity filter.

19. (canceled)

20. (canceled)

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] For a better understanding of the embodiments of the present disclosure, and to show how it may be put into effect, reference will now be made, by way of example only, to the accompanying drawings, in which:

[0014] FIG. 1 illustrates the process of manually tuning a typical cavity filter by a human expert;

[0015] FIG. 2 illustrates an overview of a training procedure for the MBRL model according to some embodiments;

[0016] FIG. 3 illustrates the method of step 202 of FIG. 2 in more detail;

[0017] FIG. 4 graphically illustrates how step 202 of FIG. 2 may be performed;

[0018] FIG. 5 illustrates an example of a decoder 405 according to some embodiments;

[0019] FIG. 6 graphically illustrates how step 203 of FIG. 2 may be performed;

[0020] FIG. 7 illustrates how the proposed MBRL model can be trained and used in an environment comprising a cavity filter being controlled by a control unit;

[0021] FIG. 8 illustrates a typical example of VNA measurements during a training loop;

[0022] FIG. 9 is a graph illustrating an observational bottleneck, where in the case of the fixed non learnable standard deviation (I) 901 (as used in Dreamer), the resulting MBRL model seems to plateau after a few thousand steps, illustrating that the more simplistic world modelling does not continue learning;

[0023] FIG. 10 is a graph illustrating the observation loss 1001 for an MBRL with a learnable standard deviation according to embodiments described herein, and the observation loss 1002 for an MBRL with a fixed standard deviation;

[0024] FIG. 11 illustrates a comparison between a how quickly a Best Model Free (SAC) agent can tune a cavity filter, and how quickly an MBRL model according to embodiments described herein can tune the cavity filter;

[0025] FIG. 12 illustrates an apparatus comprising processing circuitry (or logic) in accordance with some embodiments;

[0026] FIG. 13 is a block diagram illustrating an apparatus in accordance with some embodiments.

DESCRIPTION

[0027] The following sets forth specific details, such as particular embodiments or examples for purposes of explanation and not limitation. It will be appreciated by one skilled in the art that other examples may be employed apart from these specific details. In some instances, detailed descriptions of well-known methods, nodes, interfaces, circuits, and devices are omitted so as not obscure the description with unnecessary detail. Those skilled in the art will appreciate that the functions described may be implemented in one or more nodes using hardware circuitry (e.g., analog and/or discrete logic gates interconnected to perform a specialized function, ASICs, PLAs, etc.) and/or using software programs and data in conjunction with one or more digital microprocessors or general purpose computers. Nodes that communicate using the air interface also have suitable radio communications circuitry. Moreover, where appropriate the technology can additionally be considered to be embodied entirely within any form of computer-readable memory, such as solid-state memory, magnetic disk, or optical disk containing an appropriate set of computer instructions that would cause a processor to carry out the techniques described herein.

[0028] Hardware implementation may include or encompass, without limitation, digital signal processor (DSP) hardware, a reduced instruction set processor, hardware (e.g., digital or analogue) circuitry including but not limited to application specific integrated circuit(s) (ASIC) and/or field programmable gate array(s) (FPGA(s)), and (where appropriate) state machines capable of performing such functions.

[0029] As described above, traditionally tuning of a cavity filter is performed manually by a human expert in a lengthy and costly process. Model Free reinforcement learning (MFRL) approaches have already shown success is solving this problem. However, the MFRL approaches are not sample efficient, meaning they require a lot of training samples before obtaining a proper tuning policy. As more precise world simulations require more processing time, it may be desirable for agents to be able to learn and to solve a task whilst requiring as few interactions with the environment as possible. For reference, current 3D simulations on cavity filters require around seven minutes for a single agent interaction (carried on a 4-core cpu). Transitioning on real filters requires even more precise simulations, however, training of MFRL agents on such environments is simply unfeasible (time wise). In order to deploy such agents on real filters, a boost in sample efficiency must be achieved.

[0030] Given sufficient samples (term often called asymptotic performance), MFRL tends to exhibit better performance than model based reinforcement learning (MBRL), as errors induced by the world model get propagated to the decision making of the agent. In other words, the world model errors act as a bottleneck on the performance of the MBRL model. On the other hand, MBRL can leverage the world model to boost training efficiency, leading to faster training. For example, the agent can use the learned environment model to simulate sequence of actions and observations, which in turn give it a better understanding of the consequences of his actions. When designing an RL algorithm, one must find a fine balance, between training speed and asymptotic performance. Achieving both requires careful modelling and is the goal of the embodiments described herein.

[0031] Contemporary Model Based Reinforcement Learning (MBRL) techniques have rarely been used to deal with high dimensional observations such as those present when tuning cavity filters. State-of-the-art methods typically lack the precision required in this task, and as such cannot be applied as is whilst exhibiting acceptable results.

[0032] However, recent advances in Model Based Reinforcement Learning (MBRL) have been made which tackle complicated environments, while requiring fewer samples.

[0033] Embodiments described herein therefore provide methods and apparatuses for training a model based reinforcement learning, MBRL, model for use in an environment. In particular, the method of training produces a MBRL model that is suitable for use in environments having high dimensional observations, such as a tuning a cavity filter.

[0034] Embodiments described herein builds on a known MBRL agent structure referred to herein as the Dreamer model (see D. Hafner et al. (2020) Mastering Atari with Discrete World Models retrieved from https://arxiv.org/abs/2010.02193). The resulting MBRL agent according to embodiments described herein provides similar performance to previous MFRL agent whilst requiring significantly fewer samples.

[0035] Reinforcement learning is a learning method concerned with how an agent should take actions in an environment in order to maximize a numerical reward.

[0036] In some examples, the environment comprises a cavity filter being controlled by a control unit. The MBRL model may therefore comprise an algorithm which tunes the cavity filter, for example by turning the screws on the cavity filter.

[0037] The Dreamer model stands out among many other MBRL algorithms as it has achieved performance on a wide array of tasks of varying complexity while requiring significantly fewer samples (e.g. orders of magnitude less than otherwise required). It takes its name from the fact that the actor model in the architecture (which chooses the actions performed by the agent), bases its decisions purely on a lower dimensional latent space. In other words, the actor model leverages the world model to imagine trajectories, without requiring the generation of actual observations. This is particularly useful in some cases, especially where the observations are high dimensional.

[0038] The Dreamer model consists of an Actor-Critic network pair and a World Model. The World Model is fit onto a sequence of observations, so that it can reconstruct the original observation from the latent space and predict the corresponding reward. The actor model and critic model receive as an input the states, e.g. the latent representations of the observations. The critic model aims to predict the value of a state (how close we are to a tuned configuration), while the actor model aims to find the action which would lead to a configuration exhibiting a higher value (more tuned). The actor model obtains more precise value estimates by leveraging the world model to examine the consequences of the actions multiple steps ahead.

[0039] The architecture of an MBRL model according to embodiments described herein comprises one or more of: an actor model a critic model, a reward model (q(r.sub.t|s.sub.t)), a transition model (q(s.sub.t|s.sub.t1, a.sub.t1)), a representation model (p(s.sub.t|s.sub.t1, a.sub.t1, o.sub.t)) and an observation model (q(o.sub.m,t|s.sub.t)). Examples of how these different models may be implemented will now be described in more detail below.

[0040] The actor model aims to predict the next action, given the current latent state s.sub.t. The actor model may for example comprise a neural network. The actor model neural network may comprise a sequence of fully connected layers (e.g. 3 layers with layer widths of, for example, 400, 400 and 300) which then output the mean and the standard deviation of a truncated normal distribution (e.g. to limit the mean to lie within [1,1]).

[0041] The critic model models the value of a given state V(s.sub.t). The critic model may comprise a neural network. The critic model neural network may comprise a sequence of, for fully connected layers (e.g. three layers with layer widths of 400, 400 and 300) which then output the mean of the value distribution (e.g. a one-dimensional output). This distribution may be a Normal Distribution.

[0042] The reward model determines the reward given the current latent state s.sub.t. The reward model may also comprise a neural network. The reward model neural network may also comprise a sequence of fully connected layers (e.g. three fully connected layers with layer widths of, for example, 400, 200 and 50). The reward model may model the mean of a generative Normal Distribution.

[0043] The transition model q(s.sub.t|s.sub.t1, a.sub.t1) aims to predict the next set of latent states (s.sub.t), given the previous latent state (s.sub.t1) and action (a.sub.t1) without utilising the current observation o.sub.t. The transition model may be modelled as a Gated Recurrent Unit (GRU) comprised of one hidden layer which stores a deterministic state h.sub.t (the hidden neural network layer may have a width of 400). Alongside h.sub.t a shallow neural network comprised of Fully Connected Hidden layers (for example with a single layer with a layer width of, for example, 200) may be used to generate stochastic states. The states s.sub.t used above may comprise both deterministic and stochastic states.

[0044] The representation model (p(s.sub.t|s.sub.t1, a.sub.t1, o.sub.t)) is in essence the same as the transition model, with the only difference being that it also incorporates the current observation o.sub.t (in other words, the representation model may be considered posterior over latent states, whereas the transition model is prior over latent states). To do so, the observation o.sub.t is processed by an encoder and an embedding is obtained. The encoder may comprise a neural network. The encoder neural network may comprise a sequence of fully connected layers (e.g. two layers with layer widths of, for example, 600 and 400).

[0045] The observation model q(o.sub.m,t|s.sub.t), which is implemented by a decoder, aims to reconstruct, by generating modelled observation o.sub.m,t, the observation o.sub.t that produced the embedding which then helped to generate the latent state s.sub.t. The latent space must be such that the decoder is able to reconstruct the initial observation as accurately as possible. It may be important that this part of the model is as robust as possible, as it dictates the quality of the latent space, and therefore the usability of the latent space for planning ahead. In the Dreamer algorithm, the observation model generated modelled observations by determining means based on the latent states s.sub.t. The modelled observations were then generated by sampling distributions generated from the respective means.

[0046] FIG. 2 illustrates an overview of a training procedure for the MBRL model according to some embodiments.

[0047] In step 201, the method comprises initialising an experience buffer. The experience buffer may comprise random seed episodes, wherein each seed episode comprises a sequence of experiences. Alternatively, the experience buffer may comprise a series of experiences not contained within seed episodes. Each experience comprises a tuple in the form (o.sub.t, a.sub.t, r.sub.t, o.sub.t+1).

[0048] When drawing information from the experience buffer, the MBRL model may, for example, select a random seed episode, and may then select a random sequence of experiences from the within the selected seed episode.

[0049] The neural network parameters of the various neural networks in the model may also be initialised randomly.

[0050] In step 202, the method comprises training the world model.

[0051] In step 203, the method comprises training the actor-critic model.

[0052] In step 204, the updated model interacts with the environment to add experiences to the experience buffer. The method then returns to step 202. The method may then continue until the network parameters of the world model and the actor-critic model converge, or until the performs at a desired level.

[0053] FIG. 3 illustrates the method of step 202 of FIG. 2 in more detail. FIG. 4 graphically illustrates how step 202 of FIG. 2 may be performed. In FIG. 4 all blocks that are illustrated with non-circular shapes are trainable during step 202 of FIG. 2. In other words, the neural network parameters for the models represented by the non-circular blocks may be updated during step 202 of FIG. 2.

[0054] In step 301, the method comprises obtaining a sequence of observations, o.sub.t, representative of the environment at a time t. For example, as illustrated in FIG. 4, the encoder 401 is configured to receive the observations o.sub.t1 403a (at time t1) and o.sub.t 403b (at time t). The illustrated observations are S-parameters of a cavity filter. This is given as an example of a type of observation, and is not limiting.

[0055] In step 302, the method comprises estimating latent states s.sub.t at time t using a representation model, wherein the representation model estimates the latent states s.sub.t based on the previous latent states s.sub.t1, previous actions a.sub.t1 and the observations o.sub.t. The representation model is therefore based on previous sequences that have occurred. For example, the representation model estimates the latent state s.sub.t 402b at time t based on the previous latent state s.sub.t1 402a, the previous action a.sub.t1 404 and the observation o.sub.t 403b.

[0056] In step 303, the method comprises generating modelled observations, o.sub.m,t, using an observation model (q(o.sub.m,t|s.sub.t)), wherein the observation model generates the modelled observations based on the respective latent states s.sub.t. For example, the decoder 405 generates the modelled observations o.sub.m,t 406b and o.sub.m,t1 406a based on the states stand s.sub.t1 respectively.

[0057] The step of generating comprises determining means and standard deviations based on the latent states s.sub.t. For example, the step of generating may comprise determining a respective mean and standard deviation based on each of the latent states s.sub.t. This is in contrast to the original Dreamer model, which (as described above), produces only means based on the latent states in the observation model.

[0058] FIG. 5 illustrates an example of a decoder 405 according to some embodiments. The decoder 405 determines a mean 501 and a standard deviation 502 based on the latent state s.sub.t it receives as an input. As previously described the decoder comprises a neural network configured to attempt to map the latent state s.sub.t to the corresponding observation o.sub.t.

[0059] The output modelled observation o.sub.m,t may then be determined by sampling a distribution generated from the determined mean and standard deviation.

[0060] In step 304 the method comprises, minimizing a first loss function to update network parameters of the representation model and the observation model, wherein the first loss function comprises a component comparing the modelled observations, o.sub.m,t to the respective observations o.sub.t. In other words, the neural network parameters of the representation model and the observation model may be updated based on how similar the modelled observations o.sub.m,t are to the observations o.sub.t.

[0061] In some examples the method further comprises determining a reward r.sub.t based on a reward model (q(r.sub.t|s.sub.t)) 407, wherein the reward model 407 determines the reward r.sub.t based on the latent state s.sub.t. The step of minimizing the first loss function may then be further used to update network parameters of the reward model. For example, the neural network parameters of the reward model may be updated based on minimizing the loss function. The first loss function may therefore further comprise a component relating to the how well the reward n represents a real reward for the observation o.sub.t. In other words, the loss function may comprise a component measuring how well the determined reward r.sub.t matches how well the observation o.sub.t should be rewarded.

[0062] The overall world model may therefore be trained to simultaneously maximize the likelihood of generating the correct environment rewards r and to maintain an accurate reconstruction of the original observation via the decoder.

[0063] In some examples, the method further comprises estimating a transitional latent state s.sub.trans,t, using a transition model (q(s.sub.trans,t|s.sub.trans,t1, a.sub.t1)). The transition model may estimate the transitional latent state s.sub.trans,t based on the previous transitional latent state s.sub.trans,t1 and a previous action a.sub.t1. In other words, the transition model is similar to the representation model, except the transition model does not take into account the observations o.sub.t. This allows the final trained model to predict (or dream) further into the future.

[0064] The step of minimizing the first loss function may therefore be further used to update network parameters of the transition model. For example, neural network parameters of the transition model may be updated. The first loss function may therefore further comprise a component relating to how similar the transitional latent state s.sub.trans,t is to the latent state s.sub.t. The aim of updating the transition model is to ensure that the transitional latent states s.sub.trans,t produced by the transition model are as similar as possible to the latent states s.sub.t produced by the representation model. The trained transition model may be used in the next stage, e.g. step 203 of FIG. 2.

[0065] FIG. 6 graphically illustrates how step 203 of FIG. 2 may be performed. In FIG. 6, all blocks that are illustrated with non-circular shapes are trainable during step 203 of FIG. 2. In other words, the neural network parameters for the models represented by the non-circular blocks may be updated during step 203 of FIG. 2. In other words, during step 203 of FIG. 2, the actor model 600 and the critic model 601 may be updated.

[0066] Step 203 of FIG. 2 may be initiated by a single observation 603. The observation can be fed into the encoder 401 (trained in step 202), and embedded. The embedded observation may then be used to generate the starting transitional state s.sub.trans,t. The trained transition model then determines the following transitional states s.sub.trans,t+1, and so on, based on the previous transitional state s.sub.trans,t and the previous action a.sub.t.

[0067] Step 203 of FIG. 2 may comprise minimizing a second loss function to update network parameters of the critic model 601 and the actor model 602. The critic model determines state values based on the transitional latent states s.sub.trans,t. The actor model determines actions at based on the transitional latent states s.sub.trans,t.

[0068] The second loss function comprises a component relating to ensuring the state values are accurate (e.g. observations that lie closer to tuned configurations are attributed a higher value), and a component relating to ensuring the actor model leads to transitional latent states, s.sub.trans,t associated with high state values, whilst in some examples also being as explorative as possible (e.g. having high entropy).

[0069] A trained MBRL according to embodiments described herein may then interact with an environment, during which actions and observations fed into the trained encoder, and the trained representation model and actor model are used to determine appropriate actions. The resulting data samples may be fed back into the experience buffer to be used in continual training of the MBRL model.

[0070] In some examples, models may be stored periodically. The process may comprise evaluating stored MBRL models on multiple environments and selecting the best performing MBRL model for use.

[0071] The MBRL model trained according to embodiments described herein may be utilized in environments which require more precise generative models. Potentially, the MBRL model as described by embodiments herein may allow for the learning of any distribution described by some relevant statistics. The MBRL model as described by embodiments herein may significantly decrease the required number of training samples, for example, in a Cavity Filter Environment. This improvement to decrease the required number of training samples is achieved by enhancing the observation model to model a normal distribution with a learnable mean and standard deviation. The decrease in the number of required training samples may be, for example, a factor of 4.

[0072] As previously described, in some examples, the environment in which the MBRL model operates comprises a cavity filter being controlled by a control unit. The MBRL model may be trained and used in this environment. In this example, the observations, o.sub.t, may each comprise S-parameters of the cavity filter, and the actions a.sub.t relate to tuning characteristics of the cavity filter. For example, the actions may comprise turning screws on the cavity filter to change the position of the poles and the zeros.

[0073] Using the a trained MBRL model in the environment comprising a cavity filter controlled by a control unit may comprise tuning the characteristics of the cavity filter to produce desired S-parameters.

[0074] In some examples, the environment may comprise a wireless device performing transmissions in a cell. The MBRL model may be trained and used within this environment. The observations, o.sub.t, may each comprise a performance parameter experienced by a wireless device. For example, the performance parameter may comprise one or more of: a signal to interference and noise ratio; traffic in the cell and a transmission budget. The actions at may relate to controlling one or more of: a transmission power of the wireless device; a modulation and coding scheme used by the wireless device; and a radio transmission beam pattern. Using the trained model in the environment may comprise adjusting one of: the transmission power of the wireless device; the modulation and coding scheme used by the wireless device; and a radio transmission beam pattern, to obtain a desired value of the performance parameter.

[0075] For example, in 4G and 5G cellular communication, link adaptation technique is used to maximize the user throughput and frequency spectrum utilization. The main technique to do so, is the so-called adaptive modulation and coding (ACM) scheme in which the type and order of modulation as well as channel coding rate is selected according to channel quality indicator (CQI). Selecting the optimal ACM according to user's measured SINR (signal to noise and interference ratio) is very hard due to rapid changes in the channel between base station (gNB in 5G terminology) and user, measurements delay, and traffic changes in the cell. An MBRL model according to embodiments described herein may be utilized to find optimal policies for selecting modulation and coding schemes based on observations such as: estimated SINR, traffic in the cell, and transmission budget, to maximize a reward function which represents average throughput to the users active in the cell.

[0076] In another example an MBRL model according to embodiments described herein may be utilized for cell shaping, which is basically a way to dynamically optimize utilization of radio resources in cellular networks by adjusting radio transmission beam patterns according to some network's performance indicators. In this example, the actions may adjust the radio transmission beam pattern in order to change the observations of a network performance indicator.

[0077] In another example, an MBRL model according to embodiments described herein may be utilized in dynamic spectrum sharing (DSS), which is essentially a solution for a smooth transition from 4G to 5G so that existing 4G bands can be utilized for 5G communication without any static restructuring of the spectrum. In fact, using DSS, 4G and 5G can operate in the same frequency spectrum, and a scheduler can distribute the available spectrum resources dynamically between the two radio access standards. Considering its huge potential, an MBRL model according to embodiments described herein may be utilized to adapt an optimal policy for this spectrum sharing task as well. For example, the observations may comprise the amount of data in buffer to be transmitted to each UE (a vector), and standards that each UE can support (another vector). The actions may comprise distributing the frequency spectrum between 4G and 5G standards given a current state/time. For instance, a portion to may be distributed to 4G and a portion may be distributed to 5G.

[0078] As an example, FIG. 7 illustrates how the proposed MBRL model can be trained and used in an environment comprising a cavity filter being controlled by a control unit. Overall, the MBRL model according to embodiments described herein allows for the efficient adaptation of robust state-of-the-art techniques for the process of Cavity Filter Tuning. Not only is the approach more efficient and precise than what is present in the literature, but it is also more flexible and can act as a blueprint for modelling different, potentially more complex generative distributions.

[0079] After obtaining an Agent 700 that can suggest screw rotations in simulation, the goal is to create an end-to-end pipeline which would allow for the tunning of real, physical filters. To this end, a robot may be developed which has direct access to S-parameter readings from the Vector Network Analyser (VNA) 701. Furthermore, actions can easily be translated in exact screw rotations. For example, [1,1] may map to [1080, 1080] degrees rotations (3 full circles). Lastly, the unit may be equipped with the means of altering the screws by the specified angle amount mentioned before.

[0080] The agent 700 may be trained by interacting either with a simulator or directly with a real filter (as shown in FIG. 7), in which case a robot 703 may be used to alter the physical screws. The goal of the agent is to devise a sequence of actions that lead to a tuned configuration as fast as possible.

[0081] The training may be described as follows:

[0082] The agent 700, given an S-parameter observation o, generates an action a, evolving the system, yielding the corresponding reward r and next observation o. The tuple (o,a,r,o) may be stored internally, as it can be later used for training.

[0083] The agent then checks in step 704 if it should train its world model and actor-critic networks (e.g. perform gradient updates every 10 steps). If not, it proceeds to implement the action in the environment using the robot 703 by turning the screws on the filter in step 705.

[0084] If the training is to be performed, the agent 700 may determine in step 706 whether a simulator is being used. If a simulator is being used, the simulator simulates turning the screws in step 707 during the training. If a simulator is not being used, the robot 703 may be used to turn the physical screws on the cavity filter during the training phase.

[0085] During training, the agent 700 may train the world model, for example, by updating its reward, observation, transition and representation models (as described above). This may be performed on the basis of samples (e.g. (o, a, r, o) tuples in an experience buffer). The Actor model and the critic model may then also be updated as described above.

[0086] The goal of the agent is quantified via the reward r, which depicts the distance that the current configuration has to a tuned one. For example, the point-wise Euclidean distance between the current S-parameter values and the desired ones may be used, across the examined frequency range. If a tuned configuration is reached, the agent may, for example, receive a fixed r.sub.tuned reward (e.g. +100).

[0087] If a simulator is not being used, the agent 700 may interact with the filter by changing a set of tunable parameters via the screws that are located on top of it. Thus, observations are mapped to rewards which in turn get mapped (by the agent) to screw rotations which finally lead to physical modifications via the robot 703.

[0088] After training, at inference, the agent may be employed to interact directly with the environment based on received S-parameter observations provided from the VNA 701. In particular, the agent 700 may translate the S-parameter observations into the corresponding screw rotations and may send this information to the robot 703. The robot 703 then executes the screw rotations in step 705 as dictated by the agent 700. This process continues until a tuned configuration is reached.

[0089] FIG. 8 illustrates a typical example of VNA measurements during a training loop.

[0090] Graph 801 illustrates a modelled observation of a S-parameter curve at a time t=0. Graph 802 illustrates a modelled observation of a S-parameter curve at a time t=1 Graph 803 illustrates a modelled observation of a S-parameter curve at a time t=2. Graph 804 illustrates a modelled observation of a S-parameter curve at a time t=3.

[0091] Requirements for what the S-parameter curve should look like in this example are indicated by the horizonal bars. For instance, the curve 805 must lie above the bar 810 in the pass band and below the bars 811a to 811d in the stop band. The curve 806 and curve 807 must lie below the bar 812 in the passband.

[0092] The MBRL model satisfies these requirements after two steps (e.g. by t=2 in Graph 803).

[0093] One of the core components of the Dreamer model is its observation model q(o.sub.t|s.sub.t), which in essence is a decoder who, given a latent representation of the environment s.sub.t (encapsulating information regarding previous observations, rewards and actions) aims to reconstruct the current observation o.sub.t (e.g. the S-parameters of filter). In the Dreamer model, the observation model models the observations via a corresponding high dimensional Gaussian N((s.sub.t), I), where I is the identity matrix. Thus, the Dreamer model is only focused on learning the mean of the distribution, given the latent state s.sub.t. This approach is not sufficient in the environment of a cavity filter being controller by a control unit.

[0094] FIG. 9 is a graph illustrating an observational bottleneck, where in the case of the fixed non learnable standard deviation (I) 901 (as used in Dreamer), the resulting MBRL model seems to plateau after a few thousand steps, illustrating that the more simplistic world modelling does not continue learning.

[0095] On the other hand, by making the observation model also predict the standard deviation, this bottleneck is removed, leading to a more robust latent representation 902. In essence, it is no longer sufficient for an MBRL model to simply be accurate enough to predict the mean, but the whole model must be such that it can also be certain about its predictions. This increased precision yields better performance.

[0096] An MBRL model according to embodiments described herein also showcase enhanced distributional flexibility. Depending on the task, one can augment their network, following a similar procedure, in order to learn relevant statistics of any generative distribution.

[0097] FIG. 10 is a graph illustrating the observation loss 1001 for an MBRL with a learnable standard deviation according to embodiments described herein, and the observation loss 1002 for an MBRL with a fixed standard deviation.

[0098] During training the performance of the decoder may be evaluated by computing the likelihood (or probability) of generating the real observation of using the current decoder distribution. Ideally, a high likelihood will be found. This likelihood may be referred to as observation loss. The formula for observation loss may be log (q(o.sub.t|s.sub.t)). Minimizing the observation loss maximizes the likelihood of the decoder generating the real observation o.sub.t.

[0099] As can be seen from FIG. 10, the observation loss 1002 of the MBRL with a fixed standard deviation plateaus early at around 743 loss, which is close to the theoretically optimum loss of approximately 742.5. Whereas, the observation loss 1001 of the MBRL with a learnable standard deviation according to embodiments described herein continues to fall, thereby increasing the likelihood that the decoder will generate the real observation ot.

[0100] Furthermore, as illustrated in FIG. 11 an MBRL model according to embodiments described herein also manages to exhibit similar performance to a Model Free Soft Actor Critic (SAC) algorithm, while requiring roughly 4 times fewer samples. In particular, FIG. 11 illustrates a comparison between a how quickly a Best Model Free (SAC) agent can tune a cavity filter (illustrated by 1101), and how quickly an MBRL model according to embodiments described herein can tune the cavity filter (illustrated by 1102). The MBRL model according to embodiments described herein (1102) first tunes the filter (with positive reward) at around 8k steps, while the Best Model Free SAC agent (1101) first tunes the filter at around 44k steps. The MBRL model according to embodiments described herein therefore reaches similar performance with around 4 times fewer samples.

TABLE-US-00001 MBRL acc Dreamer embodiments SAC (original) described herein Accuracy 99.93% 69.81% 98.87%/99.72% Training steps 100k 100k 16k/32k

[0101] As can be seen from table 1, the SAC agent reaches 99.93% after training for 100k steps, whereas the MBRL according to embodiments described herein reaches similar performance at around 16k steps (e.g. close to 99%), while requiring at least 4 times fewer samples. In contrast, the original Dreamer model only reaches 69.81% accuracy with 100k steps.

[0102] FIG. 12 illustrates an apparatus 1200 comprising processing circuitry (or logic) 1201. The processing circuitry 1201 controls the operation of the apparatus 1200 and can implement the method described herein in relation to an apparatus 1200. The processing circuitry 1201 can comprise one or more processors, processing units, multi-core processors or modules that are configured or programmed to control the apparatus 1200 in the manner described herein. In particular implementations, the processing circuitry 1201 can comprise a plurality of software and/or hardware modules that are each configured to perform, or are for performing, individual or multiple steps of the method described herein in relation to the apparatus 1200.

[0103] Briefly, the processing circuitry 1201 of the apparatus 1200 is configured to: obtain a sequence of observations, o.sub.t, representative of the environment at a time t; estimate latent states s.sub.t at time t using a representation model, wherein the representation model estimates the latent states s.sub.t based on the previous latent states s.sub.t1, previous actions a.sub.t1 and the observations o.sub.t; generate modelled observations, o.sub.m,t, using an observation model, wherein the observation model generates the modelled observations based on the respective latent states s.sub.t, wherein the step of generating comprises determining means and standard deviations based on the latent states s.sub.t; and minimize a first loss function to update network parameters of the representation model and the observation model, wherein the first loss function comprises a component comparing the modelled observations, o.sub.m,t to the respective observations o.sub.t.

[0104] In some embodiments, the apparatus 1200 may optionally comprise a communications interface 1202. The communications interface 1202 of the apparatus 1200 can be for use in communicating with other nodes, such as other virtual nodes. For example, the communications interface 1202 of the apparatus 1200 can be configured to transmit to and/or receive from other nodes requests, resources, information, data, signals, or similar. The processing circuitry 1201 of apparatus 1200 may be configured to control the communications interface 1202 of the apparatus 1200 to transmit to and/or receive from other nodes requests, resources, information, data, signals, or similar.

[0105] Optionally, the apparatus 1200 may comprise a memory 1203. In some embodiments, the memory 1203 of the apparatus 1200 can be configured to store program code that can be executed by the processing circuitry 1201 of the apparatus 1200 to perform the method described herein in relation to the apparatus 1200. Alternatively or in addition, the memory 1203 of the apparatus 1200, can be configured to store any requests, resources, information, data, signals, or similar that are described herein. The processing circuitry 1201 of the apparatus 1200 may be configured to control the memory 1203 of the apparatus 1200 to store any requests, resources, information, data, signals, or similar that are described herein.

[0106] FIG. 13 is a block diagram illustrating an apparatus 1300 in accordance with an embodiment. The apparatus 1300 can train a model based reinforcement learning, MBRL, model for use in an environment. The apparatus 1300 comprises a obtaining module 1302 configured to obtain a sequence of observations, o.sub.t, representative of the environment at a time t. The apparatus 1300 comprises an estimating module 1304 configured to estimate latent states s.sub.t at time t using a representation model, wherein the representation model estimates the latent states s.sub.t based on the previous latent states s.sub.t1, previous actions a.sub.t1 and the observations o.sub.t. The apparatus 1300 comprises a generating module 1306 configured to generate modelled observations, o.sub.m,t, using an observation model, wherein the observation model generates the modelled observations based on the respective latent states s.sub.t, wherein the step of generating comprises determining means and standard deviations based on the latent states s.sub.t. The apparatus 1300 comprises a minimizing module 1308 configured to minimize a first loss function to update network parameters of the representation model and the observation model, wherein the first loss function comprises a component comparing the modelled observations, o.sub.m,t to the respective observations o.sub.t.

[0107] There is also provided a computer program comprising instructions which, when executed by processing circuitry (such as the processing circuitry 1201 of the apparatus 1200 described earlier, cause the processing circuitry to perform at least part of the method described herein. There is provided a computer program product, embodied on a non-transitory machine-readable medium, comprising instructions which are executable by processing circuitry to cause the processing circuitry to perform at least part of the method described herein. There is provided a computer program product comprising a carrier containing instructions for causing processing circuitry to perform at least part of the method described herein. In some embodiments, the carrier can be any one of an electronic signal, an optical signal, an electromagnetic signal, an electrical signal, a radio signal, a microwave signal, or a computer-readable storage medium.

[0108] Embodiments described herein therefore provide for improved distribution flexibility. In other words, the proposed embodiments to also model the standard deviation via a separate Neural Network Layer is generalizable to many different distributions, as one can augment their network accordingly to predict relevant distribution statistics. If suited, one can impose certain priors (e.g. positive output) via appropriate activation functions for each statistic.

[0109] The embodiments described herein also provide stable training as the MBRL model can steadily learn the standard deviation. As the MBRL model becomes more robust, the MBRL model may gradually decrease the standard deviation of his prediction and become more precise. Unlike maintaining a fixed value for the standard deviation, this change allows for smoother training, characterized by smaller gradient magnitudes.

[0110] The embodiments described herein provide Improved Accuracy. Prior to this disclosure the success rate at tuning filters using MBRL peaked at around 70%, however, embodiments described herein are able to reach performance comparable with the previous MFRL agents (e.g. close to 99%). At the same time, the MBRL model according to embodiments described herein is significantly faster, reaching the aforementioned performance with at least 3 to 4 times fewer training samples in comparison to the best MFRL agents.

[0111] Since training is faster, one can search the hyperparameter space faster. This may be vital for extending our model to more intricate filter environments. Training is also more stable, which leads to less dependency on certain hyperparameters. This greatly speeds up the process of hyperparameter tuning. Furthermore, convincingly solving a task with a broader range of hyperparameters is a good indicator of its extendibility to more complicated filters.

[0112] Therefore, as embodiments described herein effectively train the MBRL model faster, it means that the tuning of cavity filters can be performed much faster. For example, much faster that the current 30 minutes required for a human expert to tune a cavity filter.

[0113] It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word comprising does not exclude the presence of elements or steps other than those listed in a claim, a or an does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims. Any reference signs in the claims shall not be construed so as to limit their scope.

METHODS AND APPARATUSES FOR TRAINING A MODEL BASED REINFORCEMENT LEARNING MODEL

Inventors

Cpc classification

Classification Explorer

G06N3/044

PHYSICS

Classification Explorer

G06N3/006

PHYSICS

Classification Explorer

G06N3/047

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

H01J23/20

ELECTRICITY

Classification Explorer

G06N3/092

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

International classification

Classification Explorer

G06N3/092

PHYSICS

Abstract

Claims

Description