CAVITY FILTER TUNING USING IMITATION AND REINFORCEMENT LEARNING

20220343141 · 2022-10-27

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for solving a sequential decision-making problem is provided. The method includes gathering state-action pair data from an expert policy; applying imitation learning to yield a cloned policy based on the gathered state-action pair data from the expert policy; and applying a reinforcement learning technique, wherein the reinforcement learning technique is initialized based on the cloned policy and has an output with one or more action to be performed for solving the sequential decision-making problem.

Claims

1. A method for solving a sequential decision-making problem, the method comprising: gathering state-action pair data from an expert policy; applying imitation learning to yield a cloned policy based on the gathered state-action pair data from the expert policy; and applying a reinforcement learning technique, wherein the reinforcement learning technique is initialized based on the cloned policy and has an output with one or more action to be performed for solving the sequential decision-making problem.

2. The method of claim 1, wherein the imitation learning comprises a behavioral cloning technique.

3. The method of claim 1, wherein the sequential decision-making problem for solving comprises cavity filter tuning and the method further comprises applying a screw selector for tuning a screw in a cavity filter.

4. The method of claim 3, wherein the screw selector comprises a Deep Q Network (DQN).

5. The method of claim 1, wherein the expert policy is based on Tuning Guide Program (TGP).

6. The method of claim 1, wherein the cloned policy is in the form of a neural network, wherein the deepest hidden layer is convolutional in one dimension.

7. The method of claim 1, wherein the reinforcement learning technique comprises the Deep Deterministic Policy Gradient (DDPG) technique.

8. The method of claim 1, wherein the output of the reinforcement learning technique is forced via a multiplied tanh function.

9. The method of claim 1, wherein applying the reinforcement learning technique comprises allowing the reinforcement learning technique to run for N.sub.critic iterations where only a critic network is trained, with no change to an actor network or a target network, and after the N.sub.critic iterations, allowing the technique to run to convergence.

10. The method of claim 1, further comprising performing the one or more actions of the output of the reinforcement learning technique.

11. A node for solving a sequential decision-making problem, the node comprising: a data storage system; and a data processing apparatus comprising a processor, wherein the data processing apparatus is coupled to the data storage system, and the data processing apparatus is configured to: gather state-action pair data from an expert policy; apply imitation learning to yield a cloned policy based on the gathered state-action pair data from the expert policy; and apply a reinforcement learning technique, wherein the reinforcement learning technique is initialized based on the cloned policy and has an output with one or more action to be performed for solving the sequential decision-making problem.

12. The node of claim 11, wherein the imitation learning comprises a behavioral cloning technique.

13. The node of claim 11, wherein the sequential decision-making problem for solving comprises cavity filter tuning and wherein the data processing apparatus is further configured to apply a screw selector for tuning a screw in a cavity filter.

14. The node of claim 13, wherein the screw selector comprises a Deep Q Network (DQN).

15. The node of claim 11, wherein the expert policy is based on Tuning Guide Program (TGP).

16. The node of claim 11, wherein the cloned policy is in the form of a neural network, w herein the deepest hidden layer is convolutional in one dimension.

17. The node of claim 11, wherein the reinforcement learning technique comprises the Deep Deterministic Policy Gradient (DDPG) technique.

18. The node of claim 11, wherein an output of the reinforcement learning technique is forced via a multiplied tanh function.

19. The node of claim 11, wherein applying the reinforcement learning technique comprises allowing the reinforcement learning technique to run for N.sub.critic iterations where only the critic network is trained, with no change to the actor network or target network, and after the N.sub.critic iterations, allowing the technique to run to convergence.

20. The node of claim 11, wherein the data processing apparatus is further configured to perform the one or more actions of the output of the reinforcement learning technique.

21-23. (canceled)

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0017] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.

[0018] FIG. 1 illustrates a box diagram with a reinforcement learning component.

[0019] FIG. 2 illustrates an example of the tuning process of the cavity filter with a trained reinforcement learning agent according to an embodiment.

[0020] FIG. 3 illustrates a block diagram of the imitation learning and reinforcement learning technique, also showing the screw selector according to an embodiment.

[0021] FIG. 4 is a flow chart according to an embodiment.

[0022] FIG. 5 is a block diagram of an apparatus according to an embodiment.

[0023] FIG. 6 is a block diagram of an apparatus according to an embodiment.

DETAILED DESCRIPTION

[0024] An example of an intelligent filter tuning technique using a common reinforcement learning technique follows. Filter tuning as an MDP can be described as follows.

[0025] State: The S-parameters are the state. The S-parameters are frequency dependent, i.e. S=S(f). For a two-ports filter we have S-parameters S.sub.11; S.sub.12; S.sub.21; S.sub.22. The S-parameters may be the output of a Vector Network Analyzer, which displays S-parameter curves. The input of the observations to the artificial neural networks (ANNs) of the policy function and the Q-network for a single observation may be a real-valued vector including the real and imaginary parts of all the components of the S-parameters. Every MHz in a range between 850 and 950 MHz was sampled and attended to a vector with 400 elements.

[0026] Action: Tuning the cavity filter. For example, a 6p2z type filter has 13 adjustable screws each with a continuous range [−90°; 90° ]. One or more of the screws may be adjusted for tuning purposes.

[0027] Reward: Agent will receive a positive reward (e.g. +100 reward) if the state satisfies the design specification, otherwise, a negative reward is incurred depending on the distance to the tuning specifications. This shaped reward function may be heuristically designed by human intuition and does not necessarily lead to an optimal policy for problem solving. An example follows:

[00001] d 11 ( f ) = { 0 , if s 11 ( f ) satisfies the design spec .Math. "\[LeftBracketingBar]" s 11 ( f ) - s 11 spec ( f ) .Math. "\[RightBracketingBar]" , otherwise d 21 ( f ) = { 0 , if s 21 ( f ) satisfies the design spec .Math. "\[LeftBracketingBar]" s 21 ( f ) - s 21 spec ( f ) .Math. "\[RightBracketingBar]" , otherwise

Here s.sub.11.sup.spec(f) and s.sub.21.sup.spec(f) are the lower or upper bound of the design specifications. Then the total reward for a state s becomes:

[00002] r ( s ) = { 100 , if solved - .Math. f ( d 11 ( f ) + d 21 ( f ) ) , otherwise

[0028] The reinforcement learning technique used may be the Deep Deterministic Policy Gradient (DDPG) technique. Simulation results using the DDPG technique show that the agent could find a good policy after sampling about 149,000 data points with the best available hyper-parameters. FIG. 1 illustrates a box diagram with a reinforcement learning component 104, showing (state, reward) input to the reinforcement learning component 104, which interacts with the environment 102 with actions, resulting in a policy π.

[0029] FIG. 2 illustrates an example of the tuning process of the cavity filter with a trained reinforcement learning agent. The tuning specifications are also visible. In the beginning, the curves are quite far off from the design specifications and in the consecutive images the filter is closer to being tuned until step 18 when the tuning process is finished.

[0030] Tuning Guide Program (TGP) is one prominent example of an automatic tuning technique. By calculating the return loss curve which best matches a Chebyshev polynomial within the passband, within the feasible set of the current filter model, TGP can calculate the optimal positions of the screws and thereby provide recommendations for how to tune each screw. As the true filter may not match the model, TGP updates its estimate of the feasible set in each iteration until the filter is tuned.

[0031] TGP is (as of the time of writing) state-of-the-art on the problem of automatic cavity filter tuning. On a 6p2z environment, for example, TGP is able to tune filters with an accuracy of 97% and, on average 27 screw adjustments. The accuracy, in this case, refers to the probability that the filter will be tuned within 100 adjustments when initialized randomly. Embodiments disclosed herein build upon learning from expert data, such as that gathered by running TGP. Accordingly, embodiments herein provide solutions to the following two problems: (1) With as few data points as possible, how to ensure that the trained policy has a significantly better accuracy than the expert data (e.g. TGP); and (2) With as few data points as possible, how to ensure that the trained policy, on average, uses significantly fewer screw adjustments than the expert data (e.g. TGP), while maintaining the same or substantially similar accuracy.

[0032] In order to address the two issues identified above, embodiments herein provide an imitation-reinforcement learning technique, such as detailed below.

[0033] As a first step, state-action pair data is gathered with an expert policy (such as provided by TGP). An expert policy refers to a known policy which is desired to be improved, such as a policy where actions are chosen by a source of expert knowledge (e.g., a human expert that manually selects actions), or a policy that is known to have decent performance (e.g., TGP in the case of tuning cavity filters). After this, behavioral cloning may be performed on the expert policy, yielding a cloned policy. The expert policy and/or cloned policy may take the form of a neural network, where the deepest hidden layer is convolutional in one dimension. Convolutional layers in a neural network convolve (e.g., with a multiplication or other dot product) the input and pass its result to the next layer.

[0034] In order to improve the performance on the policy obtained with imitation learning, a reinforcement learning technique is employed. The reinforcement learning technique may employ an actor-critic network, i.e. an actor neural network and a critic neural network. An actor-critic network (such as DDPG), utilizes an actor network and a critic network, where the actor (neural) network is used to select actions, and the critic (neural) network is used to criticize the actions made by the actor, where the criticism by the critic network iteratively improves the policy of the actor network. A target network may also be used, which is similar to the actor network and initialized to the actor network, but is updated more slowly than the actor network, in order to improve convergence speed. In embodiments, the DDPG technique may be used, where an actor network is initialized with the weights of an imitation policy, as trained in the previous steps. To maintain consistency with an imitator network, the output may be forced (e.g., via a multiplied tanh function) to be within the interval [−b.sub.a, b.sub.a]. In order to have a well-initialized critic network, the reinforcement learning technique (e.g., DDPG) may be allowed to run for N.sub.critic iterations where only the critic network was trained, with no change to the actor network or target network. After this, the technique is allowed to run to convergence.

[0035] In some embodiments, a screw selector (such as one using a Deep Q Network (DQN)) may be used. For example, when using DDPG, it can necessitate that all screws must be turned in every step to converge. This property is suboptimal for minimizing or reducing the number of adjustments needed. A screw selector may be trained (e.g. using DQN), to allow the technique to tune only one screw at a time. In embodiments, anywhere from one screw to all the screws may be adjusted on a given step.

[0036] For example, the screw selector may be trained in the following manner. In every step, S-parameter data is gathered and a trained reinforcement learning actor network (for instance the one from the steps above), predicts an action to be performed for every screw. Both of these (the S-parameter data and the action for every screw) are fed into a fully connected neural network, which predicts Q-values (a cumulative reward value, short for Quality Value) for each screw. When trained, the agent then tunes the screw with the highest predicted Q-value with the amount predicted by the DDPG actor network for that particular screw. The Q-network (part of the Deep Q Network (DQN) technique) is trained using DQN with an s-decay exploration scheme.

[0037] FIG. 3 illustrates a block diagram of the imitation learning and reinforcement learning technique, also showing the screw selector. As shown, there is a simulation environment 302, expert data 304, behavioral cloning 306, reinforcement learning 308, and a screw selector 310.

[0038] The table below shows the performance of different tuning techniques for 6p2z filter. TGP refers to the expert data mentioned above. DDPG (only) refers to using only reinforcement learning using the DDPG technique. IL-DDPG (without DQN) refers to using imitation learning and reinforcement learning (using the DDPG technique). Finally, IL-DDPG-DQN refers to using imitation learning and reinforcement learning (using the DDPG technique), and additionally using a screw selector (using the DQN technique). The IL-DDPG-DQN combination has a higher success rate and fewer adjustment steps (on average), which leads to shorter total tuning time.

TABLE-US-00001 DDPG IL-DDPG Technique TGP (only) (without DQN) IL-DDPG-DQN #Total data 0 149,000 73,000 257,000 points Success rate 97% 99.36 ± 0.04% 99.67 ± 0.03% 99.67 ± 0.77% #Average 23 43 44 17 screw adjustments

[0039] FIG. 4 illustrates a process 400 for solving a sequential decision-making problem according to some embodiments. Process 400 may begin step s402.

[0040] Step s402 comprises gathering state-action pair data from an expert policy.

[0041] Step s404 comprises applying imitation learning to yield a cloned policy based on the gathered state-action pair data from the expert policy.

[0042] Step s406 comprises applying a reinforcement learning technique, wherein the reinforcement learning technique is initialized based on the cloned policy and has an output with one or more action to be performed for solving the sequential decision-making problem.

[0043] In embodiments, the imitation learning comprises a behavioral cloning technique. In embodiments, the method further includes applying a screw selector for tuning a screw in a cavity filter, such as a screw selector comprising a Deep Q Network (DQN). In embodiments, the expert policy is based on Tuning Guide Program (TGP). In embodiments, the cloned policy is in the form of a neural network, wherein the deepest hidden layer is convolutional in one dimension. In embodiments, the reinforcement learning technique comprises the Deep Deterministic Policy Gradient (DDPG) technique. In embodiments, an output of the reinforcement learning technique is forced via a multiplied tanh function. In embodiments, applying the reinforcement learning technique comprises allowing the reinforcement learning technique to run for N.sub.critic iterations where only the critic network is trained, with no change to the actor network or target network, and after the N.sub.critic iterations, allowing the technique to run to convergence. In embodiments, the method further includes performing the one or more actions of the output of the reinforcement learning technique

[0044] FIG. 5 is a block diagram of an apparatus 500, according to some embodiments. As shown in FIG. 5, the apparatus may comprise: processing circuitry (PC) 502, which may include one or more processors (P) 555 (e.g., a general purpose microprocessor and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); a network interface 648 comprising a transmitter (Tx) 545 and a receiver (Rx) 547 for enabling the apparatus to transmit data to and receive data from other nodes connected to a network 510 (e.g., an Internet Protocol (IP) network) to which network interface 548 is connected; and a local storage unit (a.k.a., “data storage system”) 508, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In embodiments where PC 502 includes a programmable processor, a computer program product (CPP) 541 may be provided. CPP 541 includes a computer readable medium (CRM) 542 storing a computer program (CP) 543 comprising computer readable instructions (CRI) 544. CRM 542 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 544 of computer program 543 is configured such that when executed by PC 502, the CRI causes the apparatus to perform steps described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, the apparatus may be configured to perform steps described herein without the need for code. That is, for example, PC 502 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

[0045] FIG. 6 is a schematic block diagram of the apparatus 500 according to some other embodiments. The apparatus 500 includes one or more modules 600, each of which is implemented in software. The module(s) 600 provide the functionality of apparatus 500 described herein (e.g., the steps herein, e.g., with respect to FIG. 4).

[0046] While various embodiments of the present disclosure are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.

[0047] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.