Randomized reinforcement learning for control of complex systems

Abstract

A method of controlling a complex system and a gas turbine being controlled by the method are provided. The method comprises providing training data, which training data represents at least a portion of a state space of the system; setting a generic control objective for the system and a corresponding set point; and exploring the state space, using Reinforcement Learning, for a control policy for the system which maximizes an expected total reward. The expected total reward depends on a randomized deviation of the generic control objective from the corresponding set point.

Claims

1. A method comprising: providing training data that represents at least a portion of a state space of a complex system; and exploring the state space, using Reinforcement Learning, for a control policy for the complex system which maximizes an expected total reward; wherein: prior to exploring the state space, setting a generic control objective for the complex system and a corresponding set point, the corresponding set point being a desired value for a process value of the complex system under control; and the expected total reward is dependent on a randomized deviation of the generic control objective from the corresponding set point.

2. The method of claim 1, wherein the randomized deviation of the generic control objective from the corresponding set point comprises a scaled random number.

3. The method of claim 2, wherein a maximum magnitude of the scaled random number is a fraction of a magnitude of the corresponding set point.

4. The method of claim 1, wherein the generic control objective for the complex system and the corresponding set point are scalable in magnitude.

5. The method of claim 1, wherein the control policy comprises a sequence of state transitions in the state space, each of the state transitions entailing a corresponding reward, and the expected total reward comprising a sum of the corresponding rewards of the sequence of state transitions of the control policy.

6. The method of claim 5, wherein the corresponding reward for each state transition of the sequence of state transitions is approximated by a neural network.

7. The method of claim 6, wherein the exploring of the state space is performed using a policy gradient method.

8. The method of claim 7, wherein the exploring of the state is performed using Policy Gradient Neural Rewards Regression.

9. The method of claim 8, wherein the exploring of the state comprises supplying the randomized deviation of the generic control objective from the corresponding set point as an input of the neural network.

10. The method of a claim 1, further comprising deploying the control policy to control the complex system, wherein deploying the control policy to control the complex system comprises supplying a deviation of the generic control objective from the corresponding set point as an input of a neural network.

11. The method of claim 10, wherein deploying the control policy to control the complex system further comprises setting the set point to a fixed value.

12. The method of claim 1, wherein the training data comprises system conditions, ambient conditions, and performed control actions recorded as time series at discrete time instants during deployment of the complex system.

13. A computer program product comprising software code for performing the steps of the method of claim 1 when the computer program product is run on a computer.

14. A gas turbine, comprising a control device configured to perform the method of claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Embodiments of the invention will be described with reference to the accompanying drawings, in which the same or similar reference numerals designate the same or similar elements.

(2) FIG. 1 is a schematic diagram for illustrating a method according to an embodiment.

(3) FIG. 2 is a schematic diagram for illustrating a method according to a further embodiment.

(4) FIG. 3 is a schematic diagram for illustrating a neural network topology according to prior art.

(5) FIG. 4 is a schematic diagram for illustrating a neural network topology deployed in the method according to various embodiments.

(6) FIG. 5 is a schematic diagram for illustrating exemplary training data used in the method according to various embodiments.

(7) FIG. 6 is a schematic diagram illustrating a gas turbine according to an embodiment.

DETAILED DESCRIPTION OF EMBODIMENTS

(8) Exemplary embodiments of the invention will now be described with reference to the drawings. While some embodiments will be described in the context of specific fields of application, the embodiments are not limited to this field of application. Further, the features of the various embodiments may be combined with each other unless specifically stated otherwise.

(9) The drawings are to be regarded as being schematic representations and elements illustrated in the drawings are not necessarily shown to scale. Rather, the various elements are represented such that their function and general purpose become apparent to a person skilled in the art.

(10) FIG. 1 is a schematic diagram for illustrating a method 10a according to an embodiment.

(11) The method 10a comprises the steps of: providing 11 training data 40, which training data 40 represents at least a portion of a state space S of a complex system 50; setting 12 a generic control objective 32 for the system 50 and a corresponding set point 33; and exploring 13 the state space S, using Reinforcement Learning, for a control policy for the system 50 which maximizes an expected total reward. The expected total reward depends on a randomized deviation 31 of the generic control objective 32 from the corresponding set point 33.

(12) The control policy comprises a sequence of state transitions in the state space S, each of the state transitions entailing a corresponding reward. The expected total reward comprises a sum of the corresponding rewards of the sequence of state transitions of the control policy.

(13) FIG. 2 is a schematic diagram for illustrating a method 10b according to a further embodiment.

(14) In addition to the method 10a, the method 10b further comprises the step of: deploying the control policy 14 to control the system. This step comprises supplying a deviation of the generic control objective 32 from the corresponding set point 33 as an input of the neural network 30, and may selectively comprise setting the set point 33 to a fixed value. In other words, the deviation of the generic control objective 32 from the corresponding set point 33 is not randomized by adding a scaled random number during deployment of the control policy, once it is determined.

(15) FIG. 3 is a schematic diagram for illustrating a neural network topology 20 according to prior art.

(16) The topology 20 has, as shown at the lower end of FIG. 3, state s.sub.t, action a.sub.t and follow-on state s.sub.t′=s.sub.t+1 as inputs, and the so-called quality function Q as outputs. Inputs and outputs are interconnected via weight matrices A-E and a hyperbolic tangent as an activation function.

(17) The quality function Q measures goodness of state-action pairs. For example, the left-hand side of FIG. 3 represents goodness of the state-action pair s.sub.t, a(s.sub.t)=a.sub.t at time instant t, and the right-hand side represents goodness of the state-action pair s.sub.t′=s.sub.t+1, a(s.sub.t′)=a.sub.t+1, wherein indices t and t+1 respectively stand for variables at time instant t (current state) and at time instant t+1 (follow-on state).

(18) In other words, the left-hand side and right-hand side of FIG. 3 denote the respective goodness of consecutive steps s.sub.t and s.sub.t+1 under the wanted control policy a(s∈S), which determines which action a∈A of the available actions A to choose from in a particular state s.

(19) Thus, it is apparent from topology 20 that each state transition s.sub.t.fwdarw.s.sub.t+1 entails a corresponding reward shown at the top of FIG. 3 which is given by the difference between the left-hand side and right-hand side of FIG. 3. The discount factor 0<γ<1 is merely responsible for ensuring convergence.

(20) FIG. 4 is a schematic diagram for illustrating a neural network topology 30 deployed in the method according to various embodiments.

(21) FIG. 4 shows that the topology 30 has additional inputs with respect to the topology 20 of FIG. 3.

(22) The additional inputs represent randomized deviations 31, from which weight matrices F-G lead into respective activation functions. As a result, the expected total reward approximated by the topology 30 also depends on the randomized deviation 31.

(23) Each randomized deviation 31 depicted in FIG. 4 comprises a deviation of a generic control objective (or target) 32 from a corresponding set point 33, as well as a scaled random number 34 (a time series thereof behaving like noise), a maximum magnitude of which being a fraction of a magnitude of the corresponding set point 33. For example, the maximum magnitude of the scaled random number 34 may be +/−0.3 times the magnitude of the corresponding set point 33.

(24) In case of multiple generic control objectives 32, the respective generic control objective 32 for the system 50 and the respective corresponding set point 33 are scalable in magnitude.

(25) The neural network topology 30 approximates the corresponding reward for each state transition of the sequence of state transitions of the wanted control policy by exploring 13 the state space S of the system 50 using a policy gradient method, in particular using Policy Gradient Neural Rewards Regression.

(26) To this end, the exploring 13 of the state space S of the system 50 comprises supplying the randomized deviation 31 of the generic control objective 32 from the corresponding set point 33 as an input of the neural network 30.

(27) On the other hand, the deploying of the control policy comprises supplying a deviation of the generic control objective 32 from the corresponding set point 33 as an input of the neural network 30, without any randomization.

(28) FIG. 5 is a schematic diagram for illustrating exemplary training data 40 used in the method according to various embodiments.

(29) The depicted table is one possible representation of the training data 40 besides comma-separated values or database storage.

(30) The training data 40 comprises system conditions 42 and ambient conditions 43 of the system 50 to be controlled, as well as performed control actions 44 recorded as time series 42, 43, 44 at discrete time instants 41 during deployment of the system 50. The system conditions 42 and ambient conditions 43 collectively represent the acquired fraction of the state space S, from which the wanted control policy is to be determined.

(31) FIG. 6 is a schematic diagram illustrating a gas turbine 50 according to an embodiment.

(32) The gas turbine 50 comprises a control device 51 configured to perform the method 10a; 10b according to various embodiments.

(33) The control device 51, including the neural network 30 shown in FIG. 4, may be trained, based on training data 40 shown in FIG. 5, in order to determine the wanted control policy that maximizes the expected total reward. Once determined, the control policy defines a mapping of a control action a∈A from a set of available control actions A to individual system states s∈S of the state space S. In other words, the control device 51 then has the information which control action to perform in each state s, and under one or more generic control objectives, in order to maximize the expected total reward, or equivalently, to optimally control the underlying gas turbine 50.

(34) Although the invention has been shown and described with respect to certain preferred embodiments, equivalents and modifications will occur to others skilled in the art upon the reading and understanding of the specification. The present invention includes all such equivalents and modifications and is limited only by the scope of the appended claims.

(35) For illustration, while above various examples have been described for a complex system implemented by a gas turbine, the techniques described herein may be readily applied to other kinds and types of complex systems. Examples of complex systems include: subsea equipment and factory; communication networks; medical equipment including imaging tools such as magnetic resonance imaging devices or computer tomography devices; power plants such as nuclear power plants or coal power plants; etc.

Randomized reinforcement learning for control of complex systems

Assignee

Inventors

Cpc classification

Classification Explorer

G06N7/01

PHYSICS

Classification Explorer

F05D2270/30

MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING

Classification Explorer

G05B13/0265

PHYSICS

Classification Explorer

G06N3/006

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

F05D2270/709

MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING

Classification Explorer

F02C9/00

MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING

International classification

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G05B13/02

PHYSICS

Classification Explorer

F02C9/00

MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING

Abstract

Claims

Description