Randomized reinforcement learning for control of complex systems
11164077 · 2021-11-02
Assignee
Inventors
- Siegmund Düll (Munich, DE)
- Kai Heesche (Munich, DE)
- Raymond S. Nordlund (Orlando, FL, US)
- Steffen Udluft (Eichenau, DE)
- Marc Christian Weber (Munich, DE)
Cpc classification
G06N7/01
PHYSICS
F05D2270/30
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
G06N3/006
PHYSICS
F05D2270/709
MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
International classification
Abstract
A method of controlling a complex system and a gas turbine being controlled by the method are provided. The method comprises providing training data, which training data represents at least a portion of a state space of the system; setting a generic control objective for the system and a corresponding set point; and exploring the state space, using Reinforcement Learning, for a control policy for the system which maximizes an expected total reward. The expected total reward depends on a randomized deviation of the generic control objective from the corresponding set point.
Claims
1. A method comprising: providing training data that represents at least a portion of a state space of a complex system; and exploring the state space, using Reinforcement Learning, for a control policy for the complex system which maximizes an expected total reward; wherein: prior to exploring the state space, setting a generic control objective for the complex system and a corresponding set point, the corresponding set point being a desired value for a process value of the complex system under control; and the expected total reward is dependent on a randomized deviation of the generic control objective from the corresponding set point.
2. The method of claim 1, wherein the randomized deviation of the generic control objective from the corresponding set point comprises a scaled random number.
3. The method of claim 2, wherein a maximum magnitude of the scaled random number is a fraction of a magnitude of the corresponding set point.
4. The method of claim 1, wherein the generic control objective for the complex system and the corresponding set point are scalable in magnitude.
5. The method of claim 1, wherein the control policy comprises a sequence of state transitions in the state space, each of the state transitions entailing a corresponding reward, and the expected total reward comprising a sum of the corresponding rewards of the sequence of state transitions of the control policy.
6. The method of claim 5, wherein the corresponding reward for each state transition of the sequence of state transitions is approximated by a neural network.
7. The method of claim 6, wherein the exploring of the state space is performed using a policy gradient method.
8. The method of claim 7, wherein the exploring of the state is performed using Policy Gradient Neural Rewards Regression.
9. The method of claim 8, wherein the exploring of the state comprises supplying the randomized deviation of the generic control objective from the corresponding set point as an input of the neural network.
10. The method of a claim 1, further comprising deploying the control policy to control the complex system, wherein deploying the control policy to control the complex system comprises supplying a deviation of the generic control objective from the corresponding set point as an input of a neural network.
11. The method of claim 10, wherein deploying the control policy to control the complex system further comprises setting the set point to a fixed value.
12. The method of claim 1, wherein the training data comprises system conditions, ambient conditions, and performed control actions recorded as time series at discrete time instants during deployment of the complex system.
13. A computer program product comprising software code for performing the steps of the method of claim 1 when the computer program product is run on a computer.
14. A gas turbine, comprising a control device configured to perform the method of claim 1.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the invention will be described with reference to the accompanying drawings, in which the same or similar reference numerals designate the same or similar elements.
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION OF EMBODIMENTS
(8) Exemplary embodiments of the invention will now be described with reference to the drawings. While some embodiments will be described in the context of specific fields of application, the embodiments are not limited to this field of application. Further, the features of the various embodiments may be combined with each other unless specifically stated otherwise.
(9) The drawings are to be regarded as being schematic representations and elements illustrated in the drawings are not necessarily shown to scale. Rather, the various elements are represented such that their function and general purpose become apparent to a person skilled in the art.
(10)
(11) The method 10a comprises the steps of: providing 11 training data 40, which training data 40 represents at least a portion of a state space S of a complex system 50; setting 12 a generic control objective 32 for the system 50 and a corresponding set point 33; and exploring 13 the state space S, using Reinforcement Learning, for a control policy for the system 50 which maximizes an expected total reward. The expected total reward depends on a randomized deviation 31 of the generic control objective 32 from the corresponding set point 33.
(12) The control policy comprises a sequence of state transitions in the state space S, each of the state transitions entailing a corresponding reward. The expected total reward comprises a sum of the corresponding rewards of the sequence of state transitions of the control policy.
(13)
(14) In addition to the method 10a, the method 10b further comprises the step of: deploying the control policy 14 to control the system. This step comprises supplying a deviation of the generic control objective 32 from the corresponding set point 33 as an input of the neural network 30, and may selectively comprise setting the set point 33 to a fixed value. In other words, the deviation of the generic control objective 32 from the corresponding set point 33 is not randomized by adding a scaled random number during deployment of the control policy, once it is determined.
(15)
(16) The topology 20 has, as shown at the lower end of
(17) The quality function Q measures goodness of state-action pairs. For example, the left-hand side of
(18) In other words, the left-hand side and right-hand side of
(19) Thus, it is apparent from topology 20 that each state transition s.sub.t.fwdarw.s.sub.t+1 entails a corresponding reward shown at the top of
(20)
(21)
(22) The additional inputs represent randomized deviations 31, from which weight matrices F-G lead into respective activation functions. As a result, the expected total reward approximated by the topology 30 also depends on the randomized deviation 31.
(23) Each randomized deviation 31 depicted in
(24) In case of multiple generic control objectives 32, the respective generic control objective 32 for the system 50 and the respective corresponding set point 33 are scalable in magnitude.
(25) The neural network topology 30 approximates the corresponding reward for each state transition of the sequence of state transitions of the wanted control policy by exploring 13 the state space S of the system 50 using a policy gradient method, in particular using Policy Gradient Neural Rewards Regression.
(26) To this end, the exploring 13 of the state space S of the system 50 comprises supplying the randomized deviation 31 of the generic control objective 32 from the corresponding set point 33 as an input of the neural network 30.
(27) On the other hand, the deploying of the control policy comprises supplying a deviation of the generic control objective 32 from the corresponding set point 33 as an input of the neural network 30, without any randomization.
(28)
(29) The depicted table is one possible representation of the training data 40 besides comma-separated values or database storage.
(30) The training data 40 comprises system conditions 42 and ambient conditions 43 of the system 50 to be controlled, as well as performed control actions 44 recorded as time series 42, 43, 44 at discrete time instants 41 during deployment of the system 50. The system conditions 42 and ambient conditions 43 collectively represent the acquired fraction of the state space S, from which the wanted control policy is to be determined.
(31)
(32) The gas turbine 50 comprises a control device 51 configured to perform the method 10a; 10b according to various embodiments.
(33) The control device 51, including the neural network 30 shown in
(34) Although the invention has been shown and described with respect to certain preferred embodiments, equivalents and modifications will occur to others skilled in the art upon the reading and understanding of the specification. The present invention includes all such equivalents and modifications and is limited only by the scope of the appended claims.
(35) For illustration, while above various examples have been described for a complex system implemented by a gas turbine, the techniques described herein may be readily applied to other kinds and types of complex systems. Examples of complex systems include: subsea equipment and factory; communication networks; medical equipment including imaging tools such as magnetic resonance imaging devices or computer tomography devices; power plants such as nuclear power plants or coal power plants; etc.