Method and System for Devising an Optimum Control Policy
20190258228 · 2019-08-22
Inventors
Cpc classification
G05B19/4155
PHYSICS
International classification
Abstract
A method for devising an optimum control policy of a controller for controlling a system includes optimizing at least one parameter that characterizes the control policy. A Gaussian process model is used to model expected dynamics of the system. The optimization optimizes a cost function which depends on the control policy and the Gaussian process model with respect to the at least one parameter. The optimization is carried out by evaluating at least one gradient of the cost function with respect to the at least one parameter. For an evaluation of the cost function a temporal evolution of a state of the system is computed using the control policy and the Gaussian process model. The cost function depends on an evaluation of an expectation value of a cost function under a probability density of an augmented state at time steps.
Claims
1. A method for devising an optimum control policy of a controller for controlling a system, said method comprising: optimizing at least one parameter that characterizes said control policy; using a Gaussian process model to model expected dynamics of the system, wherein said optimization optimizes a cost function which depends on said control policy and said Gaussian process model with respect to said at least one parameter; and carrying out said optimization by evaluating at least one gradient of said cost function with respect to said at least one parameter, wherein for an evaluation of said cost function a temporal evolution of a state of the system is computed using said control policy and said Gaussian process model, and wherein said cost function depends on an evaluation of an expectation value of a cost function under a probability density of an augmented state at time steps.
2. The method according to claim 1, wherein said augmented state at a given time step comprises the state at said given time step.
3. The method according to claim 1, wherein said augmented state at a given time step comprises an error between the state and a desired state at a previous time step.
4. The method according to claim 1, wherein said augmented state at a given time step comprises an accumulated error of a previous time step.
5. The method according to claim 3, wherein the augmented state and/or the desired state are Gaussian random variables.
6. The method according to claim 1, wherein the controller is a multivariate controller.
7. The method according to claim 1, wherein: a first step of optimizing said at least one parameter by said optimization of said cost function with respect to said at least one parameter, a second step of controlling said system by said controller using said control policy parametrized by said optimized at least one parameter, and a third step of updating said Gaussian process model based on a recorded reaction of said system during said second step are carried out iteratively.
8. The method according claim 1, wherein the system comprises an actuator and/or a robot.
9. The method according to claim 1, wherein said system is controlled by said controller, the control policy of which has been devised by the method.
10. The method according to claim 1, wherein a training system for devising an optimum control policy of a controller is configured to carry out the method.
11. The method according to claim 1, wherein a control system for controlling a system is configured to carry out the method.
12. The method according to claim 1, wherein a computer program contains instructions which cause a processor to carry out the method if the computer program is executed by said processor.
13. The method according to claim 12, wherein a machine-readable storage medium is configured to store the computer program.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0035] The objects, features and advantages of the disclosure will be apparent from the following detailed descriptions of the various aspects of the disclosure in conjunction with reference to the following drawings, where:
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
DETAILED DESCRIPTION
[0042]
[0043] This signal representing state x is then passed on to a controller 60, which may, for example, be given by a PID controller. The controller is parameterized by parameters , which the controller 60 may receive from a parameter storage P. The controller 60 computes a signal representing an input signal u, e.g. via equation (11). This signal is then passed on to an output unit 80, which transforms the signal representing the input signal u into an actuation signal A, which is passed on to the physical system 10, and causes said physical system 10 to act. Again, if the input signal u is in a suitable format, the output unit may be omitted altogether.
[0044] The controller 60 may be controlled by software which may be stored on a machine-readable storage medium 45 and executed by a processor 46. For example, said software may be configured to compute the input signal u using the control law given by equation (11).
[0045]
[0046]
[0047] First (1000), a random policy is devised, e.g. by randomly assigning values for parameters and storing them in parameter storage P. The controller 60 then controls physical system 10 by executing its control policy corresponding to these random parameters . The corresponding state signals x are recorded and passed on to block 190.
[0048] Next (1010), a GP dynamics model {circumflex over (f)} is trained using the recorded signals x and u to model the temporal evolution of the system state x, x.sub.t+1={circumflex over (f)}(x.sub.t, u.sub.t).
[0049] Then (1020), a roll-out of the augmented system state z.sub.t over a horizon H is computed based on the GP dynamics model {circumflex over (f)}, the present parameters and the corresponding control policy () and the gradient of the cost function J w.r.t. to parameters is computed, e.g. by equations (17)-(20).
[0050] Based on these gradients, new parameters are computed (1030). These new parameters replace present parameters in parameter storage P.
[0051] Next, it is checked whether the parameters have converged sufficiently (1040). If it is decided that they have not, the method iterates back to step 1020. Otherwise, the present parameters are selected as optimum parameters * that minimize the cost function J (1050).
[0052] Controller 60 is then executed with a control policy corresponding to these optimum parameters * to control the physical system 10. The input signal u and the state signal x are recorded (1060).
[0053] The GP dynamics model {circumflex over (f)} is then updated (1070) using the recorded signals x and u.
[0054] Next, it is checked whether the GP dynamics model {circumflex over (f)} has sufficiently converged (1080). This convergence can be checked e.g. by checking the convergence of the log likelihood of the measured data x, t, which is maximized by adjusting the hyperparameters of the GP, e.g. with a gradient-based method. If it is deemed not to have been sufficiently converged, the method branches back to step 1020. Otherwise, the present optimum parameters * are selected as parameters that will be used to parametrize the control policy of controller 60. This concludes the method.
[0055] Parts of this disclosure have been published as Model-Based Policy Search for Automatic Tuning of Multivariate PID Controllers, arXiv:1703.02899v1, 2017, Andreas Doerr, Duy Nguyen-Tuong, Alonso Marco, Stefan Schaal, Sebastian Trimpe, which is incorporated herein by reference in its entirety.