Constraint adaptor for reinforcement learning control
11676064 · 2023-06-13
Assignee
Inventors
Cpc classification
International classification
Abstract
A system for controlling an operation of a machine subject to state constraints in continuous state space of the machine and subject to control input constraints in continuous control input space of the machine is provided. The apparatus includes an input interface to accept data indicative of a state of the machine, a memory configured to store an optimization problem for computing the safety margin of a state and action pair satisfying the state constraints and a control policy mapping the state of the machine within a control invariant set (CIS) to a control input satisfying the control input constraints, and a processor configured to iteratively perform a reinforcement learning (RL) algorithm to jointly control the machine and update the control policy.
Claims
1. A system for controlling an operation of a machine subject to state constraints in continuous state space of the machine and subject to control input constraints in continuous control input space of the machine, comprising: an input interface to accept data indicative of a state of the machine; a memory configured to store an optimization problem for computing a safety margin of a state and action pair satisfying the state constraints and a control policy mapping the state of the machine within a control invariant set (CIS) to a control input satisfying the control input constraints, wherein a control of the machine having the state within the CIS according to the control policy maintains the state of the machine within the CIS, wherein the memory includes a supervisor algorithm that obtains the state of the machine and computes a desired safety margin, wherein the supervisor algorithm generates a safe command when a reinforcement learning (RL) algorithm generates a command that is deemed unsafe, wherein the safe command is a modification of the unsafe command according to optimization (SO):
c(t)=min α Σ.sub.k=1.sup.N∥u(k|t)∥.sub.1, where c(t) is a cost function, u (k|t) is a command vector predicted value of u(t+k), α is a scaling factor, k, N are integers, t is a current time of the system; and a processor configured to iteratively perform the RL algorithm to jointly control the machine and update the control policy, wherein, for performing the joint control and update, the processor is configured to control the machine using the control policy to collect data including a sequence of control inputs generated using the control policy and a sequence of states of the machine corresponding to the sequence of control inputs; determine a reward for a quality of the control policy on the state of the machine using a reward function of the sequence of control inputs and the sequence of states of the machine augmented with an adaptation term determined as the minimum amount of effort needed for the machine having the state to remain within the CIS; and update the control policy that improves a cost function of operation of the machine according to the determined reward.
2. The system of claim 1, wherein the RL algorithm is a deep-deterministic policy gradient (DDPG) algorithm.
3. The system of claim 2, wherein the DDPG algorithm learns both a critic network to estimate long-term values for a given policy and an actor network to sample optimal actions according to the estimated long-term values.
4. The system of claim 1, wherein the reward function is modified to an updated reward by subtracting the cost function from the reward function, wherein the updated reward is expressed by
{circumflex over (r)}(t)=r(t)−c(t), where {circumflex over (r)}(t) is the updated reward, r(t) is the reward function, c(t) is the cost function, and t is a current time of the system.
5. The system of claim 1, wherein the scaling factor α is chosen such that the cost c(t) is between +c.sub.b, wherein Cb are bounds on the cost function.
6. The system of claim 1, wherein a maximum penalty G for performing the RL algorithm is about twice a value of c.sub.b:G≈2c.sub.b, wherein c.sub.b are bounds on the cost function.
7. The system of claim 1, wherein the machine is a suspension system of a vehicle.
8. A computer-implemented method for controlling an operation of a machine subject to state constraints in continuous state space of the machine and subject to control input constraints in continuous control input space of the machine, wherein the computer-implemented method is stored in a memory, comprising steps of: accepting data indicative of a state of the machine; computing a safety margin of a state and action pair satisfying the state constraints and a control policy mapping the state of the machine within a control invariant set (CIS) to a control input satisfying the control input constraints, wherein a control of the machine having the state within the CIS according to the control policy maintains the state of the machine within the CIS, wherein the memory includes a supervisor algorithm that obtains the state of the machine and computes a desired safety margin, wherein the supervisor generates a safe command when a reinforcement learning (RL) algorithm generates a command that is deemed unsafe, wherein the safe command is a modification of the unsafe command according to optimization (SO):
c(t)=min α Σ.sub.k=1.sup.N∥u(k|t)∥.sub.1, where c(t) is a cost function, u(k|t) is a command vector predicted value of u(t+k), α is a scaling factor, k, N are integers, t is a current time of the system; and iteratively performing the RL algorithm to jointly control the machine and update the control policy, wherein, for performing the joint control and update, wherein the iteratively performing step comprises: controlling the machine using the control policy to collect data including a sequence of control inputs generated using the control policy and a sequence of states of the machine corresponding to the sequence of control inputs; determining a reward for a quality of the control policy on the state of the machine using a reward function of the sequence of control inputs and the sequence of states of the machine augmented with an adaptation term determined as the minimum amount of effort needed for the machine having the state to remain within the CIS; and updating the control policy that improves a cost function of operation of the machine according to the determined reward.
9. The method of claim 8, wherein the RL algorithm is a deep-deterministic policy gradient (DDPG) algorithm.
10. The method of claim 9, wherein the DDPG algorithm learns both a critic network to estimate long-term values for a given policy and an actor network to sample optimal actions according to the estimated long-term values.
11. The method of claim 8, wherein the reward function is modified to an updated reward by subtracting the cost function from the reward function, wherein the updated reward is expressed by
{circumflex over (r)}(t)=r(t)−c(t), where {circumflex over (r)}(t) is the updated reward, r(t) is the reward function, c(t) is the cost function, and t is a current time of the system.
12. The method of claim 8, wherein the scaling factor α is chosen such that the cost c(t) is between +c.sub.b, wherein c.sub.b are bounds on the cost function.
13. The method of claim 8, wherein a maximum penalty Gfor performing the RL algorithm is about twice a value of c.sub.b:G≈2c.sub.b, wherein c.sub.b are bounds on the cost function.
14. The method of claim 8, wherein the machine is a suspension system of a vehicle.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The presently disclosed embodiments will be further explained with reference to the attached drawings. The drawings shown are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the presently disclosed embodiments.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
DETAILED DESCRIPTION
(11) Various embodiments of the present invention are described hereafter with reference to the figures. It would be noted that the figures are not drawn to scale elements of similar structures or functions are represented by like reference numerals throughout the figures. It should be also noted that the figures are only intended to facilitate the description of specific embodiments of the invention. They are not intended as an exhaustive description of the invention or as a limitation on the scope of the invention. In addition, an aspect described in conjunction with a particular embodiment of the invention is not necessarily limited to that embodiment and can be practiced in any other embodiments of the invention.
(12) It is an object of some embodiments to provide a system and a method for controlling an operation of a machine using a data-driven state feedback optimal controller. It is another object of some embodiments to provide such a controller that is suitable for controlling a machine subject to safety constraints. An example of such a data-driven optimal controller uses reinforcement learning (RL) to determine control policies based on data obtained during the operation of the controlled machine and a supervisor to provide feedback to the RL-based controller on safe operation of the control.
(13)
x(t+1)=Ax(t)+Bu(t)+B.sub.ww(t) (1)
where x is a vector containing the system states, u is a vector of commands, and w is a vector of disturbances. When the system is nonlinear, it can in the majority of practical purposes be modeled as a linear system. The RL controller receives a feedback signal 112 from the system which is generally a function of both the system state and command vectors and not a function of the disturbance input vector because that is generally unknown. The controller modifies the command according to the feedback. In general, the feedback q (t) is a function of all the vectors above:
q(t)=f(x(t), u(t), w(t)) (2)
(14) The system 109 is output-constrained, meaning that the output 115 is subject to constraints. The output can be described mathematically as a linear combination of the system state vector, command vector, and disturbance input vector:
y(t)=Cx(t)+Du(t)+D.sub.ww(t) (3)
(15) The output is a vector and the constraints that it is subject to are modeled as a set.
Y(t)=y.sub.1(t)ê.sub.1+y.sub.2(t)ê.sub.2
where y.sub.1(t) and y.sub.2(t) are appropriately valued scalars. For safe operation, the output must remain in the constraint set 123. Mathematically, the constraints are represented as linear-inequality requirements:
Sy(t)≤s (4)
which represent the polytope or polygon that geometrically represents the constraints.
(16) Algorithms used in RL generally do not protect against constraint violation. Conventional RL algorithms work through a trial-and-error process that aims to maximize an accumulation of discounted rewards:
Σ.sub.t=0.sup.∞γ.sup.tr(t), (5)
where r(t) is the reward function and γ<1 is a positive discount factor.
u(t)=π.sub.θ(q(t)) (6)
(17)
(18) The main idea behind this invention is to modify the reward function r(t) to be the reward function minus a cost function c(t) that measures the danger of constraint violation. The updated reward is therefore:
{circumflex over (r)}(t)=r(t)−c(t) (7)
(19) The cost function c(t) is determined by an add-on element called a supervisor.
(20)
(21)
c(t)=min α Σ.sub.k=1.sup.N∥u(k|t)∥.sub.1 (8)
subject to the constraint:
Sy(k|t)+y.sub.w.sup.β(k, S)≤s (9)
for k=0, . . . , N−1 and subject to the constraint:
Hy(k|t)+y.sub.w.sup.β(k, H)≤h (10)
(22) The term y(k|t) is the predicted value of y(t+k) at time t according to the dynamics:
x(k+1|t)=Ax(k|t)+Bu(k|t) (11)
y(k|t)=Cx(k|t)+Du(k|t) (12)
With the initial conditions x(0|t)=x(t), which is obtained from the system, and u(0|t)=u(t), which is obtained from the RL controller. The term y.sub.w.sup.β(k, S) is the support of the disturbance set with probability β. This set is the set Ξ.sub.k satisfying:
Pr(y.sub.w(t+k)−y(k|t)∈ Ξ.sub.k)=β (13)
so that y.sub.w.sup.β(k, S) is the solution to:
min S.sup.Ty.sub.w (14)
subject to the constraint:
y.sub.w ∈ Ξ.sub.k (15)
(23) The multiplicative factor α in the (SO) problem is a scaling factor that modifies the size of the cost c(t). In the above, k, N are integers, and t is a current time of the system.
(24) The solution to the (SO) problem is the minimum effort required to keep the system within constraints according to the system model. The system model is not perfect and hence the need for RL to obtain a more optimal control. Furthermore, RL cannot handle constraints and hence the need for a supervisor (or supervisor algorithm) to inform RL of constraint violation. In this way, the functions of the RL algorithm and the supervisor are complimentary to each other. The supervisor is model-based and can determine the optimal value according to the linear model through a relatively simple computation. For example, in the case of the supervisor, we can obtain a strong bound on the value of N, which is the maximum number of steps needed to return to the zero-effort set. To obtain N, we compute the control invariant set (CIS), which is the set of all system states x(t) for which there exists a command u(t) that would return the state into the CIS according to the system dynamics and satisfy the set-membership constraint Sy(t)≤s. Therefore, if a state is not in the CIS, the system is guaranteed to eventually violate constraints.
(25) One way to compute the CIS is to compute the set of all combinations of initial states and commands that guarantee constraint enforcement and project this onto the x-axis. Once the projection is no longer growing the resulting set, we have found the limit N. Specifically, we compute the CIS by defining the set:
C.sub.0={(x, u.sub.0):S(Cx+Du.sub.0)≤s} (16)
and then recursively computing the sets:
C.sub.k={(x, u.sub.0, . . . , u.sub.k):Ax+Bu.sub.i ∈ C.sub.k−1, i=1, . . . , k,(x, u.sub.0) ∈ C.sub.0} (17)
The CIS is the projection of lim.sub.k.fwdarw.∞ C.sub.k onto the x-axis. When the projection at step k is the same size as the projection at step k−1 we set N=k−1. For practical purposes, we can stop the algorithm a little earlier when the difference in projections is deemed negligible.
(26) The existence of the CIS set implies that there sometimes does not exist a solution to the (SO) problem, as the state may not be inside the CIS. Furthermore, if the state is outside of the CIS then, according to the model, the system will inevitably violate constraints as there does not exist a solution to the (SO) problem. If this occurs, we set the penalty to c(t)=−G, where G is a very large number that is larger than any other possible penalty and performs a procedure to determine a modified command.
(27) The zero-effort set itself is the set of states for which the solution to the (SO) problem is nil. This set can be characterized a set of linear inequalities
Hy(k|t)+y.sub.w.sup.β(k, H)≤h (18)
for k=0, . . . , N*. So far, it is unknown how to compute N*, but it is know that the value is finite and that it is related to the rate of decay of the linear system. Therefore we choose an N* which is much larger than the settling time of the linear system.
(28)
(29) After some experimentation, we realized that the safe command should be chosen at random. When a command is deemed unsafe, it means that applying it will lead to constraint violation. If we apply a slightly modified command, it does not greatly diminish the risk of violating constraints. Furthermore, staying within the neighborhood of an unsafe region leads the RL controller to not explore all possible regions. Therefore, we instead take drastic action and randomly sample a command satisfying constraints. We do this using a hit-and-run technique. We generate a sequence of commands
{u(0|t), u(1|t), . . . , u(N−1|t)}={u.sub.0(0|t), u.sub.0(1|t), . . . , u.sub.0(N−1|t)} (19)
that satisfy the following constraints:
Sy(k|t)+y.sub.w.sup.β(k, S)≤s (20)
Hy(N+k′|t)+y.sub.w.sup.β(N+k′, H)≤h (21)
we then pick a random sequence {p.sub.0, p.sub.1, . . . , p.sub.N−1} and set
{u(0|t), u(1|t), . . . , u(N−1|t)}={u.sub.1(0|t), u.sub.1(1|t), . . . , u.sub.1(N−1|t)} (22)
where
u.sub.1(k|t)=u.sub.0(k|t)+λp.sub.k (23)
for all k=0, . . . , N−1, k′=0, . . . , N* and some scalar λ. We then find the smallest λ that satisfies the above constraints. We repeat the above to find sequences of u.sub.2, u.sub.3, . . . Since we are guaranteed that as k.fwdarw.∞ for u.sub.k, then the sequence of u.sub.k will be truly random and we will have sampled the constraints uniformly randomly.
(30) According to some embodiments of this invention, a control system or a control apparatus for controlling an operation of a machine subject to state constraints in continuous state space of the machine and subject to control input constraints in continuous control input space of the machine is realized. To that end, the system or the apparatus may include an input interface to accept data indicative of a state of the machine; a memory configured to store an optimization problem for computing the safety margin of a state and action pair satisfying the state constraints and a control policy mapping a state of the machine within the control invariant set (CIS) to a control input satisfying the control input constraints, wherein a control of the system having the state within the CIS according to the control policy maintains the state of the system within the CIS; and a processor configured to iteratively perform reinforcement learning (RL) to jointly control the machine and update the control policy, wherein, for performing the joint control and update. In this case, the processor is configured to control the machine using the control policy to collect data including a sequence of control inputs generated using the control policy and a sequence of states of the machine corresponding to the sequence of control inputs; determine a reward for a quality of the control policy on the state of the machine using a reward function of the sequence of control inputs and the sequence of states of the machine augmented with an adaptation term determined as the minimum amount of effort needed for the machine having the state to remain within the CIS; and update the control policy that improves a cost function of operation of the machine according to the determined reward.
(31) The control method (safe supervisor algorithm) used in the control system or the apparatus according to the present invention can be applied to the machines used in a factory automation system, actuators and suspensions used in a robotic system or vehicles, or a plant system.
(32)
(33) Although a vehicle suspension system is described below as an example, the safe supervisor (safe supervisor control method) according to the present invention is not limited to the vehicle suspension, the safe supervisor can be applied to control actuators and suspensions used in a robotic system or a factory automation system.
(34)
M.sub.s({umlaut over (z)}.sub.s−{umlaut over (z)}.sub.us)+c.sub.s(ż.sub.s−ż.sub.us)+k.sub.s(z.sub.s−z.sub.us)=F (24)
M.sub.us{umlaut over (z)}.sub.us+c.sub.tż.sub.us+k.sub.t(z.sub.us−z.sub.r)−c.sub.s(ż.sub.s−ż.sub.us)−k.sub.s(z.sub.s−z.sub.us)=F+M.sub.us{umlaut over (z)}.sub.r+c.sub.tż.sub.r (25)
Because they are linear, these equations can be transformed into the required format by letting:
x(t)=z.sub.s(t)−z.sub.us(t), ż.sub.s(t), z.sub.us(t)−z.sub.r(t), ż.sub.us(t)) (26)
v(t)=F(t) (27)
w(t)=M.sub.us{umlaut over (z)}.sub.r(t)+c.sub.tż.sub.r(t) (28)
We then let:
v(t)=−Kx(t)+u(t) (29)
where K is a stabilizing feedback, i.e., the matrix A−BK is a stable matrix and hence the dynamics of x(t) are stable. The matrix K represents a stabilizing feedback controller which has been designed to control the active suspension. Our intention is to use the RL algorithm to improve the controller in the presence of constraints. In this explanation, the feedback state x(t) is assumed to be directly measured using some instrumentation. This is because in our experimentation, we were able to measure all the states. However, it is likely that in real-world application, one would implement a state estimator 314 to obtain the state estimates 315 by measuring the vertical displacement of the sprung mass 301 using displacement sensors, such as linear variable displacement transducers. The nominal feedback controller has not been designed with constraints in mind and therefore the RL algorithm should improve the controller performance especially in the presence of constraints.
(35) The constraints we consider are that 1) z.sub.s−z.sub.us≥.sub.s,− to protect the spring from breaking due to compression; 2) z.sub.s−z.sub.us≤
.sub.s+ to protect the spring from breaking due to stretching; 3) |ż.sub.s|≤f.sub.s to ensure ride comfort for the passengers of the vehicle; and 4) z.sub.s−z.sub.us≥
.sub.us,− to protect the wheel from being damaged due to compression of the tire; the terms
.sub.s,−,
.sub.s,+, f.sub.s and
.sub.us,− are positive scalar limits on the functions of the variables above. Since these constraints are linear, they can be modeled in the required form.
(36) The RL algorithm we apply is the deep-deterministic policy gradient (DDPG) algorithm due to its ability to deal with continuous control systems. DDPG learns both a critic network to estimate the long-term value for a given policy and an actor network to sample the optimal action. In application to the suspension system, the critic network is composed of two hidden fully-connected layers, with 160 and 120 neurons respectively, and the actor network is composed of two hidden fully-connected layers, with 160 and 120 neurons respectively, and a softmax layer to sample the optimal action. For the rest of the design of the DDPG algorithm, default hyperparameters have been used. Importantly, since it makes for better learning, DDPG does not apply the optimal control that it has learned:
u(t)=π.sub.θ(q(t)) (30)
(37) Instead, it applies the optimal control modified with some colored noise signal:
u(t)=π.sub.θ(q(t))+∈.sub.OU (31)
where ∈.sub.OU is the output of a colored noise (also called Ohrstein-Uhlenbeck) process.
(38)
(39)
(40) It remains to explain the implementation details specific to suspension systems. The reward function that we wish to maximize is rider comfort:
r(t)=−|ż.sub.s(t)| (32)
(41) In other words, we wish that the motion of the sprung mass comfort be minimized. As stated above, the rider comfort is constrained between ±f.sub.s. This means that the reward is also constrained between these bounds. Through computation of the CIS, and experimentation of application of the safe RL algorithm, we have found that a good choice of the scaling factor α in the (SO) problem is one which ensures that the cost c(t) is always between ±c.sub.b where c.sub.b are bounds on the cost function which, through experimentation, we set as:
c.sub.b=100f.sub.s (33)
(42) This is because the effort needed to return to safety is very large at the boundaries of the CIS, and so the scaling factor needs to be large to ensure that the cost is high enough nearer to the zero-effort set. Through experimentation, we also found that the maximum penalty G should always be about twice the value of c.sub.b, that is:
G≈2c.sub.b (34)
(43) If the maximum penalty is too large, it induces a dominant effect on the learning process of the RL algorithm and the algorithm generates commands that overly avoid constraints. Therefore choosing G to be on the order of magnitude of the largest possible solution the (SO) problem is appropriate.
(44)
(45) Accordingly, some embodiments of the present invention can provide a computer-implemented method for controlling an operation of a machine subject to state constraints in continuous state space of the machine and subject to control input constraints in continuous control input space of the machine. In this case, the method may include steps of accepting data indicative of a state of the machine, computing a safety margin of a state and action pair satisfying the state constraints and a control policy mapping the state of the machine within a control invariant set (CIS) to a control input satisfying the control input constraints, wherein a control of the machine having the state within the CIS according to the control policy maintains the state of the machine within the CIS, and iteratively performing a reinforcement learning (RL) algorithm to jointly control the machine and update the control policy, wherein, for performing the joint control and update, wherein the iteratively performing step comprises controlling the machine using the control policy to collect data including a sequence of control inputs generated using the control policy and a sequence of states of the machine corresponding to the sequence of control inputs determining a reward for a quality of the control policy on the state of the machine using a reward function of the sequence of control inputs and the sequence of states of the machine augmented with an adaptation term determined as the minimum amount of effort needed for the machine having the state to remain within the CIS, and updating the control policy that improves a cost function of operation of the machine according to the determined reward. In some cases, the computer-implemented method can be used to control a suspension system (systems) of a vehicle.
(46) Although a vehicle suspension system is described above as an example, the safe supervisor (safe supervisor control method) according to the present invention is not limited to the vehicle suspension, the safe supervisor can be applied to control actuators and suspensions used in a robotic system or a factory automation system.
(47) The above-described embodiments of the present invention can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component. Though, a processor may be implemented using circuitry in any suitable format.
(48) Also, the embodiments of the invention may be embodied as a method, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
(49) Use of ordinal terms such as “first,” “second,” in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
(50) Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention.
(51) Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.