DEVICE AND METHOD FOR CONTROLLING AN AGENT

Abstract

A method for controlling an agent. The method includes determining, for a present state of the agent and a state of an environment of the agent in which the agent should be controlled, a control history indicating a sequence of actions performed by the agent that led to the present state and indicating observations about changes of a state of the agent and/or a state of an environment of the agent, determining an encoding of the control history by supplying the control history to a history encoder comprising a Kalman filter, wherein the encoding is given by a system state estimate determined by the Kalman filter, supplying the encoding to a control policy trained to determine actions from control policy encodings and controlling the agent to perform an action provided by the control policy in response to being supplied with the encoding.

Claims

1. A method for controlling an agent, comprising the following steps: determining, for a present state of the agent and a state of an environment of the agent in which the agent should be controlled, a control history indicating a sequence of actions performed by the agent that led to the present state and indicating observations about changes of a state of the agent and/or the state of the environment of the agent; determining an encoding of the control history by supplying the control history to a history encoder including a Kalman filter, wherein the encoding is given by a system state estimate determined by the Kalman filter; supplying the encoding to a control policy trained to determine actions from control policy encodings; and controlling the agent to perform an action provided by the control policy in response to being supplied with the encoding.

2. The method of claim 1, further comprising: training the control policy wherein parameters of the Kalman filter are trained together with the control policy.

3. The method of claim 1, further comprising: training the control policy using reinforcement learning.

4. The method of claim 1, wherein the Kalman filter is configured to estimate the system state using a linear structured state space model for the system state and the observations which is given by trainable matrices having diagonal structure.

5. The method of claim 1, further comprising: parallel processing of multiple control histories.

6. The method of claim 1, wherein the Kalman filter is configured to repeat, for a control history which indicates a sequence being shorter than a default length, the system state estimate the Kalman filter has determined by an end of the sequence until the Kalman filter has reached a number of estimation iterations corresponding to the default length.

7. The method of claim 1, further comprising: determining the encoding of the control history by supplying the control history to a first Kalman filter of a sequence of Kalman filters, supplying system state estimates of each Kalman filter of the sequence, except a last Kalman filter in the sequence, to a next Kalman filter in the sequence, wherein the encoding is given by a system state estimate determined by the last Kalman filter of the sequence.

8. A controller configured to control an agent, the controller configured to performing the following steps comprising: determining, for a present state of the agent and a state of an environment of the agent in which the agent should be controlled, a control history indicating a sequence of actions performed by the agent that led to the present state and indicating observations about changes of a state of the agent and/or the state of the environment of the agent; determining an encoding of the control history by supplying the control history to a history encoder including a Kalman filter, wherein the encoding is given by a system state estimate determined by the Kalman filter; supplying the encoding to a control policy trained to determine actions from control policy encodings; and controlling the agent to perform an action provided by the control policy in response to being supplied with the encoding.

9. A non-transitory computer-readable medium on which are stored instructions for controlling an agent, the instructions, when executed by a computer, causing the computer to perform the following steps comprising: determining, for a present state of the agent and a state of an environment of the agent in which the agent should be controlled, a control history indicating a sequence of actions performed by the agent that led to the present state and indicating observations about changes of a state of the agent and/or the state of the environment of the agent; determining an encoding of the control history by supplying the control history to a history encoder including a Kalman filter, wherein the encoding is given by a system state estimate determined by the Kalman filter; supplying the encoding to a control policy trained to determine actions from control policy encodings; and controlling the agent to perform an action provided by the control policy in response to being supplied with the encoding.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0028] FIG. 1 shows a control scenario according to an example embodiment of the present invention.

[0029] FIG. 2 illustrates a recurrent actor-critic architecture as an example for a reinforcement learning architecture using history encoders, according to the present invention.

[0030] FIG. 3 illustrates a Kalman filter (KF) layer according to an example embodiment of the present invention.

[0031] FIG. 4 shows a flow diagram 400 illustrating a method for controlling an agent according to an example embodiment of the present invention.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

[0032] The following detailed description refers to the figures that show, by way of illustration, specific details and aspects of this disclosure in which the present invention may be practiced. Other aspects may be utilized, and structural, logical, and electrical changes may be made without departing from the scope of the present invention. The various aspects of this disclosure are not necessarily mutually exclusive, as some aspects of this disclosure can be combined with one or more other aspects of this disclosure to form new aspects.

[0033] In the following, various examples will be described in more detail.

[0034] FIG. 1 shows a control scenario.

[0035] A robot 100 is located in an environment 101. The robot 100 has a start position 102 and should reach a goal position 103. The environment 101 contains obstacles 104 which should be avoided by the robot 100. For example, they may not be passed by the robot 100 (e.g. they are walls, trees or rocks) or should be avoided because the robot would damage or hurt them (e.g. pedestrians).

[0036] The robot 100 has a controller 105 (which may also be remote to the robot 100, i.e. the robot 100 may be controlled by remote control). In the exemplary scenario of FIG. 1, the goal is that the controller 105 controls the robot 100 to navigate the environment 101 from the start position 102 to the goal position 103. For example, the robot 100 is an autonomous vehicle but it may also be a robot with legs or tracks or other kind of propulsion system (such as a deep sea or mars rover).

[0037] Furthermore, embodiments are not limited to the scenario that a robot should be moved (as a whole) between positions 102, 103 but may also be used for the control of a robotic arm whose end-effector should be moved between positions 102, 103 (without hitting obstacles 104) etc.

[0038] Accordingly, in the following, terms like robot, vehicle, machine, etc. are used as examples for the object, i.e. computer-controlled system (e.g. machine), to be controlled. The approaches described herein can be used with different types of computer-controlled machines like robots or vehicles and other. The general term robot device is also used in the following to refer to all kinds of technical system which may be controlled by the approaches described in the following. The environment may also be simulated, e.g. the control policy may for example be a control policy for a virtual vehicle or other movable device, e.g. in a simulation for testing another policy for autonomous driving.

[0039] Ideally, the controller 105 has learned a control policy that allows it to control the robot 101 successfully (from start position 102 to goal position 103 without hitting obstacles 104) for arbitrary scenarios (i.e. environments, start and goal positions) in particular scenarios that the controller 105 has not encountered before.

[0040] Various embodiments thus relate to learning a control policy for a specified (distribution of) task(s) by interacting with the environment 101. In training, the scenario (in particular environment 101) may be simulated but it will typically be real in deployment.

[0041] An approach to learn a control policy is reinforcement learning (RL) where the robot 100 and/or its controller 105, acts as reinforcement learning agent.

[0042] Reinforcement Learning (RL) is a technique for learning a control policy. An RL algorithm iteratively updates the parameters of a parametric policy .sub. (a|s), for example represented by a neural network, that maps states s (e.g. (pre-processed) sensor signals) to actions a (control signals). During training, the policy interacts in rollouts episodically (i.e. in one or more episodes) with the (possibly simulated) environment 101. During a (possibly simulated) training rollout in the environment 101, the controller 105, according to a current control policy, executes, in every discrete time step t an action a according to the current state s.sub.t, which leads to a new state s.sub.t+1 in the next discrete time step. Furthermore, a reward r.sub.t is received, which it uses to update the policy. A (training) rollout ends once a goal state is reached, the accumulated (potentially discounted) rewards surpass a threshold, or the maximum number of time steps, the time horizon T, is reached. During training a reward-dependent objective function (e.g. the discounted sum of rewards received during a rollout) is maximized by updating the parameters of the policy. In case of an actor critic RL scheme as in the example below the training also includes updating the critic(s). The training ends once the policy meets a certain quality criterion with respect to the objective function, a maximum number of policy updates have been performed, or a maximum number of steps have been taken in the (simulation) environment.

[0043] For the following examples, an agent is considered that acts in a finite-horizon partially observable Markov decision process (POMDP) custom-character ={,,,T,p,O,r,} with state space , action space , observation space , horizon T, transition function p:.fwdarw.() that maps states and actions to a probability distribution over , an emission function O:.fwdarw.() that maps states to a probability distribution over observations, a reward function r: custom-character .fwdarw. and a discount factor [0,1).

[0044] At time step t of an episode in custom-character , the agent observes o.sub.t O(.Math.|s.sub.t) and selects an action a.sub.t based on the observed history

[00001] $h_{: t} = {(o_{h}, a_{h})}_{h = 0}^{t}_{t},$

then receives a reward r.sub.t=r(s.sub.t,a.sub.t) and the next observation o.sub.:t+1O(.Math.|s.sub.t+1) with s.sub.t+1 p(.Math.|s.sub.t,a.sub.t).

[0045] A general setting is considered the (RL) agent is equipped with: (i) a stochastic policy : custom-character .sub.t.fwdarw.() (the parameters of the policy are omitted here for simplicity) that maps from observed history to distribution over actions, and (ii) a value function Q.sup.:.sub.t.fwdarw. that maps from history and (present) action to the expected return under the policy, defined as

[00002] $Q^{} (h_{: t}, a_{t}) =_{} [{.Math.}_{h = t}^{T}^{h - t} r_{t} | h_{: t}, a_{t}] .$

The objective of the agent (i.e. of the control policy it follows) is to maximizes the value starting from some initial state s.sub.0,

[00003] $^{*} = \arg \max_{}_{} [{.Math.}_{t = 0}^{T - 1}^{t} r_{t} | s_{0}] .$

Accordingly, the policy should be trained (i.e. the parameters determined) such that the agent (which follows its policy) achieves this maximization for any initial state s.sub.0.

[0046] A weakness of approaches following the general formulation of RL in POMDPs as above is the dependence of both the policy and the values function from the entire history, which becomes intractable for all but the smallest problems. Instead, practical algorithms search to compress the history into a compact representation.

[0047] One general framework to learn such representations is through history encoders, which can be defined by a mapping : custom-character .sub.t.fwdarw. from observed history to some latent representation z.sub.t:=(h.sub.:t). In the following, with slight abuse of notation, (a.sub.t|z.sub.t) and Q.sup.(z.sub.t,a.sub.t) denote the policy and values under this latent representation, respectively.

[0048] According to various embodiments, a history encoder is used which is based on a Recurrent Kalman Network (RKN) that implements simple probabilistic inference on a latent state. In other words, a history encoder is used that comprises one or more layers, each layer operating according to a Kalman filter.

[0049] A Kalman filter operates based on a linear dynamic system discretized in the time domain. According to various embodiments, for this, a time-varying linear State Space Model (SSM) is considered defined by

[00004] $\begin{matrix} {\begin{matrix} \overset{.Math.}{z} (t) = A_{t} z (t) + B_{t} u (t) \\ y (t) = C_{t} z (t) + D_{t} u (t) \end{matrix} & (1) \end{matrix}$

where t>0 custom-character , z(t).sup.N is the state (to be estimated by the Kalman filter), u(t).sup.P is the input (i.e. the actions), y(t).sup.M is the output and (A.sub.t,B.sub.t,C.sub.t,D.sub.t) are matrices of appropriate size. Such a continuous-time system can be discretized (e.g., using zero-order hold) for some step size , resulting in a linear recurrent model

[00005] $\begin{matrix} {\begin{matrix} z_{k} = {\overline{A}}_{k} z_{k - 1} + {\overline{B}}_{k} u_{k} \\ y_{k} = {\overline{C}}_{k} z_{k} + {\overline{D}}_{k} u_{k} \end{matrix} & (2) \end{matrix}$

[0050] As it is common in practice, D.sub.n0 is set. According to various embodiments, structured SSMs are considered, which simply means special structure is imposed into the learnable matrices (.sub.n,B.sub.n,C.sub.n). In particular, a diagonal structure with a HiPPO (High-order Polynomial Projection Operators) initialization may be used which induces stability in the recurrence for handling long sequences.

[0051] To introduce uncertainty into state-space models (according to (2) with [0052] D.sub.n0), a standard linear-Gaussian SSM

[00006] $\begin{matrix} {\begin{matrix} z_{k} = {\overline{A}}_{k} z_{k - 1} + {\overline{B}}_{k} u_{k} +_{k} \\ y_{k} = {\overline{C}}_{k} z_{k} + v_{k} \end{matrix} & (3) \end{matrix}$

may be considered where .sub.k custom-character (0,.sub.z) and v.sub.k(0,.sub.y) are zero-mean transition and observation noise variables with their covariance matrices .sub.z and .sub.y, respectively. The dynamics probabilistic model used by the Kalman filter is then

[00007] $p (z_{k} | z_{k - 1}, u_{k}) = ({\overline{A}}_{k} z_{k - 1} + {\overline{B}}_{k} u_{k},_{z})$

and the observation model used by the Kalman filter is

[00008] $p (y_{k} | z_{k}) = ({\overline{C}}_{k} z_{k},_{y})$

[0053] There is a closed-form solution for Kalman filtering using such models which may be used for implementing the Kalman filter.

[0054] These, however, require matrix inversions, which may be expensive and unsuitable for gradient-based learning. Therefore, according to various embodiments, simplified inference schemes under which Kalman filtering is composed of simple element-wise addition and multiplication are used. In particular, Structured SSMs with a diagonal shape are amenable to simple Kalman filtering equations, e.g. as given in reference [2].

[0055] One key benefit of using linear recurrences and simplified inference schemes is they can be efficiently implemented using parallel scans. For an input sequence of length K, a parallel scan's runtime complexity is O(log (K)), given sufficient parallel processors. The condition for a parallel scan is to define the sequence processing problem in terms of an associative operator .circle-solid., such that (a.circle-solid.b).circle-solid.c=a.circle-solid.(b.circle-solid.c) holds for any triplet of elements (a,b,c). Linear SSMs and their associated probabilistic filters have such a property, see reference [1].

[0056] FIG. 2 illustrates a recurrent actor-critic architecture 200 as an example for a reinforcement learning architecture using history encoders.

[0057] Each of an actor 201 and a critic 202 comprise an embedder 203, 204 which generates a history (as described above) from observations and actions. A history encoder 205, 206, 207 encodes the history to a latent state based which is used as input for the policy, implemented by a first multi-layer perceptron 208 as well as two versions of the value function, implemented by a second multi-layer perceptron 209 and a third multi-layer perceptron 210. The usage of two value functions is only an example here and a single one may also be used. Using two and for example using the minimum of their outputs as value estimate may increase training stability. The architecture may be trained end-to-end according to various types of (standard) actor critic reinforcement learning and various (actor critic) loss functions, e.g. with a SAC (Soft Actor Critic) loss, which aims to maximize the (soft) Q-values.

[0058] As mentioned above, the history encoders 205, 206, 207 each comprise one or more Kalman filter layers.

[0059] FIG. 3 illustrates a Kalman filter (KF) layer 300 according to an embodiment.

[0060] Multiple of these Kalman filter layers may be stacked together to form a history encoder 205, 206, 207, e.g. similarly to non-probabilistic SSM layers and their derivatives. In contrast to standard SSM layers, the KF layer 300 produces a filtered latent state

[00009] $z_{: t}^{+},$

which can then be projected back to the input dimension (i.e. the dimension of the values of the input history h.sub.:t which includes embeddings (generated by the respective embedder 203, 204) of the actions, here denoted by u.sub.:t, see equations (1) to (3) and the observations, here denoted by w.sub.:t) for stacking. In the present example, the input history's values' dimension is changed (e.g. increased) by a first linear layer 301 and the dimension of the filtered latent states

[00010] $z_{: t}^{+}$

is decreased to the dimension of the values of the history by a second linear layer 304. Both linear layers 301, 304 (which may be represented by matrix multiplications) are trainable, i.e. they are trained together with the actor and the critic. Similarly, the matrices used by the actual Kalman filter (.sub.n,B.sub.n,C.sub.n) are trained in the training of the actor and the critic. The KF layer 300 implements a Kalman filter 405 which, according to the two phases of a Kalman filter, performs a prediction 302 and an update 303.

[0061] So, the KF layer 300 receives a history sequence h.sub.:t and projects it into three separate signals in latent space: the inputs (i.e. the actions) u.sub.:t, the observations w.sub.:t and the observation noise (diagonal) covariance .sub.w,:t. These sequences are processed by the Kalman filter 305 according to the standard Kalman filtering equations, which scale logarithmically with the sequence length using parallel scans. Lastly, the posterior mean latent states

[00011] $z_{: t}^{+}$

are projected back from the latent space back into the history space to obtain the history encodings z.sub.:t.

[0062] In order to be compute-efficient during training, according to various embodiments, the architecture 200 (i.e. a controller, e.g. controller 105 implementing the architecture) processes, in general, batches of variable-sized trajectories. On the other hand, efficient batch execution of parallel scans requires equally-sized sequences (i.e. all sequence to have a default length). This incongruence is easily remedied in some sequence-modelling tasks (such as language) by introducing special masking tokens, which are used to pad sequences up to a common maximum length. However, in the general case, a suitable mask value may not be easily defined. In particular, when data is not discrete, the choice of a mask value is arbitrary.

[0063] Instead, the associative operator may be modified to natively handle variable-sized sequences. For example, in (in particular off-policy) RL, sub-sequences (e.g. sub-trajectories) of an episode (i.e. of a complete trajectory obtained from an episode) are sample as training input and the associative operator is designed to pad shorted sequences by propagating the same state (i.e. the latent state z.sub.t in the present application) over the padded steps. Such an associative operator {tilde over (.circle-solid.)} (called masked binary operator) may be designed for any associative operator .circle-solid. as follows:

[0064] Let .circle-solid. be an associative operator acting on elements e, such that for any a,b,c, it holds that (a.circle-solid.b).circle-solid.c=a.circle-solid.(b.circle-solid.c). Then, the masked binary operator associated with .circle-solid., denoted {tilde over (.circle-solid.)} acts on elements {tilde over (e)}{0, 1}=(e,m), where m{0,1} is a binary mask, according to, for =(a,m.sub.a) and {tilde over (b)}=(b,m.sub.b),

[00012] $\begin{matrix} \tilde{a} \tilde{.Math.} \tilde{b} = {\begin{matrix} (a .Math. b, m_{a}) & if m_{b} = 0 \\ \tilde{a} & if m_{b} = 1 \end{matrix} & (4) \end{matrix}$

[0065] In summary, according to various embodiments, a method is provided as illustrated in FIG. 4.

[0066] FIG. 4 shows a flow diagram 400 illustrating a method for controlling an agent (e.g. a technical system like a robot device, e.g. a robot or a vehicle).

[0067] In 401, for a present state of the agent and a state of an environment of the agent in which the agent should be controlled, a control history indicating a sequence of actions performed by the agent that led to the present state and indicating observations about changes of a state of the agent and/or a state of an environment of the agent (caused by the sequence of actions) is determined.

[0068] In 402, an encoding of the control history is determined (i.e. generated) by supplying the control history to a history encoder comprising a Kalman filter (i.e. the input the Kalman filter expects, i.e. the series of measurements observed over time as a Kalman filter expects it as input, is given by the control history (or at least derived from it, e.g. by one or more preceding Kalman filters)) wherein the encoding is given by a system state estimate determined by the Kalman filter (from the control history, either directly or from a (pre-) processed version of the control history, e.g., by one or more preceding Kalman filters).

[0069] In 403, the encoding is supplied to a control policy (or actor) trained to determine actions from control policy encodings. The encoding may also be supplied to a critic in case of using actor critic RL.

[0070] In 404, the agent is controlled to perform an action provided by the control policy in response to being supplied with the encoding.

[0071] The approach of FIG. 4 can be used to compute a control signal for controlling a technical system (wherein the technical system or a controller of the technical system may be seen as the agent which in turn follows its control policy and is thus controlled by its control policy), like e.g. a computer-controlled machine, like a robot, a vehicle, a domestic appliance, a power tool, a manufacturing machine, a personal assistant or an access control system. According to various embodiments, a policy for controlling the technical system may be learnt and then the technical system may be operated accordingly.

[0072] Various embodiments may receive and use various types of sensor data for providing information about the environment and the state of the agent (e.g., technical system), i.e., to gather observations, in form of one or more discrete or continuous signals. This includes any type of measurement (force, velocity etc.) as well as image data (i.e., digital images) from various visual sensors (cameras) such as video, radar, LiDAR, ultrasonic, thermal imaging, motion, sonar etc.

[0073] The method of FIG. 4 may be performed by one or more data processing devices (e.g. computers or microcontrollers) having one or more data processing units. The term data processing unit may be understood to mean any type of entity that enables the processing of data or signals. For example, the data or signals may be handled according to at least one (i.e., one or more than one) specific function performed by the data processing unit. A data processing unit may include or be formed from an analog circuit, a digital circuit, a logic circuit, a microprocessor, a microcontroller, a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), a field programmable gate array (FPGA), or any combination thereof. Any other means for implementing the respective functions described in more detail herein may also be understood to include a data processing unit or logic circuitry. One or more of the method steps described in more detail herein may be performed (e.g., implemented) by a data processing unit through one or more specific functions performed by the data processing unit.

[0074] Accordingly, according to one embodiment, the method is computer-implemented.

DEVICE AND METHOD FOR CONTROLLING AN AGENT

Inventors

Cpc classification

Classification Explorer

G05D2101/15

PHYSICS

Classification Explorer

G05D1/60

PHYSICS

Classification Explorer

G05D2109/10

PHYSICS

International classification

Classification Explorer

G05D1/60

PHYSICS

Abstract

Claims

Description