UNCERTAINTY-DIRECTED TRAINING OF A REINFORCEMENT LEARNING AGENT FOR TACTICAL DECISION-MAKING

Abstract

A method of providing a reinforcement learning, RL, agent for decision-making to be used in controlling an autonomous vehicle. The method includes: a plurality of training sessions, in which the RL agent interacts with a first environment including the autonomous vehicle, each training session having a different initial value and yielding a state-action value function Q.sub.k(s, a) dependent on state and action; an uncertainty evaluation on the basis of a variability measure for the plurality of state-action value functions evaluated for one or more state-action pairs corresponding to possible decisions by the trained RL agent; additional training, in which the RL agent interacts with a second environment including the autonomous vehicle, wherein the second environment differs from the first environment by an increased exposure to a subset of state-action pairs for which the variability measure indicates a relatively higher uncertainty.

Claims

1. A method of providing a reinforcement learning, RL, agent for decision-making to be used in controlling an autonomous vehicle, the method comprising: a plurality of training sessions, in which the RL agent interacts with a first environment including the autonomous vehicle, each training session having a different initial value and yielding a state-action value function Q.sub.k(s, a) dependent on state and action; an uncertainty evaluation on the basis of a variability measure for the plurality of state-action value functions evaluated for one or more state-action pairs corresponding to possible decisions by the trained RL agent; additional training, in which the RL agent interacts with a second environment including the autonomous vehicle, wherein the second environment differs from the first environment by an increased exposure to a subset of state-action pairs for which the variability measure indicates a relatively higher uncertainty.

2. The method of claim 1, further comprising: traffic sampling, in which state-action pairs encountered by the autonomous vehicle are recorded on the basis of at least one physical sensor signal, wherein the uncertainty evaluation relates to the recorded state-action pairs.

3. The method of claim 1, wherein the first and/or the second environment is a simulated environment.

4. The method of claim 3, wherein the second environment is generated from the subset of state-action pairs.

5. The method of claim 1, wherein the state-action pairs in the subset have a variability measure exceeding a predefined threshold.

6. The method of claim 1, wherein the additional training includes modifying said plurality of state-action value functions in respective training sessions.

7. The method of claim 1, wherein the additional training includes modifying a combined state-action value function representing a central tendency of said plurality of state-action value functions.

8. The method of claim 1, wherein the RL agent is configured for tactical decision-making.

9. The method of claim 1, wherein the RL agent includes at least one neural network.

10. The method of claim 9, wherein the RL agent is obtained by a policy gradient algorithm, such as an actor-critic algorithm.

11. The method of claim 9, wherein the RL agent is a Q-learning agent, such as a deep Q network, DQN.

12. The method of claim 9, wherein the training sessions use an equal number of neural networks.

13. The method of claim 9, wherein the initial value corresponds to a randomized prior function, RPF.

14. The method of claim 1, wherein the variability measure is one or more of: a variance, a range, a deviation, a variation coefficient, an entropy.

15. An arrangement for controlling an autonomous vehicle, comprising: processing circuitry and memory implementing a reinforcement learning, RL, agent configured to interact with a first environment including the autonomous vehicle in a plurality of training sessions, each training session having a different initial value and yielding a state-action value function Q.sub.k(s, a) dependent on state and action, the processing circuitry and memory further implementing a training manager configured to: estimate an uncertainty on the basis of a variability measure for the plurality of state-action value functions evaluated for one or more state-action pairs corresponding to possible decisions by the trained RL agent, and initiate additional training, in which the RL agent interacts with a second environment including the autonomous vehicle, wherein the second environment differs from the first environment by an increased exposure to a subset of state-action pairs for which the variability measure indicates a relatively higher uncertainty.

16. The arrangement of claim 15, further comprising a vehicle control interface configured to record state-action pairs encountered by the autonomous vehicle on the basis of at least one physical sensor in the autonomous vehicle, wherein the training manager is configured to estimate the uncertainty for the recorded state-action pairs.

17. A computer program comprising instructions to cause a processor to perform the method of any of claim 1.

18. A data carrier carrying the computer program of claim 17.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0025] Embodiments of the invention are described, by way of example, with reference to the accompanying drawings, on which:

[0026] FIG. 1 is a flowchart of a method according to an embodiment of the invention;

[0027] FIG. 2 is a block diagram of an arrangement for controlling an autonomous vehicle according to another embodiment of the invention;

[0028] FIG. 3 shows an architecture of a neural network of an RL agent; and

[0029] FIG. 4 is a plot of the mean uncertainty of a chosen action over 5 million training steps in an example.

DETAILED DESCRIPTION

[0030] The aspects of the present invention will now be described more fully with reference to the accompanying drawings, on which certain embodiments of the invention are shown. These aspects may, however, be embodied in many different forms and the described embodiments should not be construed as limiting; rather, they are provided by way of example so that this disclosure will be thorough and complete, and to fully convey the scope of all aspects of invention to those skilled in the art.

[0031] Reinforcement learning is a subfield of machine learning, where an agent interacts with some environment to learn a policy π(s) that maximizes the future expected return. Reference is made to the textbook R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2.sup.nd ed., MIT Press (2018).

[0032] The policy π(s) defines which action a to take in each state s. When an action is taken, the environment transitions to a new state s′ and the agent receives a reward r. The reinforcement learning problem can be modeled as a Markov Decision Process (MDP), which is defined by the tuple ( custom-character T; R; γ) where is the state space, is the action space, T is a state transition model (or evolution operator), R is a reward model, and γ is a discount factor. This model can also be considered to represent the RL agent's interaction with the training environment. At every time step t, the goal of the agent is to choose an action a that maximizes the discounted return,

[00001] $R_{t} = {.Math.}_{k = 0}^{\infty} γ^{k} r_{t + k} .$

In Q-learning, the agent tries to learn the optimal action-value function Q*(s, a), which is defined as

[00002] $Q^{*} (s, a) = \max_{π} 𝔼 [R_{t} | s_{t} = s, a_{t} = a, π] .$

From the optimal action-value function, the policy is derived as per

[00003] $π (s) = \underset{a}{argmax} Q^{*} (s, a) .$

[0033] An embodiment of the invention is illustrated by FIG. 1, which is a flowchart of a method 100 for providing an RL agent for decision-making to be used in controlling an autonomous vehicle. In the embodiment illustrated, the method begins with a plurality of training sessions 110-1, 110-2, . . . , 110-K (K≥2), which may be carried out in a simultaneous or at least time-overlapping fashion. In each training session, the RL agent interacts with a first environment E1 which has its own initial value and includes the autonomous vehicle (or, if the environment is simulated, a model of the vehicle). The k.sup.th training session returns a state-action value function Q.sub.k(s, a), for any 1≤k≤K, from which a decision-making policy may be derived in the manner described above. Preferably, all K state-action value functions are combined into a combined state-action value function Q(s, a), which may represent a central tendency of the state-action value functions, such as a mean value, of the K state-action values:

[00004] $\bar{Q} (s, a) = \frac{1}{K} {.Math.}_{k = 1}^{K} Q_{k} (s, a) .$

[0034] The inventors have realized that the uncertainty of a possible decision corresponding to the state-action pair (ŝ, â) can be estimated on the basis of a variability of the numbers Q.sub.1(ŝ, â), Q.sub.2(ŝ, â), . . . , Q.sub.K (ŝ, â). The variability may be measured as the standard deviation, coefficient of variation (i.e., standard deviation normalized by mean), variance, range, mean absolute difference or the like. The variability measure is denoted c.sub.v(ŝ, â) in this disclosure, whichever definition is used. Conceptually, a goal of the method 100 is to determine a training set custom-character of those states for which the RL agent will be from additional training:

custom-character ={s∈S: c.sub.v(s,a)>C.sub.v for some a∈},

where C.sub.v is the threshold variability and custom-character is the set of possible actions in state s. If the threshold C.sub.v is predefined, its value may represent a desired safety level at which the autonomous vehicle is to be operated. It may have been determined or calibrated by traffic testing and may be based on the frequency of decisions deemed erroneous, collisions, near-collisions, road departures and the like. A possible alternative is to set the threshold dynamically, e.g., in such manner that a predefined percentage of the state-action pairs are to have increased exposure during the additional training.

[0035] To determine the need for additional training, the disclosed method 100 includes an uncertainty evaluation 114 of at least some of the RL agent's possible decisions, which can be represented as state-action pairs. One option is to perform a full uncertainty evaluation including also state-action pairs with a relatively low incidence in real traffic. Another option is to perform a partial uncertainty evaluation. To this end, prior to the uncertainty evaluation 114, an optional traffic sampling 112 may be performed, during which the state-action pairs that are encountered in the traffic are recorded. In the notation used above, the collection of recorded state-action pairs may be written {(ŝ.sub.l, â.sub.l)∈ custom-character ×: 1≤l≤L}, where l is an arbitrary index. In the uncertainty evaluation 114, the variability measure c.sub.v is computed for each recorded state-action pair, that is, for each l∈[1, L]. The training set can be approximated

custom-character ={s∈s=ŝ.sub.l and c.sub.v(ŝ.sub.l,â.sub.l)>C.sub.v for some l∈[1,L]}.

Recalling that L.sub.B, as defined above, is the set of all indices for which the threshold variability C.sub.v is exceeded, the approximate training set is equal to

custom-character ={s∈ŝ.sub.l for some l∈L.sub.B}.

[0036] The method 100 then concludes with an additional training stage 116, in which the RL agent is caused to interact with a second environment E2 differing from the first environment E1 by an increased exposure to the training set custom-character or by an increased exposure to the approximate training set or by another property promoting the exposure to the state-action pairs (ŝ.sub.l, â.sub.l) that have indices in L.sub.B. In some embodiments, it is an RL agent corresponding to combined state-action value function Q(s, a) that is caused to interact with the second environment E2. In other embodiments, the K state-action value function Q.sub.k(s, a), which may be regarded as an equal number of sub-agents, are caused to interact with the second environment E2. Whichever option is chosen, the method 100 is suitable for providing an RL agent adapted to make decisions underlying the control of an autonomous vehicle.

[0037] To illustrate the uncertainty evaluation stage 114, an example output when L=15 and the variability measure is the coefficient of variations may be represented as in Table 1.

TABLE-US-00001 TABLE 1 Example uncertainty evaluation l (ŝ.sub.l, â.sub.l) c.sub.v(ŝ.sub.l, â.sub.l) 1 (S1, right) 0.011 2 (S1, remain) 0.015 3 (S1, left) 0.440 4 (S2, yes) 0.005 5 (S2, no) 0.006 6 (S3, A71) 0.101 7 (S3, A72) 0.017 8 (S3, A73) 0.026 9 (S3, A74) 0.034 10 (S3, A75) 0.015 11 (S3, A76) 0.125 12 (S3, A77) 0.033 13 (S4, right) 0.017 14 (S4, remain) 0.002 15 (S4, left) 0.009

[0038] Here, the sets of possible actions for each state S1, S2, S3, S4 are not known. If it is assumed that the enumeration of state-action pairs for each state is exhaustive, then custom-character |.sub.S1=|.sub.S4={right, remain, left}, |.sub.S2={yes, no} and |.sub.S3={A71, A72, A73, A74, A75, A76, A77}. If the enumeration is not exhaustive, then {right, remain, left}⊂|.sub.S1, {yes, no}⊂|.sub.S2 and so forth. For an example value of the threshold C.sub.v=0.020, one obtains L.sub.B={3,6,8,9,11,12}. The training set consists of all states for which at least one action belongs to a state-action pair with a variability measure exceeding the threshold, namely, custom-character ={S1,S3}, which will be the emphasis of the additional training 116.

[0039] The training set S.sub.B can be defined in alternative ways. For example, the training set may be taken to include all states s∈ custom-character for which the mean variability of the possible actions |.sub.s exceeds the threshold C.sub.v. This may be a proper choice if it is deemed acceptable for the RL agent to have minor points of uncertainty but that the bulk of its decisions are relatively reliable.

[0040] FIG. 2 illustrates an arrangement 200 for controlling an autonomous vehicle 299 according to another embodiment of the invention. The autonomous vehicle 299 may be any road vehicle or vehicle combination, including trucks, buses, construction equipment, mining equipment and other heavy equipment operating in public or non-public traffic. The arrangement 200 may be provided, at last partially, in the autonomous vehicle 299. The arrangement 200, or portions thereof, may alternatively be provided as part of a stationary or mobile controller (not shown), which communicates with the vehicle 299 wirelessly.

[0041] The arrangement 200 includes processing circuitry 210, a memory 212 and a vehicle control interface 214. The vehicle control interface 214 is configured to control the autonomous vehicle 299 by transmitting wired or wireless signals, directly or via intermediary components, to actuators (not shown) in the vehicle. In a similar fashion, the vehicle control interface 214 may receive signals from physical sensors (not shown) in the vehicle so as to detect current conditions of the driving environment or internal states prevailing in the vehicle 299. The processing circuitry 210 implements an RL agent 220 and a training manager 222 to be described next.

[0042] The RL agent 220 interacts with a first environment E1 including the autonomous vehicle 299 in a plurality of training sessions, each training session having a different initial value and yielding a state-action value function dependent on state and action. The RL agent 220 may, at least during the training phase, comprise as many sub-agents as there are training sessions, each sub-agent corresponding to a state-action value function Q.sub.k(s, a). The sub-agents may form a combined into a joint RL agent, corresponding to the combined state-action value function Q(s, a) introduced above, for the purpose of the decision-making. The RL agent 220 according to any of these definitions can be brought to interact with a second environment E2 in which the autonomous vehicle 299 is exposed more intensely to the training set custom-character or its approximation as discussed previously.

[0043] The training manager 222 is configured to estimate an uncertainty on the basis of a variability measure for the plurality of state-action value functions evaluated for a state-action pair corresponding to each of the possible decisions by the RL agent. In some embodiments, the training manager 222 does not perform a complete uncertainty estimation. For example, as suggested by the broken-line arrow, the training manager 222 may receive physical sensor data via the vehicle control interface 214 and determine on this basis a collection of state-action pairs to evaluate, {(ŝ.sub.l, â.sub.l)∈ custom-character ×1≤l≤L}. The training manager 222 is configured to estimate an uncertainty on the basis of the variability measure c.sub.v for the K state-action value functions Q.sub.k(s, a) evaluated for these state-action pairs. The state-action pairs found to be associated with a relatively higher value of the variability measure are to be focused on in additional training, which the training manager 222 will initiate.

[0044] The thus additionally trained RL agent may be used to control the autonomous vehicle 299, namely by executing decision by the RL agent via the vehicle control interface 214.

[0045] Returning to the description of the invention from a mathematical viewpoint, an embodiment relies on the DQN algorithm. This algorithm uses a neural network with weights θ to approximate the optimal action-value function as Q*(s, a)≈Q(s, a; θ); see further V. Mnih et al., “Human-level control through deep reinforcement learning”, Nature, vol. 518, pp. 529-533 (2015) [doi:10.1038/nature14236.]. Since the action-value function follows the Bellman equation, the weights can be optimized by minimizing the loss function

[00005] $L (θ) = 𝔼 [{(r + γ \max_{a} Q (s^{'}, a^{'} : θ^{-}) - Q (s, a; θ))}^{2}]$

[0046] As explained in Mnih, the loss is calculated for a minibatch M and the weights θ.sup.− of a target network are updated repeatedly.

[0047] The DQN algorithm returns a maximum likelihood estimate of the Q values but gives no information about the uncertainty of the estimation. The risk of an action could be represented as the variance of the return when taking that action. One line of RL research focuses on obtaining an estimate of the uncertainty by statistical bootstrap; an ensemble of models is then trained on different subsets of the available data and the distribution that is given by the ensemble is used to approximate the uncertainty. A sometimes better-performing Bayesian posterior is obtained if a randomized prior function (RPF) is added to each ensemble member; see for example I. Osband, J. Aslanides and A. Cassirer, “Randomized prior functions for deep reinforcement learning,” in: S. Bengjo et al. (eds.), Adv. in Neural Inf. Process. Syst. 31 (2018), pp. 8617-8629. When RPF is used, each individual ensemble member, here indexed by k, estimates the Q values as the sum

Q.sub.k(s,a)=f(s,a;θ.sub.k)+βp(s,a;{circumflex over (θ)}.sub.k),

where f, p are neural networks, with parameters θ.sub.k that can be trained and further parameters {circumflex over (θ)}.sub.k that are kept fixed. The factor β can be used to tune the importance of the prior function. When adding the prior, the loss function L(θ) defined above changes into

[00006] $L (θ_{k}) = 𝔼_{M} [{(r + γ \max_{a'} (f_{θ_{k}^{-}} + β {p_{\hat{θ}}}_{k}) (s^{'}, a^{'}) - (f_{θ_{k}^{-}} + β p_{{\hat{θ}}_{k}}) (s, a))}^{2}] .$

[0048] The full ensemble RPF method, which was used in this implementation, may be represented in pseudo-code as Algorithm 1:

TABLE-US-00002 Algorithm 1 Ensemble RPF training process 1: for k ← 1 to K 2: Initialize θ.sub.k and {circumflex over (θ)}.sub.k randomly 3: m.sub.k ← { } 4: i ← 0 5: while networks not converged 6: s.sub.i ← initial random state 7: k ~ custom-character {1, K} 8: while episode not finished 9: a.sub.i ← arg max.sub.a Q.sub.k(s.sub.i, a) 10: s.sub.i+1, r.sub.i ← STEPENVIRONMENT(s.sub.i, a.sub.i) 11: for k ← 1 to K 12: if p ~ (0, 1) < p.sub.add 13: m.sub.k ← m.sub.k ∪ {(s.sub.i, a.sub.i, r.sub.i, s.sub.i+1)} 14: M ← sample from m.sub.k 15: update θ.sub.k with SGD and loss L(θ.sub.k) 16: i ← i + 1
In the pseudo-code, the function StepEnvironment corresponds to a combination of the reward model R and state transition model T discussed above. The notation k˜ custom-character {1, K} refers to sampling of an integer k from a uniform distribution over the integer range [1, K], and p˜(0,1) denotes sampling of a real number from a uniform distribution over the open interval (0,1).

[0049] Here, an ensemble of K trainable neural networks and K fixed prior networks are first initialized randomly. A replay memory is divided into K parallel buffers m.sub.k, for the individual ensemble members (although in practice, this can be implemented in a memory-efficient way that uses a negligible amount of more memory than a single replay memory). To handle exploration, a random ensemble member is chosen for each training episode. Actions are then taken by greedily maximizing the Q value of the chosen ensemble member, which corresponds to a form of approximate Thompson sampling. The new experience (s.sub.i, a.sub.i, r.sub.i, s.sub.i+1) is then added to each ensemble buffer with probability p.sub.add Finally, a minibatch M of experiences is sampled from each ensemble buffer and the trainable network parameters of the corresponding ensemble member are updated by stochastic gradient descent (SGD), using the second definition of the loss function given above.

[0050] The presented ensemble RPF algorithm was trained in a one-way, three-lane highway driving scenario using the Simulation of Urban Mobility (SUMO) traffic simulator. The vehicle to be controlled (ego vehicle) was a 16 m long truck-trailer combination with a maximum speed of 25 m/s. In the beginning of each episode, 25 passenger cars were inserted into the simulation, with a random desired speed in the range 15 to 35 m/s. In order to create interesting traffic situations, slower vehicles were positioned in front of the ego vehicle, and faster vehicles were placed behind the ego vehicle. Each episode was terminated after N=100 timesteps, or earlier if a collision occurred or the ego vehicle drove off the road. The simulation time step was set to Δt=1 s. The passenger vehicles were controlled by the standard SUMO driver model, which consists of an adaptive cruise controller for the longitudinal motion and a lane-change model that makes tactical decisions to overtake slower vehicles. In the scenarios considered here, no strategical decisions were necessary, so the strategical part of the lane-changing model was turned off. Furthermore, in order to make the traffic situations more demanding, the cooperation level of the lane changing model was set to zero. Overtaking was allowed both on the left and right side of another vehicle, and each change took 4 s to complete. This environment was modeled by defining a corresponding state space custom-character action space state transition model T, and reward R.

[0051] FIG. 3 illustrates the architecture of the neural network is used in this embodiment. The architecture includes a temporal convolutional neural network (CNN) architecture, which makes the training faster and, at least in some use cases, gives better results than a standard fully connected (FC) architecture. By applying CNN layers and a max pooling layer to the part of the input that describes the surrounding vehicles, the output of the network becomes independent of the ordering of the surrounding vehicles in the input vector, and the architecture allows a varying input vector size. Rectified linear units (ReLUs) are used as activation functions for all layers, except the last, which has a linear activation function. The architecture also includes a dueling structure that separates the state value V(s) and action advantage A(s, a) estimation.

[0052] In an example, the RL agent was trained in the simulated environment described above. After every 50000 added training samples, henceforth called training steps, the agent was evaluated on 100 different test episodes. These test episodes were randomly generated in the same way as the training episodes, but not present during the training. The test episodes were also kept identical for all the test phases.

[0053] To gain insight into how the uncertainty estimation evolves during the training process, and to illustrate how to set the uncertainty threshold parameter C.sub.v, FIG. 4 shows the coefficient of variation c.sub.v for the chosen action during the test episodes as a function of the number of training steps (scale in millions of steps). Each plotted value is an average over the 100 test episodes of that test phase. FIG. 4 shows the uncertainty of the chosen action, whereas the uncertainty for not-chosen actions may be higher. After around four million training steps, the coefficient of variation settles at around 0.01, with a small spread in values, which may justify setting the threshold at or around C.sub.v=0.02.

[0054] To assess the ability of the RPF ensemble agent to cope with unseen situations, the agent obtained after five million training steps was deployed in scenarios that had not been included in the training episodes. In various situations that involved an oncoming vehicle, the uncertainty estimate was consistently high, c.sub.v≈0.2. The fact that this value is one level of magnitude above the proposed value of the threshold C.sub.v=0.02, along with several further examples, suggests that the criterion c.sub.v(s, a)>C.sub.v for including this state-action value pair (or the state that it involves) in the additional training is a robust and reliable guideline.

[0055] The aspects of the present disclosure have mainly been described above with reference to a few embodiments. However, as is readily appreciated by a person skilled in the art, other embodiments than the ones disclosed above are equally possible within the scope of the invention, as defined by the appended patent claims. In particular, the disclosed approach to estimating the uncertainty of a possible decision by an RL agent is applicable in machine learning more generally, also outside of the field of autonomous vehicles. It may be advantageous wherever the reliability of a possible decision is expected to influence personal safety, material values, information quality, user experience and the like, and where the problem of precisely focusing additional training arises.

UNCERTAINTY-DIRECTED TRAINING OF A REINFORCEMENT LEARNING AGENT FOR TACTICAL DECISION-MAKING

Assignee

Inventors

Cpc classification

Classification Explorer

G06N7/01

PHYSICS

Classification Explorer

G06N3/006

PHYSICS

Classification Explorer

B60W50/045

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

B60W2050/0043

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

B60W2050/0028

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

B60W60/001

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06N3/048

PHYSICS

International classification

Classification Explorer

B60W60/00

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

B60W50/04

PERFORMING OPERATIONS; TRANSPORTING

Abstract

Claims

Description