MANAGING ALEATORIC AND EPISTEMIC UNCERTAINTY IN REINFORCEMENT LEARNING, WITH APPLICATIONS TO AUTONOMOUS VEHICLE CONTROL
20220374705 · 2022-11-24
Assignee
Inventors
Cpc classification
G06N7/01
PHYSICS
G06F18/214
PHYSICS
G06N3/006
PHYSICS
G06F18/217
PHYSICS
B60W60/001
PERFORMING OPERATIONS; TRANSPORTING
International classification
Abstract
Methods relating to the control of autonomous vehicles using a reinforcement learning agent include a plurality of training sessions, in which the agent interacts with an environment, each having a different initial value and yielding a state-action quantile function dependent on state and action. The methods further include a first uncertainty estimation on the basis of a variability measure, relating to a variability with respect to quantile τ, of an average of the plurality of state-action quantile functions evaluated for a state-action pair; and a second uncertainty estimation on the basis of a variability measure, relating to an ensemble variability, for the plurality of state-action quantile functions evaluated for a state-action pair.
Claims
1. A method of controlling an autonomous vehicle using a reinforcement learning, RL, agent, the method comprising: a plurality of training sessions, in which the RL agent interacts with an environment including the autonomous vehicle, each training session having a different initial value and yielding a state-action quantile function dependent on state and action; decision-making, in which the RL agent outputs at least one tentative decision relating to control of the autonomous vehicle; a first uncertainty estimation on the basis of a variability measure, relating to a variability with respect to quantile, of an average of the plurality of state-action quantile functions evaluated for a state-action pair corresponding to the tentative decision; a second uncertainty estimation on the basis of a variability measure, relating to an ensemble variability, for the plurality of state-action quantile functions evaluated for a state-action pair corresponding to the tentative decision; and vehicle control, wherein the at least one tentative decision is executed in dependence of the first and/or second estimated uncertainty.
2. A method of providing a reinforcement learning, RL, agent for decision-making to be used in controlling an autonomous vehicle, the method comprising: a plurality of training sessions, in which the RL agent interacts with an environment including the autonomous vehicle, each training session having a different initial value and yielding a state-action quantile function dependent on state and action; a first uncertainty estimation on the basis of a variability measure, relating to a variability with respect to quantile, of an average of the plurality of state-action quantile functions evaluated for state-action pairs corresponding to possible decisions by the trained RL agent; a second uncertainty estimation on the basis of a variability measure, relating to an ensemble variability, for the plurality of state-action quantile functions evaluated said state-action pairs; and additional training, in which the RL agent interacts with a second environment including the autonomous vehicle, wherein the second environment differs from the first environment by an increased exposure to a subset of state-action pairs for which the first and/or second estimated uncertainty is relatively higher.
3. The method of claim 1, wherein the RL agent includes at least one neural network.
4. The method of claim 1, wherein each of the training sessions employs an implicit quantile network, IQN, from which the RL agent is derivable.
5. The method of claim 4, wherein the initial value of a training session corresponds to a randomized prior function, RPF.
6. The method of claim 1, wherein the uncertainty estimations relate to a combined aleatoric and epistemic uncertainty.
7. The method of claim 1, wherein the variability measure used in the second uncertainty estimation is applied to sampled expected values of the respective state-action quantile functions.
8. The method of claim 1, wherein the variability measure is one or more of: a variance, a range, a deviation, a variation coefficient, an entropy.
9. The method of claim 1, wherein the tentative decision is executed only if the first and second estimated uncertainties are less than respective predefined thresholds.
10. The method of claim 9, wherein: the decision-making includes the RL agent outputting multiple tentative decisions; and the vehicle control includes sequential evaluation of the tentative decisions with respect to their estimated uncertainties.
11. The method of claim 10, wherein a backup decision, which is optionally based on a backup policy, is executed if the sequential evaluation does not return a tentative decision to be executed.
12. The method of claim 1, wherein the decision-making includes tactical decision-making.
13. The method of claim 1, wherein the decision-making is based on a central tendency of weighted averages of the respective state-action quantile functions.
14. An arrangement for controlling an autonomous vehicle, comprising: processing circuitry and memory implementing a reinforcement learning, RL, agent configured to interact with an environment including the autonomous vehicle in a plurality of training sessions, each training session having a different initial value and yielding a state-action quantile function dependent on state and action, and output at least one tentative decision relating to control of the autonomous vehicle, the processing circuitry and memory further implementing a first uncertainty estimator and a second uncertainty estimator configured for a first uncertainty estimation on the basis of a variability measure, relating to a variability with respect to quantile τ, of an average of the plurality of state-action quantile functions evaluated for a state-action pair corresponding to the tentative decision, and a second uncertainty estimation on the basis of a variability measure, relating to an ensemble variability, for the plurality of state-action quantile functions evaluated for a state-action pair corresponding to the tentative decision, the arrangement further comprising a vehicle control interface configured to control the autonomous vehicle by executing the at least one tentative decision in dependence of the estimated first and/or second uncertainty.
15. An arrangement for controlling an autonomous vehicle, comprising: processing circuitry and memory implementing a reinforcement learning, RL, agent configured to interact with a first environment including the autonomous vehicle in a plurality of training sessions, each training session having a different initial value and yielding a state-action quantile function dependent on state and action, the processing circuitry and memory further implementing a training manager configured to perform a first uncertainty estimation on the basis of a variability measure, relating to a variability with respect to quantile τ, of an average of the plurality of state-action quantile functions evaluated for one or more state-action pairs corresponding to possible decisions by the trained RL agent, perform a second uncertainty estimation on the basis of a variability measure, relating to an ensemble variability, for the plurality of state-action quantile functions evaluated for said state-action pairs, and initiate additional training, in which the RL agent interacts with a second environment including the autonomous vehicle, wherein the second environment differs from the first environment by an increased exposure to a subset of state-action pairs for which the first and/or second estimated uncertainty is relatively higher.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] Aspects and embodiments are now described, by way of example, with reference to the accompanying drawings, on which:
[0017]
[0018]
[0019]
[0020]
[0021]
DETAILED DESCRIPTION
[0022] The aspects of the present disclosure will now be described more fully hereinafter with reference to the accompanying drawings, on which certain embodiments of the invention are shown. These aspects may, however, be embodied in many different forms and should not be construed as limiting; rather, these embodiments are provided by way of example so that this disclosure will be thorough and complete, and to fully convey the scope of all aspects of invention to those skilled in the art. Like numbers refer to like elements throughout the description.
Theoretical Concepts
[0023] Reinforcement learning (RL) is a branch of machine learning, where an agent interacts with some environment to learn a policy π(s) that maximizes the future expected return. Reference is made to the textbook R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2.sup.nd ed., MIT Press (2018).
[0024] The policy π(s) defines which action a to take in each state s. When an action is taken, the environment transitions to a new state s′ and the agent receives a reward r. The decision-making problem that the RL agent tries to solve can be modeled as a Markov decision process (MDP), which is defined by the tuple (;
; T; R; γ), where
is the state space,
is the action space, T is a state transition model (or evolution operator), R is a reward model, and γ is a discount factor. The goal of the RL agent is to maximize expected future return
[R.sub.t], for every time step t, where
The value of taking action a in state s and then following policy π is defined by the state-action value function
Q.sup.π(s,a)=[R.sub.t|s.sub.t=s,a.sub.t=a,π].
In Q-learning, the agent tries to learn the optimal state-action value function, which is defined as
and the optimal policy is derived from the optimal action-value function using the relation
[0025] In contrast to Q-learning, distributional RL aims to learn not only the expected return but also the distribution over returns. This distribution is represented by the random variable
Z.sup.π(s,a)=R.sub.t given s.sub.t=s,a.sub.t=a and policy π.
The mean of this random variable is the classical state-action value function, i.e., Q.sup.π(s, a)=[Z.sup.π(s, a)]. The distribution over returns represents the aleatoric uncertainty of the outcome, which can be used to estimate the risk in different situations and to train an agent in a risk-sensitive manner.
[0026] The random variable Z.sup.π has a cumulative distribution function F.sub.Z.sub.(0,1), the sample Z.sub.τ(s, a) has the probability distribution of Z.sup.π(s, a), that is, Z.sub.τ(s, a)˜Z.sup.π(s, a).
[0027] The present invention's approach, termed Ensemble Quantile Networks (EQN) method, enables a full uncertainty estimate covering both the aleatoric and the epistemic uncertainty. An agent that is trained by EQN can then take actions that consider both the inherent uncertainty of the outcome and the model uncertainty in each situation.
[0028] The EQN method uses an ensemble of neural networks, where each ensemble member individually estimates the distribution over returns. This is related to the implicit quantile network (IQN) framework; reference is made to the above-cited works by Dabney and coauthors. The k.sup.th ensemble member provides:
Z.sub.k,τ(s,a)=f.sub.τ(s,a;θ.sub.k)+βp.sub.τ(s,a;{circumflex over (θ)}.sub.k),
where f.sub.τ and p.sub.τ are neural networks with identical architecture, θ.sub.k are trainable network parameters (weights), whereas {circumflex over (θ)}.sub.k denotes fixed network parameters. The second term may be a randomized prior function (RPF), as described in I. Osband, J. Aslanides and A. Cassirer, “Randomized prior functions for deep reinforcement learning,” in: S. Bengjo et al. (eds.), Adv. in Neural Inf. Process. Syst. 31 (2018), pp. 8617-8629. The factor β can be used to tune the importance of the RPF. The temporal difference (TD) error of ensemble member k and two quantile samples τ,τ′˜(0,1) is
is a sample-based estimate of the optimal policy using {tilde over (τ)}.sub.j˜(0,1) and K.sub.τ is a positive integer.
[0029] Quantile regression is used. The regression loss, with threshold κ, is calculated as
The full loss function is obtained from a mini-batch M of sampled experiences, in which the quantiles τ and τ′ are sampled N and N′ times, respectively, according to:
For each new training episode, the agent follows the policy {tilde over (π)}.sub.v(s) of a randomly selected ensemble member v.
[0030] An advantageous option is to use quantile Huber regression loss, which is given by
Here, the Huber loss is defined as
which ensures a smooth gradient as δ.sub.k,t.sup.τ,τ′.fwdarw.0.
[0031] The full training process of the EQN agent that was used in this implementation may be represented in pseudo-code as follows:
TABLE-US-00001 Algorithm 3 EQN training process 1: for k ← 1 to K 2: Initialize θ.sub.k and {circumflex over (θ)}.sub.k randomly 3: m.sub.k ← { } 4: t ← 0 5: while networks not converged 6: s.sub.t ← initial random state 7: v ~ {1,K} 8: while episode not finished 9:
(0, 1) < p.sub.add 14: m.sub.k ← m.sub.k ∪ {(s.sub.t, a.sub.t, r.sub.t, s.sub.t+1)} 15: M ← sample from m.sub.k 16: update θ.sub.k with SGD and loss L.sub.EQN(θ.sub.k) 17: t ← t + 1
In the pseudo-code, the function StepEnvironment corresponds to a combination of the reward model R and state transition model T discussed above. The notation v˜{1, K} refers to sampling of an integer v from a uniform distribution over the integer range [1, K], and τ˜
(0, α) denotes sampling of a real number from a uniform distribution over the open interval (0, α). SGD is short for stochastic gradient descent and i.i.d. means independent and identically distributed.
[0032] The EQN agent allows an estimation of both the aleatoric and epistemic uncertainties, based on a variability measure of the returns, Var.sub.τ[.sub.k [Z.sub.k,τ(s, a)]], and a variability measure of an expected value of returns, Var.sub.k [
.sub.τ[Z.sub.k,τ(s, a)]]. Here, the variability measure Var[.Math.] may be a variance, a range, a deviation, a variation coefficient, an entropy or combinations of these. An index of the variability measure is used to distinguish variability with respect to the quantile (Var.sub.τ[.Math.]0≤τ≤1) from variability across ensemble members (Var.sub.k[.Math.], 1≤k≤K). Further, the sampled expected value operator
.sub.τ.sub.
and K.sub.τ is a positive integer. After training of the neural networks, for the reasons presented above, it holds that Z.sub.k,τ(s, a)˜Z.sup.π(s, a) for each k. It follows that
.sub.τ.sub.
[Z.sup.π(s,a)]=Q.sup.π(s,a),
wherein the approximation may be expected to improve as K.sub.τ increases.
[0033] On this basis, the trained agent may be configured to follow the following policy:
where π.sub.backup(s) is a decision by a fallback policy or backup policy, which represents safe behavior. The agent is deemed to be confident about a decision (s, a) if both
Var.sub.τ[.sub.k[Z.sub.k,τ(s,a)]]<σ.sub.a.sup.2
and
Var.sub.k[.sub.τ[Z.sub.k,τ(s,a)]]<σ.sub.e.sup.2,
where σ.sub.a, σ.sub.e are constants reflecting the tolerable aleatoric and epistemic uncertainty, respectively.
[0034] For computational simplicity, the first part of the confidence condition can be approximated by replacing the quantile variability Var.sub.τ with an approximate variability measure Var.sub.τ.sub..sub.τ with the sampled expected value
.sub.τ.sub.
Implementations
[0035] The presented algorithms for estimating the aleatoric or epistemic uncertainty of an agent have been tested in simulated traffic intersection scenarios. However, these algorithms provide a general approach and could be applied to any type of driving scenarios. This section describes how a test scenario is set up, the MDP formulation of the decision-making problem, the design of the neural network architecture, and the details of the training process.
[0036] Simulation setup. An occluded intersection scenario was used. The scenario includes dense traffic and is used to compare the different algorithms, both qualitatively and quantitatively. The scenario was parameterized to create complicated traffic situations, where an optimal policy has to consider both the occlusions and the intentions of the other vehicles, sometimes drive through the intersection at a high speed, and sometimes wait at the intersection for an extended period of time.
[0037] The Simulation of Urban Mobility (SUMO) was used to run the simulations. The controlled ego vehicle, a 12 m long truck, aims to pass the intersection, within which it must yield to the crossing traffic. In each episode, the ego vehicle is inserted 200 m south from the intersection, and with a desired speed v.sub.set=15 m/s. Passenger cars are randomly inserted into the simulation from the east and west end of the road network with an average flow of 0.5 vehicles per second. The cars intend to either cross the intersection or turn right. The desired speeds of the cars are uniformly distributed in the range [v.sub.min, v.sub.max]=[10, 15] m/s, and the longitudinal speed is controlled by the standard SUMO speed controller (which is a type of adaptive cruise controller, based on the Intelligent Driver Model (IDM)) with the exception that the cars ignore the presence of the ego vehicle. Normally, the crossing cars would brake to avoid a collision with the ego vehicle, even when the ego vehicle violates the traffic rules and does not yield. With this exception, however, more collisions occur, which gives a more distinct quantitative difference between different policies. Each episode is terminated when the ego vehicle has passed the intersection, when a collision occurs, or after N.sub.max=100 simulation steps. The simulations use a step size of Δt=1 s.
[0038] It is noted that the setup of this scenario includes two important sources of randomness in the outcome for a given policy, which the aleatoric uncertainty estimation should capture. From the viewpoint of the ego vehicle, a crossing vehicle can appear at any time until the ego vehicle is sufficiently close to the intersection, due to the occlusions. Furthermore, there is uncertainty in the underlying driver state of the other vehicles, most importantly in the intention of going straight or turning to the right, but also in the desired speed.
[0039] Epistemic uncertainty is introduced by a separate test, in which the trained agent faces situations outside of the training distribution. In these test episodes, the maximum speed v.sub.max of the surrounding vehicles are gradually increased from 15 m/s (which is included in the training episodes) to 25 m/s. To exclude effects of aleatoric uncertainty in this test, the ego vehicle starts in the non-occluded region close to the intersection, with a speed of 7 m/s.
[0040] MDP formulation. The following Markov decision process (MDP) describes the decision-making problem.
[0041] State space, : The state of the system,
s=({x.sub.i,y.sub.iv.sub.i,ψ.sub.i}:0≤i≤N.sub.veh),
consists of the position x.sub.i, y.sub.i, longitudinal speed v.sub.i, and heading ψ.sub.i of each vehicle, where index 0 refers to the ego vehicle. The agent that controls the ego vehicle can observe other vehicles within the sensor range x.sub.sensor=200 m, unless they are occluded.
[0042] Action space, : At every time step, the agent can choose between three high-level actions: ‘stop’, ‘cruise’, and ‘go’, which are translated into accelerations through the IDM. The action ‘go’ makes the IDM control the speed towards v.sub.set by treating the situation as if there are no preceding vehicles, whereas ‘cruise’ simply keeps the current speed. The action ‘stop’ places an imaginary target vehicle just before the intersection, which causes the IDM to slow down and stop at the stop line. If the ego vehicle has already passed the stop line, ‘stop’ is interpreted as maximum braking. Finally, the output of the IDM is limited to [a.sub.min, a.sub.max]=[−3, 1] m/s.sup.2. The agent takes a new decision at every time step Δt and can therefore switch between, e.g., ‘stop’ and ‘go’ multiple times during an episode.
[0043] Reward model, R: The objective of the agent is to drive through the intersection in a time efficient way, without colliding with other vehicles. A simple reward model is used to achieve this objective. The agent receives a positive reward r.sub.goal=10 when the ego vehicle manages to cross the intersection and a negative reward r.sub.col=−10 if a collision occurs. If the ego vehicle gets closer to another vehicle than 2.5 m longitudinally or 1 m laterally, a negative reward r.sub.near=−10 is given, but the episode is not terminated. At all other time steps, the agent receives a zero reward.
[0044] Transition model, T: The state transition probabilities are not known by the agent. They are implicitly defined by the simulation model described above.
[0045] Backup policy. A simple backup policy π.sub.backup (s) is used together with the uncertainty criteria. This policy selects the action ‘stop’ if the vehicle is able to stop before the intersection, considering the braking limit a.sub.min. Otherwise, the backup policy selects the action that is recommended by the agent. If the backup policy always consisted of ‘stop’, the ego vehicle could end up standing still in the intersection and thereby cause more collisions. Naturally, more advanced backup policies would be considered in a real-world implementation.
[0046] Neural network architecture.
[0047] At the low left part of the network, an input for the sample quantile τ is seen. An embedding from τ is created by setting ϕ(τ)=(ϕ.sub.1(τ), . . . , ϕ.sub.64(τ)), where ϕ.sub.j (τ)=cos πjτ, and then passing ϕ(τ) through a fully connected layer with 512 units. The output of the embedding is then merged with the output of the concatenating layer as the element-wise (or Hadamard) product.
[0048] At the right side of the network in
[0049] Training process. Algorithm 3 was used to train the EQN agent. As mentioned above, an episode is terminated due to a timeout after maximally N.sub.max steps, since otherwise the current policy could make the ego vehicle stop at the intersection indefinitely. However, since the time is not part of the state space, a timeout terminating state is not described by the MDP. Therefore, in order to make the agents act as if the episodes have no time limit, the last experience of a timeout episode is not added to the experience replay buffer. Values of the hyperparameters used for the training are shown in Table 1.
TABLE-US-00002 TABLE 1 Hyperparameters Number of quantile samples N, N′, K.sub.τ 32 Number of ensemble members K 10 Prior scale factor β 300 Experience adding probability p.sub.add 0.5 Discount factor γ 0.95 Learning start iteration N.sub.start 50,000 Replay memory size N.sub.replay 500,000 Learning rate η 0.0005 Mini-batch size |M| 32 Target network update frequency N.sub.update 20,000 Huber loss threshold κ 10 Initial exploration parameter ϵ.sub.0 1 Final exploration parameter ϵ.sub.1 0.05 Final exploration iteration N.sub.ϵ 500,000
[0050] The training was performed for 3,000,000 training steps, at which point the agents' policies have converged, and then the trained agents are tested on 1,000 test episodes. The test episodes are generated in the same way as the training episodes, but they are not present during the training phase.
[0051] Results. The performance of the EQN agent has been evaluated within the training distribution, results of being presented in Table 2.
TABLE-US-00003 TABLE 2 Dense traffic scenario, tested within training distribution thresholds collisions (%) crossing time (s) EQN with σ.sub.a = ∞ 0.9 ± 0.1 32.0 ± 0.2 K = 10 and σ.sub.a = 3.0 0.6 ± 0.2 33.8 ± 0.3 β = 300 σ.sub.a = 2.0 0.5 ± 0.1 38.4 ± 0.5 σ.sub.a = 1.5 0.3 ± 0.1 47.2 ± 1.2 σ.sub.a = 1.0 0.0 ± 0.0 71.1 ± 1.9 σ.sub.a = 1.5, 0.0 ± 0.0 48.9 ± 1.6 σ.sub.e = 1.0
The EQN agent appears to unite the advantages of agents that consider only aleatoric or only epistemic uncertainty, and it can estimate both the aleatoric and epistemic uncertainty of a decision. When the aleatoric uncertainty criterion is applied, the number of situations that are classified as uncertain depends on the parameter σ.sub.a, see
[0052] The performance of the epistemic uncertainty estimation of the EQN agent is illustrated in
[0053] The results demonstrate that the EQN agent combines the advantages of the individual components and provides a full uncertainty estimate, including both the aleatoric and epistemic dimensions. The aleatoric uncertainty estimate given by the EQN algorithm can be used to balance risk and time efficiency, by applying the aleatoric uncertainty criterion (varying the allowed variance σ.sub.a.sup.2, see
[0054] The epistemic uncertainty information provides insight into how far a situation is from the training distribution. In this disclosure, the usefulness of an epistemic uncertainty estimate is demonstrated by increasing the safety, through classifying the agent's decisions in situations far from the training distribution as unsafe and then instead applying a backup policy. Whether it is possible to formally guarantee safety with a learning-based method is an open question, and likely an underlying safety layer is required in a real-world application. The EQN agent can reduce the activation frequency of such a safety layer, but possibly even more importantly, the epistemic uncertainty information could be used to guide the training process to regions of the state space in which the current agent requires more training. Furthermore, if an agent is trained in a simulated world and then deployed in the real world, the epistemic uncertainty information can identify situations with high uncertainty, which should be added to the simulated world.
[0055] The algorithms that were introduced in the present disclosure include a few hyperparameters, whose values need to be set appropriately. The aleatoric and epistemic uncertainty criteria parameters, σ.sub.a and σ.sub.e, can both be tuned after the training is completed and allow a trade-off between risk and time efficiency, see
Specific Embodiments
[0056] After summarizing the theoretical concepts underlying the invention and empirical results confirming their effects, specific embodiments of the present invention will now be described.
[0057]
[0058] The method 100 may be implemented by an arrangement 300 of the type illustrated in
[0059] The method 100 begins with a plurality of training sessions 110-1, 110-2, . . . , 110-K (K≥2), which may preferably be carried out in a simultaneous or at least time-overlapping fashion. In particular, each of the K training sessions may use a different neural network initiated with an independently sampled set of weights (initial value), see Algorithm 3. Each neural network may implicitly estimate a quantile of the return distribution. In a training session, the RL agent interacts with an environment which includes the autonomous vehicle (or, if the environment is simulated, a model of the vehicle). The environment may further include the surrounding traffic (or a model thereof). The k.sup.th training session returns a state-action quantile function Z.sub.k,τ(s, a)=F.sub.Z.sub.
[0060] A next step of the method 100 includes decision-making 112, in which the RL agent outputs at least one tentative decision (ŝ, â.sub.l), 1≤l≤L with L≥1, relating to control of the autonomous vehicle. The decision-making may be based on a central tendency of the K neural networks, such as the mean of the state-action value functions:
Alternatively, the decision-making is based on the sample-based estimate {tilde over (π)}(s) of the optimal policy, as introduced above.
[0061] There follows a first uncertainty estimation step 114, which is carried out on the basis of a variability measure Var.sub.τ[.sub.k [Z.sub.k,τ(s, a)]]. As the index τ indicates, the variability captures the variation with respect to quantile τ. It is the variability of an average
.sub.k [Z.sub.k,τ(s, a)] of the plurality of state-action quantile functions evaluated for at least one a state-action pair (ŝ, â.sub.l) corresponding to the tentative decision that is estimated. The average may be computed as follows:
[0062] The method 100 further comprises a second uncertainty estimation step 116 on the basis of a variability measure Var.sub.k [.sub.τ[Z.sub.k,τ(s, a)]]. As indicated by the index k, the estimation targets the variability among ensemble members (ensemble variability), i.e., among the state-action quantile functions which result from the K training sessions when evaluated for a state-action pairs (ŝ, â.sub.l) corresponding to the one or more tentative decisions. More precisely, the variability of an expected value with respect to the quantile variable τ is estimated. Particular embodiments may use, rather than
.sub.τ[Z.sub.k,τ(s, a)], an approximation
.sub.τ.sub.
[0063] The method then continues to vehicle control 118, wherein the at least one tentative decision (ŝ, â.sub.l) is executed in dependence of the first and/or second estimated uncertainties. For example, step 118 may apply a rule by which the decision (ŝ, â.sub.l) is executed only if the condition
Var.sub.τ[.sub.k[Z.sub.k,τ(ŝ,â.sub.l)]]<σ.sub.a.sup.2
is true, where σ.sub.a reflects an acceptable aleatoric uncertainty. Alternatively, the rule may stipulate that the decision (ŝ, â.sub.l) is executed only if the condition
Var.sub.k[.sub.τ[Z.sub.k,τ(ŝ,â.sub.l)]]<σ.sub.e.sup.2
is true, where σ.sub.e reflects an acceptable epistemic uncertainty. Further alternatively, the rule may require the verification of both these conditions to release decision (ŝ, â.sub.l) for execution; this relates to a combined aleatoric and epistemic uncertainty. Each of these formulations of the rule serves to inhibit execution of uncertain decisions, which tend to be unsafe decisions, and is therefore in the interest of road safety.
[0064] While the method 100 in the embodiment described hitherto may be said to quantize the estimated uncertainty into a binary variable—it passes or fails the uncertainty criterion—other embodiments may treat the estimated uncertainty as a continuous variable. The continuous variable may indicate how much additional safety measures need to be applied to achieve a desired safety standard. For example, a moderately elevated uncertainty may trigger the enforcement of a maximum speed limit or maximum traffic density limit, or else the tentative decision shall not be considered safe to execute.
[0065] In one embodiment, where the decision-making step 112 produces multiple tentative decisions by the RL agent (L≥2), the tentative decisions are ordered in some sequence and evaluated with respect to their estimated uncertainties. The method may apply a rule that the first tentative decision in the sequence which is found to have an estimated uncertainty below the predefined threshold shall be executed. While this may imply that a tentative decision which is located late in the sequence is not executed even though its estimated uncertainty is below the predefined threshold, this remains one of several possible ways in which the tentative decisions can be “executed in dependence of” the estimated uncertainties in the sense of the claims. An advantage with this embodiment is that an executable tentative decision is found without having to evaluate all available tentative decision with respect to uncertainty.
[0066] In a further development of the preceding embodiment, a backup (or fallback) decision is executed if the sequential evaluation does not return a tentative decision to be executed. For example, if the last tentative decision in the sequence is found to have too large uncertainty, the backup decision is executed. The backup decision may be safety-oriented, which benefits road safety. At least in tactical decision-making, the backup decision may include taking no action. To illustrate, if all tentative decisions achieving an overtaking of a slow vehicle ahead are found to be too uncertain, the backup decision may be to not overtake the slow vehicle. The backup decision may be derived from a predefined backup policy π.sub.backup, e.g., by evaluating the backup policy for the state ŝ.
[0067] .sub.B of those states for which the RL agent will benefit most from additional training:
.sub.B={s∈
: RL agent not confident for some a∈
|.sub.s},
where |.sub.s is the set of possible actions in state s and the property “confident” was defined above. The thresholds σ.sub.a,σ.sub.e appearing in the definition of “confident” represent a desired safety level at which the autonomous vehicle is to be operated. The thresholds may have been determined or calibrated by traffic testing and may be based on the frequency of decisions deemed erroneous, of collisions, near-collisions, road departures and the like. A possible alternative is to set the thresholds σ.sub.a,σ.sub.e dynamically, e.g., in such manner that a predefined percentage of the state-action pairs will have an increased exposure during the additional training.
[0068] The method 200 may be implemented by an arrangement 400 of the type illustrated in .sub.B. The training manager 422 is configured, inter alia, to perform the first and second uncertainty estimations described above.
[0069] The method 200 begins with a plurality of training sessions 210-1, 210-2, . . . , 210-K (K≥2), which may preferably be carried out in a simultaneous or at least time-overlapping fashion. In particular, each of the K training sessions may use a different neural network initiated with an independently sampled set of weights (initial value), see Algorithm 3. Each neural network may implicitly estimate a quantile of the return distribution. In a training session, the RL agent interacts with an environment E1 which includes the autonomous vehicle (or, if the environment is simulated, a model of the vehicle). The environment may further include the surrounding traffic (or a model thereof). The k.sup.th training session returns a state-action quantile function Z.sub.k,τ(s, a)=F.sub.Z.sub.
[0070] To determine the need for additional training, the disclosed method 200 includes a first 214 and a second 216 uncertainty evaluation of at least some of the RL agent's possible decisions, which can be represented as state-action pairs (s, a). One option is to perform a full uncertainty evaluation including also state-action pairs with a relatively low incidence in real traffic. The first uncertainty evaluation 214 includes computing the variability measure Var.sub.τ[.sub.k[Z.sub.k,τ(s, a)]] or an approximation thereof, as described in connection with step 114 of method 100 above. The second uncertainty evaluation 216 includes computing the variability measure Var.sub.k [
.sub.τ[Z.sub.k,τ(s, a)]] or an approximation thereof, similar to step 116 of method 100.
[0071] The method 200 then concludes with an additional training stage 218, in which the RL agent interacts with a second environment E2 including the autonomous vehicle, wherein the second environment differs from the first environment E1 by an increased exposure to .sub.B.
[0072] In some embodiments, the uncertainty evaluations 214, 216 are partial. To this end, more precisely, an optional traffic sampling step 212 is be performed prior to the uncertainty evaluations 214, 216. During the traffic sampling 212, the state-action pairs that are encountered in the traffic are recorded as a set .sub.L. Then, an approximate training set
.sub.B=
.sub.B ∩
.sub.L may be generated by evaluating the uncertainties only for the elements of
.sub.L. The approximate training set
.sub.B then replaces
.sub.B in the additional training stage 218. To illustrate, Table 3 shows an uncertainty evaluation for the elements in an example
.sub.B containing fifteen elements, where l is a sequence number.
TABLE-US-00004 TABLE 3 Example uncertainty evaluations 1 (S.sub.l, a.sub.l) Var.sub.τ [ .sub.k[Z.sub.k, τ(s, a)]] Var.sub.k [
.sub.τ[Z.sub.k, τ(s, a)]] 1 (S1, right) 1.1 0.3 2 (S1, remain) 1.5 0.2 3 (S1, left) 44 2.2 4 (S2, yes) 0.5 0.0 5 (S2, no) 0.6 0.1 6 (S3, A71) 10.1 0.9 7 (S3, A72) 1.7 0.3 8 (S3, A73) 2.6 0.4 9 (S3, A74) 3.4 0.0 10 (S3, A75) 1.5 0.3 11 (S3, A76) 12.5 0.7 12 (S3, A77) 3.3 0.2 13 (S4, stop) 1.7 0.1 14 (S4, cruise) 0.2 0.0 15 (S4, go) 0.9 0.2
[0073] Here, the sets of possible actions for each state S1, S2, S3, S4 are not known. If it is assumed that the enumeration of state-action pairs for each state is exhaustive, then |.sub.S1=fright, remain, left),
|.sub.S2={yes, no},
|.sub.S3={A71, A72, A73, A74, A75, A76, A77} and
|.sub.S4={stop, cruise, go}. If the enumeration is not exhaustive, then {right, remain, left}⊂
|.sub.S1, {yes, no}⊂
|.sub.S2 and so forth. For an example value of the threshold σ.sub.a.sup.2=4.0 (applied to the third column), all elements but l=3, 6, 11 pass. If an example threshold σ.sub.e.sup.2=1.0 is enforced (applied to the fourth column), then all elements but l=3 pass. Element l=3 corresponds to state S1, and elements l=6,11 correspond to state S3. On this basis, if the training set
.sub.B is defined as all states for which at least one action belongs to a state-action pair with an epistemic uncertainty exceeding the threshold, one obtains
.sub.B={S1}. Alternatively, if the training set is all states for which at least one action belongs to a state-action pair with an aleatoric and/or epistemic uncertainty exceeding the threshold, then
.sub.B={S1, S3}. This will be the emphasis of the additional training 218.
[0074] In still other embodiments of the method 200, the training set .sub.B may be taken to include all states s∈
for which the mean epistemic variability of the possible actions
|.sub.s exceeds the threshold σ.sub.e.sup.2. This may be a proper choice if it is deemed acceptable for the RL agent to have minor points of uncertainty but that the bulk of its decisions are relatively reliable. Alternatively, the training set
.sub.B may be taken to include all states s∈
for which the mean sum of aleatoric and epistemic variability of the possible actions
|.sub.s exceeds the sum of the thresholds σ.sub.a.sup.2+σ.sub.e.sup.2.
[0075] The aspects of the present disclosure have mainly been described above with reference to a few embodiments. However, as is readily appreciated by a person skilled in the art, other embodiments than the ones disclosed above are equally possible within the scope of the invention, as defined by the appended patent claims.