Method and System for Checking an Automated Driving Function by Reinforcement Learning
20220396280 · 2022-12-15
Inventors
Cpc classification
B60W2050/0028
PERFORMING OPERATIONS; TRANSPORTING
B60W60/001
PERFORMING OPERATIONS; TRANSPORTING
B60W50/06
PERFORMING OPERATIONS; TRANSPORTING
International classification
Abstract
A method for checking an automated driving function by reinforcement learning includes providing at least one specification of an automated driving function; generating a scenario, the scenario being specified by a first set of parameters; and determining a reward function such that the reward is greater in the event in which the scenario fails to meet the at least one specification in a simulation, than in the event in which the scenario meets the at least one specification in the simulation.
Claims
1.-8. (canceled)
9. A method for checking an automated driving function by reinforcement learning, the method comprising: providing at least one specification of an automated driving function; generating a scenario, the scenario being indicated by a first parameter set; and ascertaining a reward function such that a reward is higher in a case in which the scenario does not meet the at least one specification in a simulation than in a case in which the scenario meets the at least one specification in the simulation.
10. The method according to claim 9, wherein the reward function is ascertained by using a rule-based model.
11. The method according to claim 10, wherein the rule-based model describes a controller of the vehicle for the automated driving function, the controller being a model of the vehicle controlled by way of the automated driving function.
12. The method according to claim 9, further comprising: generating a second parameter set, which indicates a modification of the first parameter set.
13. The method according to claim 12, further comprising: ascertaining an estimate of a value of the reward function for a specific scenario by using a rule-based model in a simulation; generating a further scenario in accordance with a third parameter set, the third parameter set being determined based on the second parameter set and an estimated parameter set, which maximizes an estimate based on the rule-based model; and ascertaining the reward function such that the reward is higher in a case in which the estimate of the value of the reward function is lower for a scenario in a simulation than an actual value of the reward function.
14. The method according to claim 13, wherein the further scenario is generated in accordance with a third parameter set by using an inequation limitation that excludes certain scenarios, or a projection of the third parameter set onto a set of determined scenarios.
15. A computer product comprising a non-transitory computer readable medium having stored thereon program code which, when executed on one or more processors, carries out the acts of: providing at least one specification of an automated driving function; generating a scenario, the scenario being indicated by a first parameter set; and ascertaining a reward function such that a reward is higher in a case in which the scenario does not meet the at least one specification in a simulation than in a case in which the scenario meets the at least one specification in the simulation.
16. A system for checking an automated driving function by reinforcement learning, the system comprising a processor unit configured to carry out a method comprising: providing at least one specification of an automated driving function; generating a scenario, the scenario being indicated by a first parameter set; and ascertaining a reward function such that a reward is higher in a case in which the scenario does not meet the at least one specification in a simulation than in a case in which the scenario meets the at least one specification in the simulation.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0027]
[0028]
[0029]
[0030]
[0031]
DETAILED DESCRIPTION OF THE DRAWINGS
[0032] Unless stated otherwise, identical reference signs are used below for identical and identically acting elements.
[0033]
[0034] The vehicle 100 comprises the driving assistance system 110 for automated driving. During automated driving, the longitudinal and transverse guidance of the vehicle 100 is performed automatically. The driving assistance system 110 thus undertakes the vehicle guidance. To this end, the driving assistance system 110 controls the drive 20, the transmission 22, the hydraulic service brake 24 and the steering 26 by way of intermediate units, not shown.
[0035] To plan and perform the automated driving, surroundings information from a surroundings sensor system that observes the vehicle surroundings is taken by the driver assistance system 110. In particular, the vehicle can comprise at least one environment sensor 12 that is set up to pick up environment data indicating the vehicle surroundings. The at least one environment sensor 12 can comprise a LiDAR system, one or more radar systems and/or one or more cameras, for example.
[0036] It is an aim of the present disclosure to take an automatically verifiable specification and a continuous virtual simulation environment for an autonomous or automated driving function as a basis for learning how scenarios that falsify the function can be generated efficiently.
[0037] In one example, an ACC (adaptive cruise control) function is considered. The ACC function is set up to maintain a safety distance from a vehicle travelling ahead. A time gap t.sub.h, defined as t.sub.h=h/v, can be used to formalize the ACC requirements as follows: [0038] Two possible modes: setpoint velocity mode and time interval mode;
[0039] In the setpoint velocity mode, a velocity v.sub.d predefined or desired by the driver, i.e. v.sub.d ∈[v.sub.d,min; v.sub.d,max], is supposed to be maintained.
[0040] In the time interval mode, a time lead t.sub.h, i.e. t.sub.h ∈[t.sub.h,min; t.sub.h,max], for a vehicle travelling ahead is supposed to be maintained.
[0041] The system is in the setpoint velocity mode when v.sub.d≤h/t.sub.d, otherwise the system is in the time interval mode. Moreover, the acceleration of the vehicle must comply at all times with a.sub.c∈[a.sub.c,min; a.sub.c,max].
[0042] According to the embodiments of the present disclosure, reinforcement learning (RL) is used, and in particular a reinforcement-learning-based adversarial agent (denoted by Agent in the figures). The RL agent learns to generate scenarios that maximize a specific reward. Since the aim of the agent is to falsify the driving function, the reward function is designed such that the agent receives a high reward if the scenario leads to a violation of the specification and a low reward if the autonomous driving function operates according to the specification.
[0043] The agent repeatedly observes the state of the system s, which comprises all the relevant variables for the given specification. Based on the state, the agent performs an action a according to its learnt guideline and receives a corresponding reward R(s,a). The action consists of a finite set of scenario parameters. The agent changes its guideline over the course of time in order to maximize its reward.
[0044] The output from the RL agent is a scenario parameter set a, which comprises e.g. an initial vehicle velocity, the desired vehicle, the initial time gap and a velocity profile of the vehicle, which is coded by a finite time series of velocity segments v.sub.f, where t.sub.i ∈t.sub.0, t.sub.1, . . . , t.sub.n. An initial parameter set a.sub.0 is used to begin with and a corresponding initial environment state s.sub.0 is calculated. The state s.sub.t includes all the variables that are relevant to the examination of compliance with the specifications, e.g. minimum and maximum acceleration, minimum and maximum distance from the vehicle in front or minimum and maximum time progress, minimum and maximum velocity, etc. All of the above specification instructions can then be either directly recorded or numerically approximated by an inequation in the form A[s; a]−b≤0.
[0045] The input of the RL-based agent is the environment state s.sub.t at the time t and the outputs are the modified scenario parameters a.sub.t+1 for the next pass. The reward function is selected, e.g. so that R(s,a)=Σ.sub.x max(0,(exp(x)−1)), where x denotes the value of an arbitrary line on the left-hand side of the inequation A[s; a]−b≤0 for the specification. This guarantees that the reward is large only when the agent has found a scenario that infringes the specification.
[0046] General RL approaches are at the expense of slow, high variance, it can take millions of iterations to learn complex tasks, and each iteration could be cost-intensive. Even more important is the fact that the variation between the learning passes can be very high, which means that some passes of an RL algorithm are successful while others fail on account of chance happenings during initialization and sampling. This high variability of the learning can be a significant obstacle to applying RL. The problem becomes even greater in large parameter spaces.
[0047] The aforementioned problems can be alleviated by introducing prior knowledge about the process, which knowledge can be modelled in an appropriate manner by an inequation g(s.sub.t, a.sub.t)≤0 that excludes scenarios that violate the specification in a trivial way, i.e. it ensures e.g. that the vehicle starts in a nonviolating (safe) state. This inequation is incorporated into the learning process either as a regularization expression in the reward function or as an output limitation for the neural network in order to focus the learning progress. Any continuous variable-compatible RL method, e.g. policy gradient methods or actuator-critical methods, can be used for the RL agent.
[0048] Even if the method described above can be used to exclude many parameterizations that infringe the specification in a trivial manner, it still takes a considerable number of passes, which can take up to several days, before scenarios of interest are generated by the RL agent. Even more prior knowledge can therefore be incorporated in order to speed up the learning process.
[0049]
[0050] The method 300 comprises, in block 310, providing at least one specification of an automated driving function; in block 320, generating a scenario, the scenario being indicated by a first parameter set; and in block 330, ascertaining a reward function in such a way that the reward is higher in a case in which the scenario does not meet the at least one specification in a simulation than in a case in which the scenario meets the at least one specification in the simulation, the reward function being ascertained using a rule-based model.
[0051] Irrespective of the algorithm actually used in the autonomous or automated vehicle, it is assumed that the vehicle is controlled by a traditional (rule-based) control system and the driving dynamics are described by a simple analytical model, all recordable by the differential equation x.sub.k+1=f.sub.k(x.sub.k, s.sub.t, a.sub.t), where x.sub.k denotes the state of the vehicle over the execution time. On the basis of this, the following optimization problem can be formulated for the current environment state st:
[0052] This delivers to determine a new parameter set a.sub.est for an estimate of the maximum reward R.sub.est, max. If the optimization problem is not convex (which is often the case), it is possible to resort to a convex relaxation or other approximation methods.
[0053] An RL agent then receives the state s.sub.t and the RL agent reward
R.sub.nn=|R(s.sub.t,a.sub.t)−R.sub.est|.sub.n, n∈{1,2}
in parallel and generates a new parameter set a.sub.nn. In this way, the RL agent can learn only the difference between the rule-based control behavior and the actual system and not the whole system and can generate a corresponding modification a.sub.nn. Finally, the new parameter set for the next execution is stipulated as a.sub.s+t=a.sub.est+a.sub.nn. In order to avoid an initialization in an unsafe state, the method described above can be used in order to approximate prior knowledge by way of an inequation g(s.sub.t, a.sub.est)≤0.
[0054]
[0055] The method comprises generating a second parameter set, which indicates a modification of the first parameter set, and generating a further scenario in accordance with a third parameter set, the third parameter set being determined on the basis of the second parameter set and by using the rule-based model.
[0056] In some embodiments, the further scenario is generated in accordance with the third parameter set by using an inequation limitation that excludes certain scenarios. This is depicted in
a.sub.nn−a.sub.est|<a.sub.threshold
[0057] The present disclosure is not limited to the inequality limit, however, and a generalized optimization problem can be used, which can be described as follows:
[0058] Here, a suitable verification input û.sub.t is selected in accordance with a specific scenario class. By way of example, in order to prevent a collision with a vehicle travelling ahead.
[0059] According to an embodiment of the invention, the reward function is ascertained by using the rule-based model, for example. In particular, the RL agent learns to generate scenarios that maximize a reward and reflect a violation of the specification of the driving function. The learning can thus be speeded up by including available prior knowledge in the training process. This allows the automated driving function to be efficiently falsified in order to reveal weaknesses in the automated driving function.