METHOD AND COMPUTER SYSTEM FOR MULTI-LEVEL CONTROL OF MOTION ACTUATORS IN AN AUTONOMOUS VEHICLE

20250091616 ยท 2025-03-20

    Inventors

    Cpc classification

    International classification

    Abstract

    A computer system for controlling at least one motion actuator in an autonomous or semi-autonomous vehicle, the computer system comprising processing circuitry implementing a feedback controller, which is configured to sense an actual motion state of the vehicle and determine a machine-level instruction to the motion actuator for approaching or maintaining a setpoint motion state, and a reinforcement-learning agent, which is trained to perform decision-making regarding the setpoint motion state. The decisions by the reinforcement-learning agent are applied as the setpoint motion state of the feedback controller, and the machine-level instruction from the feedback controller is applied to the motion actuator.

    Claims

    1. A computer-implemented method of controlling at least one motion actuator in an autonomous or semi-autonomous vehicle, comprising: providing a feedback controller configured to sense an actual motion state of the vehicle and determine a machine-level instruction to the motion actuator for approaching or maintaining a setpoint motion state; providing a reinforcement-learning (RL) agent trained to perform decision-making regarding the setpoint motion state; applying decisions by the RL agent as the setpoint motion state of the feedback controller; and applying the machine-level instruction to the motion actuator; wherein at least one of the steps of the method is performed using processing circuitry of a computer system.

    2. The method of claim 1, wherein the feedback setpoint motion state is represented as a continuous variable.

    3. The method of claim 1, wherein the RL agent is trained to perform tactical decision-making regarding the setpoint motion state.

    4. The method of claim 1, wherein the feedback controller is configured to control at least one longitudinal motion actuator.

    5. The method of claim 4, further comprising providing a second feedback controller configured to control at least one lateral motion actuator; wherein the RL agent is trained to perform joint decision-making regarding a setpoint motion state of the longitudinal motion actuator and regarding a setpoint motion state of the lateral motion actuator.

    6. The method of claim 4, wherein the feedback controller includes an adaptive cruise controller (ACC) and the setpoint motion state of the longitudinal motion actuator is a setpoint time-to-collision (TTC).

    7. The method of claim 6, wherein: the second feedback controller includes a lane-change assistant and the setpoint motion state of the lateral motion actuator is a setpoint lane; and the RL agent is trained to perform joint decision-making regarding the setpoint TTC and regarding the setpoint lane.

    8. The method of claim 1, wherein the RL agent is trained to perform decision-making based on a state of the vehicle and/or of vehicles surrounding the vehicle which is not included in the vehicle's actual motion state sensed by the feedback controller.

    9. The method of claim 1, wherein the RL agent is configured with one of the following learning algorithms: deep Q network (DQN); advantage actor critic (A2C); proximal policy optimization (PPO).

    10. The method of claim 1, wherein the RL agent has been trained to perform decision-making in such manner as to minimize a total cost of operation (TCOP).

    11. A computer program product comprising program code for performing, when executed by processing circuitry of a computer system, the method of claim 1.

    12. A non-transitory computer-readable storage medium comprising instructions, which when executed by processing circuitry of a computer system, cause the processing circuitry to perform the method of claim 1.

    13. A computer system for controlling at least one motion actuator in an autonomous or semi-autonomous vehicle, the computer system comprising processing circuitry implementing: a feedback controller, which is configured to sense an actual motion state of the vehicle and determine a machine-level instruction to the motion actuator for approaching or maintaining a setpoint motion state; and a reinforcement-learning (RL) agent trained to perform decision-making regarding the setpoint motion state; wherein the processing circuitry is configured to: apply decisions by the RL agent as the setpoint motion state of the feedback controller; and apply the machine-level instruction to the motion actuator.

    14. A vehicle comprising the computer system of claim 13.

    15. The computer system of claim 13, wherein the feedback controller is configured for a setpoint motion state represented as a continuous variable.

    16. The computer system of claim 13, wherein the RL agent is trained to perform tactical decision-making regarding the setpoint motion state.

    17. The computer system of claim 13, wherein the feedback controller is configured to control at least one longitudinal motion actuator.

    18. The computer system of claim 17, further comprising a second feedback controller configured to control at least one lateral motion actuator; wherein the RL agent is trained to perform joint decision-making regarding a setpoint motion state of the longitudinal motion actuator and regarding a setpoint motion state of the lateral motion actuator.

    19. The computer system of claim 17, wherein the feedback controller includes an adaptive cruise controller (ACC) and the setpoint motion state of the longitudinal motion actuator is a setpoint time-to-collision (TTC).

    20. The computer system of claim 19, wherein: the second feedback controller includes a lane-change assistant and the setpoint motion state of the lateral motion actuator is a setpoint lane; and the RL agent is trained to perform joint decision-making regarding the setpoint TTC and regarding the setpoint lane.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0025] Aspects and embodiments are now described, by way of example, with reference to the accompanying drawings, on which:

    [0026] FIG. 1A illustrates a baseline vehicle control architecture;

    [0027] FIG. 1B a vehicle control architecture according to embodiments herein;

    [0028] FIG. 2 is a flow chart of a method controlling at least one motion actuator in an autonomous vehicle, according to embodiments herein;

    [0029] FIG. 3 is a top view of a multi-lane driving environment, in which some geometric quantities are introduced;

    [0030] FIG. 4 is a block diagram of a computer system;

    [0031] FIG. 5 is a comparison of episode rewards in the baseline vehicle control architecture and an improved variant of the baseline vehicle control architecture by which the state space S has been extended with a leading-vehicle distance; and

    [0032] FIGS. 6A-6C show comparisons of average episode reward in the baseline vehicle control architecture and the novel vehicle control architecture proposed herein, in respective implementations where DQN (FIG. 6A), A2C (FIG. 6B) and PPO (FIG. 6C) is used.

    DETAILED DESCRIPTION

    I. Outline

    [0033] The detailed description set forth below provides information and examples of the disclosed technology with sufficient detail to enable those skilled in the art to practice the disclosure.

    [0034] In some aspects of the present disclosure, the inventors develop ACC together with lane change maneuvers for an autonomous truck in a highway scenario. The inventors first develop a baseline architecture for tactical decision-making and control based on the previous work disclosed in (C.-J. Hoel, Wolff, and Laine 2020), (C.-J. E. Hoel 2020) and WO2021213616A1. It is recalled that the decision-making task of an autonomous vehicle is commonly divided into strategic, tactical and operational decision-making, also called navigation, guidance and stabilization. In short, tactical decisions refer to high-level, often discrete, decisions, such as when to change lanes on a highway, or whether to stop or go at an intersection. Here, the decision-making related to speed control and lane changes is done by a reinforcement learning (RL) agent.

    [0035] Reinforcement learning is a branch of machine learning where an agent acts in an environment and tries to learn a policy that maximizes a cumulative reward function. Reference is made to the textbook R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2.sup.nd ed., MIT Press (2018). The policy (s) defines which action a to take in each state s. When an action is taken, the environment transitions to a new state s and the agent receives a reward r. The reinforcement learning problem can be modeled as a Markov Decision Process (MDP), or possibly as a partially observable MDP (POMDP) approximated as MDP, which is defined by the tuple (custom-character; T; R; ), where custom-character is the state space, custom-character is the action space, T is a state transition model (or evolution operator), R is a reward model, and is a discount factor. This model can also be considered to represent the RL agent's interaction with the training environment. At every time step t, the goal of the agent is to choose an action a that maximizes the discounted return,

    [00001] R t = .Math. k = 0 k r t + k .

    In Q-learning, the agent tries to learn the optimal action-value function Q*(s,a), which is defined as

    [00002] Q * ( s , a ) = max [ R t | s t = s , a t = a , ] .

    From the optimal action-value function, the policy is derived as per

    [00003] ( s ) = arg max a Q * ( s , a ) .

    [0036] Starting from the baseline, RL-based vehicle control architecture according to Hoel et al., the inventors show how adding relevant features to the observation space improves the performance of the agent. Then, the inventors report of training a refined architecture where the high-level decision-making and low-level control actions are separated between the RL agent and low-level controllers based on physical models. Compared to the existing architectures in the literature, the RL agent in the refined architecture only needs to learn a policy to choose high level actions, such as an inter-vehicle gap to be maintained. Interestingly, the inventors find that this optimized action space improves the learning compared to speed control actions. Finally, the inventors design a realistic reward function based on TCOP, which includes explicit costs for energy consumption and human resources.

    [0037] The experimental results are obtained using Simulation of Urban Mobility (SUMO) (Lopez et al. 2018), a simulation platform for the conceptual development of autonomous vehicles. Source code for the tailored RL environment based on SUMO, which the inventors have developed for heavy vehicle combination driving in highway traffic, will be published later (Pathare 2023).

    II. Vehicle and Traffic Model

    [0038] In the present disclosure, the inventors consider a dynamic highway traffic environment with multiple lanes 301, 302, 303, in which an autonomous or semi-autonomous truck (ego vehicle) 100 and passenger cars (surrounding vehicles) 192 move, as shown in FIG. 3. Lane changes between adjacent pairs of the parallel lanes 301, 302, 303 is permitted. In a per se known manner, the ego vehicle 100 is equipped with technical components for applying linear or angular accelerations on the vehicle, such as electric or combustion-based drive motors, regenerative or dissipative brakes, steering arrangements. At least some of the components form part of a drivetrain of ego vehicle 100. Components with these characteristics may be referred to as motion actuators. In addition to motion actuators and other actuators, the ego vehicle 100 may include internal sensors for sensing various conditions of the vehicle itself and external sensors for sensing positions and behavior of surrounding vehicles. The dashed rectangle 310 in FIG. 3 represents the approximate range of the external sensors of the ego vehicle 100.

    [0039] The maximum speed of the ego vehicle is set to be 25 m/s. Furthermore, fifteen passenger cars with speeds between 15 m/s and 35 m/s are simulated based on the default Krauss car-following model (Krauss, Wagner, and Gawron 1997) and further based on the LC2013 lane change model (Erdmann 2014) in SUMO. The initial position and speed pattern of the surrounding vehicles is set randomly, which makes the surrounding traffic nondeterministic throughout the simulation. Starting from the initial position (which will be 800 m after initialization steps in SUMO), the ego vehicle is expected to reach the target 320 set at a distance of 3000 m. This means that the ego vehicle 100 has to drive 2200 m on the highway in each episode. An episode will also terminate if a hazardous situation such as a collision or driving outside the road happens, or if a maximum of 500 RL steps are executed. The observation or state of the environment at each step contains information about position, speed, and state of left/right indicators for ego vehicle and vehicles within the sensor range. In the simulations, a sensor range 310 of 200 m is assumed.

    [0040] In the simulations, a vehicle model with the following the observables of the ego vehicle 100 is used: [0041] 1. Longitudinal speed (v.sub.x0) [0042] 2. Lane change state (positive or negative sign of the lateral speed, sgn(v.sub.y0)) [0043] 3. Lane number [0044] 4. State of left turning indicator (on or off) [0045] 5. State of right turning indicator [0046] 6. Distance to leading vehicle (d.sub.lead)
    Each of the surrounding vehicles, to the extent it is located in the sensor range 310, is modelled with a different set of observables. For the i.sup.th vehicle, the model includes: [0047] 1. Relative longitudinal distance from ego vehicle (d.sub.xi) [0048] 2. Relative lateral distance from ego vehicle (d.sub.yi) [0049] 3. Relative longitudinal speed with respect to ego vehicle (v.sub.xiv.sub.x0) [0050] 4. Lane change state (positive or negative sign of the lateral speed sgn (v.sub.yi)) [0051] 5. Lane number [0052] 6. State of left turning indicator [0053] 7. State of right turning indicator

    [0054] Baseline vehicle control architecture. The baseline architecture in the present disclosure is inspired by (C.-J. Hoel, Wolff, and Laine 2020) and is illustrated in FIG. 1A. Here, the decision-making is done solely by an RL agent 130, with a frequency of 1 second. The driving environment 190 may be a simulated or real environment, which is represented as state x. The action space for the RL agent 130 is discrete. The actions indicated by the control signal u include longitudinal actions and lateral actions. Longitudinal actions are speed changes of 0, 1, 1 or 4 m/s. Lateral actions are to stay on lane, change to left lane, or change to right lane.

    [0055] In the present disclosure, the inventors compare the performance of this baseline architecture with the new architecture described below.

    [0056] New vehicle control architecture. The new vehicle control architecture proposed herein includes a feedback controller 120 and an RL agent 130. The feedback controller 120 is configured to sense an actual motion state x of the vehicle and to determine a machine-level instruction u to the motion actuator for approaching a setpoint motion state x*. The RL agent 130 for its part is trained to perform decision-making regarding the setpoint motion state x*, and the value x* thus decided is applied as setpoint motion state of the feedback controller 120.

    [0057] FIG. 1B shows an example where the new vehicle control architecture is applied to an automated truck driving framework. There are three main components: a high-level decision-making agent 130 based on RL, a low-level feedback controller 120 for longitudinal control (i.e., control of one or more longitudinal motion actuators) and a low-level controller 122 for lateral control. The RL agent chooses, based on the current state of the SUMO environment, a longitudinal action or a lateral action. The action space includes the following actions: [0058] 1. Set short time gap with leading vehicle (1s) [0059] 2. Set medium time gap with leading vehicle (2s) [0060] 3. Set long time gap with leading vehicle (3s) [0061] 4. Increase desired speed by 1 m/s [0062] 5. Decrease desired speed by 1 m/s [0063] 6. Maintain current desired speed and time gap [0064] 7. Change lane to left [0065] 8. Change lane to right

    [0066] The longitudinal action can be one of the following: set desired time gap (short, medium, or long), increment or decrement desired speed, or maintain the current time gap and desired speed. When the RL agent chooses one of these actions, the longitudinal controller will compute the acceleration of the ego vehicle based on an Intelligent Driver Model (IDM; see Treiber et al. 2000) given by

    [00004] v . = d v d t = a ( 1 - ( v v 0 ) - ( s * ( v , v ) s ) 2 ) with s * ( v , v ) = s 0 + v a T + v a v a 2 a b

    where the index refers to the ego vehicle 100 and 1 is the leading vehicle ahead of the ego vehicle 100. The real-valued exponent 0 may be used as a tuning parameter, with a default value of =4. Letting v.sub.i denote the velocity, x.sub.i the longitudinal position, and l.sub.i the length of the vehicle, one obtains the net distance s.sub.:=x.sub.-1x.sub.l.sub.-1 and the velocity difference v.sub.:=v.sub.v.sub.-1. Among the further model parameters, v.sub.0 represents desired velocity, s.sub.0 minimum spacing, T desired time gap (TTC), a maximum acceleration, and b a desired maximum braking deceleration.

    [0067] Here IDM uses the latest desired speed and time gap set by the RL agent. It computes a new acceleration for the truck and sets the resulting speed in SUMO every 0.1 s. This process continues for a total duration of 1 s, after which the RL agent chooses the next high-level action. Note that in the new architecture, the ACC mode is always activated for the truck by construction.

    [0068] The lateral action is to change the truck's course to left or right lane. When the RL agent chooses a lateral action, the lateral controller initiates the lane change. Lane change is performed using the default LC2013 lane change model (Erdmann 2014) in SUMO. The lane width is set to 3.2 m, and the lateral speed of the truck is set to 0.8 m/s. Hence, in this example, it takes a total of 4 s to complete a lane change action, which corresponds to 40 SUMO time steps of 0.1 s duration each. Following this, the RL agent 130 chooses the next high-level action.

    [0069] Reward functions. In order to conduct the experiments, the inventors have formulated two reward functions.

    [0070] The basic reward function is designed so as to motivate the RL agent 130 to have the vehicle 100 drive at maximum speed and reach the target 320 without getting into any hazardous situations. This reward function is similar to that in (C.-J. Hoel, Wolff, and Laine 2020) except that a reward for reaching the target is added, whereas the penalty for emergency braking is not considered for simplicity. The reward at each time step is given by

    [00005] r ( t ) = v t max_v - I l P l - I c P c - I n c P n c - I o P o + I tar R tar T

    Here v.sub.t is the velocity of the vehicle at time step t and max_v is the maximum velocity of the vehicle. I is an indicator function, which takes a value of 1 when the corresponding condition is satisfied, and P and R are the corresponding penalty and reward values respectively. The possible conditions are lane change l, collision c, near collision nc, driving outside the road o and reaching the target tar. T is the total time it takes to reach the target 320. The parameter values used are given in Table 1.

    [0071] An enhanced version of the reward function is designed to provide the agent with more realistic goals similar to those of a human truck driver. In particular, the enhanced reward function may be considered to represent a total cost of operation (TCOP). As used herein TCOP may include both costs, rewards and penalties, and it may assume positive or negative values. Here, the inventors consider realistic cost or revenue values (in monetary units) for each component to explore whether the agent can learn a safe and cost-efficient driving strategy with this reward function. The reward at each time step is given by:

    [00006] r ( t ) = - C e l e t - C d r t - I l P l - I c P c - I n c P n c - I o P o + I tar R tar W tar ( 1 )

    In this expression, C.sub.el is the per-unit cost of electric energy, et is the electric energy used at time step t, C.sub.dr is the driver cost and t is the duration of a time step. The electric energy used during a time step t may be calculated as

    [00007] e t = f t v t t , ( 2 a )

    where f.sub.t, the force applied by the motion actuators at time step t, is given by

    [00008] f t = m a t + 1 2 C d A f air v 2 + m g C r + m g sin , ( 2 b )

    where m is the mass of the vehicle, C.sub.d is the coefficient of air drag, A.sub.f is the frontal area, .sub.air is the air density, C.sub.r is the coefficient of rolling resistance, g is the acceleration due to gravity and a is the acceleration of the vehicle at time step t. The inclination angle (slope) of the road is set to 0 in the simulations. The parameter values used are given in Table 1.

    [0072] In the reward function (1), the penalties are defined as the average cost incurred during hazardous situations. The average cost considered here is the own risk payment which is the portion of an insurance claim made by the vehicle owner or the insured for any loss and or damage that occurs when submitting a claim. Similarly, the reward R.sub.tar denotes the revenue that can be achieved by the truck when it reaches the target 320 (i.e., completes an imaginary transport mission). Revenue is computed as the total expected cost with 20% profit in an ideal scenario where the truck drives with an average speed of 22 m/s and zero acceleration. The total length of travel is 2200 meters and hence the time to reach the target will be 100 s. Based on (2) and parameter values from Table 1, the total energy consumed can be computed as 1.85 kWh. Then, the total energy cost will be 1.85 kWh0.5=0.925 monetary units. The total driver cost for 100 s will be 1.39 units, and the total cost becomes 2.315 units. Further, adding the 20% profit gives a net revenue of 2.780 units, which is used in the reward function along with a weight W.sub.tar.

    [0073] The parameter values indicated in Table 1 may be used in the reward functions.

    TABLE-US-00001 TABLE 1 Example parameter values for the reward functions Basic reward TCOP reward Parameter function function P.sub.l 1 0.1 P.sub.c 10 1000 units P.sub.nc 10 1000 units P.sub.o 10 1000 units R.sub.tar 100 2.78 units C.sub.el 0.5 units per kWh C.sub.dr 50 units/hour t 1 s m 40000 kg C.sub.d 0.36 A.sub.f 10 m.sup.2 .sub.air 1.225 kg/m.sup.3 g 9.81 m/s.sup.2 C.sub.r 0.005

    [0074] Reinforcement learning algorithms. The inventors study three RL algorithms for the decision making in baseline and new architectures. The implementations of these algorithms from the stable-baselines3 library with the default hyperparameters are used (Raffin et al. 2021).

    [0075] 1) Deep Q-Network (DQN): DQN is a reinforcement learning algorithm based on the Q-learning algorithm, which learns the optimal action-value function by iteratively updating estimates of the Q-value for each state-action pair. In DQN, the optimal action-value function in a given environment is approximated using a neural network. The corresponding objective function is given by

    [00009] L ( ) = s , a , r , s [ ( y - Q ( s , a ; ) ) 2 ] ,

    where represents the weights (and e.g. biases) of the Q-network, s and a are the state and action at time t, r is the immediate reward, s is the next state, and

    [00010] y = r + max a Q ( s , a ; - )

    is the target value. The target value is the sum of the immediate reward and the discounted maximum Q-value of the next state, where is the discount factor and .sup. represents the weights (and e.g. biases) of a target network with delayed updates. The objective function is to minimize the mean squared error between the estimated Q-value and the target value, which is then optimized using stochastic gradient descent.

    [0076] Another improvement of DQN over the standard Q-learning approaches is the use of a replay buffer and target network. The agent uses the replay buffer to store transitions and samples from it randomly during training. This enables the efficient use of past experiences. The target network is a copy of the Q-network with delayed updates. Together with the replay buffer, it improves the learning stability of the model and help prevent over fitting.

    [0077] 2) Advantage Actor-Critic (A2C): A2C is a reinforcement learning algorithm that combines the actor-critic method with an estimate of the advantage function. In the actor-critic method, the agent learns two neural networks: an actor network that outputs a probability distribution over actions, and a critic network that estimates the state-value function. The actor network is trained to maximize the expected reward by adjusting the policy parameters, while the critic network is trained to minimize the difference between the estimated value function and the true value function. The advantage function is the difference between the estimated value and the baseline value for a given state-action pair. The corresponding objective function for A2C (surrogate objective function) is given by:

    [00011] L ( ) = t [ log ( a t | s t ) A ^ t - H ( ( s t ) ) ] ,

    where represents the policy parameters, at is the action taken at time t in state s.sub.t, .sub.t is the estimate of the advantage function, H(.sub.(s.sub.t)) is the entropy of the policy distribution, and is a hyperparameter that controls the weight of the entropy regularization term. The goal is to maximize the expected reward by adjusting the policy parameters, while also maximizing the entropy of the policy distribution to encourage exploration.

    [0078] 3) Proximal Policy Optimization (PPO): PPO belongs to the family of policy gradient algorithms and has exhibited great potential in effectively addressing diverse RL problems (Schulman et al. 2017; Svensson et al. 2023). In addition, PPO benefits from simple yet effective implementation and a broad scope of applicability.

    [0079] The PPO algorithm is designed to address the challenges such as the instability that can arise when the policies are rapidly updated. PPO works by optimizing a surrogate objective function that measures the difference between the updated policy and the previous one. The surrogate objective function of PPO is defined as the minimum of two terms: the ratio of the new policy to the old policy multiplied by the advantage function, and the clipped ratio of the new policy to the old policy multiplied by the advantage function. In mathematical terms, this function is given as,

    [00012] L CLIP ( ) = E ^ t [ min { r t ( ) A ^ t , clip ( r t ( ) , 1 - , 1 + ) A ^ t } ]

    where represents the policy parameters,

    [00013] r t ( ) = ( a t | s t ) o l d ( a t | s t )

    is the likelihood ratio between the current policy and the previous policy, the clipping function is defined as

    [00014] clip ( x , x min , x max ) = { x max , x max < x , x , x min x x max , x min , x < x min ,

    t is the estimated advantage of taking action a.sub.t in state s.sub.t, and is a hyperparameter that controls the degree of clipping. The clipping term ensures that the policy update does not result in significant policy changes, which could otherwise destabilize the learning process. The PPO algorithm maximizes this objective function using stochastic gradient ascent to update the policy parameters. The expectation is taken over a batch of trajectories sampled from the environment.

    III. Simulation Study

    [0080] This section presents the results from different experiments conducted on the SUMO simulation platform using the baseline and new architectures with different RL algorithms. The source code used to implement the RL environment for conducting these experiments will later be made available on GitHub (Pathare 2023). The experiments are conducted on a Linux cluster that consists of 28 CPUs with model name Intel Xeon CPU E5-2690 v4 with a clock frequency of 2.60 GHz. The average time elapsed for training the new architecture with DQN, A2C and PPO for 10.sup.6 time steps is 12528 s (3 h 28 min), 13662 s (3 h 47 min), 14124 s (3 h 55 min) respectively.

    [0081] Performance improvement based on selection of states. A first trial aims to show how the performance can be improved by adding relevant features to the observation space. As mentioned in section II, the state space includes position, speed, and lane-change information of the ego vehicle 100 and the surrounding vehicles 192. In this experiment, the performance of the baseline architecture is compared with an improved variant where the distance to the leading vehicle has been added as an explicit feature to the above-described state space. The basic reward function with the PPO algorithm is used in both cases. The learning curve of average reward over five realizations is shown in FIG. 5, and the evaluation metrics (average of 5 validations with 100 episodes each) are shown in Table 2.

    TABLE-US-00002 TABLE 2 Evaluation of baseline architecture and improved baseline architecture with leading vehicle distance in state space Evaluation Baseline Improved baseline metric architecture architecture Reached target successfully 61% 70.6% Driven successfully, but not 0% 0% reached target 320 within maximum number of steps Terminated by collision or driving 39% 29.4% outside road Average speed 18.17 m/s 19.43 m/s Average distance travelled 1503.93 m 1667.86 m Average executed steps 88.43 89.65

    [0082] The average rewards and the evaluation metrics clearly show that the performance has improved by adding the distance to the leading vehicle as a feature in the state. The collision rate is considerably reduced and the average speed of the ego vehicle 100 is slightly increased. It is noted that the leading vehicle distance was already available in the observation space among the properties of surrounding vehicles. The present results show that explicitly adding the leading vehicle distance as a separate feature helps the agent to learn better how to control the speed with respect to the leading vehicle and avoid forward collisions.

    [0083] Performance comparison of baseline architecture and new architecture. In a second trial, the performance of the proposed new architecture shown in FIG. 1B is compared with the baseline framework shown in FIG. 1A using different RL algorithms. For both architectures, an observation space including leading vehicle distance and the basic reward function are used. However, the new architecture introduces a low-level controller (feedback controllers 120, 122) based on a physical model to perform speed control actions. FIGS. 6A-6C show the comparison of average episode rewards in both architectures for different RL algorithms. It can be seen that the new architecture outperforms the baseline regardless of the chosen RL algorithm. DQN obtains lowest average rewards. PPO obtains highest average reward in baseline architecture whereas both PPO and A2C shows similar performance in the new architecture.

    [0084] As PPO gives consistent performance in both architectures, it is used for further validation. More precisely, SUMO-controlled simulation of an ego vehicle using the default Krauss car-following model have been considered for evaluation. The validation results in Table 3 clearly show that the new architecture has improved the performance compared to the baseline architecture.

    TABLE-US-00003 TABLE 3 Evaluation of baseline architecture and new architecture Evaluation Krauss-model Baseline New metric based simulation architecture architecture Reached target 96% 70.6% 97.8% successfully Driven successfully, 0% 0% 0.6% but not reached target 320 within maximum number of steps Terminated by 4% 29.4% 1.6% collision or driving outside road Average speed 22.65 m/s 19.43 m/s 18.56 m/s Average distance 2136.62 m 1667.86 m 2178.37 m travelled Average executed 94.95 89.65 128.75 steps
    More precisely, the collision rate is significantly reduced from 29.4% to 1.6%. The slight decrease in the average speed of the new framework is negligible when considering the improvement in safety. The reduced collision rate also explains why there is an increase in the average travelled distance and average executed steps.

    [0085] Analysis of TCOP-based reward. In a final trial, the TCOP-based reward function is applied to the new vehicle control architecture in order to assess whether the RL agent 130 can successfully acquire a driving strategy that is both safe and cost-efficient. The episodic reward is evaluated with varying weights W.sub.tar assigned to the target completion reward R.sub.tar, and the results are shown in Table 4.

    TABLE-US-00004 TABLE 4 Evaluation of new architecture with TCOP-based reward function for different weights W.sub.tar Evaluation metric W.sub.tar = 1 W.sub.tar = 10 W.sub.tar = 36 Reached target 3.6% 1% 98.8% successfully Driven successfully, 93.4% 96.8% 0.4% but not reached the target within maximum steps Terminated by 3% 2.2% 0.8% collision or driving outside road Average speed 2.39 m/s 1.56 m/s 18.24 m/s Average distance 861.64 m 580.65 m 2187.62 m travelled Average executed 477.02 486.11 126.46 steps

    [0086] As mentioned in section II, in an ideal scenario the cost would amount to 2.31 monetary units, while the revenue R.sub.tar would be 2.78 units. Consequently, in this ideal case, the episodic reward would equal 2.782.31=0.47 when W.sub.tar is 1. This value is relatively small, making it challenging for the agent to learn how to reach the target. Moreover, Table 4 demonstrates that the RL agent 130 exhibits slow driving behavior and struggles to reach the target. However, it becomes evident that increasing the weight assists the agent in learning to drive at higher speeds and successfully reach the target.

    [0087] In addition, the inventors have observed that the convergence of learning with realistic reward values takes about twice as many training steps compared to the basic reward function. Moreover, the new architecture trained with the basic reward function and the TCOP-based reward function has similar TCOP costs when evaluated with the same equation. This implies that incorporating realistic rewards and penalties in the training process could introduce additional complexities, making it more challenging for the agent to converge to an optimal and cost-effective driving strategy. This opens new possibilities to explore and address the challenges associated with training RL agents in real-world environments.

    IV. Implementations

    [0088] Attention will now be directed to implementations of this framework in a utility context. With reference to the flowchart in FIG. 2, a method 200 of controlling an ego vehicle 100 in a multi-lane driving environment 100 (see FIG. 3) according to various embodiments will now be described.

    [0089] In a first step 210 of the method 200, a feedback controller 120 is provided which senses an actual motion state x of the vehicle 100 and determines a machine-level instruction u to the motion actuator for approaching a setpoint motion state x*. Each motion actuator is a technical component for applying a linear or angular acceleration on the vehicle, such as a drive motor, a brake, a steering arrangement, a drivetrain component, etc. The motion state represents a physical or technical variable describing the position, pose or movements of the ego vehicle 100, e.g., a lane number, a speed and a time-to-collision (TTC), though it may as well include variables describing at least one surrounding vehicles 192. The machine-level instruction finally may be an input variable to a motion actuator, such as an acceleration torque, a braking torque, a left/right differential torque, a gear and a steering angle.

    [0090] The setpoint motion state x* may be a continuous variable (including a quasi-continuous digital representation of a real-valued variable, which has finite resolution) or a discrete variable. Step 210 may be completed by providing the feedback controller 120 with access to sensor data allowing the sensing of an actual motion state x of the vehicle. The feedback controller 120 may be configured to perform P, PI, PD or PID control or control governed by another suitable type of adaptive or non-adaptive feedback control law, to determine machine-level instructions u suitable for influencing the motion state x of the vehicle so as to agree or approximately agree with the setpoint motion state x*. The feedback controller 120 may be configured on the basis of a physical model of the vehicle 100, particularly, a physical model of the motion actuator's influence on relevant motion states of the vehicle.

    [0091] Step 210 may include provisioning the feedback controller 120 by sourcing it from a supplier, or alternatively by configuring a multi-purpose controller, or further by implementing the feedback controller 120 using hardware and/or software. In an example where the feedback controller 120 is implemented as a computer program, step 210 may entail initiating an execution of this program on a processor communicatively coupled to the motion actuator and sensors in the vehicle 100.

    [0092] The method 200 can be executed with just this feedback controller 120, or a second feedback controller 122 and any further feedback controller(s) may be added. This corresponds to an optional second step 212 of the method.

    [0093] In embodiments of the method 200 where two or more feedback controllers 210, 212 are provided, they may be associated with functionally distinct groups of motion actuators in the vehicle 100. For example, a first feedback controller 120 may be configured to determine machine-level instructions u.sub.1 to motion actuators generally responsible for longitudinally influencing the movements of the vehicle 100 (e.g., for commanding acceleration or braking), whereas a second feedback controller 122 may be configured to determine machine-level instructions u.sub.2 to motion actuators generally responsible for laterally influencing the movements of the vehicle 100 (e.g., for commanding steering). Similarly, the feedback controllers 120, 122 may determine these instructions u.sub.1,u.sub.2 based on partially different motion states x.sub.1,x.sub.2 of the vehicle. In such embodiments, the notation in FIG. 1B may be understood as u=(u.sub.1,u.sub.2), x=(x.sub.1,x.sub.2), x*=(x*.sub.1,x*.sub.2).

    [0094] In some of these embodiments, the first and second feedback controllers 210 have exclusive authority to directly control the respective groups of motion actuators. Provision is made for indirect control, e.g., that the speed may have to be adjustednotably in view of the available road frictionduring curvilinear movement due to action by the lateral actuators. For example, it may be desirable for the vehicle to decelerate slightly when it enters a curve and for it to accelerate gradually when it exits the curve and approaches straight road.

    [0095] In one example of such embodiments, the (first) feedback controller 120 includes an automatic cruise controller (ACC), and the second feedback controller 122 includes a lane-change assistant. The ACC senses the leading vehicle distance d.sub.lead (see FIG. 3), is configured to maintain a setpoint time-to-collision (TTC), and provides instructions to a drive motor and brakes. It is recalled that the TTC can be computed as d.sub.lead/v.sub.x0. The lane-change assistant for its part senses the positions of the ego vehicle 100 and the surrounding vehicles 192, is configured to go to or stay in a setpoint lane, andto do soprovides instructions to a steering arrangement in the vehicle. Optionally, the lane-change assistant may further activate and deactivate turning indicators as needed, and it may cause the ACC to adjust the velocity in connection to a lane-change maneuver.

    [0096] Returning to FIG. 2 and the description of the method 200, a third step 214 includes providing a reinforcement-learning (RL) agent 130. The RL agent 130 is suitable for decision-making regarding the setpoint motion state x*. Reference is made to above section II for a discussion on a suitable learning algorithm, reward function and training data for providing the RL agent.

    [0097] For example, the RL agent 130 may have been trained to perform such decision-making based on simulated or real-world training data. Suitable training data may be provided in accordance with the simulation setup described in section II and using random initial position and speed pattern of the surrounding vehicles 192. Generally speaking, any one or more learning algorithms that support a discrete action space can be used, including: deep Q network (DQN), advantage actor critic (A2C), proximal policy optimization (PPO). The definition of the reward function used during the training phase may be adapted to the system owner's needs and wishes. A basic example reward function is given in section II, as well as one TCOP-based reward function. The TCOP-based reward function may for example, in the case of a commercial vehicle, include a component representing driver salary, a revenue for reaching a target on time and a penalty for being delayed.

    [0098] Executing the third step 214 may amount to initializing a preexisting RL agent 130 which is ready for use, and making sure its decisions are output in a form suitable for use by the feedback control as a setpoint motion state x*. The third step 214 does not necessarily include defining and/or training the RL agent 130.

    [0099] The input to the RL agent 130 forms the basis of the decision-making concerning the setpoint motion state x*. In some embodiments, the RL agent 130 is fed with the same motion state of the vehicle 100 (and, optionally, of surrounding vehicles 192) as the feedback controller 120 uses, that is s=x. In an important class of embodiments, however, the state s used by the RL agent 130 is richer than the motion state x, in the sense that it provides a more complete description of the ego vehicle 100 and/or of the surrounding vehicles 192. This is illustrated by the vehicle model that was used in the simulation according to section II. Further, the state s may also include a state variable relating to the action requested by the RL agent 130, such as a lane-change state (sign of lateral velocity component) of the ego vehicle 100 in the case of a lane-change assistant. In some embodiments, the dimensionality of the state s may be greater than the dimensionality of the motion state x.

    [0100] In the above-mentioned embodiments, where two feedback controllers 120, 122 are provided, the RL agent 130 is preferably trained to perform joint decision-making regarding the setpoint motion states x*.sub.1,x*.sub.2 for all of these. The decision-making is joint, for instance, if the setpoint motion states x*.sub.1,x*.sub.2 are output in response to a single query to the RL agent 130, particularly, in response to a single query to a single neural network in the RL agent 130. Further examples of joint decision-making include where the first setpoint motion state x*.sub.1 is decided in dependence of an earlier decision on the second setpoint motion states x*.sub.2, or vice versa.

    [0101] In setups with three or more feedback controllers, the RL agent 130 may either perform joint decision-making for all, or there may be multiple parallel RL agents which are associated with respective subgroups of feedback controllers.

    [0102] In two subsequent steps 216 and 218, the decisions made by the RL agent 130 are applied as the setpoint motion state x* of the feedback controller 120, whose machine-level instruction u is applied to the motion actuator in the vehicle 100. The total processing chain may be visualized as two mappings:

    [00015] s .fwdarw. RL agent 130 x * x , x * .fwdarw. feedback controller 120 u

    where the RL agent 130 has an implicit dependence on neural network weights (and e.g. biases) . Similarly, the feedback controller 120 is defined on the basis of a particular control law and assigned parameter values therein (e.g., feedback gain), and it depends parametrically on the setpoint value x* which is to be achieved.

    [0103] The method according to FIG. 4 can be implemented by of a vehicle-carried or stationary computer system. The computer system may be communicatively connected to sensorics, actuators and communication means, as necessary in view of the steps of the relevant embodiment of the method. In particular, the method may be executed in real time during operation of the ego vehicle, i.e., while the ego vehicle 110 moves in the multi-lane driving environment 100. Further, it is possible to execute the method by an entity not authorized to directly control movements of the surrounding vehicles 120, or by an entity that should not or should preferably not make use of its authority to directly control movements of the surrounding vehicles 120.

    [0104] By way of example, FIG. 4 is a schematic diagram of a computer system 400 suitable for carrying out the method 200 according to FIG. 2. For example, the feedback controller 120 can be implemented as software code executing on general-purpose processing circuitry 402 of the computer system 400. The RL agent 130 can be implemented in a similar fashion or using a dedicated chipset, such as a hardware accelerator aimed at AI applications, or a suitably configured graphics processing unit (GPU). The feeding of information therebetween, including the setpoint motion state x*, can be effectuated by a system bus 406.

    [0105] The computer system 400 is adapted to execute instructions from a computer-readable medium to perform these and/or any of the functions or processing described herein. The computer system 400 may be connected (e.g., networked) to other machines in a LAN, an intranet, an extranet, or the Internet. While only a single device is illustrated, the computer system 400 may include any collection of devices that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein. Accordingly, any reference in the disclosure and/or claims to a computer system, computing system, computer device, computing device, control system, control unit, electronic control unit (ECU), processor device, processing circuitry, etc., includes reference to one or more such devices to individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein. For example, control system may include a single control unit or a plurality of control units connected or otherwise communicatively coupled to each other, such that any performed function may be distributed between the control units as desired. Further, such devices may communicate with each other or other devices by various system architectures, such as directly or via a Controller Area Network (CAN) bus, etc.

    [0106] The computer system 400 may comprise at least one computing device or electronic device capable of including firmware, hardware, and/or executing software instructions to implement the functionality described herein. The computer system 400 may include processing circuitry 402 (e.g., processing circuitry including one or more processor devices or control units), a memory 404, and a system bus 406. The computer system 400 may include at least one computing device having the processing circuitry 402. The system bus 406 provides an interface for system components including, but not limited to, the memory 404 and the processing circuitry 402. The processing circuitry 402 may include any number of hardware components for conducting data or signal processing or for executing computer code stored in memory 404. The processing circuitry 402 may, for example, include a general-purpose processor, an application specific processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a circuit containing processing components, a group of distributed processing components, a group of distributed computers configured for processing, or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. The processing circuitry 402 may further include computer executable code that controls operation of the programmable device.

    [0107] The system bus 406 may be any of several types of bus structures that may further interconnect to a memory bus (with or without a memory controller), a peripheral bus, and/or a local bus using any of a variety of bus architectures. The memory 404 may be one or more devices for storing data and/or computer code for completing or facilitating methods described herein. The memory 404 may include database components, object code components, script components, or other types of information structure for supporting the various activities herein. Any distributed or local memory device may be utilized with the systems and methods of this description. The memory 404 may be communicably connected to the processing circuitry 402 (e.g., via a circuit or any other wired, wireless, or network connection) and may include computer code for executing one or more processes described herein. The memory 404 may include non-volatile memory 408 (e.g., read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), etc.), and volatile memory 410 (e.g., random-access memory (RAM)), or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a computer or other machine with processing circuitry 402. A basic input/output system (BIOS) 412 may be stored in the non-volatile memory 408 and can include the basic routines that help to transfer information between elements within the computer system 400.

    [0108] The computer system 400 may further include or be coupled to a non-transitory computer-readable storage medium such as the storage device 414, which may comprise, for example, an internal or external hard disk drive (HDD) (e.g., enhanced integrated drive electronics (EIDE) or serial advanced technology attachment (SATA)), HDD (e.g., EIDE or SATA) for storage, flash memory, or the like. The storage device 414 and other drives associated with computer-readable media and computer-usable media may provide non-volatile storage of data, data structures, computer-executable instructions, and the like.

    [0109] Computer-code which is hard or soft coded may be provided in the form of one or more modules. The module(s) can be implemented as software and/or hard-coded in circuitry to implement the functionality described herein in whole or in part. The modules may be stored in the storage device 414 and/or in the volatile memory 410, which may include an operating system 416 and/or one or more program modules 418. All or a portion of the examples disclosed herein may be implemented as a computer program 420 stored on a transitory or non-transitory computer-usable or computer-readable storage medium (e.g., single medium or multiple media), such as the storage device 414, which includes complex programming instructions (e.g., complex computer-readable program code) to cause the processing circuitry 402 to carry out actions described herein. Thus, the computer-readable program code of the computer program 420 can comprise software instructions for implementing the functionality of the examples described herein when executed by the processing circuitry 402. In some examples, the storage device 414 may be a computer program product (e.g., readable storage medium) storing the computer program 420 thereon, where at least a portion of a computer program 420 may be loadable (e.g., into a processor) for implementing the functionality of the examples described herein when executed by the processing circuitry 402. The processing circuitry 402 may serve as a controller or control system for the computer system 400 that is to implement the functionality described herein.

    [0110] The computer system 400 may include an input device interface 422 configured to receive input and selections to be communicated to the computer system 400 when executing instructions, such as from a keyboard, mouse, touch-sensitive surface, etc. Such input devices may be connected to the processing circuitry 402 through the input device interface 422 coupled to the system bus 406 but can be connected through other interfaces, such as a parallel port, an Institute of Electrical and Electronic Engineers (IEEE) 1394 serial port, a Universal Serial Bus (USB) port, an IR interface, and the like. The computer system 400 may include an output device interface 424 configured to forward output, such as to a display, a video display unit (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)). The computer system 400 may include a communications interface 426 suitable for communicating with a network as appropriate or desired.

    [0111] The operational actions described in any of the exemplary aspects herein are described to provide examples and discussion. The actions may be performed by hardware components, may be embodied in machine-executable instructions to cause a processor to perform the actions, or may be performed by a combination of hardware and software. Although a specific order of method actions may be shown or described, the order of the actions may differ. In addition, two or more actions may be performed concurrently or with partial concurrence.

    [0112] To summarize, the present disclosure is concerned with optimal trajectory planning for an autonomous vehicle (AV) operating in dense traffic, where vehicles closely interact with each other. To tackle this problem, the inventors propose a novel framework that couples trajectory prediction and planning in multi-agent environments, using distributed model predictive control. A demonstration of this framework is presented in simulation, employing a trajectory planner using non-linear model predictive control. Performance and convergence are analyzed under different assumptions about the prediction errors. The results indicate, in broadly defined conditions, that the obtained locally optimal solutions are superior to the results of decoupled prediction and planning.

    V. Closing Remarks

    [0113] The aspects of the present disclosure have mainly been described above with reference to a few embodiments. However, as is readily appreciated by a person skilled in the art, other embodiments than the ones disclosed above are equally possible within the scope of the invention, as defined by the appended patent claims.

    [0114] By way of summary, the comparative study presented in section III demonstrates how prudent selection of the state space and action space may improve the learning process of an RL agent for enabling the tactical decision-making in autonomous trucks. The results demonstrate in particular that learning can be acceleratedand overall performance improvedby training the agent to focus on high-level decisions, such as the magnitude of time gap (TTC) to the leading vehicle to be maintained, while leaving the low-level speed control to a physics-based controller. The results thus demonstrate the benefits of separating the decision-making into one high-level process and one low-level process. In particular examples, further, the low-level decision-making process has been separated into multiple parallel subprocesses (corresponding to groups of motion actuations in the vehicle) while the high-level process remains monolithic with joint decision-making for all the low-level subprocesses.

    REFERENCES

    [0115] Das, Lokesh Chandra, and Myounggyu Won. 2021. Saint-Acc: Safety-Aware Intelligent Adaptive Cruise Control for Autonomous Vehicles Using Deep Reinforcement Learning. In ICML. [0116] De Jong, Gerard, Hugh Gunn, and Warren Walker. 2004. National and International Freight Transport Models: An Overview and Ideas for Future Development. Transport Reviews 24. [0117] Desjardins, Charles, and Brahim Chaib-draa. 2011. Cooperative Adaptive Cruise Control: A Reinforcement Learning Approach. IEEE Transactions on Intelligent Transportation Systems 12. [0118] Erdmann, Jakob (2014), Lane-Changing Model in SUMO. In: Proceedings of the SUMO2014 Modeling Mobility with Open Data, 24, pp. 77-88. Deutsches Zentrum fr Luft-und Raumfahrt e.V. SUMO2014, Berlin, Deutschland. ISSN 1866-721X. [0119] Grislis, Aivis. 2010. Longer Combination Vehicles and Road Safety. Transport 25. [0120] Hoel, Carl-Johan E. 2020. Source Code for Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation. https://github.com/carljohanhoel/BayesianRLForAutonomousDriving. [0121] Hoel, Carl-Johan, Krister Wolff, and Leo Laine. 2020. Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation. In Intelligent Vehicles Symposium. IEEE. [0122] Jimnez, Felipe, Jos Eugenio Naranjo, Jos Javier Anaya, Fernando Garca, Aurelio Ponz, and Jos Mara Armingol. 2016. Advanced Driver Assistance System for Road Environments to Improve Safety and Efficiency. Transportation Research Procedia 14. [0123] Kiran, B Ravi, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Prez. 2021. Deep Reinforcement Learning for Autonomous Driving: A Survey. Transactions on Intelligent Transportation Systems 23. [0124] Krauss, S., P. Wagner, and C. Gawron. 1997. Metastable States in a Microscopic Model of Traffic Flow. Phys. Rev. E 55. [0125] Lin, Yuan, John McPhee, and Nasser L. Azad. 2021. Comparison of Deep Reinforcement Learning and Model Predictive Control for Adaptive Cruise Control. IEEE Transactions on Intelligent Vehicles 6. [0126] Lopez, Pablo Alvarez, Michael Behrisch, Laura Bieker-Walz, Jakob Erdmann, Yun-Pang Fltterd, Robert Hilbrich, Leonhard Lcken, Johannes Rummel, Peter Wagner, and Evamarie Wiener. 2018. Microscopic Traffic Simulation Using SUMO. In International Conference on Intelligent Transportation Systems. IEEE. [0127] Moreno, Gonzalo, Lauro Cesar Nicolazzi, Rodrigo De Souza Vieira, and Daniel Martins. 2018. Stability of Long Combination Vehicles. International Journal of Heavy Vehicle Systems 25. [0128] Nilsson, Peter, Leo Laine, Jesper Sandin, Bengt Jacobson, and Olle Eriksson. 2018. On Actions of Long Combination Vehicle Drivers Prior to Lane Changes in Dense Highway Traffica Driving Simulator Study. Transportation Research Part F: Traffic Psychology and Behaviour 55. [0129] Pathare, Deepthi. 2023. Source Code for Improved Tactical Decision Making and Control Architecture for Autonomous Truck in SUMO Using Reinforcement Learning. https://github.com/deepthi-pathare/Autonomous-truck-sumo-gym-env/(available from 21 Sep. 2023) [0130] Raffin, Antonin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. 2021. Stable-Baselines3: Reliable Reinforcement Learning Implementations. Journal of Machine Learning Research 22. [0131] Sallab, Ahmad E l, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. 2017. Deep Reinforcement Learning Framework for Autonomous Driving. arXiv Preprint arXiv: 1704.02532. [0132] Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347. [0133] Shalev-Shwartz, Shai, Shaked Shammah, and Amnon Shashua. 2016. Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving. arXiv Preprint arXiv: 1610.03295. [0134] Shaout, Adnan, Dominic Colella, and Selim Awad. 2011. Advanced Driver Assistance Systems-Past, Present and Future. In International Computer Engineering Conference. IEEE. [0135] Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. Second. The MIT Press. http://incompleteideas.net/book/the-book-2nd.html. [0136] Svensson, Hampus Gummesson, Christian Tyrchan, Ola Engkvist, and Morteza Haghir Chehreghani. 2023. Utilizing Reinforcement Learning for de Novo Drug Design. https://arxiv.org/abs/2303.17615. [0137] Treiber, Martin, Ansgar Hennecke, and Dirk Helbing. Congested traffic states in empirical observations and microscopic simulations. Physical review E 62.2 (2000): 1805. [0138] Xiao, Lingyun, and Feng Gao. 2010. A Comprehensive Review of the Development of Adaptive Cruise Control Systems. Vehicle System Dynamics. [0139] Yang, Da, Xiaoping Qiu, Dan Yu, Ruoxiao Sun, and Yun Pu. 2015. A Cellular Automata Model for Car-Truck Heterogeneous Traffic Flow Considering the Car-Truck Following Combination Effect. Physica A: Statistical Mechanics and Its Applications 424. [0140] Yang, Junru, Xingliang Liu, Shidong Liu, Duanfeng Chu, Liping Lu, and Chaozhong Wu. 2020. Longitudinal Tracking Control of Vehicle Platooning Using DDPG-Based PID. In 2020 4th CAA International Conference on Vehicular Control and Intelligence (CVCI). [0141] Zhao, Dongbin, Bin Wang, and Derong Liu. 2013. A Supervised Actor-Critic Approach for Adaptive Cruise Control. Soft Computing 17.