METHOD AND SYSTEM FOR CONTROLLING A PLURALITY OF VEHICLES, IN PARTICULAR AUTONOMOUS VEHICLES
20230004163 · 2023-01-05
Assignee
Inventors
Cpc classification
G08G1/20
PHYSICS
G05D1/0217
PHYSICS
G05D1/0088
PHYSICS
International classification
Abstract
A traffic planning method for controlling a plurality of vehicles, wherein each vehicle occupies one node in a shared set of planning nodes and is movable to other nodes along predefined edges between pairs of the nodes in accordance with a finite set of motion commands. In the method, initial node occupancies of the vehicles are obtained, and a sequence of motion commands are determined by optimizing a state-action value function which depends on node occupancies s and the motion commands a to be given. The state-action value function includes a command-dependent term, which is updated in each iteration based on a reward function, and a command-independent term, which penalizes node occupancies with too small inter-vehicle gaps and is exempted from said updating.
Claims
1. A traffic planning method for controlling a plurality of vehicles, wherein each vehicle occupies one node in a shared set of planning nodes and is movable to other nodes along predefined edges between pairs of the nodes in accordance with a finite set of motion commands, the method comprising: obtaining initial node occupancies of the vehicles; and determining a sequence of motion commands by optimizing a state-action value function which depends on node occupancies and the motion commands to be given, the state-action value function including at least one command-independent term, which penalizes node occupancies with too small inter-vehicle gaps, and at least one command-dependent term.
2. The method of claim 1, wherein is executed repeatedly, and the command-dependent term is updated on the basis of a predefined reward function after each execution cycle of said determining.
3. The method of claim 2, wherein the command-independent term is exempted from said updating.
4. The method of claim 2, wherein the reward function represents productivity minus cost.
5. The method of claim 1, wherein the command-independent term penalizes an inter-vehicle gap expressed as a time separation of the vehicles in the respective vehicle's direction of movement.
6. The method of claim 1, wherein the command-independent term depends on a gap-balancing indicator which penalizes too small gaps and/or unevenly distributed gaps.
7. The method of claim 6, wherein the gap-balancing indicator depends on a variability measure of the gap sizes, such as a standard deviation of the gap sizes.
8. The method of claim 6, wherein the command-independent term includes a composition of the gap-balancing indicator with at least one of the following functions: a rectified linear unit, ReLU, activation function; a sigmoid function; a gaussian function.
9. The method of claim 1, further comprising obtaining the state-action value function by a preceding step of reinforcement learning.
10. The method of claim 1, wherein said determining of a sequence of motion commands and any reinforcement learning are performed within a Dyna-2 algorithm.
11. The method of claim 1, wherein the vehicles are autonomous vehicles.
12. A device configured to control a plurality of vehicles, wherein each vehicle occupies one node in a shared set of planning nodes and is movable to other nodes along predefined edges between pairs of the nodes in accordance with a finite set of motion commands, the device comprising: a first interface configured to receive initial node occupancies of the vehicles; a second interface configured to feed motion commands selected from said finite set to said plurality of vehicles; and processing circuitry configured to perform the method of claim 1.
13. A computer program comprising instructions which, when executed, cause a processor to execute the method of any of claim 1.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] Aspects and embodiments are now described, by way of example, with reference to the accompanying drawings, on which:
[0018]
[0019]
[0020]
[0021]
[0022]
DETAILED DESCRIPTION
[0023] The aspects of the present disclosure will now be described more fully hereinafter with reference to the accompanying drawings, on which certain embodiments of the invention are shown. These aspects may, however, be embodied in many different forms and should not be construed as limiting; rather, these embodiments are provided by way of example so that this disclosure will be thorough and complete, and to fully convey the scope of all aspects of the invention to those skilled in the art. Like numbers refer to like elements throughout the description.
[0024]
[0025] Each vehicle (see
[0026] One aim of the present disclosure is to enable efficient centralized control of the vehicles v1, v2, v3, v4. The vehicles v1, v2, v3, v4 are to be controlled as a group, with mutual coordination. The mutual coordination may entail that any planning node utilization conflicts that could arise between vehicles are deferred to a planning algorithm and resolved at the planning stage. The planning may aim to maximize productivity, such as the total quantity of useful transport system work or the percentage of on-schedule deliveries of goods. The planning may additionally aim to minimize cost, including fuel consumption, battery wear, mechanical component wear or the like.
[0027] Regarding the planning node utilization conflicts that may arise, it may initially be noted that if each vehicle moves one waypoint per epoch, then no vehicle blocks this movement of any other vehicle for the node occupancies (start state) shown in
[0028] It can also be seen that these node occupancies provide each vehicle with a next waypoint to which it can move in a next epoch. The choice is not arbitrary, however, as both vehicles v1 and v4 may theoretically move to waypoint wp3, but this conflict can be avoided by routing vehicle v1 to waypoint wp2 instead. If the system is evolved in the second manner, that is,
then vehicle v4 will block vehicle v1 from moving to the next waypoint wp3. This blocking state temporarily reduces the vehicle system's productivity but will be resolved once vehicle v4 continues to waypoint wp4.
[0029] It is easy to realize that the difficulty of the blocking states (as measured, say, by the number of vehicle movements needed to reach a non-blocking state) in a given waypoint topology will increase with the number of vehicles present. The efficiency gain of deploying marginally more vehicles to solve a given utility task in a given environment may therefore be offset by the increased risk of conflicts. A waypoint topology populated with many vehicles may also have more deadlock states, i.e., states where no vehicle movement is possible. As mentioned, a deadlock state may correspond to a real-life scenario where the controlled vehicles need external help to resume operation, such as operator intervention, towing etc.
[0030] The following description is made under an assumption of discrete time, that is, the traffic system evolves in evenly spaced epochs. The length of an epoch may be of the order of 0.1 s, 1 s, 10 s or longer. At each epoch, either a command at a2 is given to one of the vehicles v1, v2, v3, v4, a command is given to a predefined group of vehicles, or no command is given. Quasi-simultaneous commands v1.a1, v2.a1 to two vehicles v1, v2 or two vehicle groups can be distributed over two consecutive epochs. To allow approximate simultaneity, the epoch length may be configured shorter than the typical time scale of the tactical decision-making for one vehicle. With this setup, the space of possible planning outcomes corresponds to the set of all command sequences of length d, where d is the planning horizon (or lookahead horizon).
[0031] With reference now to
[0032] In a first step no of the method 100, a state-action value function Q(s,a)=Q.sub.S(s,a)+Q.sub.L(s) is obtained by executing a reinforcement learning scheme, including training. The type of reinforcement learning scheme may for example be Q-learning or temporal-difference learning on the basis of a predefined reward function R(s,a); see R. S. Sutton et al., Reinforcement Learning, MIT Press (2018), ISBN 9780262039246. The reward function R (s,a) may represent productivity minus cost. The productivity term(s) or factor(s) may be an (approximate) quantitative indicator of the amount of the utility task which is completed by the vehicles' movements. It may be a measure of the total distance travelled (e.g., vehicle-kilometers), total distance travelled by vehicles carrying payload, a passenger-distance measure (e.g., passenger-kilometers), a payload-distance measure (e.g., ton-kilometers), a payload quantity delivered to an intended recipient or the like. The cost term(s) or factor(s) in the reward function R(s,a) may reflect energy consumption, a projected maintenance demand in view of mechanical wear (e.g., total velocity variation, peak acceleration, braking events, number of load cycles on structural elements) or chemical wear (e.g., exposure to sunlight, corrosive fluids) and/or safety risks (e.g., minimum vehicle separation). Suitable training data for the reinforcement learning in step 110 may be obtained by running a large number of computer simulations of traffic based on a mathematical model of the vehicles and planning nodes, e.g., a road-network model. Recorded observations of real vehicle movements in an environment are an alternative source of training data. Hybrids of these are possible too, for instance, to use real-world observations as initial values to the computer simulation.
[0033] The state-action value function Q(s,a) obtained in the first step no depends on node occupancies s and on motion commands a to be given within the planning horizon. The state-action value function (which acts as objective function in a subsequent optimization) includes at least one command-independent term Q.sub.L(s), which penalizes node occupancies with too small inter-vehicle gaps, and at least one command-dependent term Q.sub.S(s,a). The command-independent term Q.sub.L(s) and the command-dependent term Q.sub.S(s,a) may respectively represent a long-term and a short-term memory of the traffic planning method 100. It is primarily the command-dependent term Q.sub.S(s,a) that may be obtained by means of reinforcement training according to the reward function R(s,a). The command-dependent term Q.sub.S(s,a) may furthermore be updated after step no (training phase) has ended, i.e. in operation, whereas the command-independent term Q.sub.L(s) may remain constant between planning calls.
[0034] The command-independent term Q.sub.L(s) may be defined by reference to a gap-balancing indicator SoB(s) which, for a state s, expresses the degree of desirability of the prevailing inter-vehicle gaps as a number. For each vehicle, the inter-vehicle gap may be defined as the distance to the vehicle immediately ahead of it. In the example situation of
[0035] As mentioned initially, a gap size in the sense of one of these definitions may refer to a distance or a time separation. If time is used to quantify the gap, it may correspond to the time needed for the rear vehicle to reach the momentary position of the vehicle immediately in front ahead of it at the rear vehicle's current speed. Alternatively, the relative speed of the front and rear vehicles may be used as basis. For example, if the front vehicle is moving more slowly, the gap-size time may correspond to the time to collision absent any intervention, and in the opposite case the gap-size time may be set to a maximum value.
[0036] The gap-balancing indicator SoB according to any of these options provides a convenient basis for determining a suitable sequence of motion commands by optimization. In particular, the gap-balancing indicator SoB(s) can be defined so that it assumes values in [0,1], where the value 1 corresponds to a state s such that no further balancing is possible and the value 0 represents the opposite. The gap-balancing indicator SoB(s) may purposefully be defined to be 1 if a state s is well-balanced but not ideally balanced, namely, if the owner of the traffic system does not find it worthwhile or meaningful to spend resources on further intervention aiming to balance the node occupancies of the vehicles.
[0037] To mention a few examples, a gap-balancing indicator SoB.sub.1 can be constructed based on a standard deviation D (s) of the sizes of the inter-vehicle gaps. The following scaling ensures that SoB.sub.1 takes values in [0,1]:
the theoretical maximum value of the standard deviation. Viable alternatives to the standard deviation D (s) are variance, variability coefficient, range, interquartile range, and other variability measures. A gap-balancing indicator of this type may cause the method 100 to return motion commands tending to distribute the available total gap size L (in time or distance) equally among the vehicles 299.
[0038] Another option is to use a gap-balancing indicator SoB.sub.2 which is proportional to the minimum gap size among the vehicles 299. A suitable scaling factor for the minimum gap size may be NIL, where L is the available total gap size and N denotes the number of vehicles. This means SoB.sub.2 (s)=0 when two vehicles are about to collide and SoB.sub.2 (s)=1 when the gaps are equally distributed.
[0039] A further option is to form a weighted combination SoB.sub.3(s)=αSoB.sub.1(s)+(1−α)SoB.sub.2(s), which maintains the interval [0,1] as its image as long as the parameter satisfies 0<α<1.
[0040] In some embodiments, the command-independent term Q.sub.L(s) includes a composition of a gap-balancing indicator SoB and a step-like function, such as a smooth activation function. This may improve the stability of the traffic system when controlled by outputs of the method 100. Suitable step-like functions may be a rectified linear unit (ReLU) activation function or modified ReLU
where 0<thr≤1, k>0, a gaussian function (or radial basis function)
Q.sub.L(S)=Q.sub.L.sup.mine.sup.−∈.Math.SoB(s).sup.
a sigmoid function, such as a logistic function
scaled variants of the above functions, or a combination of these.
[0041] An advantageous option is the following combination of two modified ReLUs composed with the above-defined gap-balancing indicators SoB.sub.1, SoB.sub.2:
Q.sub.L(s)=mReLU(SoB.sub.1(s);thr.sub.D,k.sub.D)+mReLU(SoB.sub.2(s);thr.sub.mingap,k.sub.mingap)
For a maximally unbalanced state s.sub.0, one has Q.sub.L(s.sub.0)=−k.sub.Dthr.sub.D−k.sub.mingapthr.sub.mingap. The thresholds thr.sub.D, thr.sub.mingap represents satisfactory gap balancing, beyond which no further improvement is deemed meaningful or worthwhile. The slopes k.sub.D, k.sub.mingap are preferably set large enough that the improvement of SoB in Q.sub.L(s) outweighs the reward in Q.sub.S(s,a) for moving one vehicle between two waypoints, so that the long-term memory is certain to influence the determination of motion commands. This may be achieved by trial and error, possibly in view of the number of vehicles, geometry of the traffic system and the definition of the reward function R(s,a). This proposed combination of two modified ReLUs is able to manage two desirable goals at once: to maximize the gap spread (i.e., gaps are distributed evenly) and keep vehicles from moving at so small distance that a queue may form.
[0042] In a second step 112 of the method 100, the initial node occupancies of the vehicles are obtained. The node occupancies may be represented as a data structure associating each vehicle with a planning node, similar to the occupancy function 0(⋅) introduced above. Ways of obtaining the node occupancies are described below in connection with
[0043] In a third step 114, a sequence of motion commands a is determined by optimizing the state-action value function Q(s,a). The optimization process may include executing a planning algorithm of any of the types discussed above. It is noted that the optimization process may be allowed to execute until a convergence criterion is met, until an optimality criterion is fulfilled and/or until a predefined time has elapsed. Actually reaching optimality is no necessary condition under the present invention. The optimization process is normally restricted to a planning horizon (or search depth), which may be determined in view of a computational budget (see applicant's co-pending application EP21175955.0) or fixed.
[0044] In a fourth step 116 of the method 100, the command-dependent term Q.sub.S(s,a) of the state-action value function Q(s,a) is updated on the basis of the reward function R(s,a) for every iteration of the third step 114. Alternatively, a simplified reward function {tilde over (R)}(s,a) can be used which largely reflects the same desirables as R(s,a) but is cheaper to evaluate. The command-independent term Q.sub.L(s) is preferably exempted from said iterative updating 116, recalling that its role is to act as long-term memory of the traffic planning method 100. This does not rule out the possibility of adjusting the command-independent term Q.sub.L(s) during ongoing traffic planning, though preferably this is done less frequently than at every execution cycle of the command-determining step 114.
[0045] Unlike steps 112, 114, 116, which may collectively be referred to as a planning call, the step no may constitute a training phase taking place prior to actual operation and need not be repeated after for each commissioned copy of a traffic planner product.
[0046] In some embodiments of the method 100, the reinforcement learning step no and command-determining step 114 are performed within a Dyna-2 algorithm or an equivalent algorithm. A characteristic of the Dyna-2 algorithm is that learning occurs both while the machine-learning model is being built (training phase) and while it interacts with the system to be controlled.
[0051]
[0052] The device 200 further has a second interface 220 configured to feed commands selected from said predefined commands to said plurality of vehicles, as well as processing circuitry 230 configured to perform the method 100 described above.
[0053] A possible behavior of the planning in step 114 is illustrated in
[(0(v1)−0(v2))mod9]−1 and [(0(v2)−0(v1))mod9]−1,
where 0(⋅) is the occupancy function introduced above. The vehicles v1, v2 are not allowed to occupy planning nodes 5 and 8 contemporaneously. (It is noted in passing that the road network shown in
TABLE-US-00001 TABLE 1 Future node occupancies in FIG. 3 t0 t1 t2 t3 Option a
It is seen that option a) will lead to a more even distribution of the inter-vehicle gaps at the final epoch t3. Unlike option b), it also does not allow the minimum gap to reach 1 for any epoch. Therefore, from the point of view of inter-vehicle gap balancing, option a) is perceived as the more advantageous one and is likely to be preferred by the traffic planner.
[0054]
[0055] The aspects of the present disclosure have mainly been described above with reference to a few embodiments. However, as is readily appreciated by a person skilled in the art, other embodiments than the ones disclosed above are equally possible within the scope of the invention, as defined by the appended patent claims.