MULTI-AGV MOTION PLANNING METHOD, DEVICE AND SYSTEM
20220317695 · 2022-10-06
Assignee
Inventors
- Xueqiang WANG (Shandong, CN)
- Yifan ZHANG (Shandong, CN)
- Libing ZOU (Shandong, CN)
- Fuqiang ZHANG (Shandong, CN)
Cpc classification
G06N7/01
PHYSICS
G05D1/0214
PHYSICS
G06N3/0442
PHYSICS
B60W60/0011
PERFORMING OPERATIONS; TRANSPORTING
G05D1/0088
PHYSICS
International classification
G05D1/00
PHYSICS
Abstract
A multi-AGV motion planning method, device and system are disclosed. The method of the present disclosure comprises: establishing an object model through reinforcement learning; building a neural network model based on the object model, performing environment settings including AGV group deployment, and using the object model of the AGV in a set environment to train the neural network model until a stable neural network model is obtained; setting an action constraint rule; and after the motion planning is started, inputting the state of current AGV, states of other AGVs and permitted actions in a current environment into the neural network model after trained, obtaining the evaluation indexes of a motion planning result output by the neural network model, obtaining an action to be executed of the current AGV according to the evaluation indexes, and performing validity judgment on the action to be executed using the action constraint rule.
Claims
1. A multi-AGV motion planning method, comprising: establishing an object model for describing a sequence decision process of multi-AGV motion planning through a reinforcement learning method, wherein the object model includes: an AGV state, an action space, and evaluation indexes of a motion planning result, and the AGV state includes a state of current AGV, states of other AGVs calculated based on the state of current AGV, and permitted actions of a current AGV; building a neural network model based on the object model, performing environment settings including AGV group deployment, and using the object model of the AGV in a set environment to train the neural network model until a stable neural network model is obtained; setting an action constraint rule, and using the action constraint rule to perform validity judgment on an action to be executed of the current AGV obtained based on the evaluation indexes; and after the motion planning is started, inputting the state of current AGV, states of other AGVs and permitted actions in a current environment into the neural network model after trained, obtaining the evaluation indexes of a motion planning result output by the neural network model, obtaining an action to be executed of the current AGV according to the evaluation indexes, and performing validity judgment on the action to be executed using the action constraint rule, so that the current AGV executes a valid action according to a judgment result.
2. The method according to claim 1, wherein performing validity judgment on the action to be executed using the action constraint rule, so that the current AGV executes a valid action according to a judgment result comprises: judging whether the action to be executed is a permitted action of the current AGV; if the action to be executed is a permitted action of the current AGV, executing the action to be executed by the current AGV; if the action to be executed is not a permitted action of the current AGV, calculating and executing a valid action of the current AGV according to the evaluation indexes of the motion planning result output by the neural network model and the permitted action of the current AGV.
3. The method according to claim 2, wherein if the action to be executed is not a permitted action of the current AGV, calculating and executing a valid action of the current AGV according to the evaluation indexes of the motion planning result output by the neural network model and the permitted action of the current AGV comprises: extracting a plurality of first permitted actions included in an optimal path from the action space of the current AGV; selecting a second permitted action satisfying a preset condition from the plurality of first permitted actions using the evaluation indexes of the motion planning result output by the neural network model; and calculating whether a collision will occur if the current AGV performs the second permitted action according to states of other AGVs, and if a collision will occur, executing a stationary action by the current AGV; if a collision will not occur, executing the second permitted action by the current AGV.
4. The method according to claim 3, wherein the evaluation indexes of the motion planning result include: a strategy π for describing selection of an action a.sub.t by the current AGV in a joint state at a moment t and a value function V for describing an expected return by adopting the strategy π in a joint state, and the joint state is a state composed of the state of current AGV, states of other AGVs and permitted actions; selecting a second permitted action satisfying a preset condition from the plurality of first permitted actions using the evaluation indexes of the motion planning result output by the neural network model comprises: calculating a probability value corresponding to each first permitted action using the strategy π, and selecting a first permitted action corresponding to the largest probability value as the second permitted action.
5. The method according to claim 1, wherein the object model is described using a Markov decision process, the permitted actions are obtained according to an optimal path strategy and an action space, and the action space is a speed and an angle in a body polar coordinate system of the current AGV; the state of current AGV includes: a speed of the current AGV in the body polar coordinate system, a target position, a size of current AGV, an average speed, and a road network distance between the current AGV and the target position; the states of other AGVs include: relative positions of other AGVs in the body polar coordinate system of the current AGV, relative positions of targets of other AGVs in the body polar coordinate system of the current AGV, relative speeds of other AGVs in the body polar coordinate system of the current AGV, road network distances between other AGVs and their targets, sizes of other AGVs, a sum of the sizes of other AGVs and the size of current AGV, and road network distances between other AGVs and the current AGV.
6. The method according to claim 5, wherein the object model further includes: a reward for describing a reward given to the current AGV for taking an action, the reward includes at least three description types: a first description type which is a reward given when the current AGV reaches or approaches the target position, a second description type which is a penalty given when the current AGV collides with or approaches other AGVs, and a third description type which is a penalty given when the current AGV deviates from the road network; wherein the first description type is set in the following way: when the current AGV reaches the target position p.sub.g, giving a positive maximum reward value a; setting a first discount coefficient i based on a road network distance, calculating a positive discount reward value i*a and a negative discount reward value −i*a from the first discount coefficient i and the maximum reward value a, and when the current AGV approaches the target position p.sub.g, giving the positive discount reward value i*a; when the current AGV is stationary or far away from the target position p.sub.g, giving a negative discount reward value −i*a; the second description type is set in the following way: when distances between the current AGV and other AGVs are less than a first threshold condition t.sub.agv, giving a maximum penalty value β; setting a second discount coefficient j based on a distance, calculating a discount penalty value j*β from the second discount coefficient j and the maximum penalty value β, and when distances between the current AGV and other AGVs are greater than the first threshold condition t.sub.agv and less than a second threshold condition m*t.sub.agv, giving the discount penalty value j*β; and when the distances between the current AGV and other AGVs are greater than the second threshold condition m*t.sub.agv, giving no penalty; m is a preset multiple; the third description type is set in the following way: when a distance between the current AGV and a current path is not less than a third threshold condition t.sub.rn, giving a penalty δ; when the distance between the current AGV and the current path is less than the third threshold condition t.sub.rn, giving no penalty.
7. The method according to claim 6, wherein building a neural network model based on the object model comprises: building the neural network model using a fully connected neural network, wherein an input of the neural network model includes the state of current AGV, the states of other AGVs, and the permitted actions of the current AGV; inputting the permission actions of the current AGV to a first single hidden layer fully connected neural network, inputting the state of current AGV to a second single hidden layer fully connected neural network, inputting the states of other AGVs to a long short-term memory network, and then to a third single hidden layer fully connected neural network, concatenating outputs of three single hidden layer fully connected neural networks into a tensor, then inputting the tensor to a double hidden layer fully connected neural network, and outputting the evaluation indexes of the motion planning result by the double hidden layer fully connected neural network; a dimension of the third single hidden layer fully connected neural network is greater than a dimension of the first single hidden layer fully connected neural network, and the dimension of the third single hidden layer fully connected neural network is greater than a dimension of the second single hidden layer fully connected neural network.
8. The method according to claim 7, wherein before inputting the states of other AGVs to a long short-term memory network, the method further comprises: extracting distances between other AGVs and the current AGV, sorting the states of other AGVs according to a distance from the current AGV in descending order, and then inputting the states of other AGVs to the long short-term memory network.
9. The method according to claim 7, wherein building a neural network model based on the object model further comprises: forming a single hidden layer fully connected neural network by serializing a fully connected layer and an activation function.
10. The method according to claim 1, wherein performing environment settings including AGV group deployment, and using the object model of the AGV in a set environment to train the neural network model comprises: setting at least one environment, wherein multiple AGVs are deployed in each environment, and a quantity of maps and/or road networks and/or AGVs in each environment is different; and inputting the state of current AGV, states of other AGVs and permitted actions in each environment into the neural network model for training, so as to update parameters of the neural network model.
11. The method according to claim 6, wherein at a moment t, obtaining a reward R.sub.t given to the current AGV for taking an action in the following way based on rewards of three description types:
R.sub.t=R.sub.goal+R.sub.agv+R.sub.rn where R.sub.goal is the first description type, R.sub.agv is the second description type, and R.sub.rn is the third description type.
12. A multi-AGV motion planning device, comprising: a modeling unit for establishing an object model for describing a sequence decision process of multi-AGV motion planning through a reinforcement learning method, wherein the object model includes: an AGV state, an action space, and evaluation indexes of a motion planning result, and the AGV state includes a state of current AGV, states of other AGVs calculated based on the state of current AGV, and permitted actions of a current AGV; a training unit for building a neural network model based on the object model, performing environment settings including AGV group deployment, and using the object model of the AGV in a set environment to train the neural network model until a stable neural network model is obtained; a setting unit for setting an action constraint rule, and using the action constraint rule to perform validity judgment on an action to be executed of the current AGV obtained based on the evaluation indexes; and an implementation unit for, after the motion planning is started, inputting the state of current AGV, states of other AGVs and permitted actions in a current environment into the neural network model after trained, obtaining the evaluation indexes of a motion planning result output by the neural network model, obtaining an action to be executed of the current AGV according to the evaluation indexes, and performing validity judgment on the action to be executed using the action constraint rule, so that the current AGV executes a valid action according to a judgment result.
13. The device according to claim 12, wherein the implementation unit comprises a judgment module, a first action execution module, and a second action execution module; the judgment module is for judging whether the action to be executed is a permitted action of the current AGV; the first action execution module is for, when the action to be executed is a permitted action of the current AGV, causing the current AGV to execute the action to be executed; the second action execution module is for, when the action to be executed is not a permitted action of the current AGV, calculating and executing a valid action of the current AGV according to the evaluation indexes of the motion planning result output by the neural network model and the permitted action of the current AGV.
14. The device according to claim 12, wherein the training unit comprises a model building module for building the neural network model using a fully connected neural network, and an input of the neural network model includes the state of current AGV, the states of other AGVs, and the permitted actions of the current AGV; wherein the permission actions of the current AGV are input to a first single hidden layer fully connected neural network, the state of current AGV is input to a second single hidden layer fully connected neural network, the states of other AGVs are input to a long short-term memory network, and then to a third single hidden layer fully connected neural network, outputs of three single hidden layer fully connected neural networks are concatenated into a tensor, then input to a double hidden layer fully connected neural network, and the double hidden layer fully connected neural network outputs the evaluation indexes of the motion planning result; a dimension of the third single hidden layer fully connected neural network is greater than a dimension of the first single hidden layer fully connected neural network, and the dimension of the third single hidden layer fully connected neural network is greater than a dimension of the second single hidden layer fully connected neural network.
15. A multi-AGV motion planning system, comprising: a control center for deploying a multi-AGV motion planning device; an AGV group composed of multiple AGVs and deployed in an environment; and a task scheduling center platform; wherein the multi-AGV motion planning device comprises: a modeling unit for establishing an object model for describing a sequence decision process of multi-AGV motion planning through a reinforcement learning method, wherein the object model includes: an AGV state, an action space, and evaluation indexes of a motion planning result, and the AGV state includes a state of current AGV, states of other AGVs calculated based on the state of current AGV, and permitted actions of a current AGV; a training unit for building a neural network model based on the object model, performing environment settings including AGV group deployment, and using the object model of the AGV in a set environment to train the neural network model until a stable neural network model is obtained; a setting unit for setting an action constraint rule, and using the action constraint rule to perform validity judgment on an action to be executed of the current AGV obtained based on the evaluation indexes; and an implementation unit for, after the motion planning is started, inputting the state of current AGV, states of other AGVs and permitted actions in a current environment into the neural network model after trained, obtaining the evaluation indexes of a motion planning result output by the neural network model, obtaining an action to be executed of the current AGV according to the evaluation indexes, and performing validity judgment on the action to be executed using the action constraint rule, so that the current AGV executes a valid action according to a judgment result; the AGVs in the AGV group use their own sensors to obtain their own states and upload them to the control center, and receive action instructions issued by the control center, and execute actions corresponding to the action instructions; the task scheduling center platform is configured to complete task planning and task dispatch and send a task of a certain AGV to the control center, and the control center controls the AGV to complete the task; the control center has a built-in stable neural network model, inputs a state of an AGV received into the neural network model, uses the neural network model to calculate a motion planning strategy of the AGV, and then generates action instructions according to the motion planning strategy and issues them to the AGV.
16. The device according to claim 15, wherein the implementation unit comprises a judgment module, a first action execution module, and a second action execution module; the judgment module is for judging whether the action to be executed is a permitted action of the current AGV; the first action execution module is for, when the action to be executed is a permitted action of the current AGV, causing the current AGV to execute the action to be executed ; the second action execution module is for, when the action to be executed is not a permitted action of the current AGV, calculating and executing a valid action of the current AGV according to the evaluation indexes of the motion planning result output by the neural network model and the permitted action of the current AGV.
17. The system according to claim 15, wherein the training unit comprises a model building module for building the neural network model using a fully connected neural network, and an input of the neural network model includes the state of current AGV, the states of other AGVs, and the permitted actions of the current AGV; wherein the permission actions of the current AGV are input to a first single hidden layer fully connected neural network, the state of current AGV is input to a second single hidden layer fully connected neural network, the states of other AGVs are input to a long short-term memory network, and then to a third single hidden layer fully connected neural network, outputs of three single hidden layer fully connected neural networks are concatenated into a tensor, then input to a double hidden layer fully connected neural network, and the double hidden layer fully connected neural network outputs the evaluation indexes of the motion planning result; a dimension of the third single hidden layer fully connected neural network is greater than a dimension of the first single hidden layer fully connected neural network, and the dimension of the third single hidden layer fully connected neural network is greater than a dimension of the second single hidden layer fully connected neural network.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0025] The present invention will hereinafter be described in conjunction with the following drawing figures, wherein like numerals denote like elements, and:
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
DETAILED DESCRIPTION
[0035] The following detailed description is merely exemplary in nature and is not intended to limit the invention or the application and uses of the invention. Furthermore, there is no intention to be bound by any theory presented in the preceding background of the invention or the following detailed description.
[0036] Herein the exemplary embodiments will be described in detail, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numerals in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present disclosure. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the present disclosure as recited in the appended claims.
[0037] At present, when the multi-AGV motion planning is performed in a dynamic dense environment, there are many problems to face such as an increase in search space and dynamic changes in the environment, and the difficulty will increase exponentially as the scale of the environment increases and the number of AGVs increases. Conventional motion planning algorithms have problems such as difficulty in solving and excessive computational cost. Moreover, in dense environments, the conventional methods are likely to cause the deadlock of AGV and cannot complete the motion planning well.
[0038] Based on the above description, the embodiments of the present disclosure aim to solve the problems in the multi-AGV motion planning in a dynamic environment based on the deep reinforcement learning method, and combine the deep reinforcement learning method with the optimal path search method to solve the problems in the multi-AGV motion planning in a dense environment, and thus complete the multi-AGV motion planning task in a dynamic dense environment.
[0039] Referring to
[0040] Step S110, establishing an object model for describing a sequence decision process of multi-AGV motion planning through a reinforcement learning method, wherein the object model includes: an AGV state, an action space, and evaluation indexes of a motion planning result, and the AGV state includes a state of current AGV, states of other AGVs calculated based on the state of current AGV, and permitted actions of a current AGV;
[0041] In this step, the permitted actions of the current AGV are obtained according to the optimal path strategy and action space.
[0042] Step S120, building a neural network model based on the object model, performing environment settings including AGV group deployment, and using the object model of the AGV in a set environment to train the neural network model until a stable neural network model is obtained;
[0043] Step S130, setting an action constraint rule, and using the action constraint rule to perform validity judgment on an action to be executed of the current AGV obtained based on the evaluation indexes;
[0044] In this step, the action constraint rule can judge the validity of the action to be executed of the current AGV. When the action to be executed is judged to be a valid action, after the motion planning is started, the current AGV can directly execute the action to be executed. When the action to be executed is judged to be an invalid action, after the motion planning is started, the current AGV needs to extract multiple executable actions contained in the optimal path of the current AGV from the action space, and select the corresponding action to execute from the multiple executable actions.
[0045] Step S140, after the motion planning is started, inputting the state of current AGV, states of other AGVs and permitted actions in a current environment into the neural network model after trained, obtaining the evaluation indexes of a motion planning result output by the neural network model, obtaining a current AGV's action to be executed according to the evaluation indexes, and performing validity judgment on the action to be executed using the action constraint rule, so that the current AGV executes a valid action according to a judgment result.
[0046] As shown in
[0047] Taking a multi-AGV motion environment as an example, the implementation steps of the current AGV motion planning method in a multi-AGV motion environment will be specifically described.
[0048] Refer to the multi-AGV motion environment shown in
[0049] After building the multi-AGV motion environment, the motion planning algorithm design is carried out. The motion planning algorithm design includes: construction of object model, design and training of neural network model, and motion constraint design.
Construction of Object Model
[0050] In an embodiment, the above step S110 specifically uses a Markov decision process (MDP) to describe the object model, and the MDP is a mathematical model of sequential decision, which is used to simulate the random strategy and rewards that can be achieved by the agent in an environment in which the system state has Markov properties.
[0051] In the multi-AGV motion environment of the present embodiment, the object model includes: an AGV state, an action space, and evaluation indexes of the motion planning result. The AGV state includes a state o.sub.a of current AGV, states õa of other AGVs, and permitted actions a.sub.a of the current AGV. The evaluation indexes of the motion planning result include: a strategy π and a value function V.
[0052] The state o.sub.a of current AGV includes: a speed of the current AGV in the body polar coordinate system, a target position, a size of current AGV, an average speed, and a road network distance between the current AGV and the target position.
[0053] The states õ.sub.a of other AGVs include: relative positions of other AGVs in the body polar coordinate system of the current AGV, relative positions of targets of other AGVs in the body polar coordinate system of the current AGV, relative speeds of other AGVs in the body polar coordinate system of the current AGV, road network distances between other AGVs and their targets, sizes of other AGVs, a sum of the sizes of other AGVs and the size of current AGV, and road network distances between other AGVs and the current AGV.
[0054] The permitted actions a.sub.a of the current AGV are obtained according to the optimal path strategy and action space.
[0055] The action space a is the speed and angle in the body polar coordinate system of the current AGV.
[0056] The strategy π is a strategy for describing selection of an action a.sub.t by the current AGV in a joint state at a moment t.
[0057] The value function V is for describing an expected return by adopting the strategy π in a joint state o.sub.t.
[0058] The joint state o.sub.t is a state composed of the state o.sub.a of current AGV, state õ.sub.a of other AGVs and permitted actions a.sub.a.
[0059] In an embodiment, the object model further includes: a reward r for describing a reward given to the current AGV for taking an action a.sub.t, and a discount factor γ. The discount factor γ is an attenuation factor used when calculating the reward obtained by the current AGV for performing the action a.sub.t, and is used to adjust an output of the value function.
[0060] The specific contents of the above object model are as follows:
1. State s
[0061] The AGV state s.sub.a includes the AGV position, speed, target position, size, average speed of the latest n (n is an integer greater than 1) states, and the road network distance between the AGV and the target. The way to obtain the AGV state s.sub.a does not rely on a specific sensor. It can be expressed as:
s.sub.a=[p.sub.x, p.sub.y, v.sub.x, v.sub.y, p.sub.gx, p.sub.gy, r, v.sub.m, d.sub.grn] ∈R.sup.9
[0062] where, in the Euclidean coordinate system Σ.sub.e based on the multi-AGV motion environment, (p.sub.x, p.sub.y) represents the position of the AGV, (v.sub.x, v.sub.y) represents the speed of the AGV, (p.sub.gx, p.sub.gy) represents the target position, r represents the size of the AGV, v.sub.m represents the average speed of the latest n states, d.sub.grn represents the road network distance between the AGV and the target.
[0063] Similarly, relative to the current AGV, the states {tilde over (s)}.sub.a of other AGVs in the multi-AGV motion environment are expressed as follows:
{tilde over (s)}.sub.a=[{tilde over (s)}.sub.a,1, . . . , {tilde over (s)}.sub.a,i, . . . , {tilde over (s)}.sub.a,m] ∈R.sup.9×m
[0064] where {tilde over (s)}.sub.a,i represents the state of the i-th AGV and is expressed as:
{tilde over (s)}.sub.a,i=[{tilde over (p)}.sub.x, {tilde over (p)}.sub.yl , {tilde over (v)}.sub.x, {tilde over (v)}.sub.y, {tilde over (p)}.sub.gx, {tilde over (p)}.sub.gv, {tilde over (r)}, {tilde over (v)}.sub.m, {tilde over (d)}.sub.grn] ∈R.sup.9
Observation o:
[0065] In order to adapt to different motion environments, avoid the impact of coordinate transformation, and achieve a decentralized training method, the present embodiment transforms the AGV state expressed in the global Euclidean coordinate system Σ.sub.e into the AGV observation state, i.e., the state o.sub.a of current AGV, in the body polar coordinate system Σ.sub.p of the current AGV, which is expressed as:
o.sub.a=[r.sub.v, θ.sub.v, r.sub.g, θ.sub.g, r, v.sub.m, d.sub.grm] ∈R.sup.7
[0066] where (r.sub.v, θ.sub.v) represents the polar coordinates of the speed of the current AGV, (r.sub.g, θ.sub.g) represents the polar coordinates of the target of the current AGV, r represents the size of current AGV, v.sub.m represents the average speed of the latest n states, d.sub.grn represents the road network distance between the current AGV and the target.
[0067] Similarly, the states of other AGVs in the environment observed under the body polar coordinate system of the current AGV, i.e., the states of other AGVs, are expressed as:
õ.sub.a=[õ.sub.a,1, . . . , õ.sub.a,i, . . . , õ.sub.a,m].sup.T ∈ R.sup.10×m
[0068] where õ.sub.a,i represents the observation state of the i-th AGV, which can be expressed as:
õ.sub.a,i=[{tilde over (r)}.sub.p, {tilde over (θ)}.sub.p, , {tilde over (θ)}.sub.v, {tilde over (r)}.sub.g, {tilde over (θ)}.sub.g, {tilde over (d)}.sub.grn, {tilde over (r)}, r+{tilde over (r)}, {tilde over (d)}.sub.arn] ∈ R.sup.10
[0069] where ({tilde over (r)}.sub.p, {tilde over (θ)}.sub.p) represents the polar coordinates of the relative position of the i-th AGV in Σ.sub.p, (,
) represents the polar coordinates of the relative speed of the i-th AGV in Σ.sub.p, ({tilde over (r)}.sub.h, {tilde over (θ)} .sub.g) represents the polar coordinates of the relative position of the target of the i-th AGV in Σ.sub.p, {tilde over (d)}.sub.grn represents the road network distance between the i-th AGV and the target, {tilde over (r)} represents the size of the i-th AGV, r+{tilde over (r)} represents the sum of the sizes of the i-th AGV and the current AGV, and {tilde over (d)}.sub.arn represents the road network distance between the i-th AGV and the current AGV.
Permitted Actions a.SUB.a.:
[0070] The AGV travels along the optimal path from the starting position to the target position in the road network. Therefore, there is a constraint relationship between the permitted action of the AGV at the current position and the optimal path at the current position, which is expressed as that the direction of the permitted action is consistent with the direction of the optimal path. Based on this, the present embodiment uses the action space a and the optimal path direction θ.sub.or to encode and obtain the permitted action a.sub.a of the current AGV at the current position. One way of encoding the action space a and the optimal path direction θ.sub.or is:
a.sub.aa∩θ.sub.or ∈R
[0071] In the present embodiment, the optimal path (or optimal path direction θ.sub.or) of the AGV can be obtained in the road network using a path search algorithm according to the starting position and the target position. The path search algorithm includes the A* algorithm (A* is a direct search method for solving the shortest path in a static road network), Dijkstra algorithm (shortest path algorithm), M* algorithm (also known as Big M algorithm).
2. Action Space a
[0072] The action space a taken by the current AGV refers to the speed v and angle θ in the body polar coordinate system Σ.sub.p of the current AGV. The action space a is described based on the body polar coordinate system Σ.sub.p and is not affected by the coordinate transformation of the global coordinate system of different environments. It is expressed as follows:
a=[v,θ] ∈R.sup.2
3. State Transition Model p
[0073] It is the probability distribution p(o.sub.t+1,o.sub.t|a.sub.t) of transitioning to the next joint state o.sub.t by the current AGV taking the action a.sub.t in the current joint state o.sub.t.
4. Reward r
[0074] The reward function describes the rewards given to the current AGV for taking the action a.sub.t in the multi-AGV motion environment, and includes at least three description types. The first description type R.sub.goal is a reward given when the current AGV reaches or approaches the target position. The second description type R.sub.agv is a penalty given when the current AGV collides with or approaches other AGVs. The third description type R.sub.rn is a penalty given when the current AGV deviates from the road network.
[0075] The first description type R.sub.goal is set in the following way:
[0076] when the current AGV reaches the target position p.sub.g, a positive maximum reward value a is given;
[0077] when the current AGV approaches the target position p.sub.g, the positive discount reward value i*a is given;
[0078] when the current AGV is stationary or far away from the target position p.sub.g, a negative discount reward value −i*a is given,
[0079] wherein a first discount coefficient i is set based on a road network distance, and a positive discount reward value i*a and a negative discount reward value −i*a are calculated from the first discount coefficient i and the maximum reward value a.
[0080] The second description type R.sub.agv is set in the following way:
[0081] when distances between the current AGV and other AGVs are less than a first threshold condition t.sub.agv, a maximum penalty value β is given;
[0082] when distances between the current AGV and other AGVs are greater than the first threshold condition t.sub.agv and less than a second threshold condition m*t.sub.agv, the discount penalty value j*β is given;
[0083] when the distances between the current AGV and other AGVs are greater than the second threshold condition m*t.sub.agv, no penalty will be given; m is a preset multiple,
[0084] wherein a second discount coefficient j is set based on a distance, and a discount penalty value j*β is calculated from the second discount coefficient j and the maximum penalty value β.
[0085] The third description type R.sub.rn is set in the following way:
[0086] when a distance d.sub.rn between the current AGV and a current path is not less than a third threshold condition t.sub.rn, a penalty δ is given;
[0087] when the distance d.sub.rn between the current AGV and the current path is less than the third threshold condition t.sub.rn, no penalty is given.
[0088] At a moment t, the complete reward function R.sub.t is:
R.sub.t=R.sub.goal+R.sub.agv+R.sub.rn
5. Discount Factor γ
[0089] It is an attenuation factor used when calculating the reward obtained by the current AGV for performing the action a.sub.t, and is used to adjust an output of the value function, γ ∈[0,1).
6. Strategy π
[0090] According to the current AGV's joint state o.sub.t at the moment t, the strategy π of selecting the action a.sub.t by the current AGV can be described as:
π: (a.sub.t|o.sub.t; w.sub.t)
[0091] where w.sub.t represents the parameters of the current neural network model. The goal of the strategy π is to minimize the expected time for the current AGV to reach the target position, and avoid collisions with other AGVs and avoid deviation from the road network.
7. Value function V
[0092] It is the expected return of the current AGV adopting the strategy π in the joint state o.sub.t, which can be described as:
V (o.sub.t; w.sub.t)=E[R.sub.t|o.sub.t]
[0093] The parameters described above can be used to describe the object model based on the Markov decision process, and thus complete the construction of the object model.
Design and Training of Neural Network Model
[0094] In the present embodiment, the neural network model can be constructed by the following methods.
[0095] A fully connected neural network is used to build a neural network model. As shown in
[0096] the permission actions of the current AGV are input to a first single hidden layer fully connected neural network,
[0097] the state of current AGV is input to a second single hidden layer fully connected neural network,
[0098] the states of other AGVs are input to a long short-term memory (LSTM) network, and then to a third single hidden layer fully connected neural network,
[0099] the outputs of three single hidden layer fully connected neural networks are concatenated into a tensor, then input to a double hidden layer fully connected neural network, and the double hidden layer fully connected neural network outputs the evaluation indexes of the motion planning result;
[0100] the output of the neural network model is the output of the double hidden layer fully connected neural network.
[0101] wherein the dimension of the third single hidden layer fully connected neural network is greater than the dimension of the first single hidden layer fully connected neural network, and the dimension of the third single hidden layer fully connected neural network is greater than the dimension of the second single hidden layer fully connected neural network. The fully connected layer and the activation function ReLU are serialized to form a single hidden layer fully connected neural network. Namely, the fully connected layer and ReLU are serialized to form the first single hidden layer fully connected neural network, the second single hidden layer fully connected neural network, and the third single hidden layer fully connected neural network.
[0102] Since in the present embodiment, the tensor dimension corresponding to the states of other AGVs is much larger than the tensor dimension of the permitted actions and the state of current AGV, if the states of other AGVs, the state of current AGV and the permitted actions are directly concatenated into a tensor and input to the neural network model, it will affect the performance of the neural network model due to the imbalance of dimensions of the input data, and results in poor output results. However, in the present embodiment, each of the three input tensors is first input to the corresponding single hidden layer fully connected neural network, and the tensors input are dimensionally balanced through the single hidden layer fully connected neural networks, and then concatenated into one tensor and input to the double hidden layer fully connected neural network, thereby improving the performance of the entire model.
[0103] In an embodiment, before inputting the states of other AGVs to the long short-term memory network, the distances between other AGVs and the current AGV are extracted first, the states of other AGVs are sorted according to the distance from the current AGV in descending order, and then input the sorted states of other AGVs into the LSTM.
[0104] In an embodiment, a deep reinforcement learning training framework can be established based on the GA3C training framework, and a neural network model can be built. The deep reinforcement learning training framework established is composed of three parts: an agent composed of multiple AGVs, a data queue, and a GPU-based neural network model.
[0105] It should be noted that GA3C refers to the Asynchronous Advantage Actor-Critic (A3C) algorithm applied to the Graphics Processing Unit (GPU). The GA3C framework uses GPUs for reinforcement learning training, and the model training speed and performance can be enhanced.
[0106] In a preferred embodiment, multiple parallel computing GPUs are added to the GA3C training framework to establish a multi-GPU based neural network model. Correspondingly, the deep reinforcement learning training framework is composed of an agent, a data queue, and a multi-GPU neural network model.
[0107] As shown in
[0108] an agent composed of multiple AGVs which is used to interact with the external environment, obtain real-time data such as the state and actions of the AGV, and provide data for building a deep reinforcement learning training framework.
[0109] The data queue includes a predictor and a trainer, and the training data and prediction data obtained are stored in the data queue.
[0110] Among them, the predictor selects the action a.sub.t according to the current strategy π, collects the AGV state and action space as training data, and inputs them to the trainer for training; the predictor inputs the state of current AGV to the neural network model, obtains the strategy π and value function V from the neural network model, select the action a.sub.t, and calculate the reward r to obtain the prediction data; the trainer inputs the state of current AGV, permitted actions, and states other AGVs to the neural network model and trains the neural network model to update the parameters of the neural network model.
[0111] In an embodiment, the step S120 in
[0112] It is understandable that different environments may be different, partially same or completely consistent in various aspects such as paths, restricted areas, quantity of AGVs, and targets. Increasing the environment types can promote the training of algorithms in different environments and ensure the applicability of the model.
[0113] In an embodiment, the multi-GPU neural network model is composed of multiple parallel computing GPUs to form a multi-GPU neural network model based on deep reinforcement learning.
[0114] In an embodiment, the process of updating the parameters of the neural network model comprises:
[0115] using the state of current AGV, permitted actions, and states of other AGVs as training data; randomly initializing neural network model parameters, and calculating a strategy loss function f.sub.π(w) and a value function loss function f.sub.v(w) based on the action a.sub.t obtained and the reward r obtained; updating parameters of the neural network model in each GPU through a back propagation algorithm to train the neural network model; detecting change processes of the reward r accumulated, the strategy loss function f.sub.π(w) and the value function loss function f.sub.v(w) over a period of time, and judging performance of the neural network model according to the reward and loss functions until a stable neural network model is obtained.
[0116] In the present embodiment, the higher the accumulated reward, the better the performance of the neural network model, and the lower the value of the loss function, the better the performance of the neural network model. When the values of the reward and loss functions both reach a stable interval, a stable neural network model can be obtained.
[0117] Among them, the strategy loss function f.sub.π(w) is expressed as:
f.sub.π(w)=log π(a.sub.t|o.sub.t; w.sub.t)(R.sub.t−V (o.sub.t; w.sub.t))+ϕH (π(o.sub.t; w))
[0118] Among them, ϕH(π(o.sub.t; w)) represents a regular term in the strategy loss function f.sub.π(w) which plays the role of adjusting f.sub.π(w).
[0119] The value function loss function f.sub.v(w) is expressed as:
f.sub.v(w)=(R.sub.t−V (o.sub.t; w.sub.t)).sup.2
Action Constraint Design
[0120] In the present embodiment, the action constraint rule is used to restrict the strategy output by the neural network model, and further restrict the action selection of the current AGV, so as to constrain the action executed by the current AGV within an action range permitted by the optimal path, and promote the convergence of model in the training process.
[0121] In an embodiment, the action to be executed of the current AGV obtained based on the evaluation indexes in the step S130 in
[0122] Correspondingly, in
[0123] judging whether the action to be executed is a permitted action of the current AGV;
[0124] if the action to be executed is a permitted action of the current AGV, the current AGV executes the action to be executed;
[0125] if the action to be executed is not a permitted action of the current AGV, calculating and executing a valid action of the current AGV according to the evaluation indexes of the motion planning result output by the neural network model and the permitted action of the current AGV.
[0126] Specifically: extracting a plurality of first permitted actions included in an optimal path from the action space of the current AGV;
[0127] selecting a second permitted action satisfying a preset condition from the plurality of first permitted actions using the evaluation indexes of the motion planning result output by the neural network model; and
[0128] calculating whether a collision will occur if the current AGV performs the second permitted action according to states of other AGVs, and if a collision will occur, executing a stationary action by the current AGV; if a collision will not occur, executing the second permitted action by the current AGV.
[0129] In an embodiment, the probability value corresponding to each first permitted action can be calculated using the strategy π, and a first permitted action corresponding to the largest probability value is selected as the second permitted action.
[0130] Corresponding to the foregoing method, the present disclosure also provides a multi-AGV motion planning device.
[0131]
[0132] a modeling unit 510 for establishing an object model for describing a sequence decision process of multi-AGV motion planning through a reinforcement learning method, wherein the object model includes: an AGV state, an action space, and evaluation indexes of a motion planning result, and the AGV state includes a state of current AGV, states of other AGVs calculated based on the state of current AGV, and permitted actions of a current AGV;
[0133] a training unit 520 for building a neural network model based on the object model, performing environment settings including AGV group deployment, and using the object model of the AGV in a set environment to train the neural network model until a stable neural network model is obtained;
[0134] a setting unit 530 for setting an action constraint rule, and using the action constraint rule to perform validity judgment on an action to be executed of the current AGV obtained based on the evaluation indexes; and
[0135] an implementation unit 540 for, after the motion planning is started, inputting the state of current AGV, states of other AGVs and permitted actions in a current environment into the neural network model after trained, obtaining the evaluation indexes of a motion planning result output by the neural network model, obtaining an action to be executed of the current AGV according to the evaluation indexes, and performing validity judgment on the action to be executed using the action constraint rule, so that the current AGV executes a valid action according to a judgment result.
[0136] In an embodiment, the implementation unit 540 comprises a judgment module, a first action execution module, and a second action execution module;
[0137] the judgment module is for judging whether the action to be executed is a permitted action of the current AGV;
[0138] the first action execution module is for, when the action to be executed is a permitted action of the current AGV, causing the current AGV to execute the action to be executed;
[0139] the second action execution module is for, when the action to be executed is not a permitted action of the current AGV, calculating and executing a valid action of the current AGV according to the evaluation indexes of the motion planning result output by the neural network model and the permitted action of the current AGV.
[0140] In an embodiment, the second action execution module is also for extracting a plurality of first permitted actions included in an optimal path from the action space of the current AGV; selecting a second permitted action satisfying a preset condition from the plurality of first permitted actions using the evaluation indexes of the motion planning result output by the neural network model; calculating whether a collision will occur if the current AGV performs the second permitted action according to states of other AGVs, and if a collision will occur, executing a stationary action by the current AGV; if a collision will not occur, executing the second permitted action by the current AGV.
[0141] In an embodiment, the evaluation indexes of the motion planning result include: a strategy π for describing selection of an action a.sub.t by the current AGV in a joint state at a moment t and a value function V for describing an expected return by adopting the strategy π in a joint state, and the joint state is a state composed of the state of current AGV, states of other AGVs and permitted actions; and
[0142] the second action execution module is specifically for calculating a probability value corresponding to each first permitted action using the strategy π, and selecting a first permitted action corresponding to the largest probability value as the second permitted action.
[0143] In an embodiment, the modeling unit 510 uses the Markov decision process to describe the object model, the permitted actions are obtained according to an optimal path strategy and an action space, and the action space is a speed and an angle in a body polar coordinate system of the current AGV;
[0144] the state of current AGV includes: a speed of the current AGV in the body polar coordinate system, a target position, a size of current AGV, an average speed, and a road network distance between the current AGV and the target position;
[0145] the states of other AGVs include: relative positions of other AGVs in the body polar coordinate system of the current AGV, relative positions of targets of other AGVs in the body polar coordinate system of the current AGV, relative speeds of other AGVs in the body polar coordinate system of the current AGV, road network distances between other AGVs and their targets, sizes of other AGVs, a sum of the sizes of other AGVs and the size of current AGV, and road network distances between other AGVs and the current AGV.
[0146] The object model further includes: a reward for describing a reward given to the current AGV for taking an action, the reward includes at least three description types: a first description type which is a reward given when the current AGV reaches or approaches the target position, a second description type which is a penalty given when the current AGV collides with or approaches other AGVs, and a third description type which is a penalty given when the current AGV deviates from the road network;
[0147] wherein the first description type is set in the following way:
[0148] when the current AGV reaches the target position p.sub.g, a positive maximum reward value a is given; a first discount coefficient i is set based on a road network distance, a positive discount reward value i*a and a negative discount reward value −i*a are calculated from the first discount coefficient i and the maximum reward value a, and when the current AGV approaches the target position p.sub.g, the positive discount reward value i*a is given; when the current AGV is stationary or far away from the target position p.sub.g, a negative discount reward value −i*a is given;
[0149] the second description type is set in the following way:
[0150] when distances between the current AGV and other AGVs are less than a first threshold condition t.sub.agv, a maximum penalty value β is given; a second discount coefficient j is set based on a distance, a discount penalty value j*β is calculated from the second discount coefficient j and the maximum penalty value β, and when distances between the current AGV and other AGVs are greater than the first threshold condition t.sub.agv and less than a second threshold condition m*t.sub.agv, the discount penalty value j*β is given; and when the distances between the current AGV and other AGVs are greater than the second threshold condition m*t.sub.agv, no penalty will be given; m is a preset multiple;
[0151] the third description type is set in the following way:
[0152] when a distance between the current AGV and a current path is not less than a third threshold condition t.sub.rn, a penalty δ is given; when the distance between the current AGV and the current path is less than the third threshold condition t.sub.rn, no penalty is given.
[0153] In an embodiment, the training unit 520 comprises a model building module for building the neural network model using a fully connected neural network, wherein an input of the neural network model includes the state of current AGV, the states of other AGVs, and the permitted actions of the current AGV; the permission actions of the current AGV are input to a first single hidden layer fully connected neural network, the state of current AGV is input to a second single hidden layer fully connected neural network, the states of other AGVs are input to a long short-term memory network, and then to a third single hidden layer fully connected neural network, the outputs of three single hidden layer fully connected neural networks are concatenated into a tensor, then input to a double hidden layer fully connected neural network, and the double hidden layer fully connected neural network outputs the evaluation indexes of the motion planning result; the dimension of the third single hidden layer fully connected neural network is greater than the dimension of the first single hidden layer fully connected neural network, and the dimension of the third single hidden layer fully connected neural network is greater than the dimension of the second single hidden layer fully connected neural network.
[0154] In an embodiment, the training unit 520 further comprises an environment setting module for setting at least one environment, wherein multiple AGVs are deployed in each environment, and a quantity of maps and/or road networks and/or AGVs in each environment is different; and inputting the state of current AGV, states of other AGVs and permitted actions in each environment into the neural network model for training, so as to update parameters of the neural network model.
[0155] The device embodiments described above are merely illustrative, and the specific implementations may refer to the specific implementations of the foregoing method embodiments, which will not be repeated herein.
[0156]
[0157] The AGVs in the AGV group 620 use their own sensors to obtain their own states and upload them to the control center 610, receive the action instructions issued by the control center 610, and execute actions corresponding to the action instructions.
[0158] The AGVs in the same environment run in a multi-process mode, realizes multi-AGV data communication through multi-process data sharing technology, and at the same time, realizes multi-AGV synchronous operation through multi-process data synchronization technology.
[0159] The task scheduling center platform 630 is configured to complete task planning and task dispatch, and send a task of a certain AGV to the control center 610, and the control center 610 controls the AGV to complete the task.
[0160] For example, the task scheduling center platform sends the task of the j-th AGV to the control center, and the control center controls the j-th AGV to complete the task. The task scheduling center platform may include a display interface for real-time supervising how the AGV group execute the tasks, and timely intervening in unexpected situations such as collisions and environmental changes.
[0161] The control center 610 has a built-in stable neural network model, inputs a state of an AGV received into the neural network model, uses the neural network model to calculate a motion planning strategy of the AGV, and then generates action instructions according to the motion planning strategy and issues them to the AGV.
[0162] In sum, first, the technical solution of the present disclosure adopts a deep reinforcement learning method, takes advantage of the merits of neural network models in computing high-dimensional state space and the characteristics of reinforcement learning online control, and thus improves the performance of the multi-AGV motion planning method in dynamic environments. Second, the present embodiment combines the optimal path search algorithm and the deep reinforcement learning algorithm, and constrains the learning direction of the deep reinforcement learning algorithm using the optimal path search algorithm, so that the AGV can plan a feasible route in a dense environment, and prevent the AGV from entering a local deadlock state.
[0163] It should be noted:
[0164] The algorithms and demonstrations given here are not intrinsically associated with any particular computer, virtual system, or other device. All general-purpose system can be used in conjunction with demonstrations here. Based on the above descriptions, architectures for constructing such a system are apparent. In addition, the present disclosure is not dependent on any particular programming language. It is understandable that various programming languages can be used to realize contents of the present disclosure described herein, and that the above descriptions concerning specific languages are intended to reveal the best implementation of the present disclosure.
[0165] Those skilled in the art can understand that it is possible to adaptively change the modules in the device in the embodiments and set them in one or more devices different from the embodiments. The modules or units or components in the embodiments can be combined into one module or unit or component, and in addition, they can be divided into multiple sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or units are mutually exclusive, any combination can be used to combine all features disclosed in this specification (including the accompanying claims, abstract and drawings) and all processes or units of any method or device disclosed in such a way. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract and drawings) may be replaced by an alternative feature providing the same, equivalent or similar purpose.
[0166] In addition, those skilled in the art can understand that, although some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the present disclosure, and form different embodiments, as would be understood by those in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination way.
[0167] The various component embodiments of the present disclosure may be implemented by hardware, or by software modules running on one or more processors, or by their combination. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) can be used in practice to implement some or all functions of some or all the components of the multi-AGV motion planning device according to the embodiments of the present disclosure. The present disclosure can also be implemented as a device or device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein. Such a program for realizing the present disclosure may be stored on a computer readable medium, or may have the form of one or more signals. Such signals can be downloaded from Internet websites, or provided on carrier signals, or provided in any other form.
[0168] For example,
[0169]
[0170] It should be noted that the above embodiments illustrate rather than limit the present disclosure, and those skilled in the art can design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses should not be constructed as a limitation to the claims. The word “comprising” does not exclude the presence of elements or steps not listed in the claims. The word “a” or “an” preceding an element does not exclude the presence of multiple such elements. The present disclosure can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the unit claims enumerating several devices, several of these devices may be embodied in the same hardware item. The use of the words “first”, “second” and “third” does not indicate any order. These words can be interpreted as names.
[0171] While at least one exemplary embodiment has been presented in the foregoing detailed description, it should be appreciated that a vast number of variations exist. It should also be appreciated that the exemplary embodiment or exemplary embodiments are only examples, and are not intended to limit the scope, applicability, or configuration of the invention in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing an exemplary embodiment, it being understood that various changes may be made in the function and arrangement of elements described in an exemplary embodiment without departing from the scope of the invention as set forth in the appended claims and their legal equivalents.