METHOD FOR UAV PATH PLANNING IN URBAN AIRSPACE BASED ON SAFE REINFORCEMENT LEARNING
20250085714 ยท 2025-03-13
Inventors
Cpc classification
Y02T10/40
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
The present invention discloses a method for UAV path planning in urban airspace based on a safe reinforcement learning (RL) algorithm called shield-DDPG, which combines a shield model with a DDPG algorithm and pertains to the field of UAV technologies. The method takes an attractive force from the destination point into account when an action is selected, which improves the convergence speed of the algorithm and also improve the efficiency of UAV path planning. More importantly, the method provided by the present invention can effectively verify safety of an action in terms of the air collision risk and the ground impact risk, and ensure that a final output of the algorithm is a safe optimal solution. Therefore, the present invention can effectively solve the problem that when the RL algorithm is used for UAV path planning, it is difficult to ensure the safety of the learning or execution process due to the lack of hard constraints.
Claims
1. A method for UAV path planning in urban airspace based on safe reinforcement learning, comprising: S1, collecting state information of a UAV, urban airspace and an urban ground environment, and defining a state of the UAV at any moment t as s.sub.t, wherein s.sub.t=[x.sub.t,y.sub.t,z.sub.t]; S2, constituting a safe reinforcement learning algorithm called shield-DDPG architecture by four functional modules: an environment module, a neural network module, a shield module, and a replay buffer; and conducting training by the neural network module according to the state s.sub.t, the neural network module comprising a main network and a target network; the shield module being constructed by a linear temporal logic and specifically comprising a finite-state reactive system, a state trace, a safety specification, a Markov decision process, a safety automaton and an observe function, the shield module acting between a main actor network and a main critic network, the main actor network outputting an action u.sub.t; S3, determining, by the shield module, safety of an action a.sub.t=u.sub.t+.sub.t=[a.sub.t.sup.x,a.sub.t.sup.y,a.sub.t.sup.z], in which .sub.t=.Math.D.sub.t.sup.P is an attractive force, is an attractive coefficient, and D.sub.t.sup.D is a distance between a UAV current position and a destination point; S4, verifying the safety of the action a.sub.t by the shield module, and finally outputting a safe action a.sub.t; S5, by the final safe action a.sub.t obtained, performing a.sub.t for state transition to obtain a next state s.sub.t+1 as well as a reward Reward.sub.t; and S6, storing the current state s.sub.t, the final safe action a.sub.t the reward Reward.sub.t, the next state s.sub.t+1, and a training flag d.sub.t in the replay buffer, and sampling a random minibatch of transitions from the replay buffer for updating the neural network.
2. The method for UAV path planning in urban airspace based on safe reinforcement learning according to claim 1, wherein the finite-state reactive system is M=(S,,L), in which S is a set of n states, i.e., S=[s.sub.t].sub.t=1.sup.n in which L represents an observed label and represents a state transition function; as for the specification, , all state traces in the reactive system M should satisfy all properties of , and the state traces are safety-constrained when is defined as a safety specification .sup.safe that satisfies all safety properties; the observe function :SE.fwdarw.l={D.sub.t.sup.O.sup.
3. The method for UAV path planning in urban airspace based on safe reinforcement learning according to claim 2, wherein generating the safe action a.sub.t by the shield module specifically comprises: first, determining which specific UAV action dimension or dimensions cause(s) an unsafe condition, i.e., except for the dimension to be determined, setting an action in the other two dimensions to be 0, executing a state transition process with the action in the dimension to be determined, calculating and determining whether the action is safe or not at this time, and in a similar fashion, obtaining the unsafe action dimension(s) through separate judgment of all the dimensions; then, keeping an action in the safe dimension unchanged, and as for the unsafe dimension, circularly compressing the original unsafe action for j times by a specific step , and judging the action obtained by each compression again; and assuming that m actions in the j actions meet the safety specification, executing the state transition process of the m actions, calculating the reward, and selecting the action with the largest reward as the final safe action a.sub.t.
4. The method for UAV path planning in urban airspace based on safe reinforcement learning according to claim 3, wherein the final safe action a.sub.t is performed for state transition to obtain the next state s.sub.t+1 and the reward Reward.sub.t, wherein one time of the state transition process is expressed as s.sub.t+1s.sub.t+t.Math.a.sub.t, a reward function Reward.sub.t is calculated from Reward.sub.t=r.sub.1Reward.sub.t1+r.sub.2Reward.sub.t2+r.sub.3Reward.sub.t3, in which r.sub.1, r.sub.2 and r.sub.3 are coefficients of reward subfunctions respectively, .sub.i is a reward value correspondingly further applied to the UAV, and Reward.sub.t1 is a reward function for evaluating the distance between the UAV current position and the destination point,
5. The method for UAV path planning in urban airspace based on safe reinforcement learning according to claim 4, wherein in step S6, a random minibatch of B transitions (s.sub.i,a.sub.i,Reward.sub.i,s.sub.i+1,d.sub.i) is sampled from the replay buffer, and y.sub.i=Reward.sub.i+Q(s.sub.i+1,(s.sub.i+1|.sup.)|.sup.Q) is set, wherein is a discount factor, and the parameter .sup.Q of the main critic network is updated by minimizing a loss:
Description
BRIEF DESCRIPTION OF DRAWINGS
[0029] For clearer objectives, technical solutions and beneficial effects of the present invention, the present invention is illustrated by providing the following drawings:
[0030]
[0031]
DETAILED DESCRIPTION
[0032] As shown in
[0033] Specifically, the environment module is mainly configured to acquire the UAV position and mission planning requirements, as well as the target urban airspace environment (mainly referring to state information of airspace and a ground area of an urban target area). The UAV position and mission planning requirements mainly include a three-dimensional position state s.sub.t=[x.sub.t,y.sub.t,z.sub.t] of the UAV at a current moment t, a position state s.sub.S=[x.sub.S,y.sub.S,z.sub.S] of the start point, and a position state s.sub.D=[x.sub.D,y.sub.D,z.sub.D] of the destination point. The target urban airspace environment mainly covers static obstacles such as no-fly-zones, urban buildings and infrastructures (communication stations and towers), which are equivalent to N cylindrical static obstacles. Taking an i.sup.th obstacle O.sub.i as an example, its position state is s.sub.O.sub.
[0034] Specifically, the shield module is based on linear temporal logic (LTL). The shield module constructed in this invention refers to a protection framework for preventing the algorithm from outputting an unsafe action. The framework consists of a finite-state reactive system, a state trace, a safety specification, a Markov decision process (MDP), a safety automaton and an observe function.
[0035] Specifically, the shield module performs formal modeling of UAV path planning by using the finite-state reactive system M=(S,,L) and state traces commonly used in the linear temporal logic, and constrains the state traces by the safety specification .sup.safe. Then, the safety specification and a Markov process of UAV path planning are transformed into two safety automaton models respectively. Afterwards, a state reactive system that can realize these two safety automaton models at the same time is constructed, and the state reaction system is a prototype of the shield module. After that, the observe function :SE.fwdarw.l={D.sub.t.sup.O.sub.i,Risk.sub.t} is employed to assist in evaluating safety of the action.
[0036] For the reactive system M, S represents the state, represents the state transition M S function, and L represents the observed label. For the specification , all the state traces in the reactive system M should satisfy all properties of . When is defined as a safety specification .sup.safe that satisfies all the safety properties, the state traces can be safely constrained. The observe function is defined as the mapping between the state S and an environment E. In this example, its output is the distance D.sub.t.sup.O.sub.i between the UAV and each obstacle and a ground impact risk Risk.sub.t when the UAV falls out of control.
[0037] In the present invention, a describing function .Math. indicating that the action a is performed at the state s is defined as:
[0038] The state transition function may be expressed as:
[0039] A safe state is obtained when (s,(l,a)) is output as 1, and an unsafe state is obtained when (s,(l,a)) is output as 0, which are specifically illustrated as below.
[0040] For the UAV and the i.sup.th obstacle O.sub.i, whether a xOy plane where the UAV is located intersects with the cylindrical obstacle O.sub.i is determined first. If there is no intersection, it is believed that the UAV may not collide with the obstacle O.sub.i at the current position, i.e., z.sub.i>>H.sub.O.sub.
[0041] Particularly, in the present invention, an acceptable operation safety separation threshold WC between the UAV and the obstacle is drawn into consideration. The safety separation threshold mainly takes into account the UAV's own performance parameters, operation speed, the actual urban airspace, communication and other factors. This parameter reflects the ability to maintain an adequate safe distance interval and avoid an air collision risk between the UAV and the static obstacle. In the present invention, a specific explanation is made as below: [0042] when D.sub.t.sup.O.sub.iR.sub.O.sub.
[0045] Meanwhile, it is necessary to consider whether the ground impact risk when the UAV falls out of control is within an acceptable target level of safety, which is specifically described as below:
[0046] when Risk.sub.tRisk.sub.max, it is believed that the current state is an unsafe state, and (s,(l,a))=0; [0047] when Risk.sub.min<Risk.sub.y<Risk.sub.max, it is believed that the current state is a safe state, but there is still a certain degree of ground impact risk, and (s,(l,a))=1; and [0048] when Risk.sub.tRisk.sub.min it is believed that the current state is a safe state, the ground impact risk is negligible, and (s,(l,a))=1.
[0049] Risk.sub.max refers to a maximum acceptable target level of safety from the ground impact, taking casualties per hour as an evaluation index. At present, it is generally believed that 110.sup.8 persons/hour is the maximum acceptable target level of safety; and Risk.sub.min refers to a minimum negligible risk from the ground impact, and in the present invention, 110.sup.11 persons/hour is considered the minimum negligible risk.
[0050] Based on the shield module and combined with deep deterministic strategy gradient algorithm, the present invention proposes a safe reinforcement learning algorithm called shield-DDPG for UAV path planning, which is specifically described as below.
[0051] Before training begins, a state space and an action space are designed in advance with reference to the UAV position and mission planning requirements, as well as the target urban airspace environment.
[0052] At the beginning of training, first, the parameters .sup.Q and .sup. of the critic network Q(s,a|.sup.Q) and the actor network (s|.sup.) in the main network are initialized by using a random number, and parameters of the main network are copied to the target network (parameters of the target network are distinguished with .sup.Q and .sup.). Meanwhile, the replay buffer is initialized.
[0053] The state of the UAV at any moment t is defined as s.sub.t, and a set of n states in the whole process may be expressed as S=[s.sub.t].sub.t=1.sup.n.
[0054] For any state s.sub.t of the UAV, the actor network may obtain an action u.sub.t=(s.sub.t|.sup.)+ the state s.sub.t according to an action selection policy and randomly explored noise
. Here, the explored noise
is introduced mainly to avoid inadequacy of an output action caused by the fact that the neural network has only one output with the given input, and it is unnecessary to add this noise in the test process after training. Noise used in the present invention is Ornstein-Uhlenbeck (OU) noise. Meanwhile, in order to accelerate convergence of the algorithm and avoid falling into local-optimum, target airspace is abstracted as a virtual artificial potential field. In consideration that the destination point has an attractive force .sub.t=.Math.D.sub.t.sup.D to the UAV (in which is an attractive coefficient, and D.sub.t.sup.D is the distance between the UAV current position and the destination point), the action of which the safety is to be determined by the shield module is a.sub.t=u.sub.t+.sub.t=[a.sub.t.sup.x,a.sub.t.sup.y,a.sub.t.sup.z].
[0055] Afterwards, the shield module verifies the safety of the action a.sub.t, and finally outputs a safe action a.sub.t, which is described in detail as below:
[0056] The shield module executes the state transition process under the action a.sub.t, and evaluates the value of (s,(l,a.sub.t)). If all safety properties are satisfied, the action is considered as a safe action .sub.t, and if not, the shield module needs to generate a safe action a.sub.t.
[0057] Specifically, when generating the safe action a.sub.t, the shield module needs to first determine which specific UAV action dimension or dimensions cause(s) an unsafe condition, i.e., except for the dimension to be determined, to set an action in the other two dimensions to be 0, to execute a state transition process with the action in the dimension to be determined, to calculate and determine whether the action is safe or not at this time, and in a similar fashion, to obtain the unsafe action dimension(s) through separate judgment of all the dimensions. Then, an action in the safe dimension is kept unchanged, and as for the unsafe dimension, the original unsafe action is circularly compressed for j times by a specific step . The action obtained by each compression is judged again. Assuming that m actions in the j actions meet the safety specification, the state transition process of the m actions is executed, the reward is calculated, and the action with the largest reward is selected as the final safe action a.sub.t.
[0058] The final safe action a.sub.t is performed for state transition to obtain the next state s.sub.t+1 and the reward Reward.sub.t One time of the state transition process may be expressed as s.sub.t+1s.sub.t+t.Math.a.sub.t.
[0059] Specifically, the reward function Reward.sub.t is calculated from:
[0060] Reward.sub.t1 mainly takes the distance between the UAV current position and the destination point into account, and the closer the UAV is to the destination point, the larger the value of the reward function Reward.sub.t1 is; Reward.sub.t2 mainly takes the air collision risk between the UAV and each static obstacle into account, the smaller the total risk is, the larger the value of the reward function Reward.sub.t2 is; and Reward.sub.t3 mainly takes the ground impact risk when the UAV falls out of control into account, the smaller the risk is, the larger the value of the reward function Reward.sub.t3 is r.sub.1, r.sub.2 and r.sub.3 and are coefficients of reward subfunctions respectively. For each reward function, .sub.i is a reward value correspondingly further applied to the UAV when a certain condition is met.
[0061] First, the reward function Reward.sub.t1 is intended to evaluate the distance between the UAV current position and the destination point. In the case that the UAV is close enough to the destination point, it is believed in the present invention that an additional reward needs to be applied when the distance between the UAV current position and the destination point is less than a specified threshold , which is specifically described as:
where D.sub.total is the distance between the start point and the destination point, and is a constant for judging whether the UAV is close enough to the destination point.
[0062] Second, the reward function Reward.sub.t2 is intended to evaluate the air collision risk between the UAV and each static obstacle. In the case that the air collision risk between the UAV and each static obstacle O.sub.i is negligible, it is believed in the present invention that when any static obstacle O.sub.i meets z.sub.t>H.sub.O.sub.
[0063] Third, the reward function Reward.sub.t3 is intended to evaluate the ground impact risk when the UAV falls out of control. The ground impact risk when the UAV falls out of control is negligible when it is less than the minimum negligible risk. It is believed in the present invention that an additional reward needs to be applied when Risk.sub.tRisk.sub.min which is specifically described as:
[0064] Then, the current state, the safe action, the reward, the next state and a training flag (s.sub.t,a.sub.t,Reward.sub.t,s.sub.t+1,d.sub.t) are stored in the replay buffer, in which d.sub.t is the flag to judge whether the current training step is the last step. Applying the training flag d.sub.t to the algorithm is a common method to constraint the training steps in one episode and to prevent an infinite loop, and the specific judgment method is determined according to an actual mission requirement. In the present invention, two judgment methods are provided for double constraints of the training steps: [0065] 1) it is believed that training can be stopped when the distance between the UAV and the destination point is less than a threshold required by the mission, and d.sub.t=1 at this time, otherwise, d.sub.t=0; and [0066] 2) a maximum training step may be specified during each time of training, it is believed that training can be stopped till the current step is the maximum step, and d.sub.t=1 at this time, otherwise, d.sub.t=0.
[0067] After that, a random minibatch of B transitions (s.sub.i,a.sub.i,Reward.sub.i,s.sub.i+1,d.sub.i) is sampled from the replay buffer, and y.sub.i=Reward.sub.i+Q(s.sub.i+1,(s.sub.i+1|.sup.)|.sup.Q) is set, in which is a discount factor.
[0068] Then, the parameter .sup.Q of the main critic network is updated by minimizing the loss:
[0069] The main actor policy .sup. is updated by using sampled policy gradient descent:
[0070] Afterwards, the target network is updated by soft update:
[0072] After completing model construction of the above-mentioned shield-DDPG algorithm, it is necessary to test and output a model, and the test-completed and output model is adopted to generate a recommended path for the UAV.
[0073] Specifically, after the model of the shield-DDPG algorithm meets an end-of-training condition and satisfies the requirement for convergence, parameters of the model can be exported for a model test described in the next step; otherwise, the model is retrained. The requirement for convergence is that the paths output for 10 consecutive times meet mission planning requirements, and the reward is maintained at a high level without an obvious change.
[0074] Then, after the model meets the requirements in the above step, an output effect of the model should be tested. Specifically, the model parameters exported in the above step are imported into a new python environment, and an interference [0,1] is introduced to the start point under mission planning requirements. The path is planned for 500 times consecutively by the model. If the probability of the path meeting the requirements reaches 99%, it is believed that the model meets the requirements; otherwise, the model is retained.
[0075] After that, the trained model of the shield-DDPG algorithm is deployed on a UAV path planning system as required, by which a UAV operation path is planned.
[0076] It is worth noting that the model of the shield-DDPG algorithm for UAV path planning, provided by the present invention, is developed in the python environment by using a pytorch framework, but it also has excellent compatibility on an engineering platform developed based on C++.
[0077] Meanwhile, since the model of the shield-DDPG algorithm in the present invention uses the pytorch framework, the exported model data is generally of a .pt file structure. If the four neural networks are all DNN of 4*256 structure, the size of the trained model is about 500 kb, which can improve the reading rate and effectively reduce the memory usage.
[0078] Tests show that the method for UAV path planning in urban airspace based on the shield-DDPG algorithm, provided by the present invention, can effectively improve the safety and efficiency of UAV path planning, and an example of simulation results of the planned UAV path is shown in
[0079] The present invention provides a safety reinforcement learning algorithm called shield-DDPG for UAV path planning, which can avoid an air collision risk and a ground impact risk as much as possible while ensuring fast response to a UAV mission requirement, and realize a safe and reliable verification of a UAV path planning instruction, thereby effectively ensuring the safety of a planned path and effectively solving the problem of uncertainty in a solution of a general reinforcement learning algorithm. The shield module constructed in this invention is used to ensure the safety of the learning or execution process throughout the training period, and the algorithm converges quickly.
[0080] Finally, the above preferred embodiments are intended only to illustrate the technical solutions of the present invention but not to limit it. Although the present invention has been described in detail by means of the above preferred embodiments, it should be understood by those skilled in the art that various formal or detailed changes can be made without departing from the scope defined by the claims of the present invention.