Air combat maneuvering method based on parallel self-play
11794898 · 2023-10-24
Assignee
Inventors
- Bo Li (Xi'an, CN)
- Kaifang WAN (Xi'an, CN)
- Xiaoguang GAO (Xi'an, CN)
- Zhigang GAN (Xi'an, CN)
- Shiyang LIANG (Xi'an, CN)
- Kaiqiang YUE (Xi'an, CN)
- Zhipeng YANG (Xi'an, CN)
Cpc classification
B64U2101/15
PERFORMING OPERATIONS; TRANSPORTING
International classification
G05D1/10
PHYSICS
Abstract
The present disclosure provides an air combat maneuvering method based on parallel self-play, including the steps of constructing a UAV (unmanned aerial vehicle) maneuver model, constructing a red-and-blue motion situation acquiring model to describe a relative combat situation of red and blue sides, constructing state spaces and action spaces of both red and blue sides and a reward function according to a Markov process, followed by constructing a maneuvering decision-making model structure based on a soft actor-critic (SAC) algorithm, training the SAC algorithm by performing air combat confrontations to realize parallel self-play, and finally testing a trained network, displaying combat trajectories and calculating a combat success rate. The level of confrontations can be effectively enhanced and the combat success rate of the decision-making model can be increased.
Claims
1. An air combat maneuvering method based on parallel self-play, comprising: step S1: constructing a unmanned aerial vehicle (UAV) maneuver model, comprising the following steps: supposing an OXYZ coordinate system to be a three-dimensional spatial coordinate system for UAVs, where origin O represents the center of a combat area for UAVs, with X axis pointing to the north, Z axis pointing to the east and Y axis pointing in a vertical upward direction; regarding a UAV as a mass point and establishing equations of motion for the UAV as follows:
2. The air combat maneuvering method based on parallel self-play according to claim 1, wherein the step S2 comprises: describing the relative situation of both sides acquired by the red-and-blue motion situation acquiring model with {right arrow over (D)}, d and q, wherein {right arrow over (D)} denotes a position vector between the red side and the blue side in a direction from the red side to the blue side; d denotes a distance between the red side and the blue side; q denotes a relative azimuth angle, namely an included angle between the velocity vector {right arrow over (V)}.sub.r and the distance vector {right arrow over (D)} of the red side; and denoting the combat situation of the blue side relative to the red side by {right arrow over (D)}.sub.r, d and q.sub.r and the combat situation of the red side relative to the blue side by {right arrow over (D)}.sub.b, d and q.sub.b, wherein {right arrow over (D)}.sub.r denotes a position vector between the red side and the blue side in a direction from the red side to the blue side; {right arrow over (D)}.sub.b denotes a position vector between the blue side and the red side in a direction from the blue side to the red side; q.sub.r denotes a relative azimuth angle of the blue side to the red side; and q.sub.b denotes a relative azimuth angle of the red side to the blue side; and {right arrow over (D)}.sub.r, {right arrow over (D)}.sub.b, d, q.sub.r and q.sub.b are calculated as follows:
3. The air combat maneuvering method based on parallel self-play according to claim 2, wherein the step S3 comprises: defining the state space of the red UAV as S.sub.r=[X.sub.r, Y.sub.r, Z.sub.r, v.sub.r, θ.sub.r, φ.sub.r, d, q.sub.r] and the state space of the blue UAV as S.sub.b=[X.sub.b, Y.sub.b, Z.sub.b, v.sub.b, θ.sub.b, φ.sub.b, d, q.sub.b]; defining the action space of the red UAV as A.sub.r=[dv.sub.r, dφ.sub.r, dθ.sub.r] and the action space of the blue UAV as A.sub.b=[dv.sub.b, dφ.sub.b, dθ.sub.b]; and forming the reward function R with a distance reward function R.sub.d and an angle reward function R.sub.q, R=w.sub.1*R.sub.d+w.sub.2*R.sub.a, wherein w.sub.1, w.sub.2 denote weights of a distance reward and an angle reward; the distance reward function R.sub.d is expressed as:
R.sub.q1=−q/180
R.sub.q2=3, if q<q.sub.max
R.sub.q=R.sub.q1+R.sub.q2 wherein R.sub.q1 denotes a continuous angle reward, while R.sub.q2 denotes a sparse angle reward; and q.sub.max denotes a maximum off-boresight launch angle of the missile carried by the red side.
4. The air combat maneuvering method based on parallel self-play according to claim 3, wherein the constructing a maneuvering decision-making model structure based on a SAC algorithm comprises: generating maneuver control quantities for both red and blue sides by the maneuvering decision-making model based on the SAC algorithm using a SAC method, to allow the red and blue sides to maneuver; and implementing the SAC algorithm by neural networks including an replay buffer M, one Actor neural network π.sub.θ, two Soft-Q neural networks Q.sub.φ1 and Q.sub.φ2, two Target Soft-Q networks Q.sub.φ′.sub.
μ.sub.r,σ.sub.r=π.sub.θ(s.sub.t.sup.r)
μ.sub.b,σ.sub.b=π.sub.θ(s.sub.t.sup.b)
a.sub.t.sup.r=N(μ.sub.r,σ.sub.r.sup.2)=μ.sub.r+σ.sub.r*τ
a.sub.t.sup.b=N(μ.sub.b,σ.sub.b.sup.2)=μ.sub.b+σ.sub.b*τ
a.sub.t.sup.r=tanh(a.sub.t.sup.r)
a.sub.t.sup.b=tanh(a.sub.t.sup.b) the Soft-Q neural networks Q.sub.θ1 and Q.sub.θ2 receive inputs of a state value and an action value and output Q values predicted by the neural networks; the Target Soft-Q neural networks Q.sub.φ′.sub.
5. The air combat maneuvering method based on parallel self-play according to claim 4, wherein the step S5 comprises: when initializing a plurality of groups of UAVs on both sides, with initial positions within the combat area, and setting an initial velocity range, an initial pitch angle range and an initial heading angle range; and the steps of training the SAC algorithm by performing air combat confrontations to realize parallel self-play are as follows: step S51: defining the number env_num of parallel self-play environments, defining the number batch_size of batch training sample groups, defining a maximum simulation step size N, initializing step=1, initializing env=1, initializing initial situations of both sides, and obtaining an initial state s.sub.t.sup.r of the red side and an initial state s.sub.t.sup.b of the blue side; step S52: randomly generating Actor network weight θ, Soft-Q network weights φ.sub.1, φ.sub.2, initializing the policy network π.sub.θ and the two Soft-Q networks Q.sub.φ1, Q.sub.φ2, supposing φ′.sub.1=φ.sub.1, φ′.sub.2=φ.sub.2, and initializing the Target Soft-Q networks Q.sub.φ′.sub.
6. The air combat maneuvering method based on parallel self-play according to claim 5, wherein the step S6 comprises: step S61: initializing the initial situations of both sides, and obtaining the initial states s.sub.t.sup.r, s.sub.t.sup.b of the red and blue sides; step S62: separately recording the states s.sub.t.sup.r, s.sub.t.sup.b, inputting the states s.sub.t.sup.r, s.sub.t.sup.b to the Actor neutral network of the trained SAC algorithm model to output actions a.sub.t.sup.r, a.sub.t.sup.b of the red and blue sides, and obtaining new states s.sub.t+1.sup.r, s.sub.t+1.sup.b after performing the actions by both sides; step S63: determining whether either of both sides succeeds in engaging in combat, and if yes, ending; otherwise, supposing s.sub.t.sup.r=s.sub.t+1.sup.r and s.sub.t.sup.b=s.sub.t+1.sup.b, and skipping to step S62; step S64: plotting combat trajectories of both sides according to the recorded states s.sub.t.sup.r, s.sub.t.sup.b; step S65: initializing the initial situations of n groups of UAVs on both sides, performing steps S62 to S63 on each group of UAVs on both sides, and finally recording whether either of both sides succeeds in engaging in combat, with the number of times of successfully engaging in combat being denoted as num; and step S66: calculating num/n, namely a final combat success rate, to indicate the generalization capability of the decision-making model.
7. The air combat maneuvering method based on parallel self-play according to claim 6, wherein in the step S5, the initial velocity range is set as [50 m/s, 400 m/s], and the initial pitch angle range as [−90°,90°] and the initial heading angle range as [−180°,180°].
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
DETAILED DESCRIPTION
(5) The present disclosure is further described below in conjunction with the accompanying drawings and embodiments.
(6) As shown in
(7) Further, the constructing a UAV maneuver model includes the following specific steps:
(8) The position information of UAVs of both sides is updated according to equations of motion for UAVs, so that maneuvering can be realized. Furthermore, the information of both sides is provided to the both-side situation acquiring model to calculate corresponding situations.
(9) An OXYZ coordinate system is supposed to be a three-dimensional spatial coordinate system for UAVs, where origin O represents the center of a combat area for UAVs, with X axis pointing to the north, Z axis pointing to the east and Y axis pointing in a vertical upward direction.
(10) A UAV is regarded as a mass point and equations of motion for the UAV are established as follows:
(11)
(12) Further, the step S2 includes the following specific steps:
(13) The red-and-blue motion situation acquiring model can calculate a relative situation according to red and blue state information and provide the relative situation to a maneuvering decision-making module based on a deep reinforcement learning method for decision-making.
(14) The relative situation of both sides acquired by the red-and-blue motion situation acquiring model is described with {right arrow over (D)}, d and q, where {right arrow over (D)} denotes a position vector between the red side and the blue side in a direction from the red side to the blue side; d denotes a distance between the red side and the blue side; q denotes a relative azimuth angle, namely an included angle between the velocity vector {right arrow over (V)}.sub.r and the distance vector {right arrow over (D)} of the red side.
(15) The combat situation of the blue side relative to the red side is denoted by {right arrow over (D)}.sub.r, d and q.sub.r and the combat situation of the red side relative to the blue side is denoted by {right arrow over (D)}.sub.b, d and q.sub.b, where {right arrow over (D)}.sub.r denotes a position vector between the red side and the blue side in a direction from the red side to the blue side; {right arrow over (D)}.sub.b denotes a position vector between the blue side and the red side in a direction from the blue side to the red side; q.sub.r denotes a relative azimuth angle of the blue side to the red side; and q.sub.b denotes a relative azimuth angle of the red side to the blue side.
(16) {right arrow over (D)}.sub.r, {right arrow over (D)}.sub.b, d, q.sub.r and q.sub.b are calculated as follows:
(17)
(18) Further, the step S3 includes the following specific steps:
(19) The state space of the red UAV is defined as S.sub.r=[X.sub.r, Y.sub.r, Z.sub.r, v.sub.r, θ.sub.r, φ.sub.r, d, q.sub.r] and the state space of the blue UAV is defined as S.sub.b=[X.sub.b, Y.sub.b, Z.sub.b, v.sub.b, θ.sub.b, φ.sub.b, d, q.sub.b].
(20) The action space of the red UAV is defined as A.sub.r=[dv.sub.r, dφ.sub.r, dθ.sub.r] and the action space of the blue UAV is defined as A.sub.b=[dv.sub.b, dφ.sub.b, dθ.sub.b].
(21) The reward function R is formed with a distance reward function R.sub.d and an angle reward function R.sub.q, R=w.sub.1*R.sub.d+w.sub.2*R.sub.a, where w.sub.1,w.sub.2 denote weights of a distance reward and an angle reward.
(22) The distance reward function R.sub.d is expressed as:
(23)
(24) The angle reward function R.sub.q is expressed as:
R.sub.q1=−q/180
R.sub.q2=3, if q<q.sub.max
R.sub.q=R.sub.q1+R.sub.q2 where R.sub.q1 denotes a continuous angle reward, while R.sub.q2 denotes a sparse angle reward; and q.sub.max denotes a maximum off-boresight launch angle of the missile carried by the red side.
(25) Further, as shown in
(26) Maneuver control quantities for both red and blue sides are generated by the maneuvering decision-making model based on the SAC algorithm using a SAC method, to allow the red and blue sides to maneuver.
(27) The SAC algorithm is implemented by neural networks including an replay buffer M, one Actor neural network π.sub.θ, two Soft-Q neural networks Q.sub.φ1 and Q.sub.φ2, two Target Soft-Q networks Q.sub.φ′.sub.
(28) The replay buffer M is an experience replay buffer structure for specially storing experience learned in reinforcement learning.
(29) The Actor neural network π.sub.θ receives an input of a state value s.sub.t.sup.r of the red side or a state value s.sub.t.sup.b of the blue side and generates outputs of mean μ(μ.sub.r,μ.sub.b) and variance σ(σ.sub.r,σ.sub.b). Noise τ is generated by sampling from a standard normal distribution. An action a.sub.t.sup.r of the red side or an action a.sub.t.sup.b of the blue side is generated from the mean μ, variance σ and noise τ. The action a.sub.t.sup.r or a.sub.t.sup.b is limited to a range of (−1,1) by using a tanh function, and the process of generating the action is shown below:
μ.sub.r,σ.sub.r=π.sub.θ(s.sub.t.sup.r)
μ.sub.b,σ.sub.b=π.sub.θ(s.sub.t.sup.b)
a.sub.t.sup.r=N(μ.sub.r,σ.sub.r.sup.2)=μ.sub.r+σ.sub.r*τ
a.sub.t.sup.b=N(μ.sub.b,σ.sub.b.sup.2)=μ.sub.b+σ.sub.b*τ
a.sub.t.sup.r=tanh(a.sub.t.sup.r)
a.sub.t.sup.b=tanh(a.sub.t.sup.b)
(30) The Soft-Q neural networks Q.sub.φ1 and Q.sub.φ2 receive inputs of a state value and an action value and output Q values predicted by the neural networks. The Target Soft-Q neural networks Q.sub.φ′.sub.
(31) Each of the Actor, Soft-Q and Target Soft-Q networks is a fully-connected neutral network having l hidden layers, with n neurons in each hidden layer and an activation function ReLU.
(32) Further, the step S5 includes the following specific steps:
(33) When initializing a plurality of groups of UAVs on both sides, with initial positions within the combat area, an initial velocity range is set as [50 m/s, 400 m/s], and an initial pitch angle range as [−90°,90°] and an initial heading angle range as [−180°,180°].
(34) The steps of training the SAC algorithm by performing air combat confrontations to realize parallel self-play are as follows: step S51: defining the number env_num of parallel self-play environments, defining the number batch_size of batch training sample groups, defining a maximum simulation step size N, initialize step=1, initialize env=1, initializing initial situations of both sides, and obtaining an initial state s.sub.t.sup.r of the red side and an initial state s.sub.t.sup.b of the blue side; step S52: randomly generating Actor network weight θ, and Soft-Q network weights φ.sub.1, φ.sub.2, initializing the policy network π.sub.θ and the two Soft-Q networks Q.sub.φ1, Q.sub.φ2, letting φ′.sub.1=φ.sub.q, φ′.sub.2=φ.sub.2, and initializing the Target Soft-Q networks Q.sub.φ′.sub.
(35) Both Soft-Q functions are defined as minimum output values of the Target Soft-Q networks Q.sub.φ′.sub.
Q.sub.φ′(s.sub.t,a.sub.t)=min(Q.sub.φ1′(s.sub.t,a.sub.t),Q.sub.φ2′(s.sub.t,a.sub.t)) where Q.sub.φ1′(s.sub.t,a.sub.t),Q.sub.φ2′(s.sub.t,a.sub.t) denote output target Q values of the Target Soft-Q networks Q.sub.φ′.sub.
(36) The loss function of the Actor neutral network is defined as follows:
J.sub.π(θ)=E.sub.s.sub.
(37) The loss function J.sub.Q(φ.sub.i) i=1, 2 of the Soft-Q neutral networks is defined as follows:
(38)
(39) The weights φ′.sub.1,φ′.sub.2 of the Target Soft-Q neutral networks are updated as follows:
φ′.sub.1←φ+(1−τ)φ′.sub.1
φ′.sub.2←φ+(1−τ)φ′.sub.2
(40) A regularization coefficient α is updated, and its loss function is as follows:
J(α)=E[−α log π.sub.t(a.sub.t|s.sub.t)−αH.sub.0] step S56: determining whether step is greater than N, and if yes, proceeding to step S57; otherwise, incrementing step by 1, s.sub.t.sup.r=s.sub.t+1.sup.r, s.sub.t.sup.b=s.sub.t+1.sup.b, and skipping to step S53; and step S57: determining whether the algorithm converges or whether training episodes are met, and if yes, ending the training and obtaining the trained SAC algorithm model; otherwise, skipping to step S51.
(41) Further, the step S6 includes the following specific steps: step S61: initializing the initial situations of both sides, and obtaining the initial states s.sub.t.sup.r, s.sub.t.sup.b of the red and blue sides; step S62: separately recording the states s.sub.t.sup.r, s.sub.t.sup.b, inputting the states s.sub.t.sup.r, s.sub.t.sup.b to the Actor neutral network of the trained SAC algorithm model to output actions a.sub.t.sup.r, a.sub.t.sup.b of the red and blue sides, and obtaining new states s.sub.t+1.sup.r, s.sub.t+1.sup.b after performing the actions by both sides; step S63: determining whether either of both sides succeeds in engaging in combat, and if yes, ending; otherwise, letting s.sub.t.sup.r=s.sub.t+1.sup.r and s.sub.t.sup.b=s.sub.t+1.sup.b, and skipping to step S62; step S64: plotting combat trajectories of both sides according to the recorded states s.sub.t.sup.r, s.sub.t.sup.b; step S65: initializing the initial situations of n groups of UAVs on both sides, performing steps S62 to S63 on each group of UAVs on both sides, and finally recording whether either of both sides succeeds in engaging in combat, with the number of times of successfully engaging in combat being denoted as num; and step S66: calculating num/n, namely a final combat success rate, to indicate the generalization capability of the decision-making model.
Specific Exemplary Embodiments
(42) In the embodiment, when initializing a plurality of groups of UAVs on both sides, the combat area is x∈[−6 km, 6 km], y∈[3 km, 4 km], z∈[−6 km, 6 km], and an initial velocity range is [50 m/s, 400 m/s], while an initial pitch angle range is [−90°,90°] and an initial heading angle range is [−180°,180°].
(43) The maximum attack range of a missile is 6 km and a minimum attack range is 1 km. The maximum off-boresight launch angle of the missile is 30°, w.sub.1=w.sub.2=0.5.
(44) The SAC algorithm model is constructed as follows: in the Actor neutral network in the SAC algorithm, the number of hidden layers l=2, and in each layer, the number of nodes n=256. The optimization algorithm is Adam algorithm, with discount factor γ=0.99, network learning rate lr=0.0003, entropy regularization coefficient α=1 and target entropy value H.sub.0=−3.
(45) The number of parallel self-play environments is defined as env_num=[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]; the number of training sample groups is defined as batch_size=128; and the maximum simulation step size is defined as N=800.
(46) After the training is finished, both sides are randomly initialized to test the trained algorithm, and combat trajectories are displayed, as shown in
(47) 200 Groups of UAVs on both sides are randomly initialized to test the trained algorithm, and a combat success rate is calculated. The results of the combat success rate varying with the number of parallel self-play environments is calculated as shown in
(48) Therefore, the maneuvering decision-making of UAVs can be effectively realized, and the generalization capability of the model can be improved, so that the model can be more practicable.