Intention-driven reinforcement learning-based path planning method

Abstract

The present invention discloses an intention-driven reinforcement learning-based path planning method, including the following steps: 1: acquiring, by a data collector, a state of a monitoring network; 2: selecting a steering angle of the data collector according to positions of surrounding obstacles, sensor nodes, and the data collector; 3: selecting a speed of the data collector, a target node, and a next target node as an action of the data collector according to an greedy policy; 4: determining, by the data collector, the next time slot according to the selected steering angle and speed; 5: obtaining rewards and penalties according to intentions of the data collector and the sensor nodes, and updating a Q value; 6: repeating step 1 to step 5 until a termination state or a convergence condition is satisfied; and 7: selecting, by the data collector, an action in each time slot having the maximum Q value as a planning result, and generating an optimal path. The method provided in the present invention can complete the data collection path planning with a higher probability of success and performance closer to the intention.

Claims

1. An intention-driven reinforcement learning-based path planning method, comprising the following steps: step A: acquiring, by a data collector, a state of a monitoring network; step B: determining a steering angle of the data collector according to positions of the data collector, sensor nodes, and surrounding obstacles; step C: selecting an action of the data collector according to an greedy policy, wherein the action comprises a speed of the data collector, a target node, and a next target node; step D: adjusting, by the data collector, a direction of sailing according to the steering angle and executing the action to the next time slot; step E: calculating rewards and penalties according to intentions of the data collector and the sensor nodes, and updating a Q value; step F: repeating step A to step E until the monitoring network reaches a termination state or Q-learning satisfies a convergence condition; and step G: selecting, by the data collector, an action in each time slot having the maximum Q value as a planning result, and generating an optimal data collection path.

2. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein the state S of the monitoring network in step A comprises: a direction of sailing [n] of the data collector in a time slot n, coordinates q.sub.u[n] of the data collector, available storage space {b.sub.am[n]}.sub.mM of the sensor nodes, data collection indicators {w.sub.m[n]}.sub.mM of the sensor nodes, distances {d.sub.um[n]}.sub.mM between the data collector and the sensor nodes, and {d.sub.uk[n]}.sub.kK distances between the data collector and the surrounding obstacles, wherein M is the set of sensor nodes, K is the set of surrounding obstacles, w.sub.m[n]{0,1} is a data collection indicator of the sensor node m, and w.sub.m[n]=1 indicates that the data collector completes the data collection of the sensor node m in the time slot n, or otherwise indicates that the data collection is not completed.

3. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein a formula for calculating the steering angle of the data collector in step B is: $\begin{matrix} [n + 1] = {\begin{matrix} \min (_{up} [n] - [n],_{\max}), & _{up} [n] [n] \\ \max (_{up} [n] - [n], -_{\max}), & _{up} [n] < [n] \end{matrix}, & (11) \end{matrix}$ .sub.up[n] is a relative angle between the coordinates q.sub.u[n] of the data collector and a target position p[n], and .sub.max is the maximum steering angle of the data collector.

4. The intention-driven reinforcement learning-based path planning method according to claim 3, wherein steps of determining the target position in step B comprise: step B1: determining whether the data collector senses the obstacles, and comparing .sub.m.sub.1.sub.o.sub.1[n] with .sub.m.sub.1.sub.o.sub.2[n] in a case that the data collector senses the obstacles, wherein in a case that .sub.m.sub.1.sub.o.sub.1[n]<.sub.m.sub.1.sub.o.sub.2[n], the target position of the data collector is p[n]=p.sub.o.sub.1[n], or otherwise, the target position of the data collector is p[n]=p.sub.o.sub.2[n], wherein p.sub.o.sub.1[n] and p.sub.o.sub.2[n] are two points on the boundary of surrounding obstacles detected by the data collector at the maximum sensing angle, and .sub.m.sub.1.sub.o.sub.1[n] and .sub.m.sub.1.sub.o.sub.2[n] are respectively relative angles between a target sensor node and the points p.sub.o.sub.1[n] and p.sub.o.sub.2[n]; step B2: determining whether the path q.sub.u[n]q.sub.m.sub.2 from the data collector to the next target node m.sub.2 extends through a communication area C.sub.1 of the target node m.sub.1 in a case that the data collector does not sense the surrounding obstacles, wherein in a case that q.sub.u[n]q.sub.m.sub.2 does not extend through C.sub.1, the target position is p[n]=p.sub.c.sub.1[n], wherein p.sub.c.sub.1[n] is a point in the communication area C.sub.1 realizing the smallest distance p.sub.c.sub.1[n]q.sub.u[n]+p.sub.c.sub.1[n]q.sub.m.sub.2; and step B3: determining whether the path q.sub.u[n]q.sub.m.sub.2 extends through a safety area C.sub.2 of the target node m.sub.1 in a case that q.sub.u[n]q.sub.m.sub.2 extends through C.sub.1, wherein in a case that q.sub.u[n]q.sub.m.sub.2 does not extend through C.sub.2, the target position is p[n]=q.sub.m.sub.2, or otherwise the target position is p[n]=p.sub.c.sub.2[n], wherein p.sub.c.sub.2[n] is a point in the safety area C.sub.2 realizing the smallest distance p.sub.c.sub.2[n]q.sub.u[n]+p.sub.c.sub.2[n]q.sub.m.sub.2.

5. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein a method for selecting the action according to the greedy policy in step C is expressed as: $\begin{matrix} a = {\begin{matrix} \underset{a}{\arg \min} Q (s, a), & > \\ random action, \end{matrix}, & (12) \end{matrix}$ is an exploration probability, [0,1] is a randomly generated value, and Q(s,a) is a Q value for executing the action a in a state of S.

6. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein a formula for calculating the position in the next time slot of the data collector in step D is: $\begin{matrix} q_{u} [n] = (x_{u} [n - 1] + v [n] \cos [n], y_{u} [n - 1] + v [n] \sin [n]), n, & (13) \end{matrix}$ x.sub.u[n1] and y.sub.u[n1] are respectively an x coordinate and a y coordinate of the data collector, v[n] is a sailing speed of the data collector, and is duration of the each time slot.

7. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein the rewards and penalties corresponding to the intentions of the data collector and the sensor nodes in step E are calculated as defined below: step D1: in a case that the intention of the data collector is to safely complete the data collection of all sensor nodes with the minimum energy consumption E.sub.tot and return to a base within a specified time T and that the intention of the sensor nodes is to minimize overflow data B.sub.tot.sup.s, a reward R.sub.a(s,s) of the Q-learning is a weighted sum {dot over ()}E.sub.tot+(1{dot over ()})B.sub.tot.sup.s, of the energy consumption of the data collector and the data overflow of the sensor nodes, wherein s is the next state of the monitoring network after an action a is executed in a state of S, and is a weight factor; and step D2: according to the intentions of the data collector and the sensor nodes, the penalty of the Q-learning is C.sub.a(s, s)=.sub.safe+.sub.bou+.sub.time+.sub.tra+.sub.ter, wherein .sub.safe is a safety penalty, which means that distances between the data collector and the surrounding obstacles and between the data collector and the sensor nodes are required to satisfy an anti-collision distance; .sub.bou is a boundary penalty, which means that the data collector is not allowed to exceed a feasible area; .sub.time is a time penalty, which means that the data collector is required to complete the data collection within a time T; .sub.tra is a traversal collection penalty, which means that data of all sensor nodes is required to be collected; and .sub.ter is a termination penalty, which means that the data collector is required to return to a base within the time T.

8. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein a formula for updating the Q value in step E is: $\begin{matrix} Q (s, a) (1 -) Q (s, a) + [R_{a} (s, s^{}) + C_{a} (s, s^{}) + \min_{a^{}} Q (s^{}, a^{})], & (14) \end{matrix}$ is a learning rate, and is a reward discount factor.

9. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein the termination state of the monitoring network in step F is that the data collector completes the data collection of the sensor nodes or the data collector does not complete the data collection at a moment T, and the convergence condition of the Q-learning is expressed as: $\begin{matrix} .Math. Q_{j} (s, a) - Q_{j - 1} (s, a) .Math. & (15) \end{matrix}$ is an allowable learning error, and j is a learning iteration number.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is an exemplary scenario diagram according to the present invention.

(2) FIG. 2 is a schematic diagram of an implementation process according to the present invention.

DETAILED DESCRIPTION

(3) The following specifically describes an intention-driven reinforcement learning-based path planning method provided in the embodiments of the present invention with reference to the accompanying drawings.

(4) FIG. 1 is an exemplary scenario diagram according to the present invention. Refer to FIG. 1.

(5) A marine monitoring network includes one unmanned ship, M sensor nodes, and K obstacles such as sea islands, sea waves, and reefs. The unmanned ship sets out from a base, avoids collision with the obstacles and the sensor nodes, completes data collection of each sensor node within a specified time T, and returns to the base. In order to satisfy intentions of the unmanned ship and the sensor nodes, weighted energy consumption of the unmanned ship and data overflow of the sensor nodes are expressed as rewards for reinforcement learning, and a safety intention, a traversal collection intention, and an intention of returning to the base on time are expressed as penalties, to optimize a path of the unmanned ship by using a Q-learning method.

(6) FIG. 2 is a schematic diagram of an implementation process according to the present invention. Specific implementation processes are as follows: Step I: A data collector acquires state information of a monitoring network, where the state information includes: a direction of sailing [n] of the data collector in a time slot n, coordinates q.sub.u[n] of the data collector, available storage space {b.sub.am[n]}.sub.mM of sensor nodes, data collection indicators {w.sub.m[n]}.sub.mM of the sensor nodes, distances {d.sub.um[n]}.sub.mM between the data collector and the sensor nodes, and {d.sub.uk[n]}.sub.kK distances between the data collector and surrounding obstacles, where M is the set of sensor nodes, K is the set of surrounding obstacles, w.sub.m[n]{0,1} is a data collection indicator of the sensor node m, and w.sub.m[n]=1 indicates that the data collector completes the data collection of the sensor node m in the time slot n, or otherwise indicates that the data collection is not completed. Step II: Determine a steering angle of the data collector according to positions of the data collector, the sensor nodes, and the surrounding obstacles, including the following steps: (1): determining whether the data collector senses the obstacles, and comparing .sub.m.sub.1.sub.o.sub.1[n] with .sub.m.sub.1.sub.o.sub.2[n] in a case that the data collector senses the obstacles, where in a case that .sub.m.sub.1.sub.o.sub.1[n]<.sub.m.sub.1.sub.o.sub.2[n], the target position of the data collector is p[n]=p.sub.o.sub.1[n], or otherwise, the target position of the data collector is p[n]=p.sub.o.sub.2[n], where p.sub.o.sub.1[n] and p.sub.o.sub.2[n] are two points on the boundary of surrounding obstacles detected by the data collector at the maximum sensing angle, and .sub.m.sub.1.sub.o.sub.1[n] and .sub.m.sub.1.sub.o.sub.2[n] are respectively relative angles between a target sensor node and the points p.sub.o.sub.1[n] and p.sub.o.sub.2[n]; (2): determining whether the path q.sub.u[n]q.sub.m.sub.2 from the data collector to the next target node m.sub.2 extends through a communication area C.sub.1 of the target node m.sub.1 in a case that the data collector does not sense the surrounding obstacles, where in a case that q.sub.u[n]q.sub.m.sub.2 does not extend through C.sub.1, the target position is p[n]p.sub.c.sub.1[n], where p.sub.c.sub.1[n] is a point in the communication area C.sub.1 realizing the smallest distance p.sub.c.sub.1[n]q.sub.u[n]+p.sub.c.sub.1[n]q.sub.m.sub.2; (3): determining whether the path q.sub.u[n]q.sub.m.sub.2 extends through the safety area C.sub.2 of the target node m.sub.1 in a case that q.sub.u[n]q.sub.m.sub.2 extends through C.sub.1, where in a case that q.sub.u[n]q.sub.m.sub.2 does not extend through C.sub.2, the target position is p[n]=q.sub.m.sub.2, or otherwise the target position is p[n]=p.sub.c.sub.2[n], where p.sub.c.sub.2[n] is a point in the safety area C.sub.2 realizing the smallest distance p.sub.c.sub.2[n]q.sub.u[n]+p.sub.c.sub.2[n]q.sub.m.sub.2; and (4) calculating the steering angle of the data collector by using the following formula:

(7) $\begin{matrix} [n + 1] = {\begin{matrix} \min (_{up} [n] - [n],_{\max}), & _{up} [n] [n] \\ \max (_{up} [n] - [n], -_{\max}), & _{up} [n] < [n] \end{matrix}, . & (6) \end{matrix}$ .sub.up[n] is a relative angle between the coordinates q.sub.u[n] of the data collector and a target position p[n], and .sub.max is the maximum steering angle of the data collector. Step III: Select an action of the data collector according to an greedy policy, where the action includes a speed of the data collector, a target node, and a next target node. A method for selecting the action according to the greedy policy is expressed as:

(8) $\begin{matrix} a = {\begin{matrix} \underset{a}{\arg \min} Q (s, a), & > \\ random action, \end{matrix}, . & (7) \end{matrix}$ is an exploration probability, [0,1] is a randomly generated value, and Q(s,a) is a Q value for executing the action a in a state of S. Step IV: The data collector adjusts a direction of sailing according to the steering angle and executing the action to the next time slot, where the coordinates of the data collector are expressed as:

(9) $\begin{matrix} q_{u} [n] = (x_{u} [n - 1] + v [n] \cos [n], y_{u} [n - 1] + v [n] \sin [n]), n, . & (8) \end{matrix}$ x.sub.u[n1] and y.sub.u[n1] are respectively an X coordinate and a Y coordinate of the data collector, v[n] is a sailing speed of the data collector, and is duration of the each time slot. Step V: Calculate rewards and penalties according to intentions of the data collector and the sensor nodes, and updating a Q value by using the following formula:

(10) $\begin{matrix} Q (s, a) (1 -) Q (s, a) + [R_{a} (s, s^{}) + C_{a} (s, s^{}) + \min_{a^{}} Q (s^{}, a^{})], . & (9) \end{matrix}$ is a learning rate, and is a reward discount factor.

(11) The rewards and penalties are calculated as defined below: (1): in a case that the intention of the data collector is to safely complete the data collection of all sensor nodes with minimum energy consumption E.sub.tot and return to a base within a specified time T and that the intention of the sensor nodes is to minimize overflow data B.sub.tot.sup.s, a reward R.sub.a(s,s) of the Q-learning is a weighted sum {dot over ()}E.sub.tot+(1{dot over ()})B.sub.tot.sup.s of the energy consumption of the data collector and the data overflow of the sensor nodes, where s is a next state of the monitoring network after an action a is executed in a state of S, and is a weight factor; and (2): according to the intentions of the data collector and the sensor nodes, the penalty of the Q-learning is C.sub.a(s,s)=.sub.safe+.sub.bou+.sub.time+.sub.tra+.sub.ter, where .sub.safe is a safety penalty, which means that distances between the data collector and the surrounding obstacles and between the data collector and the sensor nodes are required to satisfy an anti-collision distance; .sub.bou is a boundary penalty, which means that the data collector is not allowed to exceed a feasible area; .sub.time is a time penalty, which means that the data collector is required to complete the data collection within a time T; .sub.tra is a traversal collection penalty, which means that data of all sensor nodes is required to be collected; and .sub.ter is a termination penalty, which means that the data collector is required to return to a base within the time T. Step VI: Repeat step I to step V until the monitoring network reaches a termination state or Q-learning satisfies a convergence condition. The termination state is that the data collector completes the data collection of the sensor nodes or the data collector does not complete the data collection at a moment T, and the convergence condition of the Q-learning is expressed as:

(12) 0 $\begin{matrix} .Math. Q_{j} (s, a) - Q_{j - 1} (s, a) .Math. . & (10) \end{matrix}$ is an allowable learning error, and j is a learning iteration number. Step VII: The data collector selects an action in each time slot having the maximum Q value as a planning result, and generates an optimal data collection path.

(13) The intention-driven reinforcement learning-based path planning method of the present invention is applicable to unmanned aerial vehicles (UAVs)-assisted ground Internet of Things (IoT), unmanned ships-assisted ocean monitoring networks, and unmanned submarines-assisted seabed sensor networks.

(14) It may be understood that the present invention is described with reference to some embodiments, a person skilled in the art appreciate that various changes or equivalent replacements may be made to the embodiments of the present invention without departing from the spirit and scope of the present invention. In addition, with the teachings of the present invention, these features and embodiments may be modified to suit specific situations and materials without departing from the spirit and scope of the present invention. Therefore, the present invention is not limited by the specific embodiments disclosed below, all embodiments falling within the claims of this application shall fall within the protection scope of the present invention.

Intention-driven reinforcement learning-based path planning method

Assignee

Inventors

Cpc classification

Classification Explorer

B63B79/40

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G05D2109/30

PHYSICS

Classification Explorer

G05D2101/15

PHYSICS

Classification Explorer

G05D1/644

PHYSICS

International classification

Classification Explorer

G05D1/644

PHYSICS

Classification Explorer

B63B79/40

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G05D101/15

PHYSICS

Classification Explorer

G05D109/30

PHYSICS

Abstract

Claims

Description