METHOD FOR ENERGY EFFICIENT ROUTING IN WIRELESS SENSOR NETWORK BASED ON MULTI-AGENT DEEP REINFORCEMENT LEARNING
20230231796 · 2023-07-20
Assignee
Inventors
- Jing REN (Chengdu, CN)
- Tongyu SONG (Chengdu, CN)
- Jiangong ZHENG (Chengdu, CN)
- Xiaotong GUO (Chengdu, CN)
- Xuebin TAN (Chengdu, CN)
- Sheng WANG (Chengdu, CN)
- Shizhong XU (Chengdu, CN)
- Xiong WANG (Chengdu, CN)
Cpc classification
H04W40/24
ELECTRICITY
H04L45/08
ELECTRICITY
Y02D30/70
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
A method for energy efficient routing in wireless sensor network based on multi-agent deep reinforcement learning, predefines a to-be-deployed wireless sensor network and creates a cooperative routing decision system including A decision networks and one sink module, A decision networks deployed on the agents a.sup.i, i=1, 2, . . . , A, of the sensor nodes, the sink module deployed on the sink node n.sup.0. The decision network obtains a probability vector according to its local observation and position vectors. The sink module calculates a routing for each sensor node according the probability vectors of A decision networks and sends the routings to corresponding sensor nodes. A multi-agent deep reinforcement learning algorithm is adopted to train the decision networks of A agents a.sup.i, i=1, 2, . . . , A of the cooperative routing decision system, deploys the to-be-deployed wireless sensor network into an environment and updates the routing policy of the deployed wireless sensor network at each update cycle of routing.
Claims
1. A method for energy efficient routing in wireless sensor network based on multi-agent deep reinforcement learning, comprising: step S1: for a wireless sensor network to be deployed, denoting the sink node which is connected to power supplies and has unlimited energy by n.sup.0 and the sensor nodes which are battery-powered by n.sup.i, i=1, 2, . . . , A, where A is the number of the battery-powered sensor nodes; and for sensor node n.sup.i, taking the other nodes within its communication range as its neighbor node set N.sub.nbr.sup.i; setting the transmission cycle of each sensor node as U seconds, wherein each sensor node collects environmental data T seconds from its environment in each transmission cycle and sends the collected environmental data to sink node n.sup.0; deploying an agent on each sensor node, wherein for sensor node n.sup.i, its agent is denoted by a.sup.i; step S2: constructing a cooperative routing decision system, which comprises A decision networks and one sink module, where A decision networks are deployed on agents a.sup.i, i=1, 2, . . . , A, of sensor nodes n.sup.i, i=1, 2, . . . , A, respectively, and the sink module is deployed on sink node n.sup.0, wherein: the decision network deployed on agent a.sup.i of sensor node n.sup.i is used for determining a probability vector P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A] of choosing sink node n.sup.0 and sensor nodes n.sup.i, i=1, 2, . . . , A as its parent nodes at time t, where p.sub.t.sup.i,j is a probability of choosing node n.sup.j as the parent node of sensor node n.sup.i at time t, j=0, 1, . . . , A, t is a routing decision time, probability vector P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A] is uploaded to the sink module on sink node n.sup.0 through the current routing; the decision network comprises a neural network and a mask module, where the input of the neural network is an input vector which is obtained by concatenating a local observation vector O.sub.t.sup.i and a position vector Pos.sup.i, the output of the neural network is denoted by a raw probability vector {circumflex over (P)}.sub.t.sup.u=[{circumflex over (p)}.sub.t.sup.i,0, {circumflex over (p)}.sub.t.sup.i,1, . . . , {circumflex over (p)}.sub.t.sup.i,A] and sent to the mask module, {circumflex over (p)}.sub.t.sup.i,j is a raw probability of choosing node n.sup.j as the parent node of sensor node n.sup.i at time t, where: local observation vector O.sub.t.sup.i is determined as follows: firstly, obtaining data amounts c.sub.s.sup.i,t−b.sup.
O.sup.i=[
Pos.sup.i=(pos.sub.1.sup.i/max_dis, pos.sub.2.sup.i/max_dis) the mask module is used for correcting raw probability vector {circumflex over (P)}.sub.t.sup.i=[{circumflex over (p)}.sub.t.sup.i,0, {circumflex over (p)}.sub.t.sup.i,1, . . . , {circumflex over (p)}.sub.t.sup.i,A] according to neighbor node set N.sub.nbr.sup.i of sensor node n.sup.i to obtain probability vector P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A] as follows: for each probability {circumflex over (p)}.sub.t.sup.i,j in raw probability vector {circumflex over (P)}.sub.t.sup.u=[{circumflex over (p)}.sub.t.sup.i,0, {circumflex over (p)}.sub.t.sup.i,1, . . . , {circumflex over (p)}.sub.t.sup.i,A], firstly, if the corresponding node n.sup.j is not in neighbor node set N.sub.nbr.sup.i of sensor node n.sup.i, setting probability {circumflex over (p)}.sub.t.sup.i,j to 0, otherwise, not changing probability {circumflex over (p)}.sub.t.sup.i,j, then normalizing probability {circumflex over (p)}.sub.t.sup.i,j to obtain probability p.sub.t.sup.i,j:
2. A method for energy efficient routing in wireless sensor network based on multi-agent deep reinforcement learning of claim 1, wherein the neural network in step S2 comprises first fully connected layer, second fully connected layer, third fully connected lay, a concatenate layer, fourth fully connected layer and a softmax layer, where: the first fully connected layer is used for receiving and processing local observation vector O.sub.t.sup.i and sending the obtained feature to the second fully connected layer; the second fully connected layer is used for processing the received feature and sending its obtained feature to the concatenate layer; the third fully connected layer is used for receiving and processing position vector Pos.sup.i and sending the obtained feature to the concatenate layer; the concatenate layer is used for concatenating the two obtained features and send the concatenated feature to the fourth fully connected layer; the fourth fully connected layer is used for receiving and processing the concatenated feature and sending its obtained feature to the softmax layer; the softmax layer is used for generating raw probability vector {circumflex over (P)}.sub.t.sup.i=[{circumflex over (p)}.sub.t.sup.i,0, {circumflex over (p)}.sub.t.sup.i,1, . . . , {circumflex over (p)}.sub.t.sup.i,A] according to its received feature.
3. A method for energy efficient routing in wireless sensor network based on multi-agent deep reinforcement learning of claim 1, wherein the method for generating a spanning tree of the wireless sensor node at time t according to probability vectors P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A], i=1, 2, . . . , A in step S2 comprises: step S2.1: initializing an edge set setting an edge set E.sub.mst by sink node n.sup.0 and initializing it to an empty set, where edge set E.sub.mst is used for storing the edges of the spanning tree generated for the wireless sensor network; step S2.2: randomly selecting a sensor node randomly selecting an unsampled sensor node n.sup.i*; step S2.3: selecting a candidate parent node randomly generating a floating point number in the range of (0,1] by sink node n.sup.0, and judging the interval it fall within on the cumulative distribution function of probability vector P.sub.i.sup.i*=[p.sub.t.sup.i*,0, p.sub.t.sup.i*,1, . . . , p.sub.t.sup.i*,A] of unsampled sensor node n.sup.i*, taking the node corresponding to the probability which corresponds to the interval as the candidate parent node n.sup.j* of the unsampled sensor node n.sup.i*; step S2.4: judging whether a routing loop is formed judging whether a routing loop is formed after the edge (n.sup.i*,n.sup.j*) is added into edge set E.sub.mst, if yes, then going to step S2.5, otherwise going to step S2.6; step S2.5: updating the probability vector renormalizing probability vector P.sub.t.sup.i*=[p.sub.t.sup.i*,0, p.sub.t.sup.i*,1, . . . , p.sub.t.sup.i*,A] of sensor node n.sup.i* as follows:
4. A method for energy efficient routing in wireless sensor network based on multi-agent deep reinforcement learning of claim 1, wherein a mean field actor critic frame of an actor-critic algorithm based multi-agent deep reinforcement learning algorithm is chosen in the training of the decision network in step S3 and the training of the decision network comprises: simulating the data amount of the data collected by each sensor node in a real world according to the corresponding designed probability distribution based on existing prior knowledge for different type of data collected by the sensor node; taking decision networks of the agents in the cooperative routing decision system created in step S2 as an actor network and setting a critic network for instructing the learning of the actor network; modeling the process of routing decision of the decision network as a partially observable Markov decision process, where the input vector of each decision network is taken as a local observation in the partially observable Markov decision process, the parent node chosen by the routing of corresponding sensor node which is obtained by the sink node is taken as an action in the partially observable Markov decision process, the reward function is calculated according to the lifetime of the wireless sensor network, the calculating formula is:
Description
BRIEF DESCRIPTION OF THE DRAWING
[0045] The above and other objectives, features and advantages of the present invention will be more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which:
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0057] Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that the similar modules are designated by similar reference numerals although they are illustrated in different drawings. Also, in the following description, a detailed description of known functions and configurations incorporated herein will be omitted when it may obscure the subject matter of the present invention.
[0058]
[0059] Step S1: Predefining a Wireless Sensor Network
[0060] For a wireless sensor network to be deployed, denoting the sink node which is connected to power supplies and has unlimited energy by n.sup.0 and the sensor nodes which are battery-powered by n.sup.i, i=1, 2, . . . , A, where A is the number of the battery-powered sensor nodes; and for sensor node n.sup.i, taking the other nodes within its communication range as its neighbor node set N.sub.nbr.sup.i.
[0061] Setting the transmission cycle of each sensor node as U seconds, wherein each sensor node collects environmental data T seconds from its environment in each transmission cycle and sends the collected environmental data to sink node n.sup.0. The transmission cycle and time length of collecting environmental data should satisfy that each sensor node has enough time to complete its data transmission.
[0062] Deploying an agent on each sensor node, wherein for sensor node n.sup.i, its agent is denoted by a.sup.i. The agent is used for deploying a decision network, so as a periodic routing decision can cooperatively be made to obtain the routing of each sensor node.
[0063]
[0064] Step S2: Constructing a Cooperative Routing Decision System
[0065] The cooperative routing decision system comprises A decision networks and one sink module, where A decision networks are deployed on agents a.sup.i, 1=1, 2, . . . , A, of sensor nodes n.sup.i, i=1, 2, . . . , A, respectively, and the sink module is deployed on sink node n.sup.0. The decision network and the sink module are detailed as follows:
[0066] The decision network deployed on agent a.sup.i of sensor node n.sup.i is used for determining a probability vector P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A] of choosing sink node n.sup.0 and sensor nodes n.sup.i, i=1, 2, . . . , A as its parent nodes at time t, where p.sub.t.sup.i,j is a probability of choosing node n.sup.j as the parent node of sensor node n.sup.i at time t, j=0,1, . . . , A, t is a routing decision time, probability vector P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A] is uploaded to the sink module on sink node n.sup.0 through the current routing.
[0067]
[0068] The local observation vector O.sub.t.sup.i is determined as follows: firstly, obtaining data amounts c.sub.s.sup.i,t−b.sup.
[0069] where ĉ.sub.s.sup.i, ĉ.sub.o.sup.i and Ŵ.sup.i are the theoretical maximums of data amount c.sub.s.sup.i,t−b.sup.
[0070] Then concatenating normalized data amount
O.sup.i=[
[0071] In the embodiment, B.sub.1=B.sub.2=5, then the dimensions of local observation vector O.sub.t.sup.i is 11.
[0072] The position vector Pos.sup.i is determined as follows: establishing a Cartesian coordinate system with sink node n.sup.0 as an origin and obtaining coordinates (pos.sub.1.sup.i, pos.sub.2.sup.i) of sensor node n.sup.i under the Cartesian coordinate system, where pos.sub.1.sup.i and pos.sub.2.sup.i are the horizontal coordinate and the vertical coordinate of sensor node n.sup.i, respectively, then obtaining a distance dis.sup.i between sensor node n.sup.i and sink node n.sup.0 and a maximal distance max_dis among the A distances dis.sup.i, i=1, 2, . . . , A, then normalizing coordinates (pos.sub.1.sup.i, pos.sub.2.sup.i) to obtain position vector Pos.sup.i:
Pos.sup.i=(pos.sub.1.sup.i/max_dis, pos.sub.2.sup.i/max_dis).
[0073]
[0074] The first fully connected layer is used for receiving and processing local observation vector O.sub.t.sup.i and sending the obtained feature to the second fully connected layer.
[0075] The second fully connected layer is used for processing the received feature and sending its obtained feature to the concatenate layer.
[0076] The third fully connected layer is used for receiving and processing position vector Pos.sup.i and sending the obtained feature to the concatenate layer.
[0077] The concatenate layer is used for concatenating the two obtained features and send the concatenated feature to the fourth fully connected layer.
[0078] The fourth fully connected layer is used for receiving and processing the concatenated feature and sending its obtained feature to the softmax layer.
[0079] The softmax layer is used for generating raw probability vector {circumflex over (P)}.sub.t.sup.i=[{circumflex over (p)}.sub.t.sup.i,0, {circumflex over (p)}.sub.t.sup.i,1, . . . , {circumflex over (p)}.sub.t.sup.i,A] according to its received feature.
[0080] According the description above, the neural network in the embodiment extracts the state of the local observation vector O.sub.t.sup.i through two fully connected layers and extracts the embedded information of identifying an agent from the position vector Pos.sup.i. The respective extractions can make the extracted feature more reasonable and enhance the accuracy of the raw probability vector. In the embodiment, all fully connected layers of the neural network adopt ReLU (Rectified Liner Unit) activation functions, their widths are 128.
[0081] The mask module is used for correcting raw probability vector {circumflex over (P)}.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A] according to neighbor node set N.sub.nbr.sup.i of sensor node n.sup.i to obtain probability vector P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A] as follows: for each probability {circumflex over (p)}.sub.t.sup.i,j in raw probability vector {circumflex over (P)}.sub.t.sup.i=[{circumflex over (p)}.sub.t.sup.i,0, {circumflex over (p)}.sub.t.sup.i,1, . . . , {circumflex over (p)}.sub.t.sup.i,A], firstly, if the corresponding node n.sup.j is not in neighbor node set N.sub.nbr.sup.i of sensor node n.sup.i, setting probability {circumflex over (p)}.sub.t.sup.i,j to 0, otherwise, not changing probability {circumflex over (p)}.sub.t.sup.i,j, then normalizing probability {circumflex over (p)}.sub.t.sup.i,j to obtain probability p.sub.t.sup.i,j:
[0082] The sink module is used for making a routing decision according to probability vectors P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A], i=1, 2, . . . , A uploaded to the sink module by A decision networks as follows: firstly, generating a spanning tree of the wireless sensor network at time t according to probability vectors P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A], i=1, 2, . . . , A, then taking sink node n.sup.0 as a root node to recalculate a routing for each sensor node according to the spanning tree.
[0083] A multi-agent deep reinforcement learning algorithm is adopted in the present invention, and the modeling of deep reinforcement learning is needed to be in accordance with Markov decision processes. However, after taking the routing decision of a wireless sensor network as a continuous decision and modeling it as a Markov decision processes, a test shows that if the routing policy of the wireless sensor network are obtained through the distributed samplings of A decision networks, heavy routing loops may exist in the obtained routing policy, which leads to unaffordable energy consumption. Therefore, the present invention can totally avoid routing loops through centralized routing-decision of the sink node, thus the routing performance is enhanced.
[0084] The method for generating a spanning tree of the wireless sensor node at time t according to the probability vectors can be chosen on the basis of the actual embodiment.
[0085] Step S2.1: Initializing an Edge Set
[0086] Setting an edge set E.sub.mst by sink node n.sup.0 and initializing it to an empty set, where edge set E.sub.mst is used for storing the edges of the spanning tree generated for the wireless sensor network.
[0087] Step S2.2: Randomly Selecting a Sensor Node
[0088] Randomly selecting an unsampled sensor node n.sup.i*.
[0089] Step S2.3: Selecting a Candidate Parent Node
[0090] Randomly generating a floating point number in the range of (0,1] by sink node n.sup.0, and judging the interval it fall within on the cumulative distribution function of probability vector P.sub.t.sup.i*=[p.sub.t.sup.i*,0, p.sub.t.sup.i*,1, . . . , p.sub.t.sup.i*,A] of unsampled sensor node n.sup.i*, taking the node corresponding to the probability which corresponds to the interval as the candidate parent node n.sup.j* of the unsampled sensor node n.sup.i*.
[0091] Supposing the sensor node selected by sink node n.sup.0 is sensor node n.sup.3 at current routing decision, its probability vector is P.sub.i.sup.3=[0.5, 0, 0, 0, 0, 0, 0.1, 0.1, 0, 0, 0.1, 0, 0,0, 0, 0.1, 0, 0, 0, 0.1], the cumulative distribution function of P.sub.t.sup.3 is [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.6, 0.7, 0.7, 0.7, 0.8, 0.8, 0.8, 0.8, 0.8, 0.9, 0.9, 0.9, 0.9, 1], supposing the floating point number is 0.83, which falls within the interval of the 10.sup.th-14.sup.th element of the cumulative distribution function, then sensor node n.sup.10 is selected as the parent node of sensor node n.sup.3.
[0092] Step S2.4: Judging Whether a Routing Loop is Formed
[0093] judging whether a routing loop is formed after the edge (n.sup.i*,n.sup.j*) is added into edge set E.sub.mst, if yes, then going to step S2.5, otherwise going to step S2.6.
[0094] Step S2.5: Updating the Probability Vector
[0095] Renormalizing probability vector P.sub.t.sup.i*=[p.sub.t.sup.i*,0, p.sub.t.sup.i*,1, . . . , p.sub.t.sup.i*,A] of sensor node n.sup.i* as follows:
[0096] Then letting p.sub.t.sup.i*,j*=0 to obtain an updated probability vector and returning to step S2.2.
[0097] For example, the sensor node n.sup.10 is selected as the parent node of sensor node n.sup.3, the corresponding edge is added into the edge set E.sub.mst, and a routing loop is formed. For p.sub.t.sup.3,10=0.1, the updated probability vector P.sup.3=[0.56, 0, 0, 0, 0, 0, 0.11, 0.11, 0, 0, 0, 0, 0, 0, 0, 0.11, 0, 0, 0, 0.11].
[0098] Step S2.6: Adding an Edge to the Edge Set
[0099] Adding the edge (n.sup.i*,n.sup.j*) to edge set E.sub.mst and marking sensor node n.sup.i* as sampled.
[0100] Step S2.7: judging whether the number of edge set E.sub.mst is less than A, if yes, returning to step S2.2, otherwise the spanning tree is generated.
[0101] Step S3: Training the Decision Networks
[0102] Training the decision networks of A agents a.sup.i, 1=1, 2, . . . , A of the cooperative routing decision system by a multi-agent deep reinforcement learning algorithm.
[0103] For the decision network of an agent, the problem to be solved is an online problem, the routing decisions are interrelated and the decision space is very large due to the specialty of the problem, so it is very difficult to obtain an optimal solution. Therefore, a mean field actor critic frame of an actor-critic algorithm based multi-agent deep reinforcement learning algorithm is chosen in the embodiment and the training of the decision networks are performed through a simulation. The steps in details are as follows:
[0104] In a simulation environment, simulating the amount of the data collected by each sensor node in a real world according to the corresponding designed probability distribution based on existing prior knowledge for different type of data collected by the sensor node. In the embodiment, letting the data amount uniformly distributed in the range of [500 bytes, 1000 bytes].
[0105] The decision network of each agent in the cooperative routing decision system created in step S2 is taken as an actor network, a critic network is set for instructing the learning of the actor network. Extra global information can be obtained by the critic network, which can guarantee feasibility and stability of training. The process of routing decision of the decision network in present invention is modeled as a partially observable Markov decision process, where the input vector of each decision network is taken as a local observation in the partially observable Markov decision process, the parent node chosen by the routing of corresponding sensor node which is obtained by the sink node is taken as an action in the partially observable Markov decision process, the reward function is calculated according to the lifetime of the wireless sensor network, the calculating formula is:
[0106] where R.sub.t is the value of the reward function at time t, T is the lifetime of the wireless sensor network. In other words, after each sensor complete its data transmission according to corresponding routing of its action, if the wireless sensor network is still running, the value of the reward function is 0, if the energy of any sensor node is run out, the wireless sensor network is paralyzed, the value of the reward function is the lifetime of the wireless sensor network.
[0107] At last, simulating the wireless sensor network through a simulator and training the actor-critic network by sampling the data which is obtained by the simulation, so as the training of the decision networks are realized.
[0108] In the embodiment, setting the decision networks of all agents sharing parameters to enhance the training efficiencies of them.
[0109] In the embodiment, the neural network as shown in
[0110] The first fully connected layer is used for receiving and processing local observation vector O.sub.t.sup.i and sending obtained feature w.sub.t.sup.O to the second fully connected layer.
[0111] The second fully connected layer is used for receiving the mean action of all neighbor nodes of sensor node n.sup.i at the previous routing decision, the method for determining the mean action is: doing one-hot encoding for the action of each neighbor node and averaging the corresponding encoded vectors to obtain a mean action. The mean action is processed by the second fully connected layer to obtain a feature w.sub.t.sup.ACT, which is inputted into the first concatenate layer.
[0112] The first concatenate layer is used for concatenating the two received features w.sub.t.sup.O and w.sub.t.sup.ACT together. Then the feature obtained by concatenating is sent to the third fully connected layer.
[0113] The third fully connected layer is used for processing the feature obtained by concatenating to obtain a feature w.sub.t.sup.1. Then feature w.sub.t.sup.1 is sent to the second concatenate layer.
[0114] The fourth fully connected layer is used for receiving position vector Pos.sup.i and processing it to obtain a feature w.sup.POS. Feature w.sup.POS is sent to the second concatenate layer.
[0115] The second concatenate layer is used for concatenating the two features w.sub.t.sup.1 and w.sup.POS together, the feature obtained by concatenating is sent to the fifth fully connected layer.
[0116] The fifth fully connected layer is used for processing the received feature to obtain a feature w.sub.t, feature w.sub.t is sent to the sixth fully connected layer;
[0117] The sixth fully connected layer is used for processing the received feature w.sub.t to obtain a final evaluation value.
[0118] In the embodiment, the first to the fifth fully connected layers of the critic network adopt ReLU (Rectified Liner Unit) activation functions, and the sixth fully connected layer adopts a linear activation function.
[0119] In the embodiment, RMSProp (Root Mean Squared Propagation) optimizers are used in the trainings of the actor network and the critic network, the learning rate of the actor network is 1×10.sup.−5, the learning rate of the critic network is 5×10.sup.−5. In the embodiment, a target critic network is introduced to ensure the stability of training, and the parameters of the target critic network are updated in combination with the parameters of the critic network by using a soft-update policy. The parameter of the soft update is 1×10.sup.−3. To ensure the exploration intensity of the actor network and avoid falling into local optimal solution early, a extra entropy regularization term is added into the loss function, the weight of the extra entropy regularization term is set to 1×10.sup.−6.
[0120] Step S4: Deploying the Wireless Sensor Node
[0121] The next work is to deploy the wireless sensor node.
[0122] Step S4.1: Calculating an Initial Routing for Each Sensor Node
[0123] Firstly, calculating a minimum spanning tree according to the positions and the neighborhoods of sensor nodes n.sup.i, i=1, 2, . . . , A of the wireless sensor network to be deployed by taking the distances between nodes as weights. In the embodiment, a kruskal algorithm is used for calculating a minimum spanning tree. Then, taking sink node n.sup.0 in the minimum spanning tree as a root node and calculating an initial routing for each sensor node. In the embodiment, the initial routing for each sensor node is calculated by a BFS (Breadth First Search) algorithm.
[0124] Step S4.2: Generating a Configuration File
[0125] For each sensor node, loading the information of its neighborhood and initial routing into its configuration file according to its position.
[0126] Step S4.3: Loading the Positions
[0127] Loading the positions of the sensor nodes into the sink node.
[0128] Step S4.4: Deploying the Sensor Nodes
[0129] Deploying sensor nodes n.sup.i, i=1, 2, . . . , A into an actual environment according to their respective positions.
[0130] Step S5: Initializing the Sensor Nodes
[0131] When the wireless sensor network is started, setting up two counters in each sensor node n.sup.i and initializing the two counters to 0, wherein the two counters are used for counting the amount cnt.sub.s.sup.i of the collected environmental data and the amount cnt.sub.o.sup.i of the forwarded data at each decision, initializing a transmission count m in each sensor node to 1.
[0132] Step S6: Monitoring the Collected Environmental Data
[0133] For each sensor node, collecting environmental data from environment continuously and receiving the environmental data sent by other sensor nodes, sending the environmental data collected in current transmission cycle and forwarding the environmental data coming from other sensor nodes to the sink node according to current routing at each transmission interval of U seconds, where the amount of the environmental data collected by sensor node n.sup.i and sent to the parent node of sensor node n.sup.i at the m.sup.th transmission circle is denoted by d.sub.s.sup.i,m, the amount of the environmental data coming from other sensor nodes and forwarded by the sensor node n.sup.i at the m.sup.th transmission circle is denoted by d.sub.o.sup.i,m, then amount cnt.sub.s.sup.i of the collected environmental data is cnt.sub.s.sup.i=cnt.sub.s.sup.i+d.sub.s.sup.i,m and amount cnt.sub.o.sup.i of the forwarded data is cnt.sub.o.sup.i=cnt.sub.o.sup.i+d.sub.o.sup.i,m;
[0134] Step S7: judging whether a sensor node is below a set threshold
[0135] obtaining the residual energies of the sensor nodes and judging whether one of them is below a pre-defined threshold, if yes, then judging that the wireless sensor network is paralyzed and terminating the routing process, otherwise going to step S8.
[0136] Step S8: judging whether m % M=0, where M is a routing decision cycle, which is denoted by the number of transmission cycles, % is a remainder operator, if yes, then going to step S9, otherwise returning to step S6.
[0137] Step S9: Updating the Routing Policy of the Wireless Sensor Network
[0138] Updating the routing policy of the wireless sensor network through a cooperative routing decision of A agents a.sup.i, i=1, 2, . . . , A.
[0139] Step S9.1: Obtaining the Amount of the Data to be Transmitted
[0140] Obtaining amount cnt.sub.s.sup.i of the collected environmental data and the amount cnt.sub.o.sup.i of the forwarded data of the corresponding sensor node n.sup.i by agent a.sup.i, letting data amount c.sub.s.sup.i,t=cnt.sub.s.sup.i and data amount c.sub.o.sup.i,t=cnt.sub.o.sup.i, then setting amount cnt.sub.s.sup.i of the collected environmental data and amount cnt.sub.o.sup.i of the forwarded data to 0, where i=1, 2, . . . , A.
[0141] Step S9.2: Determining the Input Information of the Decision Network
[0142] Obtaining local observation vector O.sub.t.sup.i and position vector Pos.sup.j of sensor node n.sup.i by agent a.sup.i, then concatenating local observation vector O.sub.t.sup.i and position vector Pos.sup.i together to obtain an input vector and inputting the input vector to corresponding decision network to obtain a probability vector P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A], where i=1, 2, . . . , A.
[0143] Step S9.3: Gathering the Probability Vectors
[0144] Uploading probability vectors P.sub.t.sup.u=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A], i=1, 2, . . . , A to sink node n.sup.0 by sensors nodes n.sup.i through their corresponding current routings, respectively.
[0145] Step S9.4: Updating the Routing of Each Sensor Node
[0146] Recalculating a routing for each sensor node by sink node n.sup.0 according to the received probability vectors P.sub.t.sup.i=[p.sub.t.sup.i,0, p.sub.t.sup.i,1, . . . , p.sub.t.sup.i,A], i=1, 2, . . . , A and sending the routings to corresponding sensor nodes, respectively, then returning to step S6.
[0147] In the embodiment, the update cycle M of routing is set to 10, in other words, the routing policy of the wireless sensor network is updated at the interval of 10 transmission cycles.
[0148] In order to illustrate the technical effect, a specific example are given to verify the present invention through an experiment, and the wireless sensor network shown in
TABLE-US-00001 TABLE 1 The present Routing method MST AODV EAR invention the lifetime of the 808 1049 1391 1776 wireless sensor network
[0149] As shown in Table 1, the present invention can make the lifetime of the wireless sensor network more longer, its lifetime is twice the length of that of MST, which verified the feasibility of the present invention.
[0150] While illustrative embodiments of the invention have been described above, it is, of course, understand that various modifications will be apparent to those of ordinary skill in the art. Such modifications are within the spirit and scope of the invention, which is limited and defined only by the appended claims.