Large area surveillance method and surveillance robot based on weighted double deep Q-learning

Abstract

A large area surveillance method is based on weighted double deep Q-learning. A robot which of Q-value table including a Q.sub.A-value table and Q.sub.B-value table is provided, an unidentified object enters a large space to trigger the robot, and the robot perceives a current state s and determines whether the current state s is a target state, if yes, the robot reaches a next state and monitors the unidentified object, and if not, the robot reaches a next state, obtains a reward value according to the next state, selectively updates a Q.sub.A-value or Q.sub.B-value with equal probability, and then updates a Q-value until convergence to obtain an optimal surveillance strategy. The problems of a limited surveillance area and camera capacity are resolved, and the synchronization of multiple cameras doesn't need to be considered, and thus the cost is reduced. A large area surveillance robot is also disclosed.

Claims

1. A large area surveillance method based on weighted double deep Q-learning, comprising steps of: S1. providing a large space and a robot in the large space, wherein the robot in a working state reaches a target state from a current state by using a double Q-learning method, a Q-value table of the robot comprises a Q.sub.A-value table and a Q.sub.B-value table, and a Q-value is calculated by using a deep estimation network parameter θ, an update formula of a Q.sub.A-value being as follows: $β_{A} = \frac{| Q_{B} (s^{'}, a^{*}; θ) - Q_{B} (s^{'}, a_{L}; θ) |}{c + | Q_{B} (s^{'}, a^{*}; θ) - Q_{B} (s^{'}, a_{L}; θ) |}, δ = R (s, a) + γ [β_{A} Q_{A} (s^{'}, a^{*}; θ) + (1 - β_{A}) Q_{B} (s^{'}, a^{*}; θ)] - Q_{A} (s, a; θ), and$ $Q_{A} \leftarrow Q_{A} (s, a; θ) + α (s, a) δ,$ an update formula of a Q.sub.B-value being as follows: $β_{B} = \frac{| Q_{A} (s^{'}, a^{*}; θ) - Q_{A} (s^{'}, a_{L}; θ) |}{c + | Q_{A} (s^{'}, a^{*}; θ) - Q_{A} (s^{'}, a_{L}; θ) |}, δ = R (s, a) + γ [β_{B} Q_{B} (s^{'}, a^{*}; θ) + (1 - β_{B}) Q_{A} (s^{'}, a^{*}; θ)] - Q_{B} (s, a; θ), and$ $Q_{B} \leftarrow Q_{B} (s, a; θ) + α (s, a) δ,$ wherein β.sub.A and β.sub.B represent weights, s′ represents a next state, a* represents an optimal action of a next state, a.sub.L represents a worst action of a next state, c is a free parameter and c≥0, δ represents a time differential, R represents a reward value, γ represents a target discount and 0≤γ≤1, s represents the current state, a represents a current action, α represents a learning rate, α is within a range (0, 1), and θ represents the deep estimation network parameter, in the working state, an unidentified object is present in the large space, and in the target state, the unidentified object is in a surveillance area of the robot; S2. setting an initial state of the robot as the current state s by the robot; S3. detecting and determining whether the current state s is the working state by the robot, wherein if not, the process turns to S4, and if yes, the process turns to S5; S4. switching to standby mode to reach a next state s′ by the robot, wherein the process turns to S11; S5. detecting and determining whether the current state s is the target state by the robot, wherein if not, the process turns to S6, and if yes, the process turns to S7; S6. selecting and executing the current action a to reach a next state s′ by the robot, wherein the process turns to S8; S7. selecting and executing the current action a to reach a next state s′ and monitoring the unidentified object by the robot, wherein the process turns to S8; S8. obtaining the reward value R according to the next state s′ by the robot, wherein the process turns to S9; S9. selectively updating the Q.sub.A-value or the Q.sub.B-value with equal probability by the robot, wherein the process turns to S10; S10. determining whether the Q-value table of the robot converges by the robot, wherein if not, the process turns to S11, and if yes, the process turns to S12; S11. resetting the next state s′ as the current state s by the robot, wherein the process returns to S3; S12. formulating an optimal surveillance strategy by the robot, wherein the process turns to S13; S13. resetting the next state s′ as the current state s by the robot, wherein the process turns to S14; S14. detecting and determining whether the current state s is the working state by the robot, wherein if not, the process turns to S15, and if yes, the process turns to S16; S15. switching to standby mode to reach a next state s′ by the robot, wherein the process returns to S13; S16. detecting and determining whether the current state s is the target state by the robot using a deep estimation network, wherein if not, the process turns to S17, and if yes, the process turns to S18; S17. reaching a next state s′ according to the optimal surveillance strategy by the robot, wherein the process returns to S13; and S18. selecting and executing the current action a to reach a next state s′, and monitoring the unidentified object by the robot, wherein the process returns to S13, wherein in the large space, the Q-value, the learning rate α, and the target discount γ of the robot, the structure and parameter θ of the deep estimation network, an action selection manner, and the weight β are only initialized before the robot selects and executes the current action a for the first time.

2. The large area surveillance method based on weighted double deep Q-learning as claimed in claim 1, wherein the large space is divided into several subspaces, after selecting and executing the current action a, the robot remains still in a current subspace or moves to a subspace adjacent to the current subspace, and each subspace being less than or equal to the surveillance area of the robot.

3. The large area surveillance method based on weighted double deep Q-learning as claimed in claim 1, wherein when the robot detects the current state s, a sensor of the robot is used to acquire an approximate location loc.sub.i of the unidentified object and a precise location loc.sub.a of the robot, denoted as s= custom character loc.sub.i,loc.sub.a .

4. The large area surveillance method based on weighted double deep Q-learning as claimed in claim 1, wherein the robot acquires image information by using a camera of the robot, performs feature extraction and classification by using the deep estimation network, and determines by itself whether an unidentified object is present in the surveillance area.

5. The large area surveillance method based on weighted double deep Q-learning as claimed in claim 1, wherein when the robot selects the current action a, there is a larger probability of selecting an action represented by a maximum Q-value, and there is a smaller probability of selecting any another action.

6. The large area surveillance method based on weighted double deep Q-learning as claimed in claim 1, wherein the reward value R is set as follows: $R (s, a) = {\begin{matrix} 10 & {loc}_{i} = {loc}_{a} \\ - 0.1 & {loc}_{i} \neq {loc}_{a} \end{matrix},$ wherein loc.sub.a is a precise location of the robot, and loc.sub.i is an approximate location of an unidentified object, that is, when the unidentified object is in the surveillance area of the robot, a positive reward is provided, and when the robot does not observe the unidentified object, a negative reward is provided.

7. The large area surveillance method based on weighted double deep Q-learning as claimed in claim 1, wherein the robot keeps updating the learning rate α of the robot, $α (s, a) = \frac{1}{{n (s, a)}^{0.8}},$ wherein when the robot executes the current action a, an unidentified object also moves to form a double-movement state, and n is the number of executing the action a in the double-movement state.

8. A large area surveillance robot based on weighted double deep Q-learning, wherein a Q-value table of the robot comprises a Q.sub.A-value table and a Q.sub.B-value table, and a Q-value is calculated by using a deep estimation network parameter θ, wherein an update formula of a Q.sub.A-value is as follows: $β_{A} = \frac{| Q_{B} (s^{'}, a^{*}; θ) - Q_{B} (s^{'}, a_{L}; θ) |}{c + | Q_{B} (s^{'}, a^{*}; θ) - Q_{B} (s^{'}, a_{L}; θ) |}, δ = R (s, a) + γ [β_{A} Q_{A} (s^{'}, a^{*}; θ) + (1 - β_{A}) Q_{B} (s^{'}, a^{*}; θ)] - Q_{A} (s, a; θ), and$ $Q_{A} \leftarrow Q_{A} (s, a; θ) + α (s, a) δ,$ an update formula of a Q.sub.B-value is as follows: $β_{B} = \frac{| Q_{A} (s^{'}, a^{*}; θ) - Q_{A} (s^{'}, a_{L}; θ) |}{c + | Q_{A} (s^{'}, a^{*}; θ) - Q_{A} (s^{'}, a_{L}; θ) |}, δ = R (s, a) + γ [β_{B} Q_{B} (s^{'}, a^{*}; θ) + (1 - β_{B}) Q_{A} (s^{'}, a^{*}; θ)] - Q_{B} (s, a; θ), and$ $Q_{B} \leftarrow Q_{B} (s, a; θ) + α (s, a) δ,$ wherein β.sub.A and β.sub.B represent weights, s′ represents a next state, a* represents an optimal action of a next state, a.sub.L represents a worst action of a next state, c is a free parameter and c≥0, δ represents a time differential, R represents a reward value, γ represents a target discount, and 0≤γ≤1, s represents the current state, a represents a current action, α represents a learning rate, α is within a range (0, 1), and θ represents the deep estimation network parameter; and the robot is also provided with a sensor for detecting a precise location of the robot and an approximate location of an unidentified object in real time, and a camera for monitoring the unidentified object, the sensor and the camera being respectively electrically connected to a main control chip of the robot.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic view of a large area surveillance robot based on weighted double deep Q-learning according to the present invention.

(2) FIG. 2 is a schematic view of a large space and the locations of a robot and an unidentified object based on weighted double deep Q-learning according to the present invention.

(3) FIG. 3 is a flowchart of a large area surveillance method based on weighted double deep Q-learning according to the present invention.

(4) Where: 10, robot; 11, sensor; 12, camera; 13, main control chip; 20, large space; 21, subspace; 30, unidentified object.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

(5) The technical solutions in the embodiments of the present invention will be clearly and completely described in the following with reference to the accompanying drawings. It is obvious that the described embodiments are only illustrative and are not intended to limit the protection scope of the invention.

(6) Based on the embodiments of the present invention, all the other embodiments obtained by those skilled in the art without creative work are within the protection scope of the present invention.

Embodiment 1

(7) Referring to FIG. 1, a large area surveillance robot 10 based on weighted double deep Q-learning is illustrated. A Q-value table of the robot 10 includes a Q.sub.A-value table and a Q.sub.B-value table, and a Q-value is calculated by using a deep estimation network parameter θ.

(8) An update formula of a Q.sub.A-value is as follows:

(9) $β_{A} = \frac{| Q_{B} (s^{'}, a^{*}; θ) - Q_{B} (s^{'}, a_{L}; θ) |}{c + | Q_{B} (s^{'}, a^{*}; θ) - Q_{B} (s^{'}, a_{L}; θ) |}, δ = R (s, a) + γ [β_{A} Q_{A} (s^{'}, a^{*}; θ) + (1 - β_{A}) Q_{B} (s^{'}, a^{*}; θ)] - Q_{A} (s, a; θ), and$ $Q_{A} \leftarrow Q_{A} (s, a; θ) + α (s, a) δ,$

(10) an update formula of a Q.sub.B-value is as follows:

(11) $β_{B} = \frac{| Q_{A} (s^{'}, a^{*}; θ) - Q_{A} (s^{'}, a_{L}; θ) |}{c + | Q_{A} (s^{'}, a^{*}; θ) - Q_{A} (s^{'}, a_{L}; θ) |}, δ = R (s, a) + γ [β_{B} Q_{B} (s^{'}, a^{*}; θ) + (1 - β_{B}) Q_{A} (s^{'}, a^{*}; θ)] - Q_{B} (s, a; θ), and$ $Q_{B} \leftarrow Q_{B} (s, a; θ) + α (s, a) δ,$

(12) where β.sub.A and βB represent weights, s′ represents a next state, a* represents a best action of a next state, a.sub.L represents a worst action of a next state, c is a free parameter and c≥0, δ represents a time differential, R represents a reward value, γ represents a target discount and 0≤γ≤1, s represents a current state, a represents a current action, α represents a learning rate, α is within a range (0, 1), and θ represents the deep estimation network parameter.

(13) The robot 10 is further provided with a sensor 11 for detecting a precise location of the robot 10 and an approximate location of an unidentified object in real time, and a camera 12 for monitoring the unidentified object. The sensor 11 and the camera 12 are respectively electrically connected to a main control chip 13 of the robot 10.

(14) After the robot 10 acquires an image, the image is used as an input for a deep estimation network. The deep estimation network is an 8-layer network. All network nodes are rectified linear units (ReLU). Layer 1 is an input layer, the state is a flat vector whose length is 84*84*3=21168, and a reward signal is a numeric scalar. Layer 2 to Layer 5 are convolutional layers. In Layer 2, the convolutional kernel size is 8*8, the step size is 4*4, the number of output channels is 32, and the output dimension of this layer is 20*20*32. In Layer 3, the convolutional kernel size is 4*4, the step size is 2*2, the number of output channels is 64, and the output dimension of this layer is 9*9*64. In Layer 4, the convolutional kernel size is 3*3, the step size is 1*1, the number of output channels is 64, and the output dimension of this layer is 7*7*64. In Layer 5, the convolutional kernel size is 7*7, the step size is 1*1, and the number of output channels is 1*1*512. Layer 7 is a full connected layer, and the number of output channels is 512. Layer 8 is also a full connected layer, and the number of output channels is the number of actions, that is, an output-value is a Q-value corresponding to each state-action pair. In an experience replay mechanism, the number of each batch of samples is 32, the size of a memory replay unit is 1000000, a target Q-value is updated once for every 10000 samples, and a current Q-value is updated once for every several samples whose quantity is the number of actions.

(15) Referring to FIG. 2 and FIG. 3, a large area surveillance method based on weighted double deep Q-learning is illustrated, and the method includes the following steps:

(16) S1. A large space 20 and a robot 10 in the large space 20 are provided, the robot 10 in a working state reaches a target state from a current state by using a double Q-learning method;

(17) in the working state, an unidentified object 30 is present in the large space 20;

(18) in the target state, the unidentified object 30 is in a surveillance area of the robot 10;

(19) S2. the robot 10 sets its initial state as the current state s.

(20) S3. the robot 10 detects and determines whether the current state s is the working state, where if not, the process turns to S4, and if yes, the process turns to S5;

(21) S4. the robot 10 switches to standby mode to reach a next state s′, where the process turns to S11;

(22) S5. the robot 10 detects and determines whether the current state s is the target state, if not, the process turns to S6, and if yes, the process turns to S7;

(23) S6. the robot 10 selects and executes the current action a to reach a next state s′, where the process turns to S8;

(24) S7. the robot 10 selects and executes the current action a to reach a next state s′ and monitors the unidentified object 30, where the process turns to S8;

(25) S8. the robot 10 obtains a reward value R according to the next state s′ where the process turns to S9;

(26) S9. the robot 10 selectively updates a Q.sub.A-value or a Q.sub.B-value with equal probability, where the process turns to S10;

(27) S10. the robot 10 determines whether the Q-value table of the robot 10 converges, where if not, the process turns to S11, and if yes, the process turns to S12;

(28) S11. the robot 10 resets a next state s′ as the current state s, where the process returns to S3;

(29) S12. the robot 10 formulates an optimal surveillance strategy, where the process turns to S13;

(30) S13. the robot 10 resets a next state s′ as the current state s, where the process turns to S14;

(31) S14. the robot 10 detects and determines whether the current state s is the working state using a deep estimation network, where if not, the process turns to S15, and if yes, the process turns to S16;

(32) S15. the robot 10 switches to standby mode to reach a next state s, where the process returns to S13;

(33) S16. the robot 10 detects and determines whether the current state s is the target state, where if not, the process turns to S17, and if yes, the process turns to S18;

(34) S17. the robot 10 reaches a next state s′ according to the optimal surveillance strategy, where the process returns to S13; and

(35) S18. the robot 10 selects and executes the current action a to reach a next state s, and monitors the unidentified object 30, where the process returns to S13.

(36) In the above technical solution, in a same large space, the Q-value, the learning rate α, and the target discount γ of the robot, an action selection manner, the weight β, and the structure and parameter θ of the deep estimation network, and the like are only initialized before the robot 10 selects and executes the current action a for the first time. In this embodiment, after initialization, the Q-value is 0, the free parameter c is 1, the learning rate α is 0.8, the target discount γ is 0.95, the action selection manner is an Ú-greedy manner, and the weight β is 0.5.

(37) In the above technical solution, the large space 20 is divided into several subspaces 21. After selecting and executing the current action a, the robot 10 remains still in a current subspace or moves to a subspace adjacent to the current subspace. Each subspace 21 is not larger than the surveillance area of the robot 10.

(38) In the above technical solution, when the robot 10 detects the current state s, a sensor of the robot 10 is used to acquire an approximate location loc.sub.i of the unidentified object 30 and a precise location loc.sub.a of the robot 10, denoted as s= custom character loc.sub.i,loc.sub.a. The foregoing state has a Markov property. A future state of a state having a Markov property is only related to a current state and is not related to a previous state.

(39) In the above technical solution, when the robot 10 monitors the unidentified object 30, a camera 12 of the robot 10 is used to acquire image information of the unidentified object.

(40) In the above technical solution, when the robot 10 selects the current action a, there is a larger probability of selecting an action represented by a maximum Q-value, and there is a smaller probability of selecting any another action.

(41) In one embodiment, the camera 12 of the robot 10 is a 360-degree rotatable camera.

(42) In another embodiment, an alarm device (not shown) is further disposed on the robot 10. The alarm device is electrically connected to the main control chip 13 of the robot 10. The robot 10 performs feature extraction and classification by using the deep estimation network according to the image information and determines by itself whether an unidentified object is present in the surveillance area, where if yes, an alarm is raised by using the alarm device of the robot.

(43) In a further embodiment, the robot 10 selects the current action a in an Ú-greedy manner. The action refers to a movement direction of the robot 10, that is, upward movement, downward movement, leftward movement, rightward movement, and stillness.

(44) In one embodiment, the reward value R is set as follows:

(45) $R (s, a) = {\begin{matrix} 10 & {loc}_{i} = {loc}_{a} \\ - 0.1 & {loc}_{i} \neq {loc}_{a} \end{matrix},$

(46) where loc.sub.a is a precise location of the robot, and loc.sub.i is an approximate location of an unidentified object, that is, when an unidentified object is in the surveillance area of the robot, a positive reward is provided, and when the robot observes no unidentified object, a negative reward is provided.

(47) In a further embodiment, the robot 10 keeps updating the learning rate α of the robot 10,

(48) 0 $α (s, a) = \frac{1}{{n (s, a)}^{0.8}},$
where when the robot executes the current action a, an unidentified object also moves to form a double-movement state, and n is the number of executing the action a in the double-movement state. The abovementioned description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Multiple modifications to these embodiments are obvious to those skilled in the art, and general principles defined herein can be implemented in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to theses embodiments illustrated herein, but needs to be in line with the widest scope consistent with the principles and novel features disclosed herein.

Large area surveillance method and surveillance robot based on weighted double deep Q-learning

Assignee

Inventors

Cpc classification

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

H04N7/18

ELECTRICITY

Classification Explorer

B25J9/1697

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G05B2219/40264

PHYSICS

Classification Explorer

G05B13/027

PHYSICS

Classification Explorer

B25J9/163

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G05B2219/40298

PHYSICS

International classification

Classification Explorer

G05B19/04

PHYSICS

Classification Explorer

G05B19/18

PHYSICS

Classification Explorer

G05B13/02

PHYSICS

Classification Explorer

B25J9/16

PERFORMING OPERATIONS; TRANSPORTING

Abstract

Claims

Description