Method for adaptively role-selection in coordinated multi-robot search task and system thereof

12306643 ยท 2025-05-20

Assignee

Inventors

Cpc classification

International classification

Abstract

The present invention discloses a method for adaptively role-selection in coordinated multi-robot search task and system thereof, comprising: defining a role action space as two discrete values: [explore, cover]; acquiring and inputting local perception information o.sub.t.sup.i and joint perception information jo.sub.t.sup.i into a role policy, and outputting a role action .sub.t.sup.i; wherein, the local perception information comprises an obstacle map, an explored map, a covered map, and a position map; and, the joint perception information comprises a merged explored map and a merged covered map; inputting the local perception information o.sub.t.sup.i and the output role action .sub.t.sup.i into a primitive policy, and outputting a primitive action a.sub.t of robot to interact with the environment; then, a robot is controlled to execute corresponding output primitive action a.sub.t according to received specific role action .sub.t.sup.i.

Claims

1. A method for adaptively role-selection in coordinated multi-robot search task, comprising: defining a role action space as two discrete values: [explore, cover]; acquiring and inputting local perception information o.sub.t.sup.i and joint perception information jo.sub.t.sup.i into a role policy, and outputting a role action .sub.t.sup.i; wherein, the local perception information comprises an obstacle map, an explored map, a covered map, and a position map; and, the joint perception information comprises a merged explored map and a merged covered map; and inputting the local perception information o.sub.t.sup.i and the output role action .sub.t.sup.i into a primitive policy, and outputting a primitive action a.sub.t of robot to interact with the environment; wherein, a training of the role policy, comprising: identifying, by the robot, frontier cells or target cells based on the local perception information o.sub.t.sup.i, and evaluating an expected reward between the exploration and the coverage by using the joint perception information jo.sub.t.sup.i; wherein, a specific calculation of the expected reward is as follows:
custom character[.sub..sub.r log .sub..sub.r.sup.i(.sub.t.sup.i|o.sub.t.sup.i,jo.sub.t.sup.i).sub.t.sup.i], where, .sub..sub.r.sup.i represents the parametric representation of the role policy, .sub.t.sup.i is the advantage function of the role policy; defining an exploration reward R.sub.e and a coverage reward R.sub.c for each the robot, a reward function of the role policy is R.sub.t=R.sub.e+R.sub.c; where, and are the reward weight coefficients of the explore role action and the cover role action, respectively; using, by the role policy, a centralized training distributed execution architecture based on an (Actor-Critic).sup.R structure, and training the role policy by using a multi-agent reinforcement learning algorithm; wherein, in a centralized training phase, outputting the role action by an Actor.sup.R network, and calculating a state value function V.sub.r(s) by a Critic.sup.R network to obtain the advantage function A.sub.r (s, ) to judge a rationality of the role action calculated by the Actor.sup.R network; wherein, the advantage function A.sub.r (s, ) specifically is:
.sub.r=.sub.t+().sub.t+1+ . . . +().sup.Tt+1.sub.T1, where, .sub.t=r.sub.t+V(s.sub.t+1)V(s.sub.t), r.sub.t is the environmental reward at the time t, V(s.sub.t) and V(s.sub.t+1) represent the state value functions at the time t and the time t+1, respectively, is the discount factor; after taking the role action in state S, if the value of A is greater than 0, representing that the role action is greater than an average, which is defined as a reasonable choice; if the value of A is less than 0, representing that the role action p is worse than the average, which is defined as a not-good choice; and in a training phase of the primitive policy, setting two different rewards: the exploration reward R.sub.e and the coverage reward R.sub.c; obtaining a base reward R.sub.p (t) of the primitive policy for each time t is: R p ( t ) = { R e , R e = .Math. i N .Math. k E t _ u t k q t i = p k B e , t i = 0 R c , R c = .Math. i N .Math. j C t c t j q t i = p j , t i = 1 , where, .sub.t.sup.i is the output role action, the binarization u.sub.t.sup.k indicates whether a robot i moves to a passable area at the time t, and the passable area is an unexplored grid cell, iN, N is a number of robots; wherein, u.sub.t.sup.k=1 indicates that a robot position q.sub.t.sup.i is located at a position p.sup.k of a grid cell k, and the grid cell k is the passable area E.sub.t; otherwise, u.sub.t.sup.k=0, indicates that the robot i does not move to the passable area at the time t; the binarization c.sub.t.sup.j indicates whether the robot i moves to a target cell at the time t, wherein c.sub.t.sup.j=1 indicates that the robot position q.sub.t.sup.i is located at a position p.sup.j of a target grid cell j, the target grid cell j is an existing target set C.sub.t in environment; otherwise, c.sub.t.sup.j=0, indicates that the robot i does not move to the target cell at the time t.

2. The method according to claim 1, wherein when the robot receives an explore role action, expecting that the robot moves towards a frontier cell closest to the robot in a field of view; and, when the robot receives a cover role action, expecting that the robot moves towards a target cell closest to the robot in the field of view.

3. The method according to claim 1, wherein using a convolutional neural network (CNN) as a encoder to generate an embedding vector by embedding the local perception information; the embedding vector is shared among all the robots; splicing the output role action with the embedding vector of the local perception information as an input of the primitive policy; before extracting the encoded information of the local perception information, masking an unexplored area.

4. A system for adaptively role-selection in coordinated multi-robot search task, comprising: an action space module, configured to define a role action space as two discrete values: [explore, cover]; a role selection module, configured to acquire and input local perception information o.sub.t.sup.i and joint perception information jo.sub.t.sup.i into a role policy, and outputting a role action .sub.t.sup.i; wherein, the local perception information comprises an obstacle map, an explored map, a covered map, and a position map; and, the joint perception information comprises a merged explored map and a merged covered map; and a primitive action output module, configured to input the local perception information o.sub.t.sup.i and the output role action .sub.t.sup.i into a primitive policy, and outputting a primitive action a.sub.t of robot to interact with the environment; wherein, a training of the role policy, comprising: identifying, by the robot, frontier cells or target cells based on the local perception information o.sub.t.sup.i, and evaluating an expected reward between the exploration and the coverage by using the joint perception information jo.sub.t.sup.i; wherein, a specific calculation of the expected reward is as follows:
custom character[.sub..sub.r log .sub..sub.r.sup.i(.sub.t.sup.i|o.sub.t.sup.i,jo.sub.t.sup.i).sub.t.sup.i], where, .sub..sub.r.sup.i represents the parametric representation of the role policy, .sub.t.sup.i is the advantage function of the role policy; defining an exploration reward R.sub.e and a coverage reward R.sub.c for each the robot, a reward function of the role policy is R.sub.t=R.sub.e+R.sub.c; where, and are the reward weight coefficients of the explore role action and the cover role action, respectively; using, by the role policy, a centralized training distributed execution architecture based on an (Actor-Critic).sup.R structure, and training the role policy by using a multi-agent reinforcement learning algorithm; wherein, in a centralized training phase, outputting the role action by an Actor.sup.R network, and calculating a state value function V.sub.r(s) by a Critic.sup.R network to obtain the advantage function A.sub.r(s, ) to judge a rationality of the role action calculated by the Actor.sup.R network; wherein, the advantage function A.sub.r(s, ) specifically is:
.sub.r=.sub.t+().sub.t+1+ . . . +().sup.Tt+1.sub.T1, where, .sub.t=r.sub.t+V(s.sub.t+1)V(s.sub.t), r.sub.t is the environmental reward at the time t, V(s.sub.t) and V(s.sub.t+1) represent the state value functions at the time t and the time t+1, respectively, is the discount factor; after taking the role action in state S, if the value of A is greater than 0, representing that the role action is greater than an average, which is defined as a reasonable choice; if the value of A is less than 0, representing that the role action is worse than the average, which is defined as a not-good choice; and in a training phase of the primitive policy, setting two different rewards: the exploration reward R.sub.e and the coverage reward R.sub.c; obtaining a base reward R.sub.p(t) of the primitive policy for each time t is: R p ( t ) = { R e , R e = .Math. i N .Math. k E t _ u t k q t i = p k B e , t i = 0 R c , R c = .Math. i N .Math. j C t c t j q t i = p j , t i = 1 , where, .sub.t.sup.i is the output role action, the binarization u.sub.t.sup.k indicates whether a robot i moves to a passable area at the time t, and the passable area is an unexplored grid cell, iN, N is a number of robots; wherein, u.sub.t.sup.k=1 indicates that a robot position q.sub.t.sup.i is located at a position p.sup.k of a grid cell k, and the grid cell k is the passable area E.sub.t; otherwise, u.sub.t.sup.k=0, indicates that the robot i does not move to the passable area at the time t; the binarization c.sub.t.sup.j indicates whether the robot i moves to a target cell at the time t, wherein c.sub.t.sup.j=1 indicates that the robot position q.sub.t.sup.i is located at a position p.sup.j of a target grid cell j, the target grid cell j is an existing target set C.sub.t in environment; otherwise, c.sub.t.sup.j=0, indicates that the robot i does not move to the target cell at the time t.

5. A terminal device, comprising a processor and a memory, wherein the processor is used for implementing instructions, and the memory is used for storing the instructions; wherein when the instructions are loaded by the processor, executing a method for adaptively role-selection in coordinated multi-robot search task according to claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a flow chart of a method for adaptively role-selection in coordinated multi-robot search task according to an example of the present invention;

(2) FIG. 2 is a structural diagram of a role policy and a primitive policy in an example of the present invention;

(3) FIG. 3 is an example diagram of a simulation environment in an example of the present invention.

DETAILED DESCRIPTION

(4) It should be pointed out that the following detailed descriptions are all illustrative and are intended to provide further descriptions of the present invention. Unless otherwise specified, all technical and scientific terms used in the present invention have the same meanings as those usually understood by a person of ordinary skill in the art to which the present invention belongs.

(5) It should be noted that the terms used herein are merely used for describing specific implementations, and are not intended to limit exemplary implementations of the present invention. As used herein, the singular form is also intended to include the plural form unless the context clearly dictates otherwise. In addition, it should further be understood that, terms comprise and/or comprising used in this specification indicate that there are features, steps, operations, devices, components, and/or combinations thereof.

Example 1

(6) In one or more embodiments, providing a method for adaptively role-selection in coordinated multi-robot search task, decoupling a task planning and a task execution of the complex task. Wherein, the task planning allows a robot to learn roles from an upper-level perspective, and the roles are obtained by calculating through a role policy. Role selection between different time steps is driven by a role switching mechanism. The task execution is achieved by the primitive policy.

(7) Upper-level task planning is accomplished through the role selection framework, which consists of the role policy trained by multi-agent reinforcement learning, which can guide the robot to autonomously select the role in the current state to maximize its own expertise. In the process of sequential role planning, intelligent role switching mechanism enables different roles to promote each other to improve performance dynamically. In addition, in the present example, the task execution of the multi-robot is completed through the primitive policy, and the decision is made based on the local perception information with the role output by the upper role policy as the condition.

(8) As shown in FIG. 2, the present example introduces a reinforcement learning algorithm of double Actor-Critic, aiming at embedding the concept of role in the multi-robot area search task. According to the present invention, a centralized training and distributed execution architecture commonly for coordinated multi-robot tasks is adopted, so that the multi-robot has a distributed policy. The present invention deploys two Actor-Critic networks for training, which are (Actor-Critic).sup.R and (Actor-Critic).sup.P, wherein, the Actor.sup.R is a network used for role selection, and its output role actions are used as inputs to the Actor.sup.P and Critic.sup.P networks, guiding the Actor.sup.P and Critic.sup.P networks to be trained in an output direction of roles output at the upper-level.

(9) In the training process, the role state value function V.sub.r and the primitive state value function V.sub.p are calculated by the Critic.sup.R network and the Critic.sup.P network, respectively. In the execution process, the primitive actions are sampled from a primitive action probability distribution produced by the primitive policy.

(10) In the execution process, the Critic.sup.R and Critic.sup.P networks may be removed, so that the multi-robot completes mapping calculation of upper role actions based on role policythe Actor.sup.R, and completes mapping calculation of lower interaction actions based on primitive policy the Actor.sup.P. The interactive action distribution and state value function of robot are based on the actions of upper roles, and different roles correspond to different subtasks.

(11) Referring to FIG. 1, the method of the present example specifically comprises the following processes:

(12) (1) Defining a role action space as two discrete values: [explore, cover];

(13) Specifically, in a coordinated multi-robot task, the number of action spaces usually matches the number of subtasks. Therefore, based on attributes of subtasks of the exploration and coverage, the role action space is defined as two discrete values: [explore, cover]. When the robot receives a command of an explore role action, the robot is controlled to move towards the nearest frontier cell in a FOV. Similarly, when the robot receives a command of cover role action, the robot is controlled to move towards the nearest target cell in the FOV.

(14) (2) Acquiring and inputting local perception information o.sub.t.sup.i and joint perception information jo.sub.t.sup.i into a role policy, and outputting a role action .sub.t.sup.i; wherein, the local perception information comprises an obstacle map, an explored map, a covered map, and a position map; and, the joint perception information comprises a merged explored map and a merged covered map.

(15) Specifically, the present example introduces a role policy to perform task planning, and completes tasks based on role actions. The joint observation-based role actions of all robots represent an upper-level understanding of the dynamic area-searching environment. This design facilitates the adaptation of the corresponding multi-robot system to environments of different scales or to highly complex environments containing more robots. Multi-agent proximal policy optimization (MAPPO) algorithm is used to train the role policy and the primitive policy. There are centralized critic network and distributed actor network for both the upper role policy and the lower primitive policy. In this architecture, each robot has an independent local policy network and a centralized state value network.

(16) In a search environment containing static obstacles and targets, each robot moves within its own FOV, which is defined as r.sub.FOV. Therefore, the each robot can only receive partial environmental information within its FOV. At time t, the robot i acquires a 4-channel local perception information o.sub.t.sup.i={o.sub.t.sup.o, o.sub.t.sup.e, o.sub.t.sup.c, o.sub.t.sup.p} with a size of r.sub.FOVr.sub.FOV, comprising an obstacle map, an explored map, a covered map, and a position map. The obstacle map o.sub.t.sup.o collects free cells and obstacle cells. Similarly, the explored map o.sub.t.sup.e and the covered map o.sub.t.sup.e collect the positions of frontier cells and target cells respectively. The position map o.sub.t.sup.p collects the position information of neighbor robots N.sub.i; wherein, the neighbor robots N.sub.i refers to a set of robots that satisfy the communication condition (p.sub.N.sub.ip.sub.ir.sub.comm). If the relative distance between the neighbor robots p.sub.N.sub.i and the robot p.sub.i is less than the communication distance r.sub.comm. For example, a robot j is called a neighbor of a robot i if a robot located in p.sub.j that satisfies p.sub.jp.sub.ir.sub.comm. In addition, before extracting the encoded information of the local perception information, the unexplored area is masked, that is, the environmental information under the unexplored area is set to be invisible to the robot. At the implementation level, a binarization function is used to set unexplored areas on all channel maps to 0.

(17) In the higher-level decision-making process (i.e., the Actor.sup.R network), the robot has to execute two kinds of planning simultaneously. On one hand, the frontier cells or target cells should be identified based on local perception information o.sub.t.sup.i, so as to better perform exploration or coverage subtasks. On the other hand, the robot needs to use the joint perception information jo.sub.t.sup.i to evaluate the expected reward between the exploration and coverage. Specifically, the calculation of expectation for optimizing the objective function J(.sub.r) is as follows:
J(.sub.r)=custom character[.sub..sub.r log .sub..sub.r.sup.i(.sub.t.sup.i|o.sub.t.sup.i,jo.sub.t.sup.i).sub.t.sup.i], where, .sub..sub.r.sup.i represents the parametric representation of the role policy, .sub.t.sup.i is an evaluation of advantage function of the role policy.

(18) The joint perception information jo.sub.t.sup.i={jo.sub.t.sup.me, jo.sub.t.sup.mc}, comprising the merged explored map jo.sub.t.sup.me and the merged covered map jo.sub.t.sup.mc. Wherein, the merged explored map jo.sub.t.sup.mecustom character.sup.HW, jo.sub.t.sup.me={o.sub.0.sup.c, . . . , o.sub.t1.sup.c, o.sub.t.sup.c} refers to a set of historically explored areas of all the robots, wherein W and H refers to the width and height of the simulation environment. The merged covered map jo.sub.t.sup.mccustom character.sup.HW, jo.sub.t.sup.mc={o.sub.0.sup.c, . . . , o.sub.t1.sup.c, o.sub.t.sup.c} refers to a set of historical covered areas of all the robots. Therefore, the local perception information and the joint perception information are used as inputs of the Actor.sup.R network, which outputs a role action probability distribution for the robot.

(19) During the training phase of the role policy, two different distinct rewards are defined for each robot: the exploration reward R.sub.e and the coverage reward R.sub.c (see Section B3 for the specific settings of these two rewards). Therefore, the reward of the role policy is R.sub.t=R.sub.e+R.sub.c, where, and are the reward weight coefficients of explore role action and cover role action respectively, and its setting purpose is to modulate the execution ratio of subtasks in combination with the completion degree of tasks. When is set to 1, is 0, it means that the robot has priority to perform the exploration subtask, otherwise, the robot needs to perform the coverage subtask.

(20) The above settings for training role policy (or the Actor.sup.R network) train a distributed and independent role policy for each robot, which is also the core design of role selection. The present invention uses a multi-agent reinforcement learning algorithm to train role policy and centralized training distributed execution architecture based on an Actor-Critic structure. In the centralized training phase, the Critic.sup.R network calculates the state value V.sub.r(s), and obtains the advantage function A.sub.r(s, ) to judge the rationality of the role action calculated by the Actor.sup.R network.

(21) Wherein,
.sub.r=.sub.t+().sub.t+1+ . . . +().sup.Tt+1.sub.T1, where, .sub.r represents the optimal evaluation value of A output by the model, and .sub.t=r.sub.t+V(s.sub.t+1)V(s.sub.t); wherein, r.sub.t is the environmental reward at the time t, V(s.sub.t) and V(s.sub.t+1) represent the state value functions at the time t and the time t+1, respectively, and is the discount factor.

(22) After taking the role action in a state map S, if the value of A is greater than 0, it means that the role action is greater than the average, which is a reasonable choice; if the value of A is less than 0, the role action is worse than the average, which means it is not a good choice.

(23) (3) Inputting the local perception information o.sub.t.sup.i and the output role action .sub.t.sup.i into a primitive policy, and outputting a primitive action a.sub.t of robot to interact with the environment.

(24) Specifically, the present example uses a two-dimensional grid map to model the multi-robot area search environment, and the primitive action space contains five discrete values: {forward, rightward, backward, leftward, stop}. These primitive actions are encoded as one-hot vectors and are determined based on the probability distribution output by the primitive policy Actor.sup.P or .sub..sub.p.

(25) At the time t, the robot i acquires a 4-channel local perception information o.sub.t.sup.i={o.sub.t.sup.o, o.sub.t.sup.e, o.sub.t.sup.c, o.sub.t.sup.p} with the size of r.sub.FOVr.sub.FOV, comprising the obstacle map, the explored map, the covered map, and the position map. Then the information is encoded by an encoder (o.sub.t.sup.i): O.fwdarw.custom character.sup.F. In the present invention, a convolutional neural network (CNN) is adopted as the encoder to generate an embedding vector z.sub.t.sup.i. This encoder is shared among all the robots. The output role action .sub.t.sup.i of the role policy is concatenated with the embedding vector z.sub.t.sup.i of local perception information to form the primitive observation with dimensions of F+1.

(26) B3: Reward settings for the primitive policy. In the training phase of the primitive policy, two different rewards are set in combination with the subtasks, which are an exploration reward R.sub.e and a coverage reward R.sub.c. A base reward R.sub.p (t) for each time t, is:

(27) R p ( t ) = { R e , R e = .Math. i N .Math. k E t _ u t k q t i = p k B e , t i = 0 R c , R c = .Math. i N .Math. j C t c t j q t i = p j , t i = 1 where, .sub.t.sup.i is the output role action, the binarization u.sub.t.sup.k indicates whether the robot i (iN, which means N robots in total) moves to an unexplored cell E.sub.t at the time t, wherein u.sub.t.sup.k=1, if kE.sub.t; otherwise, u.sub.t.sup.k=0, indicates that the robot i does not move to the passable area at the time t. Similarly, the binarization c.sub.t.sup.j indicates whether a position of the robot i at time t is the target cell, wherein c.sub.t.sup.j=1 indicates that the robot position q.sub.t.sup.i is located at the position p.sup.j of the target grid cell j, the target grid cell j is the existing target set C.sub.t in the environment; otherwise, c.sub.t.sup.j=0, indicates that the position of the robot i at the time t is not the target cell.

(28) When the role action .sub.t.sup.i output from the upper layer is equal to 0, the corresponding action reward is the exploration reward R.sub.e, otherwise it is the coverage reward R.sub.c. In a fully cooperative multi-robot setup, robots with the same role should share the same global reward. A role reward at the time t is the sum of all local rewards under the same role. When the robot visits a target cell p.sup.j, (q.sub.t.sup.i=p.sup.j), it receives a coverage reward of 1. The exploration radius of each the robot is set to rad.sub.e, which allows the robot to explore 2.Math.rad.sub.e cells in the discrete grid map. Accumulate all unexplored cells k at the time t as an exploration reward, where u.sub.t.sup.k refers to that the robot moves to a passable area (satisfying q.sub.t.sup.i=p.sup.k). In addition, dividing by exploration ability B.sub.e ensures that exploration reward regularizes to between (0, 1), aligned with the coverage reward.

(29) In the present example, various settings are set for training of the primitive policy (or the Actor.sup.P network). Through inputting the role action .sub.t output by the upper role policy and the local perception information o.sub.t, the primitive action a.sub.t is output to interact with the environment. The ability of the primitive policy represents the ability to explore or cover.

Example 2

(30) In one or more embodiments, providing a system for adaptively role-selection in coordinated multi-robot search task, comprising: an action space module, configured to a role action space as two discrete values: [explore, cover]; a role selection module, configured to acquire and input local perception information o.sub.t.sup.i and joint perception information jo.sub.t.sup.i into a role policy, and to output a role action .sub.t.sup.i; wherein, the local perception information comprises an obstacle map, an explored map, a covered map, and a position map; and, the joint perception information comprises a merged explored map and a merged covered map; and a primitive action output module, configured to input the local perception information o.sub.t.sup.i and the output role action .sub.t.sup.i into a primitive policy, and output a primitive action a.sub.t of robot to interact with the environment; wherein, when the output role action .sub.t.sup.i received by a robot is an explore role action, the robot moves towards a frontier cell closest to the robot in a FOV; and, when the output role action .sub.t.sup.i received by the robot is a cover role action, the robot moves towards a target cell closest to the robot in the FOV.

Example 3

(31) In one or more embodiments, providing a terminal device, comprising a processor and a memory, wherein the processor is used for implementing instructions, and the memory is used for storing a plurality of the instructions; wherein when the instructions are loaded by the processor, executing a method for adaptively role-selection in coordinated multi-robot search task according to Embodiment 1. For the sake of brevity, they are not repeated herein.

(32) It should be understood that in the present example, the processor may be a central processing unit (CPU), other general-purpose processors, digital signal processors (DSP), application specific integrated circuits (ASIC), ready-made programmable gate arrays (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.

(33) The memory can include read-only memory and random access memory, and provide instructions and data to the processor. A portion of the memory can also include non-volatile random access memory. For example, memory can also store information about device types.

(34) In the implementation process, each step of the above method can be completed through hardware integrated logic circuits or software instructions in the processor.

(35) Although the specific embodiments of the present invention are described above in combination with the accompanying drawings, it is not a limitation on the protection scope of the present invention. Those skilled in the art should understand that on the basis of the technical scheme of the present invention, various modifications or deformations that can be made by those skilled in the art without creative labor are still within the protection scope of the present invention.