METHOD OF ACQUIRING SENSOR DATA ON A CONSTRUCTION SITE, CONSTRUCTION ROBOT SYSTEM, COMPUTER PROGRAM PRODUCT, AND TRAINING METHOD
20240181639 ยท 2024-06-06
Inventors
- Nitish KUMAR (Buchs, CH)
- Sascha KORL (Buchs, CH)
- Luca BARTOLOMEI (Cantello, IT)
- Lucas TEIXEIRA (Z?rich, CH)
- Margarita CHLI (Z?rich, CH)
Cpc classification
B25J9/1664
PERFORMING OPERATIONS; TRANSPORTING
B25J9/162
PERFORMING OPERATIONS; TRANSPORTING
G05D1/0088
PHYSICS
G06N3/006
PHYSICS
G05B2219/40576
PHYSICS
G05D1/246
PHYSICS
International classification
B25J5/00
PERFORMING OPERATIONS; TRANSPORTING
G05D1/246
PHYSICS
Abstract
A method of acquiring sensor data on a construction site by at least one sensor of a construction robot system comprising at least one construction robot is provided, wherein a sensor is controlled using a trainable agent, thus improving the quality of acquired sensor data. A construction robot system, a computer program product, and a training method are also provided.
Claims
1. A method of acquiring sensor data on a construction site by at least one sensor of a mobile construction robot system comprising at least one construction robot, the method comprising controlling the at least one sensor using a trainable agent.
2. The method according to claim 1, including selecting the sensor by the trainable agent.
3. The method according to claim 1, including controlling a pose of the sensor using the trainable agent.
4. The method according to claim 1, including acquiring at least one of image data or depth image data by the at least one sensor.
5. The method according to claim 1, comprising semantic classification.
6. The method according to claim 1, comprising at least one of localizing the construction robot, trajectory planning of the construction robot, or mapping of at least a part of the construction site.
7. The method according to claim 1, including inferring an informativeness measure by the trainable agent.
8.-10. (canceled)
11. A construction robot system comprising a construction robot, at least one sensor for acquiring sensor data, and a control unit that wherein the control unit comprises a trainable agent, wherein the mobile construction robot system is configured to acquire sensor data using the method according to claim 1.
12. The mobile construction robot system according to claim 11, wherein the mobile construction robot comprises as the at least one sensor at least one of an image sensor or a depth image sensor.
13. A computer program product including a storage readable by a control unit of a mobile construction robot system comprising at least one sensor for acquiring sensor data, the storage carrying instructions which, when executed by the control unit, cause the construction robot to acquire sensor data using the method according to claim 1.
14. A training method for training a trainable agent of a control unit of a mobile construction robot system according to claim 11, the method comprising training the trainable agent using at least one artificially generated set of sensor data.
15. The training method according to claim 14, including introducing noise into the at least one artificially generated set of sensor data.
16. The method according to claim 2, including controlling a pose of the sensor using the trainable agent.
17. The method according to claim 2, including acquiring at least one of image data or depth image data by the at least one sensor.
18. The method according to claim 2, comprising semantic classification.
19. The method according to claim 2, comprising at least one of localizing the construction robot, trajectory planning of the construction robot, or mapping of at least a part of the construction site.
20. The method according to claim 2, including inferring an informativeness measure by the trainable agent.
21. The method of claim 1, wherein the mobile construction robot system comprises a wheeled vehicle.
22. The method of claim 1, wherein the mobile construction robot system comprises a drone.
23. The method of claim 5, wherein semantic classification includes providing semantic classes corresponding to a construction site background that is expected, or not expected, to be represented in building informational model (BIM) data.
Description
IN THE DRAWINGS
[0057]
[0058]
[0059]
[0060]
[0061]
[0062] As far as possible, same reference signs are used for functionally equivalent elements within the description and in the figures.
[0063]
[0064] The method 10 will now be described with reference to an example wherein a construction robot, e. g. a construction drone, uses image data in combination with depth image data as input for navigation. Furthermore, in this example the trainable agent is a RL agent, i. e. the trainable agent is configured for reinforcement learning.
[0065] According to this example, the construction robot shall navigate to a target position on a construction site not seen before by the construction robot. The image data and the depth image data are acquired by a plurality of sensors. Due to an output of a trainable agent the trajectory of the construction robot and, thus, of the sensors are adapted such that regions are favored that provide sensor data of high informativeness for localization purposes. In that way, the trainable agent controls the sensors, in particular their poses, in order to keep on track along a trajectory that permits a safe way to the target position on the construction site. The target position may be reached with a high success rate and, in particular, independently of clutter, moving persons, active work zones with reduced visibility, or the like.
[0066] The method 10 uses three main modules: a pose estimation module 12, a trainable agent 14 and a path-planning module 16.
[0067] The pose estimation module 12 takes as input image data 18 from a camera system, e. g. a monocular camera system, and depth image data 20, e. g. from a depth-sensitive camera system. The image data 18 may preferably consist of RGB image data.
[0068] The image data 18 and the depth image data 20 are processed to estimate a pose of a construction robot and to estimate landmarks, in particular 3D landmarks, of the construction site. Furthermore, the pose estimation module 12 generates an occupancy map from the depth image data 20.
[0069] A classifier 22 classifies the image data 18 and provides a semantic mask.
[0070] The landmarks and the occupancy are assigned point-by-point to semantic classes using the semantic mask.
[0071] The trainable agent 14 utilizes the semantic mask to generate an optimal action, which consists of a set of weights to assign to each semantic class.
[0072] The optimal action, i. e. the set of weights, is then communicated to the path planning module 16.
[0073] Finally, the path planning module 16 generates and outputs an optimal trajectory. For this, it considers the dynamics of the construction robot and perceptional quality.
[0074] In more detail:
[0075] The image data 18 and the depth image data 20 preferably comprise streams of data.
[0076] The pose estimation module 12 generates 3D reconstructions of the surroundings of the construction robot using the depth image data 20, which are, thus, used to generate a dense point cloud. The dense point cloud is stored in an occupancy map employing a 3D circular buffer.
[0077] Furthermore, the pose estimation module 12 utilizes a visual odometry (VO) algorithm for estimating the construction robot's pose using the image data 18. In principle, any VO algorithm may be used to estimate the pose of the camera system. In the present example, ORBSLAM (R. Mur-Artal, J. M. M. Montiel, and J. D. Tard?s, ORB-SLAM: a Versatile and Accurate Monocular SLAM System, IEEE Transactions on Robotics, 2015) is used, which is a keyframe-based VO algorithm. It is a vision-only system; thus, the scale may not be retrievable. As will be described further below, the trainable agent 14 may be trained by simulation of artificial scenes, thus giving access to ground-truth information. The ground-truth information may be used to re-scale the estimated position and the 3D landmarks.
[0078] Both the occupancy map and the landmarks go through a classification step, in
[0079] The classifier 22 for generating the semantic mask from the image data 18 may, in principle, be of any suitable kind, e. g. Yolov3 (Yolov3: An incremental improvement, J. Redmon and A. Farhadi, CoRR, 2018).
[0080] The semantic mask also serves as input to the trainable agent 14. The trainable agent 14 outputs an optimal action, which represents values associated with a perceptual informativeness of each semantic class. The optimal action is fed into the path planning module 16.
[0081] The path planning module 16 uses the optimal action to reason about a next best action. The optimal action is utilized as a set of weights in the objective function to be optimized by the path planning module 16. This favors tracking and triangulation of points belonging to parts of the scene particularly useful for camera-based state estimation.
PATH PLANNING MODULE
[0082] The next section explains the functioning of the path-planning module 16 in more detail.
[0083] One of the objectives is to let the construction robot move through areas well-suited for VO. For this, the construction robot is to learn which semantic classes are less likely to generate a localization drift.
[0084] The robot learns this by interacting with the environment, selecting an action, and receiving a reward value as feedback.
[0085] Here, an action corresponds to a set of weights for each semantic class in a perception objective function, to be optimized in the path planning module 16. The path planning module 16 uses a kinodynamic A* path search, followed by a B-Spline trajectory optimization:
1) Kinodynamic Path Search
[0086] In the first planning step, an aim is to encourage navigation in well-textured areas. The path search is limited to the robot's position in R.sup.3. The trajectory is represented as three independent time-parametrized polynomial functions p(t):
with d?{x,y,z}.
[0087] The system is assumed to be linear and time-invariant, and we define the construction robot's state as
s(t):=[p(t).sup.T,{dot over (p)}(t).sup.T, . . . ,p.sup.(n?1)(t).sup.T].sup.T???.sup.3n
with control input u(t):=p.sup.(n)(t)??=[?u.sub.max,u.sub.max].sup.3?R.sup.3 and n=2, corresponding to a double integrator. Given the current construction robot's state s(t), the control input u(t) and a labelled occupancy map M of the environment, a cost of a trajectory is defined as
where ?u(t)?.sup.2 is the control cost; d.sup.j.sub.M(p(t),M) represents a penalty for navigating far away from areas associated to the semantic class j?{0, . . . , N} with N the total amount of classes; and T is the total time of the trajectory. The terms W.sub.U and w.sub.T are constant weights associated with the respective costs, while the w.sup.j is the weight associated with the semantic class j assigned by the current optimal action. It may be subjected to changes as the construction robot gathers additional experience.
[0088] The cost d.sup.j.sub.M(p(f),M) is defined as
where v.sub.j=[v.sub.x,v.sub.y,v.sub.z].sup.T are the voxels of the occupancy map M with semantic label j, indicated with M.sub.j.Math.M. The cost d.sup.j.sub.M(p(f), M) is composed of the two potential functions that are calculated as
d.sub.xy(p(t),|v.sub.j):=(p.sub.x(t)?v.sub.x).sup.2+(p.sub.y(t)?v.sub.y).sup.2 (4)
and, by defining ?z:=|p.sub.z(t)?v.sub.z|,
where d* controls the minimum height of the construction robot with respect to the voxels in M.sub.j. In order to speed up the search in the A* algorithm, a heuristic adapted to match the cost definitions may be used.
2) Trajectory Optimization
[0089] While the trajectory computed in the path-searching step encourages navigation towards informative areas, the trajectory optimization step leverages the additional information given by the landmarks from the VO. A trajectory ?(f) is parametrized as a uniform B-Spline of degree K.
[0090] It is defined as
where q.sub.j are the control points at time t.sub.i with i?{0, . . . ,N}, and B.sub.i,K?1(t) are the basis functions. Each control point in {q.sub.0,q.sub.1, . . . ,q.sub.N} encodes both the position and the orientation of the construction robot, i.e. q.sub.i:=[x.sub.i,y.sub.i,z.sub.i,?.sub.i].sup.T?R.sup.4 with ?.sub.i?[??,?). The B-Spline is optimized in order to generate smooth, collision-free trajectories, encouraging the triangulation and tracking of high-quality landmarks. For a B-Spline of degree K defined by N+1 control points {q.sub.0,q.sub.1, . . . , q.sub.N}, the optimization acts on {q.sub.K,q.sub.K+1, . . . q.sub.N-K} while keeping the first and last K control points fixed due to boundary constraints. The optimization problem is formulated as a minimization of the cost function
F.sub.TOT=?.sub.s
.sub.s+?.sub.f
.sub.f+?.sub.c
.sub.c+?.sub.l
.sub.l+?.sub.v
.sub.v (7),
where F.sub.s is a smoothness cost; F.sub.c is a collision cost; F.sub.t is a soft limit on the derivatives (velocity and acceleration) over the trajectory; F.sub.l is a penalty associated with losing track of high-quality landmarks currently in the field of view; and F.sub.v is a soft constraint on the co-visibility between control points of the spline. The coefficients ?.sub.s, ?.sub.c, ?.sub.f, ?.sub.l, and ?.sub.v are the fixed weights associated to each cost.
[0091] While maintaining the original cost formulations, similarly to Eq. 2, a novel perception cost that accommodates multiple semantic classes is introduced:
where L.sup.C.sub.j is the set of 3D landmarks associated to class j expressed in a camera frame C, and o.sub.k a smooth indicator function determining the visibility of landmark l.sub.C from the control pose q.
[0092] The optimal set of weights, i. e. the above-mentioned optimal action, for each semantic class is computed in real-time by the trainable agent 14 using a policy modeled as a neural network, which is trained in an episode-based deep RL-fashion.
Trainable Agent
a) Structure
[0093] The trainable agent 14, in this example in the form of a RL agent, maps from semantic masks to optimal actions, employing an Actor-Critic model.
[0094]
[0095] As previously described the action consists of the set of optimal weights w.sup.i[0,1] with j?{0, . . . , N} used by the path planning module 16 and according to Eq. 2 and Eq. 8., in which N is the total number of semantic classes.
[0096] The Actor and the Critic networks share a first part, composed of a 3-layer Convolutional Neural Network (CNN) module 24, followed by a Long-Short Term Memory (LSTM) module 26.
[0097] The LSTM module 26 is responsible for the memory of the policy generated and captures spatial dependencies that would otherwise be hard to identify, as some semantic classes can be linked together (e.g. wall and installation object). The final part of the Critic consists of two Fully Connected (FC) layers composed of 64 units, while the optimal action is output by the Actor from three FC layers with 128 units each.
[0098] In order to reduce the hyperspace dimension, the color mask may be converted into grayscale. Furthermore, the resulting image may be downsampled. By using such a semantic image as input generalization of the generated policy may be improved. Also, a training may be accelerated and improved.
b) Training
[0099] Policy optimization is performed at fixed-step intervals. For this, an on-policy algorithm, e. g. according to J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal Policy Optimization Algorithms, CoRR, 2017, may be used.
[0100] The training of the trainable agent 14 or, respectively, of the policy, is performed based on data and rewards collected in each episode.
[0101] To reduce the localization error and to increase the chances of getting to the target destination, the reward function received by the trainable agent 14 at step t is defined as
R.sub.t(p(t),e(t)):=R.sub.s+w.sub.ER.sub.E(e(t))+w.sub.GR.sub.G(p(t)) (9),
where R.sub.S is the survival reward, R.sub.E is associated to the localization error e(t) and R.sub.G to the progress towards the goal position. The survival reward is assigned at every step, unless tracking is lost:
Note that it is not penalized explicitly when the VO system loses track in order not to penalize promising actions that lead to high errors due to faulty initialization of the visual tracks at the beginning of the episode.
[0102] The reward associated to the localization error is instead assigned in every step, and encourages to take actions that reduce the drift in the VO system:
where e.sub.min and e.sub.max are the minimum and the maximum acceptable errors, respectively, and R.sub.E.sup.max is the maximum reward value. Finally, the last component of the reward function favors the progress towards the goal position p.sub.G(t) and is inversely proportional to the distance between the current construction robot position and the destination:
where R.sub.G.sup.max is the maximum achievable reward value. So, when the construction robot reaches the goal, it receives a final reward equal to R.sub.G.sup.max.
[0103] Thus, at the beginning of an episode, the construction robot is placed at a given starting position, the VO tracking system is initialized, and an end target position is set.
[0104] Then, the construction robot navigates towards the target position generating trajectories by optimizing the cost functions defined in Eq. 2 and Eq. 7, given the optimal set of weights output by a current policy.
[0105] During movement, the construction robot monitors the localization error. The movement and, thus, the episode, ends when either the target position is reached or the VO system loses track.
[0106] In order to maximize the generalization of the learned policy and to avoid overfitting to a specific scene, the trainable agent 14 is trained in a set of randomly generated environments using a simulator comprising a game engine and a robot simulator engine. In this example, which is based on a drone-like construction robot, the Unity framework may be used as game engine. The Flightmare framework (Song, S. Naji, E. Kaufmann, A. Loquercio, and D. Scaramuzza,Flightmare: A Flexible Quadrotor Simulator,Conference on Robot Learning, 2020) may be used as robot simulator engine. In general, the game engine and/or the robot simulator engine may be selected depending on the type of construction site and/or the type of construction robot, e. g. flying robot, legged robot, wheeled robot, robot with tracked wheels, etc., to be trained.
[0107] Continuing with the example, the simulated construction drone may be attributed a set of sensors which corresponds to the sensors of the real construction drone. In this example, it is simulated as being equipped with a front-looking camera mounted with a pitch of 60?.
[0108] The game engine may, thus, provide the image data 18 required by the VO systems as well as the depth image data 20. Additionally, or as an alternative, photorealistic data, e. g. from photogrammetry, may be used.
[0109] In a variant of the method 10, it may also provide the semantic masks. Hence, for the purpose of training, the classifier 22 may not be simulated, but being replaced by calculated semantic masks representing ground-truth.
[0110]
[0111] Preferably, noise, in particular zero-mean Gaussian noise, may be applied to the depth images in order to mimic the noise in real sensors, such as stereo or depth cameras.
[0112] At the beginning of each episode, a new scene is generated, and a target destination is placed randomly in the scene.
[0113] The simulated construction drone (virtually) starts navigating towards the target position, and the episode ends when either the goal is reached or the VO system loses track.
[0114] The trainable agent 14 outputs actions at fixed time intervals or steps, communicates them to the path planning module 16, and collects the reward as feedback.
[0115] In the first episode of the training process, the policy may be initialized randomly. The training continues until a maximum number of steps, for example more than 1000, particularly preferred between 5000 and 10000, e. g. 9000, across all episodes is reached.
[0116] The set of semantic classes may be fixed for all scenes. In general, the system, and thus the method 10, can handle any set of classes.
[0117] The reward parameters are set to R.sub.E.sup.max=5 and R.sub.G.sup.max=50, with minimum and maximum localization errors e.sub.min=0.5 m and e.sub.max=2.5 m. To compute the total reward as in Eq. 9, the weights for components associated to the localization error and to the goal-reaching task are set to w.sub.E=3 and w.sub.G=0.1, respectively.
[0118] An example of training performance of the trainable agent 14 is shown in
[0119] As depicted by the initial sharp increase in the reward curve, the trainable agent 14 learns quickly to identify semantic classes that allow robust localization, resulting in a decrease in the pose estimation error. The training performance successively decreases, as visible from the plateau in the reward curve and the small increase in the translational error. Despite the decrease due to slightly higher RMSE, the reward does not drop, as the trainable agent 14 is able to reach the target destination more frequently. This indicates that an optimal behavior is reached and that the oscillations in performance are linked more to a randomness of the scene and consequently, of the VO algorithm's performance.
[0120] Finally,
[0121] In this embodiment, the control unit 104 is arranged inside the mobile construction robot 102. It comprises a computing unit 106 and a computer program product 108 including a storage readable by the computing unit 106. The storage carries instructions which, when executed by the computing unit 106, cause the computing unit 106 and, thus, the construction robot system 102 to execute the method 10 as previously described.
[0122] Furthermore, the mobile construction robot 102 comprises a robotic arm 110. The robotic arm may have at least 6 degrees of freedom. It may also comprise a lifting device for increasing the reach and for adding another degree of freedom of the robotic arm 110. The mobile construction robot 102 may comprise more than one robotic arm.
[0123] The robotic arm 110 comprises an end effector, on which a power tool 113 is detachably mounted. The mobile construction robot 102 may be configured for drilling, grinding, plastering and/or painting floors, walls, ceilings or the like. For example, the power tool 113 may be a drilling machine. It may comprise a vacuum cleaning unit for an automatic removal of dust. The robot arm 110 and/or the power tool 113 may also comprise a vibration damping unit.
[0124] The robotic arm 110 is mounted on a mobile base 116 of the mobile construction robot 102. In this embodiment, the mobile base 116 is a wheeled vehicle.
[0125] Furthermore, the mobile construction robot 102 may comprise a locating mark 115 in the form of a reflecting prism. The locating mark 115 may be used for high-precision localization of the construction robot. This may be particularly useful in case that at least on a part of the construction site 101 a high-precision position detection device, e. g. a total station, is available.
[0126] Then, for example, the mobile construction robot 102 may navigate to a target position on that part of the construction site 101 using the method 10. After arriving at the target position a working position, e. g. for a hole to be drilled, may be measured and/or fine-tuned using the high-precision position detection device.
[0127] The mobile construction robot 102 comprises a plurality of additional sensors. In particular, it comprises a camera system 112 comprising three 2D-cameras. It further comprises a LIDAR scanner 114.
[0128] It may comprise further modules, for example a communication module, in particular for wireless communication, e. g. with an external cloud computing system (not shown in