Urban environment labelling

Abstract

The present invention relates to a method and system for automatic localisation of static objects in an urban environment. More particularly, the present invention relates to the use of noisy 2-Dimensional (2D) image data to identify and determine 3-Dimensional (3D) positions of objects in large scale urban or city environments. Aspects and/or embodiments seek to provide a method, system, and vehicle for automatically locating static 3D objects in urban environments by using a voting-based triangulation technique. Aspects and/or embodiments also provide a method for updating map data after automatically new 3D static objects in an environment.

Claims

1. A computer-implemented method comprising: determining, by a computing system, a data set including images associated with an object captured in an area; distributing, by the computing system, the data set into at least a first cluster that is associated with the object and a second cluster that is associated with the object, wherein the first and second clusters are based on an identification of the images included in the data set; and determining, by the computing system, a position of the object based on triangulating the object according to at least the first cluster and the second cluster, wherein the determining the position of the object comprises: performing, by the computing system, voting-based triangulation on the first cluster, wherein a vote is based on a distance between an estimated position and a position of a detection of the object in an image; and performing, by the computing system, the voting-based triangulation on the second cluster.

2. The method of claim 1, wherein the triangulating the object is based on a hypothesis and the hypothesis is determined based on at least one of: a point triangulated in front of one or more cameras associated with the images in the first cluster or the second cluster, a reprojection error of intersected rays associated with the images in the first cluster or the second cluster, an angle between optical axes associated with the images in the first cluster or the second cluster, or a distance from the object to the one or more cameras associated with the images in the first cluster or the second cluster.

3. The method of claim 2, further comprising: determining, by the computing system, inliers in the first cluster or the second cluster that observe the object within a threshold distance of a position associated with the hypothesis, wherein the hypothesis is confirmed based on a threshold number of the inliers.

4. The method of claim 3, further comprising: removing, by the computing system, the inliers that observe the object from the data set based on a confirmation of the hypothesis; and distributing, by the computing system, the data set into new clusters associated with another object captured in the area.

5. The method of claim 4, wherein the distributing the data set into the new clusters is based on a determination that at least one hypothesis with a threshold number of inliers remains, wherein the threshold number of inliers is based on an average number of inliers per hypothesis and a confidence parameter.

6. The method of claim 4, further comprising: reusing, by the computing system, computations of inliers that are not associated with a confirmed hypothesis to confirm a hypothesis associated with a third cluster of the new clusters.

7. The method of claim 1, further comprising: determining, by the computing system, another position of another object in the area based on at least a third cluster; and merging, by the computing system, the object and the another object based on a determination that the position of the object and the another position of the another object are within a threshold distance.

8. The method of claim 1, wherein the first cluster and the second cluster are independently processed and results of the independently processed first cluster and the independently processed second cluster are merged.

9. The method of claim 8, wherein the object is detected based on a convolutional neural network trained to predict bounding boxes fitted around pixels representing objects using an application of a thresholding schema to determine connected components of the pixels representing the objects in the images.

10. The method of claim 1, wherein the distributing the data set comprises: distributing the images in the data set captured within a first threshold distance of the object into the first cluster until the first cluster reaches a first threshold size; removing the images distributed into the first cluster from the data set; and distributing the images in the data set captured within a second threshold distance of the object into the second cluster until the second cluster reaches a second threshold size.

11. A system comprising: at least one processor; and a memory storing instructions that, when executed by the at least one processor, cause the system to perform: determining a data set including images associated with an object captured in an area; distributing the data set into at least a first cluster that is associated with the object and a second cluster that is associated with the object, wherein the first and second clusters are based on an identification of the images included in the data set; and determining a position of the object based on triangulating the object according to at least the first cluster and the second cluster, wherein the determining the position of the object comprises: performing voting-based triangulation on the first cluster, wherein a vote is based on a distance between an estimated position and a position of a detection of the object in an image; and performing the voting-based triangulation on the second cluster.

12. The system of claim 11, wherein the triangulating the object is based on a hypothesis and the hypothesis is determined based on at least one of: a point triangulated in front of one or more cameras associated with the images in the first cluster or the second cluster, a reprojection error of intersected rays associated with the images in the first cluster or the second cluster, an angle between optical axes associated with the images in the first cluster or the second cluster, or a distance from the object to the camera associated with the images in the first cluster or the second cluster.

13. The system of claim 12, wherein the at least one processor further causes the system to perform: determining inliers in the first cluster or the second cluster that observe the object within a threshold distance of a position associated with the hypothesis, wherein the hypothesis is confirmed based on a threshold number of the inliers.

14. The system of claim 13, wherein the at least one processor further causes the system to perform: removing the inliers that observe the object from the data set based on a confirmation of the hypothesis; and distributing the data set into new clusters associated with another object captured in the area.

15. The system of claim 14, wherein the distributing the data set into the new clusters is based on a determination that at least one hypothesis with a threshold number of inliers remains, wherein the threshold number of inliers is based on an average number of inliers per hypothesis and a confidence parameter.

16. A non-transitory computer-readable storage medium including instructions that, when executed by at least on processor of a computing system, cause the computing system to perform: determining a data set including images associated with an object captured in an area; distributing the data set into at least a first cluster that is associated with the object and a second cluster that is associated with the object, wherein the first and second clusters are based on an identification of the images included in the data set; and determining a position of the object based on triangulating the object according to at least the first cluster and the second cluster, wherein the determining the position of the object comprises: performing voting-based triangulation on the first cluster, wherein a vote is based on a distance between an estimated position and a position of a detection of the object in an image; and performing the voting-based triangulation on the second cluster.

17. The non-transitory computer-readable storage medium of claim 16, wherein the triangulating the object is based on a hypothesis and the hypothesis is determined based on at least one of: a point triangulated in front of one or more cameras associated with the images in the first cluster or the second cluster, a reprojection error of intersected rays associated with the images in the first cluster or the second cluster, an angle between optical axes associated with the images in the first cluster or the second cluster, or a distance from the object to the camera associated with the images in the first cluster or the second cluster.

18. The non-transitory computer-readable storage medium of claim 17, wherein the at least one processor further causes the computing system to perform: determining inliers in the first cluster or the second cluster that observe the object within a threshold distance of a position associated with the hypothesis, wherein the hypothesis is confirmed based on a threshold number of the inliers.

19. The non-transitory computer-readable storage medium of claim 18, wherein the at least one processor further causes the computing system to perform: removing the inliers that observe the object from the data set based on a confirmation of the hypothesis; and distributing the data set into new clusters associated with another object captured in the area.

20. The non-transitory computer-readable storage medium of claim 19, wherein the distributing the data set into the new clusters is based on a determination that at least one hypothesis with a threshold number of inliers remains, wherein the threshold number of inliers is based on an average number of inliers per hypothesis and a confidence parameter.

Description

BRIEF DESCRIPTION OF DRAWINGS

(1) Embodiments will now be described, by way of example only and with reference to the accompanying drawings having like-reference numerals, in which:

(2) FIG. 1 illustrates a semantic map on which traffic lights are detected and labelled according to an embodiment; and

(3) FIG. 2 depicts the logic flow of the robust voting-based triangulation according to an embodiment.

DETAILED DESCRIPTION

(4) An example embodiment will now be described with reference to FIGS. 1 and 2.

(5) In this embodiment, the system starts by receiving a large set of 2D images I.sub.i, with associated camera-intrinsic parameters q.sub.i and 6 degrees-of-freedom poses P.sub.i∈ custom character (3), and produces a set of 3D positions of objects L.sub.i∈.sup.3 detected from the set of 2D images.

(6) As illustrated in FIG. 1, the initial set of 2D images are captured from a mapping fleet traversing various cities/urban environments. Section 101 of FIG. 1 shows an example of such environments. The mapping fleet usually comprises vehicles that traverse roads and paths multiple times, in both directions, at varying times of day and during different weather conditions. During this time, the vehicles of the mapping fleet capture images, 103, 104, at regular intervals. The trajectories of the traversing vehicles are also illustrated in FIG. 1 by 102. The data captured by the fleet of mapping vehicles may also be used to generate a map, 101, of the environment by implementing techniques such as SLAM.

(7) Whilst capturing these images, the system records camera-intrinsic parameters such as the optical centre (principal point), focal length, image distortion, etc. Additionally, the poses can be calculated using a large-scale structure-from-motion (SFM) pipeline. State-of-the-art SFM systems construct large-scale maps of an environment and, in this embodiment, it is used to accurately localise the positions of all the sensors (e.g., cameras). Although it is preferred that the poses are calculated using SFM, there is no restriction on the method of calculation or source of the poses as long as they are accurate and globally consistent.

(8) To calculate the 3D positions P of each image, each captured image is resized to 640×480 and then fed through a large-scale, distributed, structure-from-motion pipeline which may be running on multiple computers.

(9) In order to detect objects in the data set of 2D images, a noisy 2D detector is applied to each image I.sub.L resulting in a set of object detections Z.sub.i⊂ custom character .sup.2. In the case of traffic lights, an off-the-shelf CNN trained to predict bounding boxes for traffic lights can be used to generate the 2D object detections in the images. Similarly, when detecting other objects in an environment, CNNs pre-trained to predict bounding boxes that for that particular object may be used in this system. Examples of the bounding boxes 105n for traffic lights are illustrated in FIG. 1 within the captured images, 103, 104. The detections illustrated in FIG. 1 correspond to true positive detections of traffic lights from obtained/received images.

(10) The detected traffic lights can be shown on the trajectory or map data as indicated by 106n in section 102.

(11) In the CNN architecture used to detect traffic lights, firstly, a binary segmentation network is used to compute the probability of each pixel in a picture depicting a part of a traffic light. Once a probability for each pixel is computed, a thresholding schema is then applied to determine the connected components of pixels representing traffic lights. Finally, to visually aid the detection, a bounding box is fitted around a group of pixels that are detected to be portraying a traffic light.

(12) The output detections of this system are usually noisy and suffer from many false positives and false negatives. As discussed later, the system compensates for these noisy detections by using a large amount of data. One alternative to using a detector as described above is to use hand-annotated labels from internet based crowdsourcing platforms such as “Amazon Mechanical Turk” that enable individuals and businesses to coordinate the use of human intelligence to perform tasks that computers currently struggle to complete. However, this alternate also suffers from label noise. In this way, each image will have associated ground-truth 2D labels of traffic lights with label noise estimated at approximately 5%.

(13) In doing so, many physical 3D objects are detected from the initial dataset of 2D images. Each 2D data set covers an area or an urban environment with a certain number of physical objects, for example, traffic lights. In this embodiment, a traffic light is considered recoverable if it has been observed from at least two different viewpoints under an angle difference of at least θ.sub.min. However, as the amount of data increases, almost all the traffic lights in any given area eventually become recoverable. In some traditional scenarios, where the 3D position of a traffic light cannot be accurately determined, some traffic lights are not recoverable.

(14) Bearing in mind that each physical 3D object can be captured by a plurality of images taken in varying angles, many of these detections may in fact relate to the same physical object. Using the set of 2D detections alone, it is not possible to identify which detections are to be associated with which physical object and thus identify multiple detections of the same physical object. Any feature descriptors that might associate/differentiate the detections would be useless under the appearance changes that are seen in outdoor environments and this is particularly the case of objects that look similar. Traffic lights are a good example of physical 3D objects that are difficult to associate/differentiate. Many existing approaches rely on a need to visually match objects between images.

(15) Without relying on the appearance, the only differentiating factor between each physical 3D object is their position in 3D space. Current methods of multi-view triangulation cannot be used without positions of the objects in 3D space. Instead of using traditional methods of triangulation, this system uses a robust voting-based triangulation method, as shown in FIG. 2, to simultaneously determine 2D associations of physical objects and the position of the traffic lights/physical objects in 3D space. The flow shown in FIG. 2 lists various input and output variables. For example, inputs may include but are not limited to, a set of images, camera intrinsics, camera poses, maximum reprojection error, minimum ratio of inliers and the output comprises 3D positions for each physical 3D objects.

(16) For each pair of detections (z.sub.a, z.sub.b), where a and b are indices into 2D detections, from two different images (I.sub.i, I.sub.j), a 3D hypothesis h.sub.ab is created under the assumption that these two detections correspond to the same physical 3D object/traffic light. The pairing of 2D detections results in a total custom character (N.sup.2) hypotheses where N is the total number of detected traffic lights.

(17) In some cases, a hypothesis can be constrained to or is considered viable if it satisfies the following:

(18) 1) triangulation constraint: the point is triangulated in front of each camera,

(19) 2) rays intersect in 3D space: the reprojection error is smaller than d.sub.max,

(20) 3) the projection is stable: the angle between the optical axes is larger than θ.sub.min,

(21) 4) distance to camera: the distance from the traffic light to either camera is less than r.sub.max.

(22) Optionally, additional constraints reflecting prior information about the location of a traffic lights can be used to further restrict the hypothesis space.

(23) Once a set of hypotheses have been created, the system estimates the 3D position of each hypothesis. This can be achieved using traditional methods of triangulation using the pair of detections, z.sub.a, z.sub.b as shown in FIG. 2:
l.sub.ab←trianulate({z.sub.a,z.sub.b})

(24) One such method of estimating the 3D position l* of each hypothesis is K-view triangulation where K is indicative of the number of detections for each physical object. In the example of the pair of detections (z.sub.a, z.sub.b), K=2. By using K-view triangulation, the sum of the reprojection errors is minimised:

(25) $l^{*} = \arg \min_{l} \underset{k \in K}{.Math.} {(π (l, p_{k}, q_{k}) - z_{k})}^{2},$

(26) where: K is {a, b} in this case, π is the projection of the 3D hypothesis l into the camera at position p.sub.k with camera intrinsics q.sub.k.

(27) For each estimated 3D position, a set of consistent inliers S.sub.ab is computed. This set of inliers consists of all the 2D detections that correctly observe an object/traffic light at the same location. The set of inliers is computed by projecting the 3D position l* into each image and verifying whether the projected position is less than d.sub.max to any 2D detection. In this way the system determines whether the estimated 3D position of a hypothesis is close enough to a 2D detection in an image to be considered a correct and true hypothesis, and gives the hypothesis a vote.

(28) In doing so repeatedly for each hypothesis, the hypothesis with the maximum number of votes and the detections that voted for it (inlier detections) are removed as they have already been identified as correct. This process is repeated until no hypothesis with at least α.Math.M inliers is found, where M is the average number of inliers per hypothesis and a is a tuneable parameter over the confidence. This process then creates a set of confirmed hypotheses.

(29) In the case of a noisy but unbiased 2D detector and a uniform distribution of the data, the system converges to the correct solution as the amount of data increases. For example, this can improve false negative and/or false positive detections. This is due to noisy detections forming hypotheses with small numbers of votes, and correct detections gathering consistent votes over time. As the amount of data increases, these two metrics begin to separate, and a is the threshold on their ratio. Notably, the number of received votes is relative to the amount of initial data (2D images) received by the system.

(30) Finally, for every hypothesis its 3D position is refined by optimising the reprojection error over all the hypothesis detections. This entire flow of the system is presented in FIG. 2.

(31) The above method works well for small-scale scenarios but does not scale well to large, city-scale settings due to its potential custom character (N.sup.4) complexity where N is the number of detected traffic lights. A slightly better complexity of (N.sup.3) can be achieved by reusing the computation of the inliers after each iteration. However, to reduce the complexity of the method, a distribution schema based on splitting the data set to clusters is preferred. In this way, the above method can be used to process each cluster independently and then merge the results of the clusters at the end.

(32) A simple clustering schema can be implemented whereby system identifies the closest images to a detected traffic light until a cluster of size Nmax is created, at which point we remove it from the data set and continue the process until it terminates.

(33) After traffic lights from each cluster are triangulated using the method above, it might be the case that the same traffic light is triangulated in two different clusters. To resolve this issue, all pairs of traffic lights closer than 1 metre are merged, producing the final set of labels L.

(34) Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure.

(35) Any feature in one aspect of the invention may be applied to other aspects of the invention, in any appropriate combination. In particular, method aspects may be applied to system aspects, and vice versa. Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination.

(36) It should also be appreciated that particular combinations of the various features described and defined in any aspects of the invention can be implemented and/or supplied and/or used independently.

Urban environment labelling

Assignee

Inventors

Cpc classification

Classification Explorer

G06T7/136

PHYSICS

Classification Explorer

G06T7/55

PHYSICS

Classification Explorer

G06T2207/10004

PHYSICS

Classification Explorer

G06T2207/30252

PHYSICS

Classification Explorer

G06T2207/30244

PHYSICS

Classification Explorer

G06F18/00

PHYSICS

Classification Explorer

G06T7/70

PHYSICS

Classification Explorer

G06V20/582

PHYSICS

Classification Explorer

G06V20/10

PHYSICS

Classification Explorer

G06T7/73

PHYSICS

Classification Explorer

G06T2207/20081

PHYSICS

Classification Explorer

G06V20/584

PHYSICS

International classification

Classification Explorer

G06K9/00

PHYSICS

Classification Explorer

G06T7/73

PHYSICS

Classification Explorer

G06T7/70

PHYSICS

Classification Explorer

G06T7/55

PHYSICS

Classification Explorer

G06V20/58

PHYSICS

Classification Explorer

G06T7/136

PHYSICS

Classification Explorer

G06V20/10

PHYSICS

Abstract

Claims

Description