Positioning system and method for determining the three dimensional position of a movable object
20240029304 ยท 2024-01-25
Inventors
Cpc classification
International classification
Abstract
The invention relates to a system and method for determining a movable object's three dimensional position, comprising an attachable camera, attached to the object, and configured to capture an image containing representations of visual features that are spaced apart from the camera, means for storing the visual features, the features comprising a set of landmarks, containing position information of the corresponding visual feature, and detecting the positions the visual features in the captured image as mage positions of the visual features; estimate camera orientation; define an image plane, of the camera orientation; project positions of the visual features stored in the 3D model onto the virtual image plane creating positions of the visual features; perform matching between the image positions and the projected visual features positions, Identifying image visual features; and determine the position of the object from the three-dimensional positions of the visual features identified in the matching process.
Claims
1. Positioning system for determining the three-dimensional position of a movable object (1), comprising: a camera (10) that is attachable to the object at a defined relative position and orientation to the object (1) and is configured to capture an image containing image representations of at least a part of a plurality of visual features (2) that are placeable or placed at fixed locations spaced apart from the camera (1), storage means (11) for storing a 3D model of the plurality of visual features (2), the 3D model comprising a set of landmarks (l.sub.1, l.sub.2, . . . , l.sub.n) for the visual features (2), the landmarks (l.sub.1, l.sub.2, . . . , l.sub.n) containing information on the three-dimensional position of the corresponding visual feature (2), and computing means (12) that are configured to: detect the positions of image representations of the visual features (2) in the captured image as image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2); perform an estimation of the orientation of the camera (10); define a virtual image plane, preferably based on the estimation of the orientation of the camera (10); project the three-dimensional positions of at least a part of the visual features (2) stored in the 3D model onto the virtual image plane to create projected positions (p.sub.1, p.sub.2, . . . , p.sub.n) of the visual features (2); perform matching between the image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2) and the projected positions (p.sub.1, p.sub.2, . . . , p.sub.n) of the visual features (2) to identify the visual features (2) captured in the image; determine the three-dimensional position of the object (1) based on the information on the three-dimensional positions of the visual features (2) identified in the matching process.
2. Positioning system according to claim 1, comprising orientation determining means (13), preferably a gyroscope, an accelerometer and/or a magnetometer, wherein the computing means (12) are configured to perform the estimation of the orientation of the camera (10) based on information obtained from the orientation determining means (13).
3. Positioning system according to claim 1 or 2, wherein the computing means (12) are configured to perform the matching between the image positions (i.sub.1, i.sub.2, . . . , i.sub.n) and the projected positions (p.sub.1, p.sub.2, . . . , p.sub.n) of the visual features (2) based on descriptors (d.sub.in) of the image positions (in) and the descriptors (d.sub.pn) of the projected positions (p.sub.n), wherein the computing means (12) are configured to compute the descriptor (d.sub.in/d.sub.pn) for an image/projected position (i.sub.n/p.sub.n) based on the distribution of neighbouring image/projected positions (i.sub.n/p.sub.n).
4. Positioning system according to claim 3, wherein the computing means (12) are configured to compute the descriptor (d.sub.in/d.sub.pn) for an image/projected position (i.sub.n/p.sub.n) as a set of numerical values, the numerical values comprising at least one of:
5. Positioning system according to any of the preceding claims, wherein the computing means (12) are configured to perform matching until a predetermined number of image positions (i.sub.1, i.sub.2, . . . , i.sub.n), preferably three image positions (i.sub.1, i.sub.2, . . . , i.sub.n), are matched to projected positions (p.sub.1, p.sub.2, . . . p.sub.n).
6. Positioning system according to any of the preceding claims, wherein the computing means (12) are configured to determine a first transformation (T) for transforming matched image positions (i.sub.1, i.sub.2, . . . , i.sub.n) to matched projected positions (p.sub.1, p.sub.2, . . . , p.sub.n) obtained during the matching between the image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2) and the projected positions (p.sub.1, p.sub.2, . . . , p.sub.n) of the visual features (2); apply the first transformation (T) to the image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2) captured in the image to obtain refined image positions (i.sub.1, i.sub.2, . . . , i.sub.n); perform matching between the refined image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2) and the projected positions (p.sub.1, p.sub.2, . . . , p.sub.n) of the visual features (2) to identify the visual features (2) captured in the image.
7. Positioning system according to any of the preceding claims, wherein the computing means (12) are configured to perform an estimation of the distance between the camera (10) and at least one visual feature (2) whose image representation is contained in the captured image.
8. Positioning system according to claim 7, wherein the computing means (12) are configured to perform the estimation of the distance between the camera (10) and at least one visual feature (2) whose image representation is contained in the captured image based on the average distance between the camera (10) and visual features (2) during previous operation of the positioning system.
9. Positioning system according to claim 7, wherein the computing means (12) are configured to perform the estimation of the distance between the camera (10) and at least one visual features (2) whose image representation is contained in the captured image based on the number of image representations of visual features (2) in the captured image.
10. Positioning system according to any of the preceding claims, wherein the computing means (12) are configured to define a plane perpendicular to the direction of gravity as virtual image plane.
11. Positioning system according to any of the preceding claims, wherein the computing means (12) are configured to perform an unskewing of the image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2).
12. Positioning system according to claim 11, wherein the computing means (12) are configured to determine a first unskewing transformation (S.sub.1) which transforms the estimated orientation of the camera (10) to a vector orthogonal on the virtual image plane, and wherein the unskewing of the image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2) is performed by applying the first unskewing transformation (S.sub.1) to the image positions (i.sub.1, i.sub.2, . . . , i.sub.n).
13. Positioning system according to claim 11, wherein the computing means (12) are configured to determine a second unskewing transformation (S.sub.2) which transforms the estimated orientation of the camera (10) to a vector in the direction or opposite to the direction of gravity, and wherein the unskewing of the image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2) is performed by applying the second unskewing transformation (S.sub.2) to the image positions (i.sub.1, i.sub.2, . . . , i.sub.n).
14. Positioning system according to any of the preceding claims, wherein the computing means (12) are configured to perform a correction of the captured image and/or the image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2) for lens distortions of the camera (10).
15. Method for determining the three-dimensional position of a movable object (1), the method comprising: a) receiving an image captured by a camera (10) with a known position and/or orientation relative to the object (1), the image containing image representations of at least a part of a plurality of visual features (2) that are located at fixed locations spaced apart from the object (1), b) detecting the positions of image representations of the visual features (2) in the captured image as image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2); c) performing an estimation of the orientation of the camera (10); d) defining a virtual image plane, preferably based on the estimation of the orientation of the camera (10); e) projecting three-dimensional positions of at least a part of the visual features (2) stored in a 3D model (30) onto the virtual image plane to create projected positions (p.sub.1, p.sub.2, . . . , p.sub.n) of the visual features (2), the 3D model (30) comprising a set of landmarks (l.sub.1, l.sub.2, . . . , l.sub.n) for the visual features (2), the landmarks (l.sub.1, l.sub.2, . . . , l.sub.n) containing information on the three-dimensional position of the corresponding visual feature (2); f) performing matching between the image positions (i.sub.1, i.sub.2, . . . , i.sub.n) of the visual features (2) and the projected positions (p.sub.1, p.sub.2, . . . , p.sub.n) of the visual features (2) to identify the visual features (2) captured in the image; g) determining the three-dimensional position of the object (1) based on the information on the three-dimensional positions of the visual features (2) identified in the matching process.
Description
[0172] The above and further features and advantages of the present invention will become more readily apparent from the following detailed description of preferred embodiments of the invention with reference to the accompanying drawings, in which:
[0173]
[0174]
[0175]
[0176]
[0177]
[0178]
[0179] The camera 10 is communicatively coupled to a computing means 12 and a storage means 11, which is configured to store a 3D model 30 that will be discussed in greater detail below. The computing means 12 and the storage means 11 may be contained in a unit or units distinct from the camera 10, e.g. in a computer that is communicatively coupled to the camera 10. Alternatively, the computing means 12 and/or the storage means 11 may be provided at the position of the camera 10, e.g. in the form of a microcomputer and corresponding data storage device.
[0180] The positioning system comprises the camera 10, the storage means 11 and the computing means 12.
[0181] Spaced apart from the object 1, a plurality of visual features 2 is present in the environment of the object 1. In the embodiment shown in
[0182] In
[0183] Visual features 2 that are in the field of view of the camera 10 will be visible in an image captured by the camera 10 that corresponds to the depicted image plane. The image representations of these visual features 2 are schematically indicated in the depicted image plane above the camera 10. When the object 1 and, thus, the camera 10, are moved, the camera's position and/or orientation will change. Correspondingly, images captured by the camera 10 will display a movement of the image representations of the visual features 2 in the camera's field of view. Some visual features 2 will leave the field of view of the camera 10 and other visual features 2 will enter the field of view of the camera 10.
[0184] The storage means 11 stores a 3D model 30 of the visual features 2. The 3D model 30 contains information on the three-dimensional positions of the visual features 2 that are referred to as landmarks l1, . . . , ln. If N visual features 2 are accounted for in the 3D model 30, the 3D model 30 will contain N landmarks l1, . . . , ln, each landmark l1, . . . , ln containing information on the three-dimensional position of a corresponding visual feature 2. As detailed in the general part of the description, information on the 3D position of the visual features 2 may, for example, be provided as discrete coordinates, or as a probability distribution.
[0185] If the positioning system can identify the landmarks ln corresponding to the visual features 2 represented in the image, the position of the camera 10 and the object 1 can be calculated, based on the known position of the visual features 2 contained in the 3D model 30 and the intrinsic properties of the camera 10.
[0186] If, however, the camera 10 is obstructed and moved during operation of the positioning system, or if the positioning system is turned off for a while and turned back on again, it is not possible to determine which visual features 2 are captured in the image of the camera 10. In this case, it is not possible to determine the position of the camera 10 and the object 1.
[0187] Thus, in order to determine the three-dimensional position of the camera 10 and, thus, of the object 1, the visual features 2 in the field of view of the camera 10, whose image representations are contained in the images captured by the camera 10, need to be identified.
[0188] In order to accomplish this task, the computing means 12 are configured to perform a method for identifying the visual features 2 captured in an image of the camera 10. Based on the identification of the visual features 2 represented in the image of the camera 10, the position of the object 1 can be determined as explained above.
[0189] The method for identifying the visual features 2 captured in an image of the camera 10 is based on a novel, efficient algorithm that is designed to match the visual features captured in the image of the camera 10 to the 3D model 30, in order to enable an identification of the visual features 2 captured in the image.
[0190] In order to determine the position of the object 1, the camera 10 of the positioning system is configured to capture an image that contains image representations of at least a part of the visual features 2. In the example of
[0191] The detection of the image positions i1, . . . , in may be based on a suitable feature extraction algorithm, as detailed in the general part of the description. If the image representations of the visual features 2 have a substantial extent in the image (i.e., if the image representations of the visual features 2 cover a substantial area of the image with a plurality of pixels), the image positions i1, . . . , in may be determined as a center of the area of the image representation and stored as x/y (pixel) coordinates. Alternatively, the image positions i1, . . . , in may be stored as a distribution representing the area of the image representation.
[0192] The image positions i1, . . . , in thus obtained represent a set of 2D positions that needs to be matched with the 3D positions of the visual features 2 stored in the landmarks l1, . . . , ln in the 3D model 30, in order to identify the visual features 2 corresponding to the image positions i1, . . . , in.
[0193] In order to enable this matching, the positioning system is configured to define a virtual image plane, onto which the 3D positions of the visual features 2 are projected. This projection yields projected positions p1, . . . , pn of the visual features. Matching is performed between the (2D) image positions i1, . . . , in and the (2D) projected positions p1, . . . , pn.
[0194] The matching between the image positions i1, . . . , in and the projected positions p1, . . . , pn is facilitated if the image plane of the captured image from which the image positions i1, . . . , in are determined and the virtual image plane with which the projected positions p1, p2, . . . , pn are generated have a similar orientation. In order to enable a suitable definition of the virtual image plane, the positioning system is configured to estimate the orientation of the camera 10.
[0195]
[0196] The orientation of the camera 10 may be estimated with the aid of orientation determination means 13 that are rigidly attached to the camera 10. Examples of orientation determination means 13 include inertial measurement units such as a gyroscope, an accelerometer or a magnetometer. Preferably, the orientation determination means 13 are configured to enable at least an approximate orientation of the camera 10. For example, if the orientation determination means 13 comprise an accelerometer, the direction of gravity {right arrow over (g)} can be determined when the camera 10 is resting. Taking into consideration the fixed and known positioning of the orientation determination means 13 relative to the camera 10, the orientation {right arrow over (c)} of the camera 10 can be estimated.
[0197] Additionally or alternatively, the orientation determination means 13 may comprise a magnetometer that is configured to determine the direction of the earth magnetic field, which may be used as reference to estimate the orientation of the camera. Further, the orientation determination means 13 may comprise a gyroscope that can be used to estimate the orientation of the camera 10 by integrating the angular velocity provided by the gyroscope during operation of the camera.
[0198] It should be noted that the provision of orientation determination means 13 is not a compulsory element of the positioning system. The orientation of the camera 10 may also be estimated by other means, e.g., by instructing a user to point the camera 10 in a defined direction, (e.g. north, straight upwards, tilted upwards under an angle of 45 etc.). Once the camera 10 is approximately oriented as instructed, the camera 10 captures the image that will be used to determine the position of the camera 10 and the object 1. The orientation of the camera 10 is then estimated to correspond to the instructed direction.
[0199] Further, it is possible to estimate the orientation of the camera 10 by the use of specialized camera pedestals or mounts on which the object 1 with the camera 10 is placed before capturing the image for the position determination.
[0200] The positioning system is configured to define a virtual image plane on which the positions of the visual features 2 stored in the 3D model 30 will be projected. The definition of the virtual image plane is preferably based on the estimation of the orientation of the camera 10. In some embodiments, the virtual image plane will be defined as a plane perpendicular to the estimated orientation of the camera 10. Thus, the virtual image plane is defined as parallel to the image plane of the camera 10 at the estimated orientation. In other embodiments described below, the virtual image plane is defined based on the position of the visual features 2 stored in the 3D model 30, or as a plane perpendicular to the direction of gravity.
[0201] Given the defined virtual image plane, the positioning system may project the 3D positions of the visual features 2 stored in the 3D model 30 onto the virtual image plane. Thus, projected positions p1, . . . , pn of the visual features 2 are created. Like the image positions i1, . . . , in, the projected positions p1, . . . , pn form a set of 2D positions that may be expressed as discrete x/y coordinates or x/y probability distributions, depending on the information on the position of the visual features 2 that is available from the landmarks l1, . . . , ln.
[0202] The positioning system is configured to perform matching between the image positions i1, . . . , in and the projected positions p1, . . . , pn in order to identify the visual features 2 captured in the image. The matching process is an iterative procedure that is preferably performed by a pairwise comparison of image positions ip and projected positions pq. In some embodiments, a new set of projected positions p1, . . . , pn is created for each choice of a pair of image position ip and projected position pq.
[0203] During matching, the positioning system is configured to select one image position ip and one projected position pq corresponding to a landmark lq. The goal of the matching process is to determine whether the selected image position ip and the selected projected position pq (and, thus, the corresponding landmark lq) correspond to the same visual feature 2.
[0204] According to one embodiment, the pairwise matching is performed by quantifying the distribution of the neighbouring image/projected positions of the selected image position ip and the selected projected position pq. This quantification is achieved by the calculation of a descriptor dip for the selected image position ip and a descriptor dpq for the selected projected position pq. Descriptors are represented by an array of numerical values that characterize the distribution of the neighbouring image/projected positions.
[0205]
[0206] For the description of the descriptor value definitions illustrated in
[0207]
[0208] From the vectors {right arrow over (d.sub.1)}, {right arrow over (d.sub.2)} and {right arrow over (e.sub.1)}, multiple descriptor values that are invariant to rotations, translations and/or scalings can be computed. For example, the quotient of the magnitudes of the vectors, {right arrow over (d.sub.1)}/{right arrow over (d.sub.2)} yields a descriptor value invariant to rotation, translation and scaling. Likewise, the vector product of the vectors {right arrow over (d.sub.1)} and {right arrow over (d.sub.2)}, or the vector product of the normalized vectors may be considered. From the vector product of the vectors {right arrow over (d.sub.1)} and {right arrow over (d.sub.2)} the angle between the vectors can be calculated, which is also invariant to rotation, translation and scaling.
[0209]
[0210] From these vectors, various invariant descriptor values may be calculated. Non-limiting examples for suitable descriptor values are
[0211] From the normalized vector product
the angle between the vectors {right arrow over (d.sub.1)} and {right arrow over (d.sub.2)} may be calculated as a numerical value for the descriptor of the image/projected position ip/pq. Analogously, the angle between the vectors {right arrow over (d.sub.2)} and {right arrow over (d.sub.3)}, and the angle between the vectors {right arrow over (d.sub.1)} and {right arrow over (d.sub.3)} can be calculated as a numerical value of the descriptor of the image/projected position ip/pq from the respective normalized vector or scalar products.
[0212] Matching based on descriptors may be performed as follows. First, one image position ip and one projected position pq are selected. Then, a descriptor dip for the image position ip and a descriptor dpq for the projected position pq are calculated, wherein the descriptor dip and the descriptor dpq contain the same set of descriptor values, e.g. both contain the descriptor values {right arrow over (d.sub.2)}/{right arrow over (d.sub.1)}, {right arrow over (d.sub.1)}{right arrow over (d.sub.3)} and
[0213] The descriptor dip and the descriptor dpq are compared by calculating the differences between the respective descriptor values. If the differences of all descriptor values lie below a respective predetermined threshold, a match is confirmed. Otherwise, it is determined that the image position ip and the projected position pq are no match, and a new pair of image position and projected position is selected for matching.
[0214]
[0215]
[0216] Matching may be iteratively performed until a predetermined number of matches has been confirmed. When a predetermined number of matches has been confirmed, matching may be terminated.
[0217] If a higher robustness against ambiguous matching is desired, corroboration can be performed after a predetermined number of matches has been found. In this case, a set of the matched image positions ia, ib, . . . , ic and a corresponding set of the matched projected positions px, py, . . . , pz is formed. Preferably, each set contains two or more positions. Then, a (first) transformation T that transforms the set of matched image positions ia, ib, . . . , ic into the set of matched projected positions px, py, . . . , pz is determined. Subsequently, the transformation T is iteratively applied to unmatched image positions id to obtain refined image positions idT. For each refined image position idT, it is investigated whether the refined image position idT lies sufficiently close to a projected position pd. This investigation can be performed, for example, by identifying the projected position closest to the refined image position idT and checking whether the distance between them lies below a predetermined threshold.
[0218] If it is decided that the refined image position idT lies sufficiently close to a projected position pd, a corroboration count is incremented by 1. In this way, corroboration continues until, either, a predetermined corroboration count is reached or exceeded. If this is the case, corroboration is affirmative, so that the match from which the transformation T for the corroboration has been determined is confirmed. If the corroboration count does not reach or exceed the predetermined corroboration count after transformation of all image positions, the match is discarded and matching resumes with the selection of a new pair of image position and projected position for matching.
[0219] When corroboration is affirmative, it is possible to identify to which projected positions pj and, thus, to which landmarks lj the matched image positions ij correspond. Then, the position of the camera 10 and the object 1 can be determined based on the information contained in the landmarks lj identified in the 3D model 30 by the matching process.
[0220] In the description of the corroboration, the determination of a (first) transformation T for transforming image positions to projected positions has been described. It is also possible to perform corroboration in an inverse manner, by determining a (second) transformation T1 that transforms matched projected positions px, py, . . . , pz into the respective matched image positions ia, ib, . . . , ic. In this case, the corroboration is performed by applying the transformation T1 to projected positions pd and investigating whether the refined projected position pdT1 lies sufficiently close to an image position id.
[0221] According to another possible embodiment of the matching process, after estimating the orientation of the camera 10, a pair of one image position ip and one landmark lq is selected. Next, the image position ir closest to the selected image position ip is determined. If the positioning system is configured to support visual features of different sizes, a size relation of the image representations corresponding to ip and ir may be calculated. Alternatively, size relation can be a priori known if ir is chosen as the position of closest visual feature representation to ip that is of specific size relative to the visual feature representation corresponding to ip, for example, roughly the same size as the visual feature representation corresponding to ip. Further, based on the information available in the 3D model 30, the closest neighbour ls of the selected landmark lq is identified. If the size relation of the selected image position ip and its closest neighbour has been calculated, this size relation can be taken into account when identifying the closest neighbour ls of the selected landmark lq (size-aware selection of closest landmark neighbour). Optionally, ir can be selected as the image position closest to ip whose image representation corresponds to a specific size relative to the image representation ip, for example, roughly of the same size as the image representation of ip.
[0222] Based on these selections of image positions ip, ir and landmarks lq, ls, the positioning system is configured to find an orientation/pose of the camera 10 at which the landmarks lq and ls are projected into projected positions pq and ps, respectively, such that pq corresponds to ip and ps corresponds to ir.
[0223] In order to establish that desired orientation/pose of the camera 10, the originally estimated orientation of the camera 10 is combined with a (third) transformation T2 that preferably consists of a translation and a roll rotation (i.e. a rotation around the vector representing the estimated orientation of the camera 10). The two correspondences between ip and pq and between ir and ps uniquely determine the (third) transformation T2. Once the transformation T2 is determined, it is applied to the estimated orientation of the camera 10 to enable the definition of a refined virtual image plane.
[0224] Matching can then directly proceed with the corroboration step. Yet unmatched landmarks, i.e. landmarks representing yet unmatched projected positions (i.e. all projected positions except pq and ps) are selected from the 3D model 30 and projected to see whether the resulting projected position will correspond to any of yet unmatched image positions. In the context of this embodiment, it can be beneficial to pick the yet unmatched landmarks closest to lq first, because it is reasonable to expect that the landmarks that are close to each other in the 3D model 30 will also be close to each other in a set of projected 2D positions and thus have the highest probability to lie within the camera's field of view that is represented by the captured image from which the image positions i1, . . . , in have been determined.
[0225] Since this embodiment of the matching process takes into account the distribution of neighbouring image/projected positions of the selected image position ip and the projected position pq corresponding to the selected landmark lq, this matching process may also be regarded as a descriptor-based matching process. That is, the determination of the third transformation T2 that is decisive for the described matching process is determined based on ip and ir and selected landmark/projected position lq/pq and neighbouring landmark/projected position ls/ps. The determination of these closest neighbours, that optionally includes the calculation of size relations, is equivalent to the calculation of a descriptor that characterizes the selected image position ip and selected landmark/projected position lq/pq by their respective closest neighbour.
[0226] Alternatively, instead of picking ls as the closest (size-aware) landmark based on the distances in the 3D model 30, it is possible to project multiple landmarks that are closest to lq in the 3D model 30, thus producing a set of projected points ps1, ps2, . . . . The closest (size-aware) projected position to the projected position pq can then be selected from the set ps1, ps2, . . . . In case of significant differences in the depth of the landmarks, this approach can yield better matching performance.
[0227] As mentioned earlier, in some embodiments of the positioning system and corresponding method, the virtual image plane may be defined based on the position of the visual features 2 stored in the 3D model 30, or as a plane perpendicular to the direction of gravity. This definition of the virtual plane is particularly useful if the visual features lie far away from the object 1 and the camera 10. According to one embodiment, the orientation of the virtual image plane is determined by fitting an auxiliary plane to the positions of the visual features 2 stored in the 3D model, e.g. by a least square fit to the positions of the visual features 2. The virtual image plane is then defined as a plane parallel to or identical with the auxiliary plane.
[0228] This definition of the auxiliary plane may be understood as the determination of a common plane on which the visual features 2 approximately lie. If the visual features 2 are located above the positioning system at a great distance, the fitting of the auxiliary plane may be replaced by defining the virtual image plane as a plane perpendicular to the direction of gravity. This corresponds to the assumption that the visual features 2 are located approximately at the same height.
[0229] If this type of definition is used for the virtual image plane, the orientation of the image plane of the camera 10 on which the image positions i1, . . . , in lie may substantially deviate from the orientation of the virtual image plane. In this case, it is preferred to unskew the image positions i1, . . . , in in order to improve the performance of the matching process. Unskewing essentially designates the determination and application of a transformation S that transforms the positions of the image positions i1, . . . , in to unskewed image positions i1, . . . , in that would have been obtained if the orientation of the camera 10 had been orthogonal to the virtual image plane.
[0230]
[0231] According to some embodiments of the positioning system and corresponding method, the distance between the visual features 2 captured in the image of the camera 10 and the camera 10 is taken into consideration at the generation of the projected positions p1, . . . , pn. The image captured by the camera 10 is used to detect and determine the image positions i1, . . . , in. However, no depth information is derivable from the captured image. When defining the virtual image plane, the estimated orientation of the camera 10 is taken into account (or, if unskewing is used, the orientation of the camera 10 is taken into account for the unskewing, which is equivalent to taking the orientation of the camera 10 into account for the definition of the virtual image plane, since unskewing represents the transformation of the assumed orientation of the camera 10 to the orientation of the virtual image plane).
[0232] The distance between the camera 10 and the visual features 2 also substantially affects the relative positions of the image representations of the visual features 2 in the captured image, especially if the visual features 2 are not located at a great distance from the camera 10. In such cases, it may be beneficial to estimate the depth of the visual features 2 whose image representation is contained in the captured image. The definition of the virtual image plane may then take into account the estimated depth, i.e. the estimated distance between the visual features 2 captured in the image of the camera 10 and the camera 10. Exemplary suitable means for estimating the depth or distance between the visual features 2 captured in the image of the camera 10 and the camera 10 are described in the general part of the description.
[0233] Based on the above explanations, a sequence of method steps for determining the position of the object 1 according to an embodiment of the positioning system and corresponding method will be described under reference to
[0234] The method starts with the camera 10 capturing an image containing an image representation of at least some of the visual features 2. In the captured image, image positions i1, . . . , in of the image representations of the visual features 2 are detected. Further, the orientation of the camera 10 is estimated (this step may be performed before, during or after capturing the image).
[0235] Subsequently, a hypothesis is set up. Setting up a hypothesis comprises defining a virtual image plane, and optionally unskewing the image positions i1, . . . , in. Further optionally, setting up a hypothesis may comprise an estimation of the distance between the visual features 2 captured in the image and the camera 10.
[0236] Next, a pair consisting of a landmark lq from the 3D model 30 and an image position ip is selected. Assuming that the selected landmark lq and the selected image position ip correspond to the same visual feature 2, an assumption on the position of the camera 10 is implicitly made: if the selected landmark lq and the selected image position ip correspond to the same visual feature 2, it must be assumed that the camera 10 is located on a line that extends through the position stored in the landmark lq and the image position ip on the image plane perpendicular to the assumed orientation of the camera 10. If the depth or distance between the visual features 2 captured in the image of the camera 10 and the camera 10 of the visual features has also been estimated, the position of the camera 10 on said line is also implicitly assumed. Thus, the hypothesis is refined by the selection of the pair of the landmark lq and the image position ip. Further refinements of the hypothesis may include refining the image positions or projected positions by calculation and application of the transformation T or T1, or by refining the estimation of the camera orientation and/or the definition of the virtual image plane, e.g. by calculating and applying the transformation T2.
[0237] Subsequently, the 3D positions of the visual features 2 stored in the landmarks l1, . . . , ln of the 3D model 30 are projected onto the virtual image plane to create projected positions p1, . . . , pn. The positioning system performs matching between the (optionally unskewed) image positions i1, . . . , in and the projected positions p1, . . . , pn. Matching may comprise the calculation of descriptors as described above. If a predetermined number of matches is found, matching is terminated with a positive result.
[0238] Afterwards, corroboration may be performed on the positive matching result. If corroboration also returns a positive result, the visual features 2 captured in the image may be identified in the 3D model 30 based on the match between image positions i1, . . . , in and projected positions p1, . . . , pn, and the position of the object 1 may be determined. If the matching process or the corroboration do not return a positive result, the process returns to selecting a new pair of a landmark lq and an image position ip. If all possible pairs of landmarks lq and image positions ip have been selected and matched unsuccessfully, the method returns to setting up a new hypothesis.
[0239] According to an alternative embodiment, the sequence of method steps as shown in
[0240] As is evident from the above description, the described features and functionalities of the positioning system and corresponding method according to different embodiments of the invention may be combined in different ways, depending on the circumstances in which the positioning system and corresponding method is employed. In order to illustrate this flexibility in more detail, embodiments of the present invention that differ mainly in the setting up of the hypotheses (i.e. the estimation of the camera's orientation, the optional estimation of depth of the visual features 2, the definition of the virtual image plane and the optional unskewing of the image positions i1, . . . , in) will be described, in order to highlight the advantages obtainable by combining the concepts of the present invention in a suitable manner.
[0241] According to a possible embodiment of the invention, the positioning system is configured to capture an image containing image representations of visual features 2, and to detect image positions i1, . . . , in corresponding to the visual features 2 captured in the image. Subsequently, an orientation of the camera 10 is estimated, as well as a depth of at least one of the visual features 2 in the image. A pair of a landmark lq and an image position ip is selected. The virtual image plane is defined as the image plane of a camera 10 with the estimated orientation and located at the estimated distance from the visual feature 2 corresponding to the landmark lq. Subsequently, matching is performed. If matching is unsuccessful, a new pair of landmark lq and image position ip is selected and the process is repeated. If matching is successful, corroboration is optionally performed on the matching result.
[0242] Matching may be performed with the use of descriptors, as described earlier, optionally including the use of any of the first, second or third transformation T, T1, T2, or by a pattern matching algorithm.
[0243] Alternatively, the camera's roll (i.e. the rotational orientation of the camera 10 around its optical axis) may be estimated and matching may be performed by directly overlapping the projected position pq corresponding to the selected landmark lq and the selected image position ip and checking whether a sufficient number of the remaining image positions in lies sufficiently close to a projected position pn. If no match is found, a new estimate for the roll and/or depth is selected and the matching is repeated. This provides a conceptually simple, easily implementable method for position determination.
[0244] According to another possible embodiment of the invention, the positioning system is configured to capture an image containing image representations of visual features 2, and to detect image positions i1, . . . , in corresponding to the visual features 2 captured in the image. Subsequently, an orientation of the camera 10 is estimated. A plane is fitted to the positions of the visual features 2 stored in the landmarks l1, . . . , ln of the 3D model 30. This plane is defined as the virtual image plane. Subsequently, the positions of the visual features 2 stored in the landmarks l1, . . . , ln of the 3D model 30 are orthographically projected onto the fitted plane to obtain the projected positions p1, . . . , pn. The image positions i1, . . . , in are unskewed based on an unskewing transformation S that represents the transformation of the estimated orientation of the camera 10 to a vector perpendicular to the virtual image plane. Subsequently, pairwise matching between the unskewed image positions i1, . . . , in and the projected positions p1, . . . , pn is performed, preferably with the use of descriptors. Since the projected positions p1, . . . , pn are obtained by an orthographic projection, the matching and corroboration criteria should not be set too strictly, i.e., a match between unskewed image positions in and projected positions pn should be confirmed if the differences between the descriptor values and/or the distances during corroboration lie below a relatively large threshold.
[0245] This embodiment bears the advantage that the generation of the projected positions p1, . . . , pn is only performed once. This embodiment is particularly suitable when it may be assumed that the visual features 2 lie approximately on a common plane, and if the camera 10 is located at a great distance from the visual features.
[0246] In order to improve the accuracy, the unskewing of the image positions i1, . . . , in may be modified by first transforming the image positions i1, . . . , in such that they form an orthographic projection on the plane whose normal is the camera's optical axis, and then unskewing these modified image positions as described above. This may improve the accuracy and efficiency of the matching process.
[0247] According to another possible embodiment of the invention, the positioning system is configured to capture an image containing image representations of visual features 2, and to detect image positions i1, . . . , in corresponding to the visual features 2 captured in the image. Subsequently, an orientation of the camera 10 is estimated. A plane is fitted to the positions of the visual features 2 stored in the landmarks l1, . . . , ln of the 3D model 30. Further, the depth of at least one of the visual features 2 captured in the image is estimated. A pair of a landmark lq and an image position ip is selected, and the virtual image plane is defined as a plane parallel to the fitted plane, spaced apart from the fitted plane at a distance corresponding to the estimated depth. The image positions i1, . . . , in are unskewed based on an unskewing transformation S that represents the transformation of the estimated orientation of the camera 10 to a vector perpendicular to the virtual image plane. The positions of the visual features 2 stored in the 3D model 30 are projected on the virtual image plane with a perspective projection, wherein the position of the camera 10 is estimated based on the selected pair of landmark lq and the unskewed image position ip. Subsequently, matching is performed, preferably with the use of descriptors, and preferably including corroboration.
[0248] This embodiment may provide a higher matching accuracy than the previous embodiment, but requires a re-generation of the projected positions for each matching step.
[0249] According to another possible embodiment of the invention, the orientation of the camera 10 is estimated by prompting the user to point the camera 10 in an essentially upward direction. When the camera is oriented in this way, the positioning system is configured to capture an image containing image representations of visual features 2, and to detect image positions i1, . . . , in corresponding to the visual features 2 captured in the image. The virtual image plane is defined as a plane perpendicular to the direction of gravity.
[0250] According to a first alternative, the positions of the visual features 2 stored in the landmarks l1, . . . , ln of the 3D model 30 are orthographically projected onto the fitted plane to obtain the projected positions p1, . . . , pn, and matching is performed. According to a second alternative, the depth of the visual features 2 captured in the image is estimated, and the virtual image plane is defined as a plane perpendicular to the direction of gravity, spaced apart from the visual features 2 captured in the image at a distance corresponding to the estimated depth.
[0251] Subsequently, matching is performed, preferably with the use of descriptors and corroboration.
[0252] This embodiment bears the advantage that the matching accuracy may be greatly improved by limiting the camera's orientation to an essentially upward or downward direction. Further, no unskewing is required. Surprisingly, it has been found that this method is very robust and yields reliable positioning results even if the visual features 2 are not approximately located on a common plane. The method is particularly preferable when the majority of visual features 2 is located above the object 1 at a great distance.
[0253] According to a possible modification of some of the embodiments described above, the step of projecting the positions of the visual features 2 stored in the landmarks l1, . . . , ln of the 3D model 30 may be replaced by capturing a plurality of images that are then combined to a larger data structure containing image representations of the visual features 2 that appear in the camera's field of view when capturing the plurality of images. Preferably, the camera 10 is moved around whilst capturing the plurality of images for generating the data structure to capture a larger number of visual features 2. Projected positions p1, . . . , pn of the visual features 2 represented in the data structure can be generated from the data structure by detecting the positions of the visual features 2 represented in the data structure in a manner analogous to the determination of the image positions i1, . . . , in. Thus, according to this modification, the projection of the positions of the visual features 2 stored in the landmarks l1, . . . , ln of the 3D model 30 is performed by obtaining positional information of the visual features 2 from images captured by the camera 10.