Method for simultaneous localization and mapping
11199414 · 2021-12-14
Assignee
Inventors
Cpc classification
G06T7/246
PHYSICS
G01C21/3848
PHYSICS
G01C21/188
PHYSICS
International classification
G01C21/16
PHYSICS
G06T7/246
PHYSICS
Abstract
A method for simultaneous localization and mapping is provided, which can reliably handle strong rotation and fast motion. The method provided a simultaneous localization and mapping algorithm framework based on a key frame, which can support rapid local map extension. Under this framework, a new feature tracking method based on multiple homography matrices is provided, and this method is efficient and robust under strong rotation and fast motion. A camera orientation optimization framework based on a sliding window is further provided to increase motion constraint between successive frames with simulated or actual IMU data. Finally, a method for obtaining a real scale of a specific plane and scene is provided in such a manner that a virtual object is placed on a specific plane in real size.
Claims
1. A method for simultaneous localization and mapping, comprising steps of: 1) a foreground thread processing a video stream in real time, and extracting a feature point for any current frame (I.sub.i); 2) tracking a set of global homography matrices (H.sub.i.sup.G); 3) using a global homography matrix and a specific plane homography matrix to track a three-dimensional point so as to obtain a set of 3D-2D correspondences required by a camera attitude estimation; and 4) evaluating quality of the tracking according to a comparison of a number of the tracked feature points and a predefined threshold and classifying the quality into a first grade, a second grade and a third grade, wherein the quality of the tracking indicated by the first grade is better than that indicated by the second grade, and the quality of the tracking indicated by the second grade is better than that indicated by the third grade; 4.1) when the quality of the tracking is classified as the first grade, performing extension and optimization of a local map, and then determining whether to select a new key frame; 4.2) when the quality of the tracking is classified as the second grade, estimating a set of local homography matrices (H.sub.k.fwdarw.i.sup.L), and re-matching a feature that has failed to be tracked; and 4.3) when the quality of the tracking is classified as the third grade, triggering a relocating program, and once the relocating is successful, performing a global homography matrix tracking using a key frame that is relocated, and then performing tracking of features again, wherein in step 4.1), when it is determined to select a new key frame, the selected new key frame wakes up a background thread for global optimization, the new key frame and a new triangulated three-dimensional point is added for extending a global map, and a local bundle adjustment is adopted for optimization; then an existing three-dimensional plane is extended, the added new three-dimensional point is given to the existing plane, or a new three-dimensional plane is extracted from the added new three-dimensional point; subsequently, a loop is detected by matching the new key frame with an existing key frame; and finally, a global bundle adjustment is performed on the global map.
2. The method for simultaneous localization and mapping according to claim 1, wherein the step 2) comprises: for two adjacent frames of images, estimating a global homography transformation from a previous frame (I.sub.i−l) to a current frame (I.sub.i) by directly aligning the previous frame of image with the current frame of image as:
3. The method for simultaneous localization and mapping according to claim 1, wherein the step 3) comprises: for a feature point which is fully constrained and has a reliable three-dimensional position, determining a search position by predicting a camera orientation (C.sub.i) of a current frame predicting the camera orientation (C.sub.i) using the global homography matrix (H.sub.i.sup.G), denoting a two-dimensional feature point of a frame F.sub.k∈K.sub.i−1 as x.sub.k and denoting a corresponding three-dimensional point as X.sub.k, obtaining a three-dimensional to two-dimensional correspondence (X.sub.k,x.sub.i) through a formula x.sub.i=π(H.sub.k.fwdarw.i.sup.Gx.sub.k.sup.h), in which a set of all the three-dimensional to two-dimensional correspondences is denoted as V.sub.C.sub.
4. The method for simultaneous localization and mapping according to claim 1, wherein said performing extension and optimization of the local map in step 4.1) comprises: for one two-dimensional feature (x.sub.k) of the key frame (F.sub.k), calculating a ray angle according to a formula of
5. The method for simultaneous localization and mapping according to claim 4, wherein said fixing the three-dimensional point position and optimizing all camera orientations in the local window comprises: assuming that there is already a linear acceleration â and a rotation speed {circumflex over (ω)} measured in a local coordinate system, and a real linear acceleration is a=â−b.sub.a+n.sub.a, a real rotation speed is ω={circumflex over (ω)}−b.sub.ω+n.sub.ω, n.sub.a˜N(0, σ.sub.n.sub.
q.sub.i+1=q.sub.Δ(ω.sub.i,t.sub.Δ.sub.
{tilde over (θ)}≈2[q.sub.Δ({circumflex over (ω)}.sub.i−b.sub.ω.sub.
{circumflex over (θ)}≈2[q.sub.i+1.Math.q.sub.i.sup.−1.Math.qΔ.sup.−1({circumflex over (ω)}.sub.i−b.sub.ω.sub.
e.sub.q(q.sub.i,q.sub.i+1,b.sub.ω.sub.
6. The method for simultaneous localization and mapping according to claim 4, wherein the global optimization by the background thread comprises: for each three-dimensional point in the new key frame (F.sub.k), calculating a maximum ray angle with respect to the existing key frame using the formula (8) first; marking the point as a fully constrained three-dimensional point if max.sub.i α(i,k)≥δ.sub.α and three-dimensional coordinates are successfully triangulated; then extending the existing three-dimensional plane visible in the key frame (F.sub.k) by the new and fully constrained three-dimensional point X belonging to the plane; defining a neighborhood relationship by performing Delaunay triangulation on the two-dimensional feature in the key frame (F.sub.k); when a three-dimensional point satisfies the following three conditions at the same time, adding this point to a set V.sub.P of the plane P: 1) X is not assigned to any plane, 2) X is connected with another point in V.sub.P, and 3) |n.sup.TX+d|<δ.sub.P, where d.sub.k is the average depth of the key frame (F.sub.k); adopting a RANSAC algorithm to extract a new three-dimensional plane; randomly sampling three connected points in each iteration to initialize a set V.sub.P of three-dimensional interior points; adopting SVD to recover a three-dimensional plane parameter P, and then starting an inner loop which iteratively adjusts extension and optimization of the plane; for each internal iteration, checking a distance from the three-dimensional points X, which are connected to the points in the set V.sub.P, to the plane; if |n.sup.TX+d|<δ.sub.P, adding X to V.sub.P, optimizing the plane by a minimization formula for all points in V.sub.P:
7. The method for simultaneous localization and mapping according to claim 1, further comprising an initialization step of placing a virtual object on a specific plane of a real scene in a real size: initializing the tracking using a planar marker of a known size, providing a world coordinate system at a center of the planar marker, adopting n three-dimensional points X.sub.i, i=1, . . . , n uniformly along an edge of the planar marker, and recovering an orientation (R,t) of a camera in an initial frame by solving the following optimization equation:
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
(5)
DESCRIPTION OF EMBODIMENTS
(6) The present disclosure discloses a method for simultaneous localization and mapping, which can reliably handle strong rotation and fast motion, thereby ensuring a good experience of augmented reality. In this method, a new feature tracking method based on multiple homegraphy matrix is proposed, which is efficient and robust under strong rotation and fast motion. Based on this, a real-time local map extension strategy is further proposed, which can triangulate three-dimensional points of a scene in real time. Moreover, a camera orientation optimization framework based on a sliding window is also proposed to increase priori motion constraints between successive frames using simulated or actual IMU data.
(7) Before describing the present method in detail, the convention of the mathematical formula in the method is introduced (same as R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2004). A scalar is represented by an italic character (such as x), and a matrix is represented by a bold uppercase letter (such as H); a column vector is represented by a bold letter (such as b), and a transpose b.sup.⊥ thereof represents a corresponding row vector. In addition, a set of points, vectors, and matrices is generally represented by an italic uppercase letter (such as V). For the vector x, the superscript h represents the corresponding homogeneous coordinate, such as X.sup.h. In the method, each three-dimensional point i contains a three-dimensional position X.sub.i and a local image block X.sub.i. Each frame i holds its image I.sub.i, camera orientation C.sub.i, a set of two-dimensional feature point positions {x.sub.ij}, and a set of corresponding three-dimensional point indices V.sub.i. A set of selected key frames is represented by {F.sub.k|k=1 . . . N.sub.F}. The camera orientation is parameterized as a quaternion q.sub.i and a camera position p.sub.i. For a three-dimensional point X.sub.j, its two-dimensional projection in the i.sub.th image is calculated by the following formula: x.sub.ij=π(K(R.sub.i(X.sub.j−p.sub.i))) (1), where K is the camera internal parameter which is known and constant by default, and R.sub.i is a rotation matrix of q.sub.i, while π(⋅)is a projection function π([x,y,z].sup.T)=[x/z,y/z].sup.T. Each plane i holds its parameter P.sub.i and an index of a three-dimensional point belonging to it, which is marked as V.sub.p.sub.
(8)
For each key frame F.sub.k, a set of specific plane homography matrices is solved and expressed as H.sub.k.fwdarw.i.sup.P={H.sub.k.fwdarw.i.sup.P.sup.
(9) For a set of continuously dynamically changing video sequences (
(10) As shown in
(11) 1. The foreground thread processes a video stream in real time. For any current frame I.sub.ia FAST corner point is extracted first (E. Rosten and T. Drummond. Machine learning for high speed corner detection. In 9th European Conference on Computer Vision, Part I, pages 430-443. Springer, 2006), and a set of global homography matrices H.sub.i.sup.G of I.sub.i is tracked and obtained.
(12) The global homography matrix can roughly align the entire image, and the global homography matrix is estimated by directly aligning the key frame F.sub.k and the current frame I.sub.i:
(13)
{tilde over (F)}.sub.k and Ĩ.sub.i are respectively small blurred images (SBI) of F.sub.k and I.sub.i (G. Klein and D. W. Murray. Improving the agility of keyframe-based SLAM. In 10th European Conference on Computer Vision, Part II, pages 802-815. Springer, 2008). Ω is a SBI domain, and a superscript h indicates that the vector is a homogeneous coordinate. A tilde in {tilde over (H)} indicates that the homography matrix H is converted from a source image space to a SBI space. δ.sub.I, ∥⋅∥.sub.δ represents a Huber function, and a calculation method is as follows:
(14)
(15) Assuming that a baseline between two adjacent frames is small, a global homography matrix can roughly represent a motion between two frames of images. By solving the formula (3) for the previous frame I.sub.i−1 and the current frame I.sub.i, H.sub.(i−1).fwdarw.i.sup.G can be obtained. H.sub.k.fwdarw.i.sup.G=H.sub.(i−1).fwdarw.i.sup.GH.sub.k.fwdarw.(i−1).sup.G ) can be obtained by transitivity. The set of feature matchings (denoted as M.sub.k,i−1={(x.sub.k, x.sub.i−1)}) between F.sub.k and I.sub.i−1 is used, H.sub..fwdarw.(i−1).sup.G. First, a set of key frames which has an overlapping area with I.sub.i−1 is determined using M.sub.k,i−1 and denoted as K.sub.i−1. Five key frames are selected under conditions that |M.sub.k,i−1|>20 and |M.sub.k,i−1| is maximized. For each key frame F.sub.k in K.sub.i−1, M.sub.k,i−1 is used to prevent cumulative errors and offsets, and the formula (3) is modified to:
(16)
δ.sub.x. In the experiment, δ.sub.I=0.1, δ.sub.x=10. With respect to formula (3), formula (5) can interpolate multiple plane motions in case of a larger baseline. The Gauss-Newton method is used to solve formulas (3) and (5).
(17) 2. The global homography matrix and the specific plane homography matrix are used to track the three-dimensional point, and obtain a set of 3D-2D correspondences required by the camera orientation estimation.
(18) For a point which is fully constrained and has a reliable three-dimensional position, it is only necessary to predict the camera orientation of the current frame C.sub.i to project a three-dimensional point and determine the search position. C.sub.i is predicted using a global homography matrix H.sub.i.sup.G. The global homography matrix can be derived from the previous frame I.sub.i−1, and the set of key frames overlapping with it is In order not to lose generality, any two-dimensional feature of the frame F.sub.k∈K.sub.i−1 is denoted as x.sub.k and the corresponding three-dimensional point is denoted as X.sub.k. Through the formula x.sub.i=π(H.sub.k.fwdarw.i.sup.Gx.sub.k.sup.h), a three-dimensional to two-dimensional correspondence (X.sub.k,xi) can be obtained. A set of all three-dimensional to two-dimensional correspondences is denoted as V.sub.c.sub.
(19)
R.sub.i. If there is real IMU data, the camera position can also be predicted directly by propagation of the IMU state.
(20) For a point belonging to the three-dimensional plane P.sub.j, the plane homography matrix H.sub.k.fwdarw.i.sup.P.sup.
(21) 3. The method in the paper (G. Klein and D. W. Murray. Parallel tracking and mapping for small AR workspaces. In 6th IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 225-234, 2007) is used to evaluate quality of the tracking which can be divided into good, medium (not good) and poor (the tracking result may be not good or even poor for fast motion with large translation).
(22) 3.1. When the quality of the tracking is good, the tracking result is used directly to extend and optimize the local map, and then corresponding algorithms in papers (G. Klein and D. W. Murray. Parallel tracking and mapping for small AR workspaces. In 6th IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 225-234, 2007) and (G. Klein and D. W. Murray. Improving the agility of key frame-based SLAM. In 10th European Conference on Computer Vision, Part II, pages 802-815. Springer, 2008) are used to determine whether a new key frame is to be selected or not.
(23) For a two-dimensional feature x.sub.k of the frame F.sub.k, once a corresponding feature x.sub.i is found in frame I.sub.i, the three-dimensional point X will be triangulated. Firstly, the ray angle is calculated according to the formula
(24)
where r.sub.j=R.sub.j.sup.TK.sup.−1x.sub.j.sup.h. In the ideal case where there is sufficient parallax and α(i, k)≥δ.sub.α, the conventional triangulation algorithm in (R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2004) is used to obtain X. In the experiment, it is set that δ.sub.α=1°. However, in most cases, the parallax in the first observation is often insufficient. But on the other hand, the depth of any x.sub.k will result in similar re-projection results in frame I.sub.i, so that in this step an arbitrary depth can be specified for x.sub.k. In the method, the average depth d.sub.k of the key frame F.sub.k is used to initialize X, that is X=d.sub.kK.sup.−1x.sub.k.sup.h, and X is optimized according to a formula of
(25)
(26) Once there is enough motion parallax, the depth error may affect camera tracking. It is best to perform bundle adjustment to the local map at the time when each new frame arrives, but this cannot achieve real-time performance. It is therefore possible to modify the optimization process so as to optimize only the three-dimensional point or the camera orientation at a time, and these two steps are carried out in turn. This strategy is very effective in practical applications, and can significantly reduce the computing load to achieve real-time, even in mobile devices.
(27) Optimizing the three-dimensional point under the premise of a fixed camera orientation is relatively simple. Each three-dimensional point X.sub.i can be individually optimized by the minimization formula
(28)
where V.sub.x.sub.
(29) The orientation of each camera i is optimized separately using equation
(30)
where V.sub.i is a set of point indices of three-dimensional points visible in the frame i. However, any feature tracking method may fail due to the severe blur caused by fast camera motion. These blurred frames lack the constraints of reliable camera orientation estimation. In view of that the fast camera motion typically only occurs occasionally and does not last long, adjacent frames can be used to constrain the blurred frames. Specifically, we draw on the idea of VI-SLAM, such as (A. I. Mourikis and S. I. Roumeliotis. A multi-state constraint kalman filter for vision-aided inertial navigation. In IEEE International Conference on Robotics and Automation, pages 3565-3572. IEEE, 2007), (J. A. Hesch, D. G. Kottas, S. L. Bowman, and S. I. Roumeliotis. Consistency analysis and improvement of vision-aided inertial navigation. IEEE Transactions on Robotics, 30 (1): 158-176, 2014) and (S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale. Keyframe-based visual-inertial odometry using nonlinear optimization. The International Journal of Robotics Research, 34 (3):314-334, 2015). Assume that there is already IMU measurement data in the local coordinate system (including a linear acceleration â and a rotation speed {circumflex over (ω)}) and a camera attitude is the same as that of the IMU, the state of the camera motion can be extended to s=[q.sup.T p.sup.T v.sup.T b.sub.a.sup.T b.sub.ω.sup.T].sup.T, where v is the linear velocity in the global coordinate system, and b.sub.a and b.sub.ω are respectively offsets of the linear acceleration and the rotation speed with time. The real linear acceleration is a=â−b.sub.a+n.sub.a(13) and the rotation speed is ω={circumflex over (ω)}−b.sub.ω+n.sub.ω(14), where n.sub.a˜N(0, σ.sub.n.sub.
(31)
where
(32)
and [⋅].sub.x represents a skew symmetric matrix. b.sub.a and {dot over (b)}.sub.ω are respectively random walk processes represented by the Gaussian white noise model w.sub.a˜N(0, σ.sub.w.sub.
(33)
where t.sub.Δ.sub.
the formula
(34)
where {tilde over (θ)}is a 3×1 error vector. By substituting formula (14) into formula (19), it can be obtained that:
{tilde over (θ)}≈2[q.sub.Δ({circumflex over (ω)}.sub.i−b.sub.ω.sub.
where G.sub.ω is a Jacobian matrix associated with noise By substituting the formula (19) into the formula (17), it can be obtained that:
{tilde over (θ)}≈2[q.sub.i+1.Math.q.sub.i.sup.−1.Math.q.sub.Δ.sup.−1({circumflex over (ω)}.sub.i−b.sub.ω.sub.
Combining (20) and (21), the cost function and covariance of the rotation component are respectively defined as:
e.sub.q(q.sub.i,q.sub.i+1,b.sub.ω.sub.
(35) The derivation for other components is relatively simple. The transitivity formulas of the discrete-time states are as follows:
p.sub.i+1=p.sub.i+v.sub.it.sub.66 .sub.
v.sub.i+1=v.sub.i+R.sub.i.sup.Tv.sub.Δ(a.sub.i,t.sub.Δ.sub.
b.sub.a.sub.
b.sub.ω.sub.
where
(36)
The error function and the corresponding covariance are:
(37)
(38) Based on these correlation constraints between adjacent frames defined by (22) and (25), the energy function of all motion states s.sub.1 . . . s.sub.1 in the sliding window can be defined as:
(39)
where l is a size of the sliding window and ∥e∥.sub.Σ.sup.2=e.sup.TΣ.sup.−1e is a square of a Mahalanobis distance.
(40) The above derivatives assume that inertial measurement data â.sub.i and {circumflex over (ω)}.sub.i are known, but actually there may be no IMU sensor. For the linear acceleration, it can be set to be â.sub.i=0, because there is almost no sudden jump in the AR application. However, for the rotation speed, since the user often needs to turn his or her head to view the entire AR effect, it cannot be set to be {circumflex over (ω)}.sub.i=0. Therefore, feature matching is used to align the continuous SBI to obtain a best fitting:
(41)
where R.sub.Δ(ω,t.sub.Δ) is a rotation matrix of formula (18), and M.sub.i,i+1 is a set of feature matchings between images i and i+1. δ.sub.X controls a weight of a feature matching item and is set to be 10 in the experiment. The Gauss Newton algorithm is used to solve the formula (27).
(42) The above optimization framework can be easily extended to the case of combining real IMU data, and the main difference lies in that there may be multiple IMU measurement data between two adjacent frames. Moreover, there is always some differences between the attitude of the camera and the attitude of the real IMU sensor, which cannot be simply considered to be consistent (the relative attitude of the camera and the real IMU sensor can be obtained by prior calibration). The set of IMU measurement data between frames i and i+1 is denoted as {(aâ.sub.ij, {circumflex over (ω)}.sub.ij, t.sub.ij)|j=1 . . . n.sub.i}, where â.sub.ij, {circumflex over (ω)}.sub.ij and t.sub.ij are the j.sub.th linear acceleration, the rotation speed and the time stamp, respectively. The pre-integration technology in (T. Lupton and S. Sukkarieh. Visual-inertial-aided navigation for high-dynamic motion in built environments without initial conditions. IEEE Transactions on Robotics, 28(1):61-76, 2012) and (C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza. IMU preintegration on manifold for efficient visual-inertial maximum-a-posteriori estimation. In Robotics: Science and Systems, 2015) can be used to pre-integrate these IMU measurement data, and the components in (18) and (24) are replaced with the following formulas:
(43)
where {circumflex over (b)}.sub.a.sub.
(44) 3.2. When the result is medium (not good), there are still some matched features, most of which are still fully constrained points, and thus, a set of local homography matrices H.sub.k.fwdarw.i.sup.L can be estimated and a feature that has failed to be tracked is re-matched.
(45) If the quality of the tracking of the global homography matrix and the specific three-dimensional plane homography matrix is not good, the obtained matching can be further used to estimate the local homography matrix. The present method uses a simple extraction strategy of multiple homography matrices to estimate a set of local homography matrices. Specifically, RANSAC method is used for M.sub.k,i, and a best local homography matrix is estimated based on the number of the interior points satisfying the homography transformation and denoted as H.sub.k.fwdarw.i.sup.L.sup.
(46) For a feature point which is not fully constrained and has no reliable three-dimensional position, π(H.sub.k.fwdarw.i.sup.Gx.sub.k.sup.h) is first directly used as the search position. If the tracking result is not good, a set of local homography matrices H.sub.k.fwdarw.i.sup.L is estimated according to the matching correspondences. For the unmatched features, trying is performed one by one in the search position {H.sub.k.fwdarw.i.sup.L.sup.
(47) The representation method of the multiple homography matrices in the present method can not only predict a good starting point but also correct perspective distortion. A homography matrix H.sub.k.fwdarw.i is given, and the local block near x.sub.k in the key frame F.sub.k is aligned to the current frame I.sub.i through the formula κ(y)=I.sub.k(H.sub.k.fwdarw.i.sup.−1(π(H.sub.k.fwdarw.ix.sub.k.sup.h+y.sup.h))) (7), y.sup.h, where κ(y) is an image gray scale for an offset of y relative to a center of the block. y∈(−W/2, W/2)×(−W/2, W/2), where W is a dimension of the image block and is set to be 8 in the experiment. For points belonging to the three-dimensional plane P.sub.j, the plane homography matrix H.sub.k.fwdarw.i.sup.P.sup.
(48) Given a precise search range and an undistorted image block, the correspondence in the current frame can be easily tracked and obtained. Similar to PTAM, a value of a zero mean SSD is calculated for a FAST corner point extracted within a search area, and one FAST corner which has the minimum sero-mean SSD is selected when the offset is smaller than a certain threshold.
(49) In the present method, since a sufficient feature matching can generally be obtained by a global homography matrix and a specific plane homography matrix, it is rarely necessary to solve a local homography matrix. Only when there is a large parallax between the frames F.sub.k and I.sub.i, and there are multiple different planes in the overlapping area, it is necessary to additionally try each local homography matrix for those abnormal points. As shown in
(50) 3.3. When the result is poor, the relocating program introduced in the paper (G. Klein and D. W. Murray. Parallel tracking and mapping for small AR workspaces. In 6th IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 225-234, 2007) is triggered. Once the relocating is successful, the relocated key frame is used for global homography matrix tracking, and then the feature tracking is performed again.
(51) 4. Waking up the background thread if a new key frame is selected.
(52) 4.1. A new key frame and a newly triangulated three-dimensional point are added for extending the global map, and a local bundle adjustment is adopted for optimization. After that, the existing three-dimensional plane is extended and the newly added three-dimensional point is given to the existing plane, or a new three-dimensional plane is extracted from the newly added three-dimensional point.
(53) Similar to the paper (G. Klein and D. W. Murray. Parallel tracking and mapping for small AR workspaces. In 6th IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 225-234, 2007), the present method uses local and global bundle adjustment methods in the background to optimize the map and camera orientation of the key frame. For each three-dimensional point in the new key frame F.sub.k, the maximum ray angle with respect to the existing key frame is first calculated using formula (8). If max.sub.i α(i,k)≥δ.sub.α and the three-dimensional coordinates are successfully triangulated, the point is marked as fully constrained. The existing three-dimensional plane visible in frame F.sub.k is then extended with a new and fully constrained three-dimensional point X belonging to the plane. In order to determine whether a three-dimensional point X belongs to the plane P, it is only needed to check the distance |n.sup.TX+d| from the point to the plane. However, satisfying the plane equation is only a necessary condition rather than a sufficient condition, due to the fact that the plane equation does not contain plane boundary information. Suppose that the points belonging to the same plane are close to each other in space, only a point having its adjacent point on the plane is checked when checking the point-to-plane distance. The neighborhood relationship is defined by Delaunay triangulation of the two-dimensional feature in the key frame F.sub.k. When a three-dimensional point satisfies the following three conditions, the point is added to the set V.sub.P of the plane P: 1) X is not assigned to any plane, 2) X is connected with another point in V.sub.P, and 3) |n.sup.TX+d|<δ.sub.P. In the experiment, setting δ.sub.P=0.01d.sub.k, where d.sub.k is an average depth of the key frame F.sub.k.
(54) RANSAC algorithm (M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6): 381-395, 1981) is adopted to extract a new three-dimensional plane. In each iteration of the RANSAC, three connected points are randomly sampled to initialize a set of three-dimensional interior points V.sub.P. It is assumed that the three-dimensional plane P is generated by V.sub.P. Then, an inner loop which iteratively adjusts plane extension and optimization begins. In each internal iteration, the distance from those three-dimensional points X, which are connected to points in the set V.sub.P, to the plane is checked. If |n.sup.TX+d|<δ.sub.P, X is added to V.sub.P. After that, the plane is optimized by the minimization formula for all points in V.sub.P:
(55)
and an attempt to further extend V.sub.P is made. When |V.sub.P|<30, P and V.sub.P are discarded. In order to avoid marking one point to multiple planes, the extracted planes are sorted in a descending order according to the number of the relevant three-dimensional points after each RANSAC operation. The set of the sorted interior points is denoted as {V.sub.P.sub.
(56) By adding a point-to-plane constraint, parameters of the three-dimensional plane can be optimized in a global bundle adjustment. The present method uses a LM (Levenberg Marquart) algorithm to minimize the following energy function so as to jointly optimize all camera orientations, three-dimensional points and three-dimensional planes of the key frames:
(57)
(58) It should be noted that other functions of the present method are also solved by the LM algorithm except (3), (5) and (27).
(59) 4.2. Detecting a loop by matching features of a new key frame with those of an existing key frame.
(60) For the loop detection, the same algorithm as the relocation is used to find feature matching of the new key frame with the previous frame, and then a closed circulation loop is obtained through the global bundle adjustment.
(61) 4.3. Performing a bundle adjustment on the global map.
(62) In the experiment, effects of the method proposed by the present disclosure in four different settings are first compared (
(63) In addition, we have applied this method in an application on a mobile terminal and tested it. When the user is using the application, a scene is shot with a mobile phone, and then a plurality of three-dimensional virtual furniture models are inserted into the room for viewing the effect of the furniture placement. We could not know the actual size of the three-dimensional structure recovered by the Monocular Simultaneous Localization and Mapping (SLAM) method. In order to ensure that the three-dimensional furniture model has the right size when being inserted, a method is needed for accurate estimation. However, it may be difficult to estimate the exact size in time based on the noisy IMU data provided by the mobile phone. It is more reliable to use a calibration of a known size. In the application, the user needs to use a sheet of A4 paper, aiming a center of a camera at the paper at the beginning of the application. Our augmented reality system automatically uses camera orientation estimates to automatically detect four sides of the paper. After initialization, the user can download different three-dimensional models in a list and put them into the scene.