Fast and robust multimodal remote sensing image matching method and system

11244197 · 2022-02-08

Assignee

Inventors

Cpc classification

International classification

Abstract

A multimodal remote sensing image matching method and system integrate different local feature descriptors for automatic matching of multimodal remote sensing images. First, a local feature descriptor, such as the Histogram of Oriented Gradient (HOG), the local self-similarity (LSS), or the Speeded-Up Robust Feature (SURF), is extracted for each pixel of an image to form a pixel-wise feature representation map. Then, the three-dimensional Fourier transform (namely 3D FFT) is used to establish a fast matching similarity metric in a frequency domain based on the feature representation map, followed by a template matching scheme to achieve control points (CP) between images. In addition, the novel pixel-wise feature representation technique named channel features of orientated gradients (CFOG), which outperforms the pixel-wise feature representation methods based on the traditional local descriptors (e.g., HOG, LSS and SURF) in both matching performance and computational efficiency.

Claims

1. A fast and robust multimodal remote sensing image matching method, comprising the following steps: A. comparing a resolution of a reference image and a resolution of an input image, when the two resolutions are the same; proceeding to step B, otherwise, resampling the reference image and the input image until they have a same resolution; B. detecting a series of uniformly distributed interest points in the reference image based on a partitioning strategy; denoting the points as P.sub.1i (i=1, 2, 3, . . . , N), and selecting a template area AreaW.sub.1i centered on point P.sub.1i; C. predicting a matching area AreaW.sub.2i in the input image that corresponds to P.sub.1i according to georeference information of remote sensing image; D. building a pixel-wise feature representation map (D2) for the matching area AreaW.sub.2i and building a pixel-wise feature representation map (D1) for the template area AreaW.sub.1i; E. establishing a fast similarity metric for control point (CP) detection using 3-Dimensional (3D) Fast Fourier Transform (FFT) based on the pixel-wise feature representation maps D1 and D2, which comprises converting the pixel-wise feature representation maps D1 and D2 into a frequency domain by using the 3D FFT, performing correlation operation to obtain a similarity map, and defining a position of the maximum of the similarity map as an image matching position; F. obtaining a sub-pixel location for the CP by fitting a local extremum of the similarity map; G. repeating steps C to F and traversing all the points of P.sub.1i (i=1,2,3, . . . ,N) to obtain a CP pair {PD.sub.1i(x,y), PD.sub.2i(x,y)} (i=1, 2, 3, . . . , N) at sub-pixel accuracy for each point in P.sub.1i (i=1,2,3, . . . ,N); and H. rejecting the CPs with large errors from the {PD.sub.1i(x,y), PD.sub.2i(x,y)} (i=1, 2, 3, . . . , N) to obtain the final CPs {PID.sub.1i(x,y), PID.sub.2i(x,y) } (i=1, 2, 3, . . . , S).

2. The method of claim 1, wherein the Step D further comprises the following steps: calculating a local feature descriptor of every pixel according to image data of the matching area; and then arranging a feature vector for every pixel in Z direction to form a 3D pixel-wise feature representation map.

3. The method of claim 2, wherein the local feature descriptor is Histogram of Oriented Gradient (HOG), Local Self-Similarity (LSS), or Speeded Up Robust Features (SURF).

4. The method of claim 1, wherein the Step D further comprises building channel features of an orientated gradients (CFOG) map in the matching area, which comprises the following steps: computing multiple orientated gradients for each pixel to form a 3D orientated gradient map for the image data in the matching area; performing convolution operation based on the 3D orientated gradient map by using a Gaussian filter to generate a feature map g.sub.o.sup.σand performing convolution operation on the feature map g.sub.o.sup.σin the Z-direction by using a one-dimensional filter [1,2,1] to obtain a feature map g.sub.o.sup.c; and normalizing the feature map g.sub.p.sup.c to obtain the CFOG map.

5. The method of claim 4, wherein the step of building CFOG further comprises the following steps: for all the pixels in the area, calculating a horizontal gradient g(in X-direction) and a vertical gradient g.sub.y (in Y-direction), respectively, by using an one-dimensional filter [−1,0,1] and an one-dimensional filter [−1,0,1].sup.T; using the g.sub.x and g.sub.y to calculate gradient values g.sub.θof different directions by Formula (1);
g.sub.θ=└abs(cos θ.Math.g=.sub.x+sin θ.Math.g.sub.y)┘  (1) wherein θ is a quantized gradient orientation, abs represents an absolute value, └ ┘ denotes that the enclosed quantity equals itself when its value is positive or equals zero when its value is not positive; collecting the g.sub.θof all directions together to form a 3D orientated gradient map g.sub.o, then, performing the convolution operation on the g.sub.o by a 2-Dimensional (2D) Gaussian filter by the standard of σ in X-direction and Y-direction to achieve a feature map g.sub.o.sup.σ, and finally performing the convolution operation on the g.sub.o.sup.σby an one-dimensional filter [1,2,1] in Z-direction to form a feature map g.sub.o.sup.c, wherein each pixel of the feature map g.sub.o.sup.c corresponds to a feature vector f in Z-direction and is traversed to normalize the feature vector f by Formula (2) to obtain the final CFOG map; f i = f i .Math. f i .Math. + .Math. ( 2 ) wherein ε is a constant to avoid division by zero.

6. The method of claim 1, wherein the step E comprises the following steps: obtaining the pixel-wise feature representation maps D.sub.1 and D.sub.2 for the area AreaWi and the area AreaW.sub.2i by the Step D, respectively; sliding the D.sub.1 in the D.sub.2 as a template, and matching the D.sub.1 and the D.sub.2 by taking sum-of-squared differences (SSD) of feature vectors as a similarity metric, wherein the SSD is defined by Formula (3):
S.sub.i(v)=Σ.sub.c[D.sub.1(c)−D.sub.2(c−v)].sup.2  (3) wherein c denotes a coordinate of a pixel in the pixel-wise feature representation map D.sub.1, v is the offset between the D.sub.1 and the D.sub.2, S.sub.i is the SSD of feature vectors between the D.sub.1 and the D.sub.2, and the offset v.sub.i between the D.sub.1 and the D.sub.2 is obtained by minimizing the S.sub.i, (i.e., matching position) by Formula (4): v i = argmax v { .Math. c [ D 1 ( c ) - D 2 ( c - v ) ] 2 } ( 4 ) expanding Formula (4) to obtain: v i = argmax v { .Math. c D 1 2 ( c ) + .Math. c D 2 2 ( c - v ) - 2 .Math. c D 1 ( c ) .Math. D 2 ( c - v ) } ( 5 ) wherein, in Formula (5), the first and second terms are nearly constant, Formula (5) is minimized when the third term is maximum so that the similarity metric is redefined as: v i = argmax v { .Math. c D 1 ( c ) .Math. D 2 ( c - v ) } ( 6 ) wherein Σ.sub.cD.sub.1(c).Math.D.sub.2(c−v) is a convolution operation; the FFT in frequency domain is used to accelerate the computational efficiency because the convolution operation in the spatial domain become dot products in the frequency domain; thus, defining the similarity metric based on FFT as: v i = argmax v { F - 1 [ F ( D 1 ( c ) ) .Math. F * ( D 2 ( c - v ) ) ] } ( 7 ) wherein F and F.sup.−1 are the forward FFT and inverse FFT, respectively; F* is the complex conjugate of F; since the D.sub.1 and the D.sub.2 are 3D feature maps, the Formula (7) is computed by 3D FFT according to the principle of convolution; and accordingly, and obtaining the final similarity metric as: v i = argmax v { 3 D F - 1 [ 3 D F ( D 1 ( c ) ) .Math. 3 DF * ( D 2 ( c - v ) ) ] } ( 8 ) wherein 3DF and 3DF.sup.−1 denote the 3D forward FFT and inverse FFT, respectively, and 3DF* is the complex conjugate of 3DF.

7. A fast and robust multimodal remote sensing image matching system, comprising: a preprocessing unit for comparing resolution information of a reference image and an input image; which is followed by a next unit if the resolutions of the images are the same or sampling the images at the sample resolution if the resolutions of the images are different; a template area selection unit for detecting a series of uniformly distributed feature points in the reference image, denoting the points as P.sub.1i (i=1,2,3, . . . ,N), and selecting a template area AreaW.sub.1i centered on the point P.sub.1i; a matching area selection unit for predicting a matching area AreaW.sub.2i in the input image corresponding to a point set P.sub.1i (i=1,2,3, . . . ,N) by using the georeference information of remote sensing images; a feature extraction unit for building a pixel-wise feature representation map in the matching area; a preliminary matching unit for establishing a fast similarity metric for control point (CP) detection by using the 3-Dimensional (3D) Fast Fourier Transform (FFT) based on the pixel-wise feature representation map, converting the pixel-wise feature representation map into a frequency domain by using the 3D FFT, performing correlation operation to obtain a similarity map, and taking the position of the maximum of the similarity map as the image matching position, and obtaining a sub-pixel location for the CP by fitting a local extremum of the similarity map; and repeating the operations involving the units and traversing all points of P.sub.1i (i=1,2,3, . . . ,N) to obtain their corresponding CP pairs {PD.sub.1i(x,y), PD.sub.2i(x,y)} (i=1, 2, 3, . . . , N) at sub-pixel accuracy; and a fine-matching unit for rejecting the CP pairs with large errors from the {PD.sub.1i(x,y), PD.sub.2i(x,y)} (i=1, 2, 3, . . . , N) (i=1,2,3, . . . ,N) to obtain the final CP pairs {PID.sub.1i(x,y), PID.sub.2i(x,y) } (i=1, 2, 3, . . . , S).

8. The multimodal remote sensing image matching system of claim 7, wherein the feature extraction unit is used to calculate the local feature descriptor of each pixel for the image data of the matching area, and arrange all the feature vectors corresponding to all pixels in Z-direction to form the 3D pixel-wise feature representation map.

9. The multimodal remote sensing image matching system of claim 7, wherein the preliminary matching unit converts the pixel-wise feature representation map into the frequency domain by using the 3D FFT, obtains the similarity map by correlation operation, and takes the position of the maximum of the similarity map as the image matching position.

Description

BRIEF DESCRIPTION THE DRAWINGS

(1) FIG. 1 is an overall flow chart of the invention.

(2) FIG. 2 is a diagram of the pixel-wise feature representation.

(3) FIG. 3 is a construction process of CFOG of the invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

(4) In order to enable those skilled in the art to understand the technical solution of the invention, the technical solution is clearly and completely described in combination with drawings. The embodiments of the application and all other similar embodiments obtained by those of ordinary skill in the art without making creative work are within the protection scope of the invention.

(5) FIG. 1 shows a fast and robust multimodal remote sensing images matching method comprising the following steps:

(6) Step A: determining consistency between resolutions of two images; if these resolutions are consistent, the method proceeds to the next step; otherwise, these images need to be processed at the sampling resolution; and Step B: extracting a large number of uniformly distributed feature points by using Harris or Forstner operator with a partitioning strategy in the reference image, particularly comprising: the reference image is divided into n×n non-overlapping square grids. In each region, the Harris or Forstner value is calculated for each pixel, and k pixels are selected with the largest Harris or Forstner values as the feature points. As a result, k feature points are detected from each grid and n×n×k feature points are extracted in the reference image, where the values of n and k are set according to actual conditions; and these feature points are denoted as P.sub.1i (i=1, 2, 3, . . . , N). Note that image features can be extracted by other operators in other embodiments, and the present invention has no related limit. Step C: based on the georeference information of remote sensing images, the search (or matching) area of the point set P.sub.1i (i=1, 2, 3, . . . , N) is determined in the input image, and the step C comprises the following steps. (1) A point P.sub.1i(x,y) is selected from the point set P.sub.1i (i=1, 2, 3, . . . , N), where x and y denote the coordinates of P.sub.1i(x,y). Then a template area (named AreaW.sub.1i) centred on P.sub.1i(x,y) is selected, and the area has a size of M×M pixels. The geographic coordinate of P.sub.1i(x,y) is denoted as Geo.sub.i. (2) According to the geographic coordinate Geo.sub.i and the georeference information of the input image, a point P.sub.2i(x,y) corresponding to the P.sub.1i(x,y) of the reference image is determined in the input image. Then a template area centred on P.sub.2i(x,y) having a size of M×M pixels is determined. This template area is regarded as the matching area AreaW.sub.2i. Step D: The pixel-wise feature representation is formed for the AreaW.sub.1i and AreaW.sub.2i (see FIG. 2). In an embodiment, the step D comprises the following steps. A local feature descriptor (e.g., HOG, LSS or SURF) of every pixel in these areas is calculated. A feature vector per pixel in Z direction is arranged to form a 3D pixel-wise feature representation map. The embodiment has no limit for the local feature descriptor. In another embodiment, step D comprises building the Channel Feature of Orientated Gradient (CFOG) for both the AreaW.sub.1i and AreaW.sub.2i(See FIG. 3), particularly comprising the following steps. (1) For all the pixels in the area, a one-dimensional filter [−1, 0, 1] and a one-dimensional filter [−1, 0, 1].sup.T are used to calculate a horizontal gradient g.sub.x (in X-direction) and a vertical gradient g.sub.y (in Y-direction), respectively; (2) The g.sub.x and g.sub.y are used to calculate the gradient values g.sub.θ of different directions by Formula (1).
g.sub.θ=└abs(cos θ.Math.g=.sub.x+sin θ.Math.g.sub.y)┘  (1) where, θ is a quantized gradient orientation. The θ is uniformly divided into 18 equal parts in 360 degree. As a result, each part has degree of 20°. The θ is of {0°, 20°, . . . , 340° }. abs represents an absolute value and is used to convert the gradient information of [180°, 360°) to [0°, 180°) to mitigate the effects caused by gradient reversal between multimodal images. └ ┘ denotes that the enclosed quantity is equal to itself when its value is positive or zero otherwise. (3) The g.sub.θ of all directions is first collected together to form a 3D orientated gradient map g.sub.o. Then, g.sub.o is convoluted by a 2D Gaussian filter by the standard of σ in X-direction and Y-direction to achieve a feature map g.sub.o.sup.σ. Finally, g.sub.o.sup.σ is convoluted by a one-dimensional filter [1, 2, 1] in Z-direction to form a feature map g.sub.o.sup.c. (4) Each pixel of the feature map g.sub.o.sup.c corresponds to a feature vector f.sub.i in Z-direction and is traversed to normalize the feature vector f.sub.i by Formula (2) to remove the influence of illumination changes and obtain the final CFOG map.

(7) f i = f i .Math. f i .Math. 2 + .Math. ( 2 ) where, ε is a constant to avoid division by zero. Step E: Based on the pixel-wise feature representation map, the invention establishes a fast similarity metrics for CP detection using 3D FFT, and step E comprises the following steps. (1) By Step D, the pixel-wise feature representation maps D.sub.1 and D.sub.2 can be obtained for the area AreaW.sub.1i and the area AreaW.sub.2i, respectively. D.sub.1 is taken as a template and slides in D.sub.2. The sum-of-squared differences (SSD) of feature vectors between D.sub.1 and D.sub.2 is taken as a similarity metric for matching. The SSD formula is simplified below and the matching process is accelerated by the 3D fast Fourier transform (FFT).
The SSD between D.sub.1 and D.sub.2 is computed by:
S.sub.i(v)=Σ.sub.c[D.sub.1(c)−D.sub.2(c−v)].sup.2  (3)
Where, c is the coordinate of a pixel in the pixel-wise feature representation, v represents the offset between D.sub.1 and D.sub.2, and S.sub.i represents the SSD of feature vectors between D.sub.1 and D.sub.2. The offset v.sub.i (i.e., matching position) between D.sub.1 and D.sub.2 can be achieved by minimizing the SSD by Formula (4):

(8) v i = argmax v { .Math. c [ D 1 ( c ) - D 2 ( c - v ) ] 2 } ( 4 )
The Formula (4) is expanded to obtain:

(9) v i = argmax v { .Math. c D 1 2 ( c ) + .Math. c D 2 2 ( c - v ) - 2 .Math. c D 1 ( c ) .Math. D 2 ( c - v ) } ( 5 )
In Formula (5), since the first and second terms are nearly constant, the formula will be minimized when the third term is maximum. Therefore, the similarity metric can be redefined as:

(10) 0 v i = argmax v { .Math. c D 1 ( c ) .Math. D 2 ( c - v ) } ( 6 )
where, Σ.sub.cD.sub.1(c).Math.D.sub.2(c−v) is a convolution operation, which can be accelerated by using FFT because convolutions in the spatial domain become dot products in the frequency domain. Hence, the similarity metric based on FFT is defined as:

(11) v i = argmax v { F - 1 [ F ( D 1 ( c ) ) .Math. F * ( D 2 ( c - v ) ) ] } ( 7 )
where, F and F.sup.−1 are the forward FFT and inverse FFT, respectively, F* is the complex conjugate of F. Since D.sub.1 and D.sub.2 are 3D feature representation maps, the Formula (7) is computed by 3D FFT according to the principle of convolution. Accordingly, the final similarity metric is defined as.

(12) v i = argmax v { 3 D F - 1 [ 3 D F ( D 1 ( c ) ) .Math. 3 DF * ( D 2 ( c - v ) ) ] } ( 8 )
Where, 3DF and 3DF.sup.−1 denote the 3D forward FFT and inverse FFT, respectively. 3DF* is the complex conjugate of 3DF. (2) When Formula (8) is used for image matching, D.sub.1 and D.sub.2 are first processed by the FFT to obtain 3DF(D.sub.1) and 3DF(D.sub.2), respectively. Then the dot product between 3DF(D.sub.1) and the complex conjugate 3DF*(D.sub.2) of 3DF(D.sub.2) are implemented, followed by taking an inverse 3D FFT to the results of the dot product to achieve the similarity map between D.sub.1 and D.sub.2. As a result, the maximum of the similarity map corresponds to the offset v.sub.i between D.sub.1 and D.sub.2, i.e., the offset between P.sub.1i(x,y) and P.sub.2i(x,y).

(13) The X-direction and Y-direction offset of the obtained v.sub.i is denote as (Δx,Δy), and the corresponding point P.sub.2i(x−Δx,y−Δy) of P.sub.1i(x,y) is denote as P*.sub.2i(x,y). Accordingly, the obtained CP pair is denoted as {P.sub.1i(x,y), P*.sub.2i(x,y)}. Step F: For the CP pair {P.sub.1i(x,y), P*.sub.2i(x,y)}, its sub-pixel accuracy is determined by a local fitting technique, and step F comprises the following steps. (1) A local window of 3×3 pixels is selected centered around the P*.sub.2i(x,y). Then, the similarity values of all pixels are collected to build the function relationship between the similarity metric values and the coordinates of pixels based on the principle of least squares by a quadratic polynomial. (2) The partial derivative is taken for the quadratic polynomial to solve the position where the partial derivative is equal to 0. The position corresponds to a CP pair with sub-pixel accuracy which is denoted as {PD.sub.1i(x,y), PD.sub.2i(x,y)}. Step G: The step C to the step F are repeated, each point of P.sub.1i (i=1, 2, 3, . . . , N) is traversed to achieve the CP pairs with sub-pixel accuracy which are denoted as {PD.sub.1i(x,y), PD.sub.2i(x,y)} (i=1, 2, 3, . . . , N). Step H: The final CP pairs are achieved by removing the CP pairs with large errors in {PD.sub.1i(x,y), PD.sub.2i(x,y)} (i=1, 2, 3, . . . , N), and step H comprises the following steps. (1) A projective transformation model is built by using the coordinates of {PD.sub.1i(x,y), PD.sub.2i(x,y)} (i=1, 2, 3, . . . , N) based on the least square method. (2) The root mean square errors (RMSEs) and residual errors of CP pairs are calculated, and the CP pair with the largest residual error is removed. (3) Repeat the above two steps until the RMSE is less than 1.5 pixels, achieving the final CP pairs {PID.sub.1i(x,y), PID.sub.2i(x,y)} (i=1, 2, 3, . . . , S) with the sub-pixel accuracy.
In another embodiment, the invention presents a fast and robust multimodal remote sensing images matching system, and the system comprises the following units.
a preprocessing unit for comparing resolution information of a reference image and an input image; if the resolutions of both images are the same, the system proceeds to the next unit; otherwise, these images need to be processed at the sampling resolution;
a template area selection unit for detecting a series of uniformly distributed feature points in the reference image by a partitioning strategy; these points are denoted as P.sub.1i (i=1, 2, 3, . . . , N), and a template area AreaW.sub.1i centered on the point P.sub.1i is selected;
a matching area selection unit for predicting a matching area AreaW.sub.2i in the input image corresponding to a point set P.sub.1i (i=1, 2, 3, . . . , N) by using the georeference information of remote sensing images;
a feature extraction unit for building a pixel-wise feature representation map in the matching area;
a preliminary matching unit for establishing a fast similarity metric for CP detection by using the 3D FFT based on the pixel-wise feature representation map; obtaining a sub-pixel location for the CP by fitting the local extremum of the similarity map; and repeating the operations involving these units and traversing all points of P.sub.1i (i=1, 2, 3, . . . , N) to obtain their corresponding CP pair {PD.sub.1i(x,y), PD.sub.2i(x,y) } (i=1, 2, 3, . . . , N) at sub-pixel accuracy; and a fine-matching unit for rejecting the CP pairs with large errors from the {PD.sub.1i(x,y), PD.sub.2i(x,y) } (i=1, 2, 3, . . . , N) to obtain the final CP pairs {PID.sub.1i(x,y), PID.sub.2i(x,y) } (i=1, 2, 3, . . . , S).
Further, the feature extraction unit is used to calculate the local feature descriptor of each pixel covered by the image data of the matching area, and arrange all the feature vectors corresponding to all pixels in Z-direction to form the 3D pixel-wise feature representation map.
Further, the preliminary matching unit converts the pixel-wise feature representation map into the frequency domain by using the 3D FFT, obtains the similarity map based on correlation operation, and takes the position of the maximum of the similarity map as the image matching position.
These are the introduction of embodiments of the invention. Based on the matching frame proposed in the invention, different local feature descriptors (e.g., HOG, LSS and SURF) are used to build the effective pixel-wise feature representation, which effectively captures the common structure, shape and texture properties. Moreover, a novel similarity metric is established by 3D FFT based on the pixel-wise feature representation, and it is more computationally effective compared with the similarity metrics commonly used in the spatial domain (e.g. normalized correlation coefficient and mutual information). In addition, the proposed CFOG is a novel pixel-wise feature representation technique, which outperforms the other pixel-wise feature representation techniques based on HOG, LSS and SURF in both the matching efficiency and accuracy. The technical solution of the invention can fill the gap that traditional matching methods are sensitive to nonlinear radiometric differences between multimodal images, and can effectively address the matching difficulty of multimodal sensing images such as visible, infrared, LiDAR, SAR and raster map data.

(14) The technical solution of the invention is a general technical frame that integrates different local feature descriptors (including but not limited to CFOG, HOG, LSS and SURF) for image matching.

(15) The invention is not limited to the embodiments, and can expand to any new features or any new combination disclosed in the specification, and steps in any new method or procedure or any new combination disclosed.