Frequency domain interest point descriptor
09747521 · 2017-08-29
Assignee
Inventors
Cpc classification
International classification
Abstract
Systems and methods for image analysis and recognition are disclosed, in particular the methods for interest point description. An interest point and its surrounding area is broken into subareas, a frequency domain description of each area is created by applying discrete Fourier transform (DFT). Frequency domain features are than coded bitwise by comparing them to predefined thresholds. Subsequently, the present invention provides alternative or improved methods and data structures for interest point description that may reduce memory consumption and allow fast bitwise matching.
Claims
1. A method comprising: identifying a neighborhood around an interest point location identified within an image, the neighborhood being divided into a plurality of subregions; for each subregion, calculating a frequency domain representation by: applying a Fourier transform to the subregion to create the frequency domain representation of the subregion; and normalizing the frequency domain representation of the subregion as created based on the applying of the Fourier transform to the subregion; and combining one or more normalized frequency domain representations that correspond to respective subregions into a multidimensional descriptor for the interest point.
2. The method of claim 1, wherein the frequency domain representation of the subregion is calculated using integral images.
3. The method of claim 1, further comprising: for each subregion, weighting the frequency domain representation before or after normalization.
4. The method of claim 3, wherein the frequency domain representation is weighted using coefficient weights derived from a training set of descriptors from the same image or from different images.
5. The method of claim 4, wherein inverse medians of corresponding descriptor components from the training set are used as weighting coefficients.
6. The method of claim 1, wherein for each subregion, the frequency domain representation of the subregion is calculated as a 2-dimensional DCT of the luminosity of the subregion.
7. The method of claim 1, further comprising encoding the multidimensional descriptor to a bitwise representation.
8. The method of claim 7, wherein Hamming distance between descriptors is used as the distance measure when matching the multidimensional descriptor to another descriptor.
9. The method of claim 7, wherein the bitwise descriptor representation is formed based on one or more comparisons of components of the frequency domain representation to threshold values.
10. The method of claim 9, wherein the threshold values are calculated from a training set of descriptors.
11. A computer based system implementing the method of claim 1.
12. The method of claim 1, wherein the neighborhood around the interest point location is proportional in size to a scale of the interest point.
13. The method of claim 1, wherein combining one or more normalized frequency domain representations that correspond to respective subregions into a multidimensional descriptor for the interest point further comprises: comparing the frequency domain representation of the subregion with a predefined threshold.
14. The method of claim 1, wherein combining one or more normalized frequency domain representations that correspond to respective subregions into a multidimensional descriptor for the interest point further comprises: based on a comparison of the frequency domain representation of the subregion with a predefined threshold, discarding one or more parts of the frequency domain representation of the subregion that exceed the predefined threshold.
15. The method of claim 1, further comprising matching the multidimensional descriptor for the interest point with another descriptor of another interest point present in another image.
16. The method of claim 15, wherein matching the multidimensional descriptor for the interest point with another descriptor comprises calculating a Hamming distance between the multidimensional descriptor for the interest point and the another descriptor.
17. The method of claim 15, wherein the matching comprises bitwise matching.
18. A method comprising identifying a neighborhood around an interest point location identified within an image, the neighborhood being divided into a plurality of subregions; for each subregion, calculating a frequency domain representation by: applying a Fourier transform to the subregion to create the frequency domain representation of the subregion; and normalizing the frequency domain representation of the subregion as created based on the applying of the Fourier transform to the subregion; discarding one or more parts of the frequency domain representation of the subregion that exceed a threshold; and combining one or more normalized frequency domain representations that correspond to respective subregions into a multidimensional descriptor for the interest point.
19. The method of claim 18, further comprising matching the multidimensional descriptor for the interest point with another descriptor of another interest point present in another image by calculating a Hamming distance between the multidimensional descriptor for the interest point and the another descriptor.
20. A system comprising: a processor configured to: identify a neighborhood around an interest point location identified within an image, the neighborhood being divided into a plurality of subregions; for each subregion, calculate a frequency domain representation via: an application of a Fourier transform to the subregion to create the frequency domain representation of the subregion; and a normalization of the frequency domain representation of the subregion as created based on the application of the Fourier transform to the subregion; compare the frequency domain representation of the subregion with a threshold; based on a comparison of the frequency domain representation of the subregion with the threshold, discard one or more p arts of the frequency domain representation of the subregion that exceed the threshold; combine one or more normalized frequency domain representations that correspond to respective subregions into a multidimensional descriptor for the interest point; and match the multidimensional descriptor for the interest point with another descriptor of another interest point present in another image.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS
(12) The present invention provides methods and apparatus for creation of distinctive and compact data structures (descriptors), that may be used as compact representative signatures of a digital image or its specific areas. The descriptors created by the embodiments of the present invention enable to quantitively measure the similarity of a pair of interest points either in the same or in a different image.
(13) The methods of the present invention are used as a second step in a sequence of three independent and standalone steps: interest point detection, feature description and feature matching (See
(14) The coordinates (x,y) provided by an interest point detector are used to determine the center of the image area to describe. The area used to compute a descriptor is a square with its linear size proportional to σ, i.e. if we define a as the side length of the described area, a=kσ. Coefficient k may vary in reasonable measures around about 10. Extremely large values of k extend the described area to include too much of surroundings thus reducing emphasis on the interest point itself, whereas the values of k that are too small produce descriptors that are not distinctive because they only describe the local luminosity extremum in the center of an interest point. See
(15) The described area is afterwards divided into smaller subareas, each to form an independent chunk of the descriptor. Such division may be performed in various ways, but for practical purposes the following division method was observed to show the best results: 1. The first subarea is the area itself. 2. 9 more subareas are formed by dividing the area into 9 squares equal in size having the edge length equal to a/3.
See
(16) Each of the described subareas is divided into a uniform orthogonal grid, for illustrative embodiment assume the grid having 8×8 cells, but it can be any size depending on performance, compactness and distinctiveness requirements. Afterwards, average luminosity is calculated for each cell of the grid, thus forming a 8×8 integer matrix L.sub.xy. Optionally we can increase performance with the use of integral images [Crow, 1984] making computation of the average luminosity a constant complexity operation with regards to cell size. Then the luminosity matrix is transformed to frequency domain by applying DFT, specifically DCT [Ahmed N. and R., 1974] was used for illustrative embodiment. As soon as DCT is quite an expensive operation in terms of performance, it's reasonable to use integer DCT instead of usual float DCT. As a result of this operation we derive a new frequency domain 8×8 matrix F.sub.ij.
(17)
(18) By describing an image area in the frequency domain we get 64 numbers, each describing an independent visual peculiarity of an image, in contrary to the spacial domain, where none of the pixels anyhow describes an image or its area as a whole. For example, the F.sub.00 coefficient corresponds to the average luminosity in the area, F.sub.10 characterizes the gradient along the X axis, F.sub.01 is likewise proportional to the gradient along the Y axis and F.sub.22 corresponds to a light or a dark blob in the center of the area depending on its sign.
(19) As soon as the F.sub.00 coefficient corresponds to average luminosity, it should be discarded to make the descriptor robust to lighting changes. Afterwards, we also make the descriptor invariant to contrast changes by normalizing the rest of F.sub.ij:
(20)
To improve performance, normalization may also be limited to only integer operations:
(21)
where C is an arbitrary sufficiently large constant, e.g. 2.sup.16. After normalization, each of the f.sub.ij coefficients shows the significance of the feature corresponding to that coefficient in comparison to all the other features. Thus, visually predominant features will have large absolute values of their corresponding f.sub.ij coefficients.
(22) It is useful to discard the high frequency part of f.sub.ij because high frequency features are rarely important and often correspond to image noise. This also helps keeping the descriptor as compact as possible. Thanks to visual independence of the features described by the f.sub.ij coefficients, each of the remaining coefficients may be stored in just 1 bit with 1-bit denoting a large absolute value of that coefficient and 0-bit denoting f.sub.ij being close to zero. In order to perform such classification, each of |f.sub.ij| should be compared to predefined threshold T.sub.ij. Note that the threshold may not be the same for different frequencies, i.e. for higher frequencies the values of f.sub.ij are typically smaller than for low frequencies, so the thresholds T.sub.ij should be smaller as well.
(23) The exact threshold values T.sub.ij vary depending on the type of images we operate with and should be fine tuned for concrete implementations. One of good ways to define T.sub.ij is using the statistical approach. First, a training set of images is selected depending on the application domain. Then f.sub.ij is computed for each interest point in each of the images from the training set. Afterwards, for each pair of i and j a median m.sub.ij of the list of computed f.sub.ij is calculated. The value of m.sub.ij may be either directly used as a threshold T.sub.ij or somehow transformed. However, any deviations from the median values in T.sub.ij would lead to reduction of descriptor entropy and therefore distinctiveness.
(24) By the thresholding operation described above the memory footprint of the descriptor chunk is reduced to N bits, where N is the number of non-discarded f.sub.ij. To improve the descriptor distinctiveness for the price of its size, significant negative and significant positive f.sub.ij coefficients may be stored separately, thus increasing the descriptor chunk to 2N bits. For example, for a 10-chunk descriptor (i.e. the one describing the whole feature area and its 9 subareas) this leads to a 10N-bit descriptor (20N-bit for signed variant). For practical purposes it is sufficient to use N=8, so that the descriptor takes only 10 bytes (20 bytes for signed alternative).
(25) Example of the descriptor chunk creation process is shown in
(26) In contrary to SIFT or SURF descriptors where the distance is defined as an Euclidean distance between the descriptor vectors, in frequency domain descriptor the distance is defined as a Hamming distance between the two bit sequences. Thus, the distance between descriptors a and b may be calculated as the Hamming weight of a^b, where ^ denotes bitwise XOR (see
(27) Embodiment of the present invention may be created using software, hardware or the combination thereof. Current description of the illustrative embodiment is focused on the software implementation of the present invention. However, any of the parts and methods of the present invention or the whole of it may be implemented as hardware circuit.
CONCLUSION, RAMIFICATIONS, AND SCOPE
(28) The reader will see that, according to one embodiment of the invention, we have provided a method for computing distinctive descriptors of interest points and their surroundings that are both compact and efficient during the matching stage.
(29) While the above description contains many specificities, these should not be seen as limitations on the scope of any embodiment, but as exemplifications of the presently preferred embodiments thereof. Many other ramifications and variations are possible within the teachings of the various embodiments. For example, the embodiment described above operates on 2D images but the present invention may be extended to 3D solid images, or spatio-temporal domains, such as video. Also the present invention was mainly described with reference to gray scale images but the present invention is not limited thereto. The present invention may also be applied to color images, however any such application implies transformation of color image or its specific areas to gray scale in one way or another.
(30) Thus the scope of the invention should be determined by the appended claims and their legal equivalents, and not by the examples given.
REFERENCES
(31) [Ahmed N. and R., 1974] “Discrete cosine transform” (Ahmed N., Natarajan T. and Rao K. R.). [Bissacco, 2011] U.S. Pat. No. 8,086,616 (Bissacco, et al.). [Crow, 1984] “Summed-area tables for texture mapping” (Crow, Franklin). [Funayama, 2012] U.S. Pat. No. 8,165,401 (Funayama, et al.). [Harris C., 1988] “A combined corner and edge detector” (Harris C., Stephens M.). [Herbert Bay, 2008] “Surf: Speeded up robust features” (Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool). [Lin, 2010] U.S. Pat. No. 7,668,376 (Lin, et al.). [Lowe, 2000] U.S. Pat. No. 6,711,293 (Lowe). [Lowe, 2004] “Distinctive image features from scale-invariant keypoints” (Lowe, D. G.). [Rosten and Drummond, 2006] “Machine learning for high-speed corner detection” (E. Rosten and T. Drummond). [Takaki, 2011] U.S. Pat. No. 8,027,514 (Takaki, et al.). [Troyanker, 2003] U.S. Pat. No. 6,563,959 (Troyanker). [Wang, 2012] U.S. Pat. No. 8,233,711 (Wang, et al.).