Multi-modal dense correspondence imaging system
11210560 · 2021-12-28
Assignee
Inventors
Cpc classification
G06V40/103
PHYSICS
G06F18/214
PHYSICS
G06F18/213
PHYSICS
G06V40/10
PHYSICS
International classification
Abstract
A multi-modal dense correspondence image processing system submit the multi-modal images to a neural network to produce multi-modal features for each pixel of each of the multi-modal image. Each multi-modal image includes an image of a first modality and a corresponding image of a second modality different from the first modality. The neural network includes a first subnetwork trained to extract first features from pixels of the first modality, a second subnetwork trained to extract second features from pixels of the second modality, and a combiner configured to combine the first features and the second features to produce multi-modal features of a multi-modal image. The system compares the multi-modal features of a pair of multi-modal images to estimate a dense correspondence between pixels of the multi-modal images of the pair and outputs the dense correspondence between pixels of the multi-modal images in the pair.
Claims
1. A multi-modal dense correspondence image processing system, comprising: an input interface configured to accept a motion sequence of multi-modal images, each multi-modal image includes an image of a first modality and a corresponding image of a second modality different from the first modality, wherein corresponding images of different modalities are images of the same scene; a memory configured to store a neural network including a first subnetwork trained to extract first features from pixels of the first modality, a second subnetwork trained to extract second features from pixels of the second modality, and a combiner configured to combine the first features and the second features to produce multi-modal features of a multi-modal image; a processor configured to submit the multi-modal images to the neural network to produce the multi-modal features for each pixel of each of the multi-modal images, wherein each of the multi-modal images is submitted separately to the neural network to produce its multi-modal features thereby executing the neural network multiple times but once for each of the multi-modal images; and to estimate a dense correspondence between pixels of the multi-modal images by computing distances between multi-modal features of a pair of multi-modal images; and an output interface configured to output the dense correspondence between pixels of the multi-modal images in the pair.
2. The system of claim 1, wherein the first subnetwork is jointly trained with the second subnetwork to reduce an error between the multi-modal features of the multi-modal images and ground truth data.
3. The system of claim 2, wherein the error includes an embedding loss and an optical flow loss, wherein the embedding loss is a distance between multi-modal features produced by the neural network for corresponding pixels of the same point in a pair of different multi-modal images, wherein an optical flow loss is an error in an optical flow reconstructed from the multi-modal features produced by the neural network for corresponding pixels of the same point in the pair of different multi-modal images.
4. The system of claim 1, wherein the neural network is jointly trained with an embedding loss subnetwork trained to reduce a distance between multi-modal features produced by the neural network for corresponding pixels of the same point in a training pair of different multi-modal images and is jointly trained with an optical flow subnetwork trained to reduce an error in an optical flow reconstructed by the optical flow subnetwork from the multi-modal features of pixels in the training pair of different multi-modal images.
5. The system of claim 1, wherein the processor is configured to estimate the dense correspondence by comparing computed distances between the multi-modal features of different pixels in the pair of multi-modal images to find a correspondence between pixels with the smallest distance between their multi-modal features.
6. The system of claim 5, wherein the processor is configured to compare the multi-modal features of different pixels with nested iterations, wherein the nested iteration iterates first through multi-modal features of a first multi-modal image in the pair and for each current pixel of the first multi-modal image in the first iteration, iterates second through multi-modal features of a second multi-modal image in the pair to establish a correspondence between the current pixel in the first multi-modal image and a pixel in the second multi-modal image having the multi-modal features closest to the multi-modal features of the current pixel.
7. The system of claim 5, wherein the processor solves an optimization problem minimizing a difference between the multi-modal features of all pixels of a first multi-modal image in the pair and permutation of the multi-modal features of all pixels of a second multi-modal image in the pair, such that the permutation defines the correspondent pixels in the pair multi-modal images.
8. The system of claim 1, wherein the first modality is selected from a depth modality such that the image of the first modality is formed based on a time-of-flight of light, and wherein the second modality is selected from an optical modality such that the image of the second modality is formed by refraction or reflection of light.
9. The system of claim 8, wherein the image of the optical modality is one or combination of a radiography image, an ultrasound image, a nuclear image, a computed tomography image, a magnetic resonance image, an infrared image, a thermal image, and a visible light image.
10. The system of claim 1, wherein a modality of an image is defined by a type of a sensor acquiring an image, such as the image of the first modality is acquired by a sensor of different type than a sensor that acquired the image of the second modality.
11. The system of claim 1, wherein the images of the first modality are depth images, and wherein the images of the second modality are color images.
12. The system of claim 1, wherein the motion sequence includes a sequence of consecutive digital multi-modal images.
13. The system of claim 1, wherein the motion sequence includes a sequence of multi-modal images, which are images within a temporal threshold in a sequence of consecutive digital multi-modal images.
14. A radar imaging system configured to reconstruct a radar reflectivity image of a moving object from the motion sequence of multi-modal images using the dense correspondence determined by the system of claim 1.
15. A method for multi-modal dense correspondence reconstruction, wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out steps of the method, comprising: accepting a motion sequence of multi-modal images, each multi-modal image includes an image of a first modality and a corresponding image of a second modality different from the first modality, wherein corresponding images of different modalities are images of the same scene; submitting the multi-modal images to a neural network to produce multi-modal features for each pixel of each of the multi-modal image, wherein a neural network includes a first subnetwork trained to extract first features from pixels of the first modality, a second subnetwork trained to extract second features from pixels of the second modality, and a combiner configured to combine the first features and the second features to produce multi-modal features of a multi-modal image, and wherein each of the multi-modal images is submitted separately to the neural network to produce its multi-modal features thereby executing the neural network multiple times but once for each of the multi-modal images; estimating a dense correspondence between pixels of the multi-modal images of the pair by comparing the multi-modal features of a pair of multi-modal images; and outputting the dense correspondence between pixels of the multi-modal images in the pair.
16. The method of claim 15, wherein the first subnetwork is jointly trained with the second subnetwork to reduce an error between the multi-modal features of the multi-modal images and ground truth data, wherein the error includes an embedding loss and an optical flow loss, wherein the embedding loss is a distance between multi-modal features produced by the neural network for corresponding pixels of the same point in a pair of different multi-modal images, wherein an optical flow loss is an error in an optical flow reconstructed from the multi-modal features produced by the neural network for corresponding pixels of the same point in the pair of different multi-modal images.
17. The method of claim 15, wherein the first modality is selected from a depth modality such that the image of the first modality is formed based on a time-of-flight of light, and wherein the second modality is selected from an optical modality such that the image of the second modality is formed by refraction or reflection of light.
18. A non-transitory computer-readable storage medium embodied thereon a program executable by a processor for performing a method, the method comprising: accepting a motion sequence of multi-modal images, each multi-modal image includes an image of a first modality and a corresponding image of a second modality different from the first modality; submitting the multi-modal images to a neural network to produce multi-modal features for each pixel of each of the multi-modal image, wherein a neural network includes a first subnetwork trained to extract first features from pixels of the first modality, a second subnetwork trained to extract second features from pixels of the second modality, and a combiner configured to combine the first features and the second features to produce multi-modal features of a multi-modal image, and wherein each of the multi-modal images is submitted separately to the neural network to produce its multi-modal features thereby executing the neural network multiple times but once for each of the multi-modal images; estimating a dense correspondence between pixels of the multi-modal images of the pair by comparing the multi-modal features of a pair of multi-modal images; and outputting the dense correspondence between pixels of the multi-modal images in the pair.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
DETAILED DESCRIPTION
(14)
(15) These instructions implement a method for computing per-pixel features for multi-modal images. The features are computed in a manner such that for pixels in a pair of multi-modal images belonging to the same part of a human body, the features are similar. In other words, the distance between those features from the different multi-modal images is small according to some metric. For example, in one embodiment the multi-modal images are a depth image and a color (RGB) image.
(16) The image processing system 100 is configured to perform feature computation and correspondence computation between a pair of multi-modal images. The image processing system 100 can include a storage device 108 adapted to store ground truth data 131 used for training, the neural network weights 132, a feature computation 133 and a correspondence computation 134. The storage device 108 can be implemented using a hard drive, an optical drive, a thumbdrive, an array of drives, or any combinations thereof. Different implementations of the image processing system 100 may have different combination of the modules 131-134. For example, one embodiment uses the neural network 132 trained in advance. In this embodiment, the ground truth data 131 may be absent.
(17) A human machine interface 110 within the image processing system 100 can connect the system to a keyboard 111 and pointing device 112, wherein the pointing device 112 can include a mouse, trackball, touchpad, joy stick, pointing stick, stylus, or touchscreen, among others. The image processing system 100 can be linked through the bus 106 to a display interface 140 adapted to connect the image processing system 100 to a display device 150, wherein the display device 150 can include a computer monitor, camera, television, projector, or mobile device, among others.
(18) The image processing system 100 can also be connected to an imaging interface 128 adapted to connect the system to an imaging device 130 which provides multi-modal images. In one embodiment, the images for dense correspondence computation are received from the imaging device. The imaging device 130 can include a RGBD camera, depth camera, thermal camera, RGB camera, computer, scanner, mobile device, webcam, or any combination thereof.
(19) A network interface controller 160 is adapted to connect the image processing system 100 through the bus 106 to a network 190. Through the network 190, the images 195 including one or combination of the features and imaging input documents and neural network weights can be downloaded and stored within the computer's storage system 108 for storage and/or further processing.
(20) In some embodiments, the image processing system 100 is connected to an application interface 180 through the bus 106 adapted to connect the image processing system 100 to an application device 185 that can operate based on results of image comparison. For example, the device 185 is a system which uses the dense correspondences to reconstruct radar images of moving people to provide high throughput access security. The image processing system 100 can also be connected with other image processing applications 135.
(21)
(22)
(23)
(24) For example, in some embodiments, the first modality is selected from a depth modality such that the image of the first modality is formed based on a time-of-flight of light, and wherein the second modality is selected from an optical modality such that the image of the second modality is formed by refraction or reflection of light. Additionally, or alternatively, in some embodiments, the image of the optical modality is one or combination of a radiography image, an ultrasound image, a nuclear image, a computed tomography image, a magnetic resonance image, an infrared image, a thermal image, and a visible light image.
(25) Additionally, or alternatively, in some embodiments, a modality of an image is defined by a type of a sensor acquiring an image, such as the image of the first modality is acquired by a sensor of different type than a sensor that acquired the image of the second modality. Additionally, or alternatively, in some embodiments, the images of the first modality are depth images, and the images of the second modality are color images.
(26)
(27) In such a manner, the neural network is trained to produce multi-modal features suitable to improve accuracy of dense correspondence. To that end, the multi-modal dense correspondence image processing system 100 is configured to compare the multi-modal features of a pair of multi-modal images to estimate a dense correspondence between pixels of the multi-modal images of the pair and output the dense correspondence between pixels of the multi-modal images in the pair.
(28) The features computation 133 includes several components. A first modality image 401 at time step t.sub.i is input to a neural network 411. The neural network 411 computes a feature vector, or simply features, 421. A second modality image 402 at the same time step t.sub.i is input to a neural network 412. The neural network 412 computes feature, 422. A concatenation module 430 combines the features 421 and 422 to multi-modal features 423 at time step t.sub.i, by concatenation of the feature vectors.
(29)
(30) It is to be understood that different image content for a first modality image 401, for example modality image 403, will result in different features 421, for example features 441. Similarly, different image content for a second modality image 402, for example modality image 404, will result in different features 422, for example features 442.
(31)
(32) The input image of a first modality 401 includes an array of pixels 510. For clarity only a small subset of pixels 510 is shown in modality image 401. The input image of a first modality 401 has a resolution 515, of height (H) 516 by width (W) 517 by modality channel depth (D.sub.1) 518. The input image of a second modality 402 includes an array of pixels 520. For clarity only a small subset of pixels 520 is shown in modality image 402. The input image of a second modality 402 has a resolution 525, of height (H) 516 by width (W) 517 by modality channel depth (D.sub.2) 528.
(33) The features 421 of a first modality are determined from an array of pixels 530. For clarity only a small subset of pixels 530 for features 421 are shown. Features 421 have a resolution 535 of height (H) 516 by width (W) 517 by feature channel depth (D′.sub.1) 538. The features 422 of a second modality are determined from an array of pixels 540. For clarity only a small subset of pixels 540 for features 422 are shown. Features 422 have a resolution 545 of height (H) 516 by width (W) 517 by feature channel depth (D′.sub.2) 548.
(34) The multi-modal features 423 are determined from an array of pixels 550. For clarity only a single pixel 550 is shown in multi-modal features 423. The multi-modal features 423 have a resolution 555 of height (H) 516 by width (W) 517 by feature channel depth (D′) 558. The multi-modal features 423 are formed by concatenation 430 of features 421 and features 422. The multi-modal features 423 channel depth D′ 558 is thus the sum of channel depths 538 (D′.sub.1) and 548 (D′.sub.2): D′=D′.sub.1+D′.sub.2. Since the H and W for features 421, 422 and 423 are the same as the H and W for input 401 and 402, this disclosure labels them as per-pixel features with channel depths D′.sub.1, D′.sub.2, D′ respectively.
(35)
(36) In such a manner, the system 100 is configured to compare the multi-modal features of different pixels with a nested iterations comparison. The nested iteration comparison iterates first through multi-modal features of a first multi-modal image in the pair and for each combine features of the first multi-modal image of a current pixel in the first iteration iterates second through multi-modal features of a second multi-modal image in the pair to establish a correspondence between the current pixel in the first multi-modal image and a pixel in the second multi-modal image having the multi-modal features closest to the multi-modal features of the current pixel.
(37) Additionally or alternatively, some embodiments solve an optimization problem minimizing a difference between the multi-modal features of all pixels of a first multi-modal image in the pair and permutation of the multi-modal features of all pixels of a second multi-modal image in the pair, such that the permutation defines the correspondent pixels in the pair multi-modal images. For example, on embodiment poses the correspondence computation 134 as an optimization problem:
(38)
(39) The per-pixel multi-modal features 423 are stacked into a matrix F.sub.1 and the per-pixel multi-modal features 443 are stacked into a matrix F.sub.2. The matrix M is a permutation matrix. The matrix W can impose constraints on matrix M during optimization. The dense correspondences 601 are then determined by the permutation matrix M after the optimization has finished.
(40) Training
(41)
(42) As described previously feature computation 133 uses the subnetworks 411 and 412 along with concatenation 430 to produce multi-modal features 423 and 443 of each multi-modal image. The multi-modal features 423 and 443 are input to the embedding loss 720, and also to another optical flow neural network 730. The optical flow network 730 produces an optical flow prediction 740. The embedding prediction is compared with the ground truth data 131 to determine an embedding loss. The optical flow prediction is compared with the ground truth optical flow 131 to determine an optical flow loss.
(43) A loss is the error computed by a function. In one embodiment, the function for the embedding loss 720 is defined as:
(44)
(45) The functions D.sub.1( ) and D.sub.2( ) in equation (1) above represent the steps to produce the multi-modal features 423 and 443 respectively. For a given pixel p.sub.i the corresponding feature from features 423 is denoted as D.sub.1(p.sub.i). For a given pixel p′.sub.i the corresponding feature from features 443 is denoted as D.sub.2(p′.sub.i). If the pixels are in correspondence (p.sub.i⇔p′.sub.i and thus y.sub.i=1 in equation (2)), the loss in equation (1) is computed according to the left-hand side with respect to the ‘+’ sign. If the pixels are not in correspondence (y.sub.i=0 in equation (2)), the loss in equation (1) is computed according to the right-hand side with respect to the ‘+’ sign. In colloquial terms the loss function specified with equation (1) tries to achieve similar features for pixels in correspondence, and dissimilar features for pixels that are not in correspondence.
(46) Training randomly selects a number P pixels from the multi-modal input images 701 and 702, and determine the corresponding pixels in 711 and 712 using the ground truth optical flow data 131 for training. Training further selects a number Q of non-correspondences. The correspondences and non-correspondences sum to N=P+Q. The selection of pixels is providing the data 760 for computing the embedding loss 720.
(47)
(48)
(49) In some embodiments, the optical flow loss 750 is computed as the difference between the flow values at the pixels of the predicted optical flow image and the pixels of the ground truth optical flow image.
(50)
(51) Training the neural network involves computing the weight values associated with the connections in the artificial-neural-network. To that end, unless herein stated otherwise, the training includes electronically computing weight values for the connections in the fully connected network, the interpolation and the convolution. The embedding loss 720 and optical flow loss 750 are summed together and a stochastic gradient descent based method is used to update the neural network weights. Training continues until some desired performance threshold is reached.
(52)
(53) The memory 38 includes a database 90, trainer 82, the neural network 780, preprocessor 84. The database 90 can include historical data 106, training data 88, testing data 92 and ground truth data 94. The database may also include results from operational, training or retaining modes of using the neural network 780. These elements have been described in detail above.
(54) Also shown in memory 38 is the operating system 74. Examples of operating systems include AIX, OS/2, and DOS. Other elements shown in memory 38 include device drivers 76 which interpret the electrical signals generated by devices such as the keyboard and mouse. A working memory area 78 is also shown in memory 38. The working memory area 78 can be utilized by any of the elements shown in memory 38. The working memory area can be utilized by the neural network 780, trainer 82, the operating system 74 and other functions. The working memory area 78 may be partitioned amongst the elements and within an element. The working memory area 78 may be utilized for communication, buffering, temporary storage, or storage of data while a program is running.
(55)
(56) The above-described embodiments of the present invention can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component. Though, a processor may be implemented using circuitry in any suitable format.
(57) Also, the embodiments of the invention may be embodied as a method, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
(58) Use of ordinal terms such as “first,” “second,” in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
(59) Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.