MONOCULAR DEPTH ESTIMATION DEVICE AND DEPTH ESTIMATION METHOD
20230080120 · 2023-03-16
Inventors
- Saad Imran (Seoul, KR)
- Muhammad Umar Khan (London, GB)
- Sikander Bin Mukaram (Nortwood, GB)
- Chong-Min Kyung (Daejeon, KR)
Cpc classification
International classification
Abstract
A depth estimation device includes a difference map generating network and a depth transformation circuit. The difference map generating network generates, from a monocular input image and using a plurality of neural networks, a plurality of difference maps corresponding to a plurality of baselines. The plurality of difference maps includes a first difference map corresponding to a first baseline and a second difference map corresponding to a second baseline. The depth transformation circuit generates a depth map using one of the plurality of difference maps.
Claims
1. A depth estimation device comprising: a difference map generating network configured to generate a plurality of difference maps corresponding to a plurality of baselines from a single input image and to generate a mask indicating a masking region; and a depth transformation circuit configured to generate a depth map using one of the plurality of difference maps, wherein the plurality of difference maps includes a first difference map corresponding to a first baseline and a second difference map corresponding to a second baseline.
2. The depth estimation device of claim 1, further comprising a synthesizing circuit configured to generate a synthesized difference map by combining the mask, the first difference map, and the second difference map.
3. The depth estimation device of claim 2, wherein the synthesizing circuit generates the synthesized difference map by synthesizing data of the first difference map corresponding to the masking region with the second difference map.
4. The depth estimation device of claim 1, wherein the difference map generating network comprises: an encoder configured to generate, using a first neural network, feature data by encoding the input image; a first decoder configured to generate, using a second neural network, the first difference map from the feature data; a second decoder configured to generate, using a third neural network, a left difference map and a right difference map from the feature data; a third decoder configured to generate, using a fourth neural network, the second difference map from the feature data; and a mask generating circuit configured to generate the mask according to the left difference map and the right difference map.
5. The depth estimation device of claim 4, wherein the mask generating circuit comprises: a transformation circuit configured to generate a reconstructed left difference map by transforming the right difference map according to the left difference map; and a comparison circuit configured to generate the mask according to the left difference map and the reconstructed left difference map.
6. The depth estimation device of claim 5, wherein the comparison circuit determines data of the mask by comparing a threshold value with a difference between the left difference map and the reconstructed left difference map.
7. The depth estimation device of claim 4, wherein a learning operation for the second, third, and fourth neural networks uses a first image, a second image paired with the first image to form a first baseline image pair, and a third image paired with the first image to form a second baseline image pair.
8. The depth estimation device of claim 7, further comprising a first loss calculation circuit to calculate a first loss function by using the first image and a first reconstructed image generated by transforming the second image according to the first difference map.
9. The depth estimation device of claim 7, further comprising: a second loss calculation circuit configured to calculate a second loss function by using the first image and a second reconstructed image generated by transforming the third image according to the left difference map; and a third loss calculation circuit configured to calculate a third loss function by using the third image and a third reconstructed image generated by transforming the first image according to the right difference map.
10. The depth estimation device of claim 7, further comprising a fourth loss calculation circuit configured to calculate a fourth loss function by calculating a first loss subfunction using the first image and a fourth reconstructed image generated by transforming the third image according to the second difference map, calculating a second loss subfunction using the first difference map and the second difference map, and calculating a third loss subfunction by using the second difference map and the first image.
11. A depth estimation method comprising: receiving an input image corresponding to a single monocular image; generating, from the input image, a plurality of difference maps including a first difference map corresponding to a first baseline and a second difference map corresponding to a second baseline; generating a depth map using one of the plurality of difference maps.
12. The depth estimation method of claim 11, further comprising: generating, from the input image, a mask indicating a masking region; and generating a synthesized difference map by combining the mask, the second difference map and the first difference map.
13. The depth estimation method of claim 12, wherein generating the synthesized difference map comprises synthesizing data of the first difference map corresponding to the masking region with the second difference map.
14. The depth estimation method of claim 11, further comprising: generating feature data by encoding the input image using a first neural network, wherein generating the plurality of difference maps comprises: generating the first difference map by decoding the feature data using a second neural network; and generating the second difference map by decoding the feature data using a fourth neural network wherein generating the mask comprises: generating a left difference map and a right difference map by decoding the feature data using a third neural network, and generating the mask according to the left difference map and the right difference map.
15. The depth estimation method of claim 14, wherein generating the mask comprises: generating a reconstructed left difference map by transforming the right difference map according to the left difference map; and generating the mask by comparing a threshold value to a difference between the left difference map and the reconstructed left difference map.
16. The depth estimation method of claim 14, wherein a learning operation for the one or more of the first through fourth neural networks uses a first image, a second image paired with the first image to form a first baseline image pair, and a third image paired with the first image to form a second baseline image pair.
17. The depth estimation method of claim 16, wherein the learning operation comprises: calculating a first loss function by using the first image and a first reconstructed image generated by transforming the second image according to the first difference map; calculating a second loss function by using the first image and a second reconstructed image generated by transforming the third image according to the left difference map; calculating a third loss function by using the third image and a third reconstructed image generated by transforming the first image according to the right difference map; training the first, second, and third neural networks using the first, second, and third loss functions; calculating a fourth loss function by calculating a first loss subfunction using the first image and a fourth reconstructed image generated by transforming the third image according to the second difference map, calculating a second loss subfunction using the first difference map and the second difference map, and calculating a third loss subfunction by using the second difference map and the first image; and training the fourth neural networks using the fourth loss function.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views, together with the detailed description below, are incorporated in and form part of the specification, and serve to further illustrate various embodiments, and explain various principles and beneficial aspects of those embodiments.
[0016]
[0017]
[0018]
DETAILED DESCRIPTION
[0019] The following detailed description references the accompanying figures in describing illustrative embodiments consistent with this disclosure. The embodiments are provided for illustrative purposes and are not exhaustive. Additional embodiments not explicitly illustrated or described are possible. Further, modifications can be made to the presented embodiments within the scope of teachings of the present disclosure. The detailed description is not meant to limit embodiments of this disclosure. Rather, the scope of the present disclosure is defined in accordance with claims and equivalents thereof. Also, throughout the specification, reference to “an embodiment” or the like is not necessarily to only one embodiment, and different references to any such phrase are not necessarily to the same embodiment(s).
[0020]
[0021] The depth estimation device 1 includes a difference map generating network 100, a synthesizing circuit 210, and a depth transformation circuit 220.
[0022] During an inference operation, the difference map generating network 100 receives a single input image. The single input image may correspond to a single image taken from a monocular imaging device.
[0023] However, during a learning operation of the difference map generating network 100, a plurality of input images corresponding to sets of multi-baseline images are used. The learning operation will be disclosed in more detail below.
[0024] During the learning operation, the difference map generating network 100 generates a first difference map d.sub.s, a second difference map d.sub.m, and a mask M from the plurality of input images. During the inference operation the difference map generating network 100 may generate only the second difference map d.sub.m. from the single input image.
[0025] In general, a small baseline stereo system generates accurate depth information at a relatively near range. When the baseline is small, an occlusion area visible only to one of the two cameras is relatively small.
[0026] In contrast, a large baseline stereo system generates accurate depth information at a relatively far range. When the baseline is large, the occlusion area is relatively large.
[0027] The first difference map d.sub.s corresponds to a map indicating inferred differences between small baseline images, and the second difference map d.sub.m corresponds to a map indicating inferred differences between large baseline images.
[0028] Disparity represents a distance between two corresponding points in two images, and a difference map represents disparities for the entire image.
[0029] Since a technique for calculating a depth of a point using a baseline, a focal length, and a disparity is well known due to articles such as D. Gallup, J. Frahm, P. Mordohai and M. Pollefeys, “Variable baseline/resolution stereo,” 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1-8, doi: 10.1109/CVPR.2008.4587671.
, a detailed description thereof will be omitted.
[0030] The difference map generating network 100 further generates a mask M, wherein the mask M indicates a masking region of the second difference map d.sub.m to be replaced with data of the first difference map d.sub.s.
[0031] A method of generating the mask M will be disclosed in detail below.
[0032] The synthesizing circuit 210 is used for a training operation, and the depth transformation circuit 220 is used for an inference operation.
[0033] The synthesizing circuit 210 applies the mask M to the second difference map d.sub.m, thus removing the data corresponding to the masking region from the second difference map d.sub.m.
[0034] The synthesizing circuit 210 generates a synthesized difference map using the first difference map d.sub.s and the mask M.″
[0035] In this case, the synthesizing circuit 210 replaces data of the masking region in the second difference map d.sub.m ″with corresponding data of the first difference map d.sub.s.
[0036] The depth transformation circuit 220 generates a depth map from the synthesized difference map.
[0037] In this embodiment, the first difference map d.sub.s corresponding to a first baseline is used inside the masking region, and the second difference map d.sub.m corresponding to a second baseline is used outside the masking region.
[0038]
[0039] The difference map generating network 100 includes an encoder 110, a first decoder 121, a second decoder 122, a third decoder 123, and a mask generating circuit 130.
[0040] The encoder 110 encodes an input image I.sub.L to generate feature data. In embodiments, the encoder 110 uses a trained neural network to generate the feature data.
[0041] The first decoder 121 decodes the feature data to generate a first difference map d.sub.s, and the second decoder 122 decodes the feature data to generate a left difference map d.sub.l and a right difference map d.sub.r, and the third decoder 123 decodes the feature data to generate a second difference is map d.sub.m. In embodiments, the first decoder 121, second decoder 122, and third decoder 123 use respective trained neural networks to decode the feature data.
[0042] The mask generating circuit 130 generates a mask M from the left difference map d.sub.l and the right difference map d.sub.r.
[0043] The mask generating circuit 130 includes a transformation circuit 131 that transforms the right difference map d.sub.r according to the left difference map d.sub.l to generate a reconstructed left difference map d.sub.l′.
[0044] In the present embodiment, the transformation operation corresponds to a warp operation, and the warp operation is a type of transformation operation that transforms a geometric shape of an image.
[0045] In this embodiment, the transformation circuit 131 performs a warp operation as shown in Equation 1. The warp operation by the Equation 1 is known by prior articles such as Saad Imran, Sikander Bin Mukarram, Muhammad Umar Karim Khan, and Chong-Min Kyung, “Unsupervised deep learning for depth estimation with offset pixels,” Opt. Express 28, 8619-8639 (2020)
. Equation 1 represents a warp function f.sub.w used to warp an image I with the difference map d. In detail, warping is used to change the viewpoint of a given scene across two views with a given disparity map. For example, if IL is a left image and dR is a difference map between the left image IL and a right image IR with the right image IR taken as reference, then in the absence of occlusion, fw(IL; dR) should be equal to the right image IR.
f.sub.w(I; d)=I(i+d.sub.l(i, j), j)∀i, j [Equation 1]
[0046] The transformation circuit 131 may additionally perform a bilinear interpolation operation as described in M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Advances in neural information processing systems, (2015), pp. 2017-2025
on the operation result of Equation 1.
[0047] The mask generating circuit 130 includes a comparison circuit 132 that generates the mask M by comparing the reconstructed left difference map d.sub.l′ with the left difference map d.sub.l.
[0048] In the occlusion region, there is a high probability that the reconstructed left difference map d.sub.l′ and the left difference map di have different values.
[0049] Accordingly, in the present embodiment, if a difference between each pixel of the reconstructed left difference map d.sub.l′ and the corresponding pixel of the left difference map d.sub.l is greater than a threshold value, which is 1 in an embodiment, then corresponding mask data for that pixel is set to 1. Otherwise, the corresponding mask data for that pixel is set to 0. Hereinafter, an occlusion region may be referred to as a masking region.
[0050] During the inference operation, the input image I.sub.L is one monocular image such as may be acquired by a single camera. During the inference operation the encoder 110 generates the feature data from the single input image I.sub.L and the third decoder 123 generates the second difference map d.sub.m from the feature data.
[0051] During the learning operation, a prepared training data set is used and the training data set includes three images as one unit of data as shown in
[0052] The three images include a first image I.sub.L, a second image I.sub.R1, and a third image I.sub.R2.
[0053] The first image I.sub.L corresponds to a leftmost image, the second image I.sub.R1 corresponds to a middle image, and the third image I.sub.R2 corresponds to a rightmost image.
[0054] That is, the first image I.sub.L and the second image I.sub.R1 correspond to a small baseline B.sub.s image pair, and the first image I.sub.L and the third image I.sub.R2 correspond to a large baseline B.sub.L image pair.
[0055] During the learning operation, the total loss function is calculated and weights included in the neural networks of the encoder 110, the first decoder 121, and the second decoder 122 shown in
[0056] In this embodiment, weights for the third decoder 123 are adjusted separately, as will be described in detail below.
[0057] In this embodiment, the total loss function L.sub.total corresponds to a combination of an image reconstruction loss component L.sub.recon, a smoothness loss component L.sub.smooth, and a decoder loss component L.sub.dec3, as shown in Equation 2.
L.sub.total=L.sub.recon+λL.sub.smooth+L.sub.dec3 [Equation 2]
[0058] In Equation 2, a smoothness weight λ is set in embodiments to 0.1.
[0059] In Equation 2, the image reconstruction loss component L.sub.recon is defined as Equation 3.
L.sub.recon=L.sub.a(I.sub.L, I.sub.L1′)+L.sub.a(I.sub.L, I.sub.L2′)+L.sub.a(I.sub.R2, I.sub.R2′) [Equation 3]
[0060] In Equation 3, the reconstruction loss component L.sub.recon is expressed as the sum of the first image reconstruction loss function L.sub.a between the first image I.sub.L and the first reconstruction image I.sub.L1′, the second reconstruction loss function L.sub.a between the first image I.sub.L and the second reconstruction image I.sub.L2′, and the third image reconstruction loss function L.sub.a between the third image I.sub.R2 and the third reconstruction image I.sub.R2′.
[0061] In
[0062] The transformation circuit 141 transforms the second image I.sub.R1 according to the first difference map d.sub.s to generate a first reconstructed image I.sub.L1′.
[0063] The transformation circuit 142 transforms the third image I.sub.R2 according to the left difference map d.sub.l to generate a second reconstructed image I.sub.L2′.
[0064] The transformation circuit 143 transforms the first image I.sub.L according to the right difference map d.sub.r to generate a third reconstructed image I.sub.R2′.
[0065] The image reconstruction loss function L.sub.a is expressed by Equation 4. The image reconstruction loss function L.sub.a of Equation 4 represents photometric error between an original image I and a reconstructed image I′.
[0066] In Equation 4, the Structural Similarity Index (SSIM) function is used for comparing similarity between images and a well-known function through an article such as Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600?612, 2004.
.
[0067] In Equation 4, N denotes the number of pixels, I denotes an original image, and I′ denotes a reconstructed image. In this embodiment, a 3×3 block filter is used instead of a Gaussian for the SSIM operation.
[0068] In this embodiment, the value of alpha is set to 0.85, so that more weight is given to the SSIM calculation result. The SSIM calculation result produces values based on contrast, illuminance, and structure.
[0069] When the difference in illuminance between the two images is large, it may be more effective to use the SSIM calculation result.
[0070] In Equation 2, the smoothness loss component L.sub.smooth is defined by Equation 5. The smoothness loss discourages disparity smoothness in absence of small image gradients.
L.sub.smooth=L.sub.s(d.sub.s, I.sub.L)+L.sub.s(d.sub.l, I.sub.L)+L.sub.s(d.sub.r, I.sub.R2) [Equation 5]
[0071] In Equation 5, the smoothness loss component L.sub.smooth is expressed as the sum of the first smoothness loss function L.sub.s between the first difference map d.sub.s and the first image I.sub.L, the second smoothness loss function L.sub.s between the left difference map di and the first image I.sub.L, and the third smoothness loss function L.sub.s between the right difference map d.sub.r and the third image I.sub.R2.
[0072] In
[0073] The smoothness loss function L.sub.s is expressed by the following Equation 6. In Equation 6, d corresponds to an input difference map, I corresponds to an input image, ∂x is a horizontal gradient of the input image, and ∂y is a vertical gradient of the input image. It can be seen from Equation 6 that when the image gradient is small, the smoothness loss component becomes small. This same loss has been used in the articles such as Godard, Clément et al. “Unsupervised Monocular Depth Estimation with Left-Right Consistency.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017): 6602-6611
.
[0074] In Equation 2, the decoder loss component L.sub.dec3 is defined by Equation 7. Here, the decoder loss component is associated with the third decoder 123.
L.sub.dec3=(1−M).Math.L.sub.a(I.sub.L, I.sub.L3′)+L.sub.da(d.sub.s, d.sub.m)+λ.Math.L.sub.s(d.sub.m, I.sub.L) [Equation 7]
[0075] In Equation 7, the decoder loss component L.sub.dec3 is expressed as sum of the fourth image reconstruction loss function L.sub.a between the first image I.sub.L and the fourth reconstruction image I.sub.L3′, the fourth smoothness loss function L.sub.s between the second difference map d.sub.m and the first image I.sub.L, and the difference assignment loss function L.sub.da between the first difference map d.sub.s and the second difference map d.sub.m.
[0076] In
[0077] The calculation method of the fourth image reconstruction loss function L.sub.a and the fourth smoothness loss function L.sub.s is the same as described above.
[0078] The transformation circuit 144 transforms the third image I.sub.R2 according to the second difference map d.sub.m to generate a fourth reconstructed image I.sub.L3′.
[0079] In Equation 7, (1−M) indicates that pixels in the masking region (also referred to as the occlusion region) do not affect the image reconstruction loss, and the difference allocation loss L.sub.da is considered in the masking region.
[0080] In order for the second difference map d.sub.m to follow the first difference map d.sub.s in the masking region, that is, to minimize the value of the difference assignment loss function L.sub.da, only the weights of the third decoder 123 are adjusted. Accordingly, the first difference map d.sub.s is not affected by the difference assignment loss function L.sub.da.
[0081] In Equation 7, the difference assignment loss function L.sub.da is defined by Equation 8.
[0082] In this embodiment, β is set to 0.85, and r is the ratio of the large baseline to the small baseline.
[0083] By using r, the scale of the first difference map d.sub.s can be adjusted to the scale of the second difference map d.sub.m. For example, when the small baseline is 1 mm and the large baseline is 5 mm, the difference range of the second difference map d.sub.m is 5 times the difference range of the first difference map d.sub.s, and the ratio r is set to 5.
[0084] Although various embodiments have been illustrated and described, various changes and modifications may be made to the described embodiments without departing from the spirit and scope of the invention as defined by the following claims.