MONOCULAR UNSUPERVISED DEPTH ESTIMATION METHOD BASED ON CONTEXTUAL ATTENTION MECHANISM

Abstract

The present invention provides a monocular unsupervised depth estimation method based on contextual attention mechanism, belonging to the technical field of image processing and computer vision. The invention adopts a depth estimation method based on a hybrid geometric enhancement loss function and a context attention mechanism, and adopts a depth estimation sub-network, an edge sub-network and a camera pose estimation sub-network based on convolutional neural network to obtain high-quality depth maps. The present invention uses convolutional neural network to obtain the corresponding high-quality depth map from the monocular image sequences in an end-to-end manner. The system is easy to construct, the program framework is easy to implement, and the algorithm runs fast; the method uses an unsupervised method to solve the depth information, avoiding the problem that ground-truth data is difficult to obtain in the supervised method.

Claims

1. An unsupervised method for monocular depth estimation based on contextual attention mechanism, wherein comprising the following steps: (1) preparing initial data, the initial data includes the monocular video sequence used for training and the single image or sequence used for testing; (2) the construction of depth estimation sub-network and edge estimation sub-network and the construction of context attention mechanism: (2-1) using the encoder-decoder structure, the residual network containing the residual structure is used as the main structure of the encoder to convert the input color map into the feature map; the depth estimation sub-network and the edge estimation sub-network share the encoder, but have their own decoders, which are easy to output their respective features; the decoders contain deconvolution layers for up-sampling the feature map and converting the feature map into a depth map or edge map; (2-2) constructing the context attention mechanism into the decoder of the depth estimation sub-network; (3) the construction of the camera pose sub-network: the camera pose sub-network contains an average pooling layer and more than five convolutional layers, and except for the last convolutional layer, all other convolutional layers adopt batch normalization and ReLU activation function; (4) the construction of the discriminator structure: the discriminator structure contains more than five convolutional layers, each of which uses batch normalization and Leaky-ReLU activation functions, and the final fully connected layer; (5) the construction of a loss function based on hybrid geometry enhancement; (6) training the whole network composed by (2), (3) and (4); the supervision method adopts the loss function based on the hybrid geometric enhancement constructed in step 5) to gradually optimize the network parameters; after training, using the trained model to test on the test set to get the output result of the corresponding input image.

2. The unsupervised method for monocular depth estimation based on contextual attention mechanism according to claim 1, wherein the construction of the context attention mechanism in step (2-2) specifically includes the following steps: the context attention mechanism is added to the front end of the decoder of the depth estimation network; the feature map obtained by the previous encoder network is A∈ custom-character .sup.H×W×C, where H, W, C respectively represent the height, width, and number of channels; at first, transform A into B∈.sup.N×C(N=H×W), and then multiply B and its transposed matrix B.sup.T; the result can get the spatial attention map S∈.sup.N×N or channel attention map S∈ custom-character .sup.C×C after the softmax activation function operation, that is, S=softmax(BB.sup.T) or S=softmax(B.sup.TB); next, perform matrix multiplication on S and B and transform them into U∈.sup.H×W×C and finally add the original feature map A and U pixel by pixel to get the final feature output A.sub.a.

3. The unsupervised method for monocular depth estimation based on contextual attention mechanism according to claim 1, wherein the construction of a loss function based on hybrid geometric enhancement specifically includes the following steps: (5-1) designing the photometric loss function L.sub.p; use the depth map information and the camera pose to obtain the source frame image coordinates from the target frame image coordinates, and establish the projection relationship between adjacent frames; the formula is:
p.sub.s=KT.sub.t.fwdarw.sD.sub.t(p.sub.t)K.sup.−1p.sub.t where K is the camera calibration parameter matrix, K.sup.−1 is the inverse matrix of the parameter matrix, D.sub.t is the predicted depth map, s and t represent the source frame and the target frame, respectively; T.sub.t.fwdarw.s is the camera pose information from t to s, p.sub.s is the image coordinate of the source frame, and p.sub.t is the image coordinate of the target frame; the source frame image I.sub.s is warped to the target frame angle of view to obtain the synthesized image Î.sub.s.fwdarw.t, which is expressed as follows: ${\hat{I}}_{s .fwdarw. t} (p_{t}) = I_{s} (p_{s}) = \underset{j \in {t, b, l, r}}{.Math.} w^{j} I_{s} (p_{s}^{j})$ among them, w.sup.j is the linear interpolation coefficient, and the value is ¼; p.sub.s.sup.j is the adjacent pixel in p.sub.s, j∈{t,b,l,r} represents 4-neighborhood, and t, b, l, r represent the top, bottom, left and right ends of the coordinate position; L.sub.p is defined as follows: $L_{p} = \frac{1}{N} {.Math.}_{t = 1}^{N} \underset{p_{t}}{.Math.} M_{t}^{*} (p_{t}) .Math. I_{t} (p_{t}) - {\hat{I}}_{s .fwdarw. t} (p_{t}) .Math.$ among them, N represents the number of images per training, the effective mask M.sub.t*=1−M, M is defined as: M=I(ξ≥0), where I is the indicator function, and the definition of ξ is ξ=∥D.sub.t−{circumflex over (D)}.sub.t∥.sup.2−(η.sub.1∥Ď.sub.t∥.sup.2+η.sub.2), where η.sub.1 and η.sub.2 are weight coefficients set to 0.01 and 0.5 respectively; Ď.sub.t is a depth map generated by warping the depth map D.sub.t of the target frame; (5-2) designing space smooth loss function L.sub.s, used to process the depth value of low-texture areas, the formula is as follows: $L_{s} = \frac{1}{N} {.Math.}_{t = 1}^{N} \underset{p_{t}}{.Math.} (.Math. \nabla_{x}^{2} D_{t} (p_{t}) .Math. e^{- γ | E_{t} (p_{t}) |} + .Math. \nabla_{y}^{2} D_{t} (p_{t}) .Math. e^{- γ | E_{t} (p_{t}) |})$ among them, the parameter γ is set to 10, E.sub.t is the output result of the edge sub-network, and ∇.sub.x.sup.2 and ∇.sub.y.sup.2 are the two-step gradient in the x and y directions of the coordinate system, respectively; to avoid getting trivial solutions, design the edge regularization loss function L.sub.e, the formula is as follows: $L_{e} = \frac{1}{N} {.Math.}_{t = 1}^{N} \underset{p_{t}}{.Math.} {.Math. E_{t} (p_{t}) .Math.}^{2}$ (5-3) designing the left and right consistency loss function L.sub.d to eliminate the error caused by occlusion between the viewpoints; the formula is as follows: $L_{d} = \frac{1}{N} {.Math.}_{t = 1}^{N} \underset{p_{t}}{.Math.} .Math. D_{t} (p_{t}) - {\overset{ˇ}{D}}_{t} (p_{t}) .Math.$ (5-4) the discriminator uses the adversarial loss function when distinguishing real images and synthetic images; regarding the combination of deep network, edge network, and camera pose network as the generator, and the final synthesized image is sent to the judgment together with the real input image to get better results in the device; the adversarial loss function formula is as follows: $L_{Adv} = \frac{1}{N} {.Math.}_{t = 1}^{N} {E_{I_{t} \sim P (I_{t})} [\log D (I_{t})] + E_{{\hat{I}}_{s .fwdarw. t} \sim P (I_{s .fwdarw. t})} [\log (1 - D ({\hat{I}}_{s .fwdarw. t}))]}$ among them, P(*) represents the probability distribution of the data *, E represents the expectation, and D represents the discriminator; this adversarial loss function prompts the generator to learn the mapping of synthetic data to real data, so that the synthetic image is similar to the real image; (5-5) the loss function of the overall network structure is defined as follows:
L=α.sub.1L.sub.p+α.sub.2L.sub.s+α.sub.3L.sub.e+α.sub.4L.sub.d+α.sub.5L.sub.Adv among them, α.sub.1, α.sub.2, α.sub.3, α.sub.4 and α.sub.5 are the weight coefficients.

Description

DESCRIPTION OF DRAWINGS

[0023] FIG. 1 is the structure diagram of convolutional neural network proposed by the present invention.

[0024] FIG. 2 is the structure diagram of attention mechanism.

[0025] FIG. 3 is the results show. (a) Input color image; (b) Ground truth depth map; (c) Results of the present invention.

DETAILED DESCRIPTION

[0026] The present invention proposes a monocular unsupervised depth estimation method based on a context attention mechanism, which is described in detail with reference to the drawings and embodiments as follows:

[0027] The method includes the following steps:

[0028] (1) preparing initial data:

[0029] (1-1) use two public datasets, KITTI dataset and Make3D dataset to evaluate the invention;

[0030] (1-2) the KITTI dataset is used for training and testing of the present invention. It has a total of 40,000 training samples, 4,000 verification samples, and 697 test samples. During training, the original image resolution size of 375×1242 is scaled to 128×416. The length of the input image sequence during training is set to 3, and the middle frame is the target view while the other frames are the source views.

[0031] (1-3) the Make3D dataset is mainly used to test the generalization performance of the present invention on different datasets. The Make3D dataset has a total of 400 training samples and 134 test samples. Here, the present invention only selects the test set of the Make3D dataset, and the training model comes from the KITTI dataset. The resolution of the original image in the Make3D dataset is 2272×1704. By cropping the central area, the image resolution is changed to 525×1704 so that the sample set has the same aspect ratio as the KITTI sample, and then its size is scaled to 128×416 as input for network testing.

[0032] (1-4) the input during the test can be either a sequence of images with the length of 3 or a single image.

[0033] (2) the construction of depth estimation sub-network and edge sub-network and the construction of context attention mechanism:

[0034] (2-1) as shown in FIG. 1, the main architecture of the depth estimation and edge estimation network is mainly based on the encoder-decoder structure (N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, T. Brox, A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation, in: IEEE CVPR, 2016, pp. 4040-4048). Specifically, the encoder part adopts a residual network containing a 50-layer residual structure (ResNet50), which converts the input color map into feature maps and obtains multi-scale features by using a convolutional layer with a step size of 2 to downsample the feature map layer by layer. In order to reduce the training parameters, the depth estimation network and the edge network adopt a shared encoder design, and the decoder part is unique to output its own characteristics. The network structure of the decoder part is symmetrical to the network structure of the encoder part. It mainly contains deconvolution layers, which infer the final depth map or edge map by gradually up-sampling the feature map. In order to enhance the feature expression ability of the network, the encoder-decoder structure uses skip connections to connect the feature maps with the same spatial dimensions of the encoder and decoder parts.

[0035] The context attention mechanism is added to the front end of the decoder of the depth estimation network; the context attention mechanism is shown in FIG. 2. The feature map obtained by the previous encoder network is A∈ custom-character .sup.H×W×C, where H, W, C respectively represent the height, width, and number of channels. At first, transform A into B∈.sup.N×C(N=H×W), and then multiply B and its transposed matrix B.sup.T. The result can get the spatial attention map S∈.sup.N×N or channel attention map S∈ custom-character .sup.C×C after the Softmax activation function operation, that is, S=softmax(BB.sup.T) or S=softmax(B.sup.TB). Next, we perform matrix multiplication on S and B and transform them into U∈.sup.H×W×C and finally add the original feature map A and U pixel by pixel to get the final feature output A.sub.a. Experiments have proved that the effect of this attention mechanism added to the forefront of the depth estimation sub-network decoder is significantly improved. On this basis, adding this mechanism to other networks is difficult to improve the effect and will significantly increase the amount of network parameters.

[0036] (3) construction of camera pose network:

[0037] the camera pose network is mainly used to estimate the pose transformation between two adjacent frames, where the pose transformation refers to the displacement and rotation of the corresponding position between the two adjacent frames. The camera pose network consists of an average pooling layer and eight convolutional layers. Except for the last convolutional layer, all other convolutional layers use batch normalization (BN) and ReLU (Rectified Linear Unit) activation functions.

[0038] (4) construction of the discriminator structure:

[0039] the discriminator is mainly used to judge the authenticity of the color map, that is, to determine whether it is a real color map or a synthesized color map. Its purpose is to enhance the ability of the network to synthesize color maps to thereby indirectly improving the quality of depth estimation. The discriminator structure contains five convolutional layers, each of which uses batch normalization and Leaky-ReLU activation functions, and the final fully connected layer.

[0040] (5) in order to solve the problem that the ordinary unsupervised loss function is difficult to produce high-quality results in the edge, occlusion and low-texture areas, this invention constructs the loss function based on hybrid geometric enhancement to train the network.

[0041] (5-1) designing the photometric loss function L.sub.p; use the depth map information and the camera pose to obtain the source frame image coordinates from the target frame image coordinates, and establish the projection relationship between adjacent frames; the formula is:

p.sub.s=KT.sub.t.fwdarw.sD.sub.t(p.sub.t)K.sup.−1p.sub.t

[0042] where K is the camera calibration parameter matrix, K.sup.−1 is the inverse matrix of the parameter matrix, D.sub.t is the predicted depth map, s and t represent the source frame and the target frame, respectively; T.sub.t.fwdarw.s is the camera pose information from t to s, p.sub.s is the image coordinate of the source frame, and p.sub.t is the image coordinate of the target frame; the source frame image I.sub.s is warped to the target frame angle of view to obtain the synthesized image Î.sub.s.fwdarw.t, which is expressed as follows:

[00001] ${\hat{I}}_{s .fwdarw. t} (p_{t}) = I_{s} (p_{s}) = \underset{j \in {t, b, l, r}}{.Math.} w^{j} I_{s} (p_{s}^{j})$

[0043] among them, w.sup.j is the linear interpolation coefficient, and the value is ¼; p.sub.s.sup.j is the adjacent pixel in p.sub.s, j∈{t,b,l,r} represents 4-neighborhood, and t, b, l, r represent the top, bottom, left and right ends of the coordinate position;

[0044] L.sub.p is defined as follows:

[00002] $L_{p} = \frac{1}{N} {.Math.}_{t = 1}^{N} \underset{p_{t}}{.Math.} M_{t}^{*} (p_{t}) .Math. I_{t} (p_{t}) - {\hat{I}}_{s .fwdarw. t} (p_{t}) .Math.$

[0045] among them, N represents the number of images per training, the effective mask M.sub.t*=1−M, M is defined as: M=I(ξ≥0), where I is the indicator function, and the definition of ξ is ξ=∥D.sub.t−Ď.sub.t∥.sup.2−(n.sub.1∥D.sub.t∥.sup.2+η.sub.1∥Ď.sub.t∥.sup.2+η.sub.2), where η.sub.1 and η.sub.2 are weight coefficients set to 0.01 and 0.5 respectively; Ď.sub.t is a depth map generated by warping the depth map D.sub.t of the target frame;

[0046] (5-2) designing space smooth loss function L.sub.S, used to process the depth value of low-texture areas, the formula is as follows:

[00003] $L_{s} = \frac{1}{N} {.Math.}_{t = 1}^{N} \underset{p_{t}}{.Math.} (.Math. \nabla_{x}^{2} D_{t} (p_{t}) .Math. e^{- γ .Math. E_{t} (p_{t}) .Math.} + .Math. \nabla_{y}^{2} D_{t} (p_{t}) .Math. e^{- γ .Math. E_{t} (p_{t}) .Math.})$

[0047] among them, the parameter γ is set to 10, E.sub.t is the output result of the edge sub-network, and ∇.sub.x.sup.2 and ∇.sub.y.sup.2 are the two-step gradient in the x and y directions of the coordinate system, respectively; to avoid getting trivial solutions, design the edge regularization loss function L.sub.e, the formula is as follows:

[00004] $L_{e} = \frac{1}{N} {.Math.}_{t = 1}^{N} \underset{p_{t}}{.Math.} {.Math. E_{t} (p_{t}) .Math.}^{2}$

[0048] (5-3) designing the left and right consistency loss function L.sub.d to eliminate the error caused by occlusion between the viewpoints; the formula is as follows:

[00005] $L_{d} = \frac{1}{N} {.Math.}_{t = 1}^{N} \underset{p_{t}}{.Math.} .Math. D_{t} (p_{t}) - {\overset{.Math.}{D}}_{t} (p_{t}) .Math.$

[0049] (5-4) the discriminator uses the adversarial loss function when distinguishing real images and synthetic images; regarding the combination of deep network, edge network, and camera pose network as the generator, and the final synthesized image is sent to the judgment together with the real input image to get better results in the device; the adversarial loss function formula is as follows:

[00006] $L_{Adv} = \frac{1}{N} {.Math.}_{t = 1}^{N} {{��}_{I_{t} \sim P (I_{t})} [\log �� (I_{t})] + {��}_{{\hat{I}}_{s .fwdarw. t} \sim P ({\hat{I}}_{s .fwdarw. t})} [\log (1 - �� ({\hat{I}}_{s .fwdarw. t}))]}$

[0050] among them, P(*) represents the probability distribution of the data *, E represents the expectation, and D represents the discriminator; this adversarial loss function prompts the generator to learn the mapping of synthetic data to real data, so that the synthetic image is similar to the real image;

[0051] (5-5) the loss function of the overall network structure is defined as follows:

L=α.sub.1L.sub.p+α.sub.2L.sub.s+α.sub.3L.sub.e+α.sub.4L.sub.d+α.sub.5L.sub.Adv

[0052] among them, α.sub.1, α.sub.2, α.sub.3, α.sub.4 and α.sub.5 are the weight coefficients.

[0053] (6) the convolutional neural networks obtained from (2), (3) and (4) into the network structure are combined as shown in FIG. 1 and then the joint training is performed. The data enhancement strategy proposed in the paper (A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS, 2012, pp. 1097-1105) is used to enhance the initial data and reduce over-fitting problem. The supervision method adopts the hybrid geometric enhancement loss function constructed in (5) to gradually iteratively optimize the network parameters. During the training process, the batch size is set to 4, and the Adam optimization method with β.sub.1=0.9 and β.sub.2=0.999 is used for optimization, and the initial learning rate is set to 1e−4. When the training is completed, the trained model can be used to test on the test set to obtain the output result of the corresponding input image.

[0054] The final result of this implementation is shown in FIG. 3, where (a) is the input color map, (b) is the ground-truth depth map and (c) is the output depth map result of the present invention.

MONOCULAR UNSUPERVISED DEPTH ESTIMATION METHOD BASED ON CONTEXTUAL ATTENTION MECHANISM

Inventors

Cpc classification

Classification Explorer

G06T2207/10016

PHYSICS

Classification Explorer

G06V10/454

PHYSICS

Classification Explorer

G06T7/74

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06T7/50

PHYSICS

Classification Explorer

G06N3/048

PHYSICS

Classification Explorer

G06T7/75

PHYSICS

Classification Explorer

G06N3/088

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06T2207/20016

PHYSICS

Classification Explorer

G06T2207/10024

PHYSICS

Classification Explorer

G06T9/002

PHYSICS

Classification Explorer

G06T2207/20084

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06T2207/20081

PHYSICS

Classification Explorer

G06F18/2132

PHYSICS

Classification Explorer

G06T7/564

PHYSICS

Classification Explorer

G06F18/2193

PHYSICS

International classification

Classification Explorer

G06T7/564

PHYSICS

Classification Explorer

G06K9/62

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06T7/73

PHYSICS

Classification Explorer

G06T9/00

PHYSICS

Abstract