TIME-OF-FLIGHT DEPTH ENHANCEMENT

Abstract

An image processing system configured to receive an input time-of-flight depth map representing the distance of objects in an image from a camera at a plurality of locations of pixels in the respective image, and in dependence on that map to generate an improved time-of-flight depth map for the image, the input time-of-flight depth map having been generated from at least one correlation image representing the overlap between emitted and reflected light signals at the plurality of locations of pixels at a given phase shift, the system being configured to generate the improved time-of-flight depth map from the input time-of-flight depth map in dependence on a colour representation of the respective image and at least one correlation image.

Claims

1. An image processing system configured to receive an input time-of-flight depth map representing the distance of objects in an image from a camera at a plurality of locations of pixels in the respective image, and in dependence on that map to generate an improved time-of-flight depth map for the image, the input time-of-flight depth map having been generated from at least one correlation image representing the overlap between emitted and reflected light signals at the plurality of locations of pixels at a given phase shift, the system being configured to generate the improved time-of-flight depth map from the input time-of-flight depth map in dependence on a colour representation of the respective image and at least one correlation image.

2. The image processing system of claim 1, wherein the colour representation of the respective image has a higher resolution than the input time-of-flight depth map and/or the at least one correlation image.

3. The image processing system of claim 1, wherein the system is configured to generate the improved time-of-flight depth map by means of a trained artificial intelligence model.

4. The image processing system of claim 3, wherein the trained artificial intelligence model is an end-to-end trainable neural network.

5. The image processing system of claim 3, wherein the model is trained using at least one of: input time-of-flight depth maps, correlation images and colour representations of images.

6. The image processing system of claim 3, wherein the system is configured to combine the input time-of-flight depth map with the at least one correlation image to form a correlation-enriched time-of-flight depth map.

7. The image processing system of claim 6, wherein the system is configured to generate the improved time-of-flight depth map by hierarchically upsampling the correlation-enriched time-of-flight depth map in dependence on the colour representation of the respective image.

8. The image processing system of claim 6, wherein the improved time-of-flight depth map has a higher resolution than the input time-of-flight depth map.

9. The image processing system of claim 8, wherein the colour representation of the respective image is a colour-separated representation.

10. A method for generating an improved time-of-flight depth map for an image in dependence on an input time-of-flight depth map representing the distance of objects in the image from a camera at a plurality of locations of pixels in the respective image, the input time-of-flight depth map having been generated from at least one correlation image representing the overlap between emitted and reflected light signals at the plurality of locations of pixels at a given phase shift, the method comprising generating the improved time-of-flight depth map from the input time-of-flight depth map in dependence on a colour representation of the respective image and at least one correlation image.

11. The method of claim 10, wherein the colour representation of the respective image has a higher resolution than the input time-of-flight depth map and/or the at least one correlation image.

12. The method of claim 10, wherein the method comprises generating the improved time-of-flight depth map by means of a trained artificial intelligence model.

13. The method of claim 12, wherein the trained artificial intelligence model is an end-to-end trainable neural network.

14. The method of claim 10, the method further comprising combining the input time-of-flight map with the at least one correlation image to form a correlation-enriched time-of-flight depth map.

15. The method of claim 14, the method further comprising hierarchically upsampling the correlation-enriched time-of-flight depth map in dependence on the colour representation of the respective image.

Description

BRIEF DESCRIPTION OF THE FIGURES

[0028] The present application will now be described by way of example with reference to the accompanying drawings. In the drawings:

[0029] FIGS. 1(a) and 1(b) illustrate the acquisition of ToF depth data.

[0030] FIG. 2(a) illustrates a photographic image. FIG. 2(b) illustrates a ToF depth map corresponding to the image of FIG. 2(a).

[0031] FIG. 3 illustrates an overview of an example of a pipeline for processing a ToF depth map.

[0032] FIG. 4 illustrates an exemplary overview of a pipeline for processing a ToF depth map. A shallow encoder takes RAW correlation images as input. During the decoding stage, noisy ToF depth data is injected and upsampled to four times the original resolution with RGB guidance.

[0033] FIG. 5 illustrates results of the proposed pipeline for ToF upsampling with multi-modality guided upsampling (GU).

[0034] FIGS. 6(a)-(c) illustrate exemplary results for different scenes.

[0035] FIGS. 7(a)-(j) show results obtained using the proposed pipeline and results obtained using classical upsampling with a U-Net without multi-modality guidance for comparison.

[0036] FIGS. 8(a)-(c) show an ablation study, with a comparison between images processed using guided upsampling only, depth injection only, and images processed using the multi-modality approach of the present application.

[0037] FIGS. 8(d) and 8(e) show ground truth images and RGB images.

[0038] FIG. 9 shows an example of a camera configured to use the pipeline of the present application to process images taken by the camera.

DETAILED DESCRIPTION OF THE APPLICATION

[0039] FIG. 3 shows an overview of an exemplary pipeline for generating an enhanced ToF depth map. The pipeline of FIG. 3 comprises an end-to-end trainable neural network. The pipeline takes as an input a ToF depth map 301 of relatively low resolution and quality or density (compared to the output ToF depth map 305). The input time-of-flight depth map 301 represents the distance of objects in an image from a camera at a plurality of locations of pixels in the respective image.

[0040] The input time-of-flight depth map 301 is generated from at least one RAW correlation image representing the overlap between emitted and reflected light signals at the plurality of locations of pixels at a given phase shift. As is well known in the art, using the speed of light, this RAW correlation image data is processed to generate the input ToF depth map. This processing of the RAW correlation data to form the input ToF depth map may be performed separately from the pipeline, or in an initialisation step of the pipeline. The noisy ToF input depth 301 is fed into the learning framework (labelled ToF upsampling, ToFU), indicated at 302.

[0041] The pipeline also takes as an input a colour representation 303 of the respective image for which the input depth map 301 has been generated. In this example, the colour representation is a colour-separated representation, specifically an RGB image. However, the colour representation may comprise one or more channels.

[0042] The pipeline also takes as an input at least one RAW correlation image, as shown at 304. Therefore, multi-modality input data is utilised.

[0043] The system is configured to generate an improved time-of-flight depth map 305 from the input time-of-flight depth map 301 in dependence on the colour representation of the respective image 303 and at least one correlation image 304.

[0044] The system and method will now be described in more detail below with reference to FIG. 4.

[0045] In this example, an end-to-end neural network comprises an encoder-decoder convolutional neural network with a shallow encoder 401 and a decoder 402 with guided upsampling and depth injection. The shallow encoder 401 takes RAW correlation images 403 as input. The network encodes the RAW correlation information 403 at the original resolution of 1/1 from the ToF sensor to extract deep features for depth prediction.

[0046] During the decoding stage, the input ToF depth data, shown at 404 (which may be noisy and corrupted) is injected (i.e. combined with the RAW correlation data) at the original resolution 1/1 and is then hierarchically upsampled to four times the original resolution with RGB guidance. The input ToF depth information is injected in the decoder at the ToF input resolution stage, thus supporting the network to predict depth information with metric scale.

[0047] RGB images at 2× and 4× the original resolution of the ToF depth map, shown at 405 and 406 respectively, are utilized during guided upsampling (GU) to support the residual correction of a directly upsampled depth map and to enhance boundary precision at depth discontinuities.

[0048] The noisy ToF depth data 404 is therefore injected and upsampled to four times the original resolution with RGB guidance to generate an enhanced ToF depth map, shown at 407.

[0049] Co-injection of RGB and RAW correlation image modalities helps to super-resolve the input ToF depth map by leveraging additional information to fill the holes (black areas in the input ToF depth map), predict further away regions and resolve ambiguities due to multi-path reflections.

[0050] The above-described method may reliably recover the depth for the whole scene, despite the depth injection from the input ToF depth map being noisy and corrupted and far away pixel values being invalidated. The guided upsampling helps to improve and sharpen depth discontinuities. In this example, the final output is four times the resolution of the original input ToF depth map. However, the depth map may also be upsampled to higher resolutions.

[0051] In summary, the modalities utilized are as follows: [0052] INPUT: RAW correlation images (low resolution), input ToF depth map (low resolution) and RGB image (high resolution). [0053] OUTPUT: Upsampled depth map (high resolution).
The modalities complement each other and ToFU extracts useful information from each modality in order to produce the final super-resolved output ToF depth map.
An exemplary network architecture is described as follows. Other configurations are possible.

Layers for Encoder: 1×2D-Convolutions of RAW Correlation Input (->½ Input Resolution)

[0054] Layers before Injection: 1×2D-UpConvolution (from ½ Input Resolution to 1/1 Input Resolution)

Decoder and Guided Upsampler

[0055] Depth Injection: [0056] For each Input: [0057] 2D Cony->BatchNorm->LeakyReLu->ResNetBlock->ResNetBlock [0058] Concatenation [0059] 4× ResNetBlock [0060] Residual=2D-Convolution [0061] Output=Depth+Residual
Concatenation of UpConvolution (before Injection)+Injection Output
Convolution of Concatenation+Upsampling with Bilinear UpSampling (Depth Prediction at 1× Input Resolution)
Layers before GU 1: 1×2D-UpConvolution of Convolution of Concatenation (from 1/1 Input Resolution to 2× Input Resolution)

Guided Upsampling Stage 1:

[0062] For each Input: [0063] 2D Cony->BatchNorm->LeakyReLu->ResNetBlock->ResNetBlock

[0064] Concatenation

[0065] 4× ResNetBlock

[0066] Residual=2D-Convolution

[0067] Output=Depth+Residual

Concatenation of UpConvolution+Guided Upsampling Output

[0068] Convolution of Concatenation and Upsampling with Bilinear UpSampling (Depth Prediction at 2× Input Resolution)
Layers before GU 2: 1×2D-UpConvolution of Convolution of Concatenation (from 2× Input Resolution to 4× Input Resolution)

Guided Upsampling Stage 2:

[0069] For each Input: [0070] 2D Cony->BatchNorm->LeakyReLu->ResNetBlock->ResNetBlock

[0071] Concatenation

[0072] 4× ResNetBlock

[0073] Residual=2D-Convolution

[0074] Output=Depth+Residual

Concatenation of UpConvolution+Guided Upsampling Output

Convolution of Concatenation and Prediction of Depth (Depth Prediction at 4× Input Resolution)

[0075] Equations (1)-(4) below describe an exemplary Loss Function. For training the proposed network, the pixel-wise difference between the predicted inverse depth to mimic a disparity and the ground truth is minimized by exploiting a robust norm for fast convergence together with a smoothness term:

L.sub.Total=ω.sub.sL.sub.Smooth+ω.sub.DL.sub.Depth (1)

where:

L.sub.Smooth=Σ|∇D(p)|.sup.Te.sup.−|∇I(p)| (2)

and:

L.sub.Depth=Σω.sub.Scale|D(p)−D.sub.Pred(p)|.sub.Barron (3)

where |*|.sub.Barron is the Barron Loss as proposed in Barron, “A General and Adaptive Robust Loss Function”, CVPR 2019, in the special form of a smoothed L.sub.1 norm:

f(x)=√{square root over ((x/2).sup.2+1)}−1 (4)

ω.sub.Scale accounts for the contribution of L.sub.Depth at each scale level, and D is the inverse depth and I is the RGB image. As the disparity values for lower scale levels should be scaled accordingly (for example, half the resolution results in half the disparity value), the value for the loss term should to be scaled inversely by the same scale parameter. Additionally, the amount of pixels decreases quadratically with every scale level, resulting in a scale weight for equal contribution of each scale level of: ω.sub.Scale=Scale*Scale.sup.2=Scale.sup.3.

[0076] In one implementation, for generating training data, together with accurate depth ground truth, a physics-based rendering pipeline (PBRT) may be used together with Blender, as proposed in Su, Shuochen, et al., “Deep end-to-end time-of-flight imaging”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018. A low resolution version of the depth is clipped and corrupted with noise to simulate the ToF depth input signal.

[0077] FIG. 5 illustrates results of the proposed pipeline for ToF upsampling with multi-modality guided upsampling. The input ToF depth map is shown in FIG. 5(a). The predicted depth map after upsampling to 2× resolution is shown in FIG. 5(b), with the resulting error shown in FIG. 5(c). The predicted depth map after upsampling to 4× resolution is shown in FIG. 5(d), with the resulting error shown in FIG. 5(e). The corresponding RGB image and the ground truth ToF depth maps are shown for comparison in FIGS. 5(f) and 5(g) respectively. The proposed method helps to recover the depth without losing information relating to fine structures, while simultaneously improving edges along depth discontinuities.

[0078] Further exemplary results for different scenes are shown in FIGS. 6(a)-(c). FIG. 6(a) shows input ToF depth maps for the scenes, depicted with small RGB images, FIG. 6(b) shows the respective upsampled outputs, and FIG. 6(c) shows the respective corresponding ground truth depth maps for the scenes.

[0079] FIGS. 7(a)-(j) show a comparison between results obtained using classical upsampling with a U-Net without multi-modality guidance and results obtained using the approach of the present application. FIG. 7(a) shows an input ToF depth map for a scene. FIGS. 7(b) and 7(c) show the ToF depth map and corresponding residual respectively obtained using classical upsampling after 32k iterations. FIGS. 7(d) and 7(e) show the ToF depth map and corresponding residual respectively obtained using the approach described herein after 32k iterations. FIG. 7(f) shows an input ToF depth map for an area of the scene at a higher magnification. FIGS. 7(g) and 7(h) show the ToF depth map and corresponding residual respectively obtained using classical upsampling after convergence. FIGS. 7(i) and 7(j) show the ToF depth map and corresponding residual respectively obtained using the approach described herein after convergence. The ToF depth maps in FIGS. 7(d) and 7(i) generated using the method described herein recover the depth without losing information relating to fine structures, whilst simultaneously improving edges along depth discontinuities.

[0080] In FIG. 8, results obtained using the approach of the present application are shown in FIG. 8(a) compared to results obtained using GU only (no depth injection) in FIG. 8(b) and injection only (no GU) in FIG. 8(c). The corresponding ground truth image and RGB image are shown in FIGS. 8(d) and 8(e) respectively. Edges and fine structures are not refined for “injection only” during classical upsampling (FIG. 8(c)), which can be seen by comparing the residual along edges and fine structures in the dashed and full circles. The depth injection at low resolution helps the network to start with a good depth estimate at lower resolution, thus helping the GU for higher resolutions. In FIG. 8(a), the residual error of the depth prediction when compared to the ground truth is reduced compared to the other approaches. In particular, the GU improves the residual along depth discontinuities where the image gradient is usually strong, thereby recovering also fine structures.

[0081] Therefore depth injection may guide the network to predict a well-defined depth at lower resolution which gets refined with RGB-guidance during the hierarchical guided upsampling to recover the depth at four times the original resolution.

[0082] FIG. 9 shows an example of a camera 901 configured to use the pipeline to process images taken by an image sensor 902 in the camera. The camera also has a depth sensor 903 configured to collect ToF depth data. Such a camera 901 typically includes some onboard processing capability. This could be provided by the processor 904. The processor 904 could also be used for the essential functions of the device.

[0083] The transceiver 905 is capable of communicating over a network with other entities 910, 911. Those entities may be physically remote from the camera 901. The network may be a publicly accessible network such as the internet. The entities 910, 911 may be based in the cloud. Entity 910 is a computing entity. Entity 911 is a command and control entity. These entities are logical entities. In practice they may each be provided by one or more physical devices such as servers and data stores, and the functions of two or more of the entities may be provided by a single physical device. Each physical device implementing an entity comprises a processor and a memory. The devices may also comprise a transceiver for transmitting and receiving data to and from the transceiver 905 of camera 901. The memory stores in a non-transient way code that is executable by the processor to implement the respective entity in the manner described herein.

[0084] The command and control entity 911 may train the artificial intelligence models used in the pipeline. This is typically a computationally intensive task, even though the resulting model may be efficiently described, so it may be efficient for the development of the algorithm to be performed in the cloud, where it can be anticipated that significant energy and computing resource is available. It can be anticipated that this is more efficient than forming such a model at a typical camera.

[0085] In one implementation, once the algorithms have been developed in the cloud, the command and control entity can automatically form a corresponding model and cause it to be transmitted to the relevant camera device. In this example, the pipeline is implemented at the camera 901 by processor 904.

[0086] In another possible implementation, an image may be captured by the camera sensor 902 and the image data may be sent by the transceiver 905 to the cloud for processing in the pipeline. The resulting image could then be sent back to the camera 901, as shown at 912 in FIG. 9.

[0087] Therefore, the method may be deployed in multiple ways, for example in the cloud, on the device, or alternatively in dedicated hardware. As indicated above, the cloud facility could perform training to develop new algorithms or refine existing ones. Depending on the compute capability near to the data corpus, the training could either be undertaken close to the source data, or could be undertaken in the cloud, e.g. using an inference engine. The method may also be implemented at the camera, in a dedicated piece of hardware, or in the cloud.

[0088] The present application therefore uses an end-to-end trainable deep learning pipeline for ToF depth super-resolution, which enriches the input ToF depth map with the encoded features from a low-resolution RAW correlation signal. The composed feature maps are hierarchically upsampled with co-modality guidance from aligned higher-resolution RGB images. By injecting the encoded RAW correlation signal, the ToF depth is enriched by the RAW correlation signal for domain stabilization and modality guidance.

[0089] The method utilizes cross-modality advantages. For example, ToF works well at low-light or for textureless areas, while RGB works well in bright scenes or scenes with darker-textured objects.

[0090] Because the pipeline is trainable end-to-end, accessing all three different modalities (RGB, depth and RAW correlation) at the same time, they can mutually improve the overall recovered depth map. This may increase the resolution, accuracy and precision of the ToF depth map.

[0091] ToF depth errors may also be corrected. Missing data may be recovered, as the method measures farther away regions, and multi-path ambiguities may be resolved via RGB-guidance.

[0092] The network may utilise supervised or unsupervised training. The network may utilise multi-modality training, with synthetic correlation, RGB, ToF depth and ground truth renderings for direct supervision.

[0093] Additional adjustments may be made to the output ToF depth map in dependence on the ground truth image.

[0094] The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present application may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the application.

TIME-OF-FLIGHT DEPTH ENHANCEMENT

Assignee

Inventors

Cpc classification

Classification Explorer

G06T5/50

PHYSICS

Classification Explorer

G06T7/514

PHYSICS

Classification Explorer

G01S17/894

PHYSICS

Classification Explorer

G06T2207/10024

PHYSICS

Classification Explorer

G01S17/36

PHYSICS

Classification Explorer

G01S7/4808

PHYSICS

Classification Explorer

G06T2207/20081

PHYSICS

Classification Explorer

G06T2207/10028

PHYSICS

Classification Explorer

G01S17/86

PHYSICS

International classification

Classification Explorer

G06T7/514

PHYSICS

Abstract

Claims

Description