WAVELET TRANSFORM BASED DEEP HIGH DYNAMIC RANGE IMAGING

Abstract

Described herein is an image processing apparatus (701) comprising one or more processors (704) configured to: receive (601) a plurality of input images (301, 302, 303); for each input image, form (602) a set of decomposed data by decomposing the input image (301, 302, 303) or a filtered version thereof (307, 308, 309) into a plurality of frequency-specific components (313) each representing the occurrence of features of a respective frequency interval in the input image or the filtered version thereof; process (603) each set of decomposed data using one or more convolutional neural networks to form a combined image dataset (327); and subject (604) the combined image dataset (327) to a construction operation that is adapted for image construction from a plurality of frequency-specific components to thereby form an output image (333) representing a combination of the input images. The resulting HDR output image may have fewer artifacts and provide a better quality result. The apparatus is also computationally efficient, having a good balance between accuracy and efficiency.

Claims

1. An image processing apparatus comprising one or more processors configured to: receive a plurality of input images; for each input image, form a set of decomposed data by decomposing the input image or a filtered version thereof into a plurality of frequency-specific components each representing the occurrence of features of a respective frequency interval in the input image or the filtered version thereof; process each set of decomposed data using one or more convolutional neural networks to form a combined image dataset; and perform a construction operation on the combined image data set, wherein the construction operation is adapted for image construction from a plurality of frequency-specific components, to thereby form an output image representing a combination of the input images.

2. The image processing apparatus as claimed in claim 1, wherein the step of decomposing the input image comprises performing a discrete wavelet transform operation on the input image.

3. The image processing apparatus as claimed in claim 1, wherein the construction operation is an inverse discrete wavelet transform operation.

4. The image processing apparatus as claimed in claim 1, the apparatus comprising a camera and the apparatus being configured to, in response to an input from a user of the apparatus, cause the camera to capture the said plurality of input images, each of the input images being captured with a different exposure from others of the input images.

5. The image processing apparatus as claimed in claim 1, wherein the decomposed data is formed by decomposing a version of the respective input image filtered by a convolutional filter.

6. The image processing apparatus as claimed in claim 1, wherein the apparatus is configured to: mask and weight at least some areas of some of the sets of the decomposed data so as to form attention-filtered decomposed data; select a subset of components of the attention-filtered decomposed data that correspond to lower frequencies than other components of the attention-filtered decomposed data; merge at least the components of the subset of components to form merged data; and wherein the merged data form an input to the construction operation.

7. The image processing apparatus as claimed in claim 6, wherein the apparatus is configured to decompose the attention-filtered data, merge relatively low frequency components of the attention-filtered data through a plurality of residual operations to form convolved low frequency data, and perform a reconstruction operation in dependence on relatively high frequency components of the attention-filtered data and the convolved low frequency data.

8. The image processing apparatus as claimed in claim 1, the apparatus being configured to: for each input image, form the respective set of decomposed data by decomposing the input image or a filtered version thereof into a first plurality of sets of frequency-specific components each representing the occurrence of features of a respective frequency interval in the input image or the filtered version thereof, performing a convolution operation on each of the sets of frequency-specific components to form convolved data and decomposing the convolved data into a second plurality of sets of frequency-specific components each representing the occurrence of features of a respective frequency interval in the convolved data.

9. The image processing apparatus as claimed in claim 8, the apparatus being configured to: merge the first subset of the second plurality of sets of frequency-specific components to form first merged data; perform a masked and weighted combination of a first subset of the second plurality of sets of frequency-specific components and the first merged data to form first combined data; perform a first convolutional combination of a second subset of the second plurality of sets of frequency-specific components to form second combined data; upsample the first and second combined data to form first upsampled data; perform a masked and weighted combination of a first subset of the first plurality of sets of frequency-specific components and the first upsampled data to form third combined data; perform a second convolutional combination of a second subset of the first plurality of sets of frequency-specific components to form fourth combined data; upsample the third and fourth combined data to form second upsampled data; and wherein the output image is formed in dependence on the second upsampled data.

10. The image processing apparatus as claimed in claim 8, wherein the first subsets are subsets of relatively low frequency components.

11. The image processing apparatus as claimed in claim 8, wherein the second subsets are subsets of relatively high frequency components.

12. The image processing apparatus as claimed in claim 8, wherein the output image is formed in dependence on a combination of the second upsampled data and convolved versions of the input images.

13. A computer-implemented image processing method comprising: receiving a plurality of input images; for each input image, forming a set of decomposed data by decomposing the input image or a filtered version thereof into a plurality of frequency-specific components each representing the occurrence of features of a respective frequency interval in the input image or the filtered version thereof; processing each set of decomposed data using one or more convolutional neural networks to form a combined image dataset; and subjecting the combined image dataset to a construction operation that is adapted for image construction from a plurality of frequency-specific components to thereby form an output image representing a combination of the input images.

Description

BRIEF DESCRIPTION OF THE FIGURES

[0028] The present embodiments will now be described by way of example with reference to the accompanying drawings.

[0029] In the drawings:

[0030] FIG. 1 shows an illustration of the HDR imaging task, taking input from multiple frames.

[0031] FIG. 2 schematically illustrates a discrete wavelet transform (DWT).

[0032] FIG. 3 shows a schematic illustration of the overall structure of the network.

[0033] FIG. 4 shows a schematic illustration of the feature merge module.

[0034] FIG. 5 shows a schematic illustration of the upsample module.

[0035] FIG. 6 shows an example of a computer-implemented image processing method.

[0036] FIG. 7 shows an example of a device configured to implement the method described herein.

[0037] FIG. 8 shows a qualitative comparison between results produced using the method described herein and baseline methods.

DETAILED DESCRIPTION

[0038] The network implemented in the apparatus and method described herein is a Unet (as described in Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation”, International Conference on Medical image computing and computer-assisted intervention, pp. 234-241. Springer, Cham, 2015), which is a deep neural network architecture that is commonly used in image processing operations such as image denoising, super-resolution and joint-denoising-demosaicing.

[0039] Wavelet transform is useful tool to transform an image into different groups of frequency components. As shown in FIG. 2, wavelet transform can decompose a signal corresponding to an image into components with different frequency intervals, referred to as low-low (LL), shown at 201, high-low (HL), shown at 202, low-high (LH), shown at 203 and high-high (HH), shown at 204.

[0040] In a preferred implementation, the apparatus and method described herein use wavelet transform combined with a Unet for HDR imaging using multiple LDR images as input. Discrete wavelet transform (DWT) and inverse discrete wavelet transform (IDWT) are used to replace the maxpooling/strided convolution operation in the Unet network. Using DWT may reduce the information loss during the downsampling. IDWT can reconstruct the signal using the output of DWT to restore feature maps to the original scale.

[0041] Wavelet transform is also used in a feature merging module for merging low-frequency components of the inputs. In the feature merging module, which takes the low-frequency components as input, a spatial attention module is used to handle the misaligned regions between the reference and supporting images.

[0042] FIG. 3 shows the overall network architecture of one implementation of the present disclosure. In addition to the basic Unet structure, the network also includes a feature merge module and two upsample modules.

[0043] Each LDR input is sent into an individual encoder of the same structure. In this example, three LDR inputs 301, 302 and 303 are used. For each input, a convolution layer 304, 305, 306 is used to adjust the channel number of the input and form feature maps 307, 308, 309.

[0044] As shown at 310, 311, 312, DWT is then used to decompose the feature map for each input into one low frequency component and three high frequency components, which are shown within the dashed box at 313. Thus, for each input image, a set of decomposed data is formed by decomposing the feature maps into a plurality of frequency-specific components each representing the occurrence of features of a respective frequency interval in the feature map.

[0045] As will be described in more detail below, each set of decomposed data is then processed using one or more convolutional neural networks to form a combined image dataset, which is subjected to a construction operation that is adapted for image construction from a plurality of frequency-specific components to thereby form an output image representing a combination of the input images.

[0046] In the example shown in FIG. 3, the decomposed data is formed by decomposing a version 307, 308, 309 of the respective input image 301, 302, 303 filtered by a convolutional filter 304, 305, 306. Each respective set of decomposed data is formed by decomposing the filtered image into a first plurality of sets of frequency-specific components each representing the occurrence of features of a respective frequency interval in the filtered image (feature map).

[0047] As shown in FIG. 3, in this example, while keeping the high frequency components for later stage (as illustrated by dashed line 314), the low frequency component of each set of decomposed data goes through another convolution layer 315, 316, 317 for feature extraction.

[0048] Convolved data 318, 319 and 320 respectively is formed. The convolved data is subsequently decomposed into a second plurality of sets of frequency-specific components each representing the occurrence of features of a respective frequency interval in the convolved data, as shown within dashed box 324. In this example, DWT is used again, as shown at 321, 322, 323 to decompose the feature maps 318, 319, 320 for each of the inputs into different frequency components (three high frequency and one low frequency), illustrated within dashed box 324, from which the low frequency components are sent into the feature merge module 325 (described later with reference to FIG. 4) where feature fusion is performed and the high frequency components are stored for a later stage (as illustrated by dashed line 326).

[0049] The fused feature map 327 is sent, along with the pre-stored low and high frequency components, to upsampling modules 328 and 329 sequentially to reconstruct the feature map to the original scale, shown at 330. In this example, a global residual connection, shown at 331, is added to enhance the representation ability of the network. The final feature map is shown at 332 and the tonemapped HDR image at 333.

[0050] FIG. 4 shows an exemplary illustration of the feature merge module (325 in FIG. 3). The inputs of the merge module are three low frequency components from different LDR images, shown at 401, 402 and 403. Here, L.sub.2 is the low frequency component of the reference frame. L.sub.1 and L.sub.3 are the low frequency components of the supporting frames.

[0051] The apparatus is configured to mask and weight at least some areas of some of the sets of the decomposed data so as to form attention-filtered decomposed data. A subset of components of the attention-filtered decomposed data are selected that correspond to lower frequencies than other components of the attention-filtered decomposed data. This subset is merged to form merged data, which forms an input to the construction operation.

[0052] In the initial stage, the inputs L.sub.1 and L.sub.3 are sent into the attention modules 404 and 405 respectively separately along with the reference low frequency component L.sub.2 to generate corresponding attention masks M.sub.1.sup.att and M.sub.3.sup.att. Then the masks are applied to the corresponding input using the element-wise multiplication to get L′.sub.1 and L′.sub.3:

L′.sub.1=L.sub.1⊙M.sub.1.sup.att (1)

L′.sub.3=L.sub.3⊙M.sub.3.sup.att (2)

[0053] L′.sub.1, L.sub.2 and L′.sub.3 are concatenated together, as shown at 406, and go through a convolution layer to squeeze the channel number of the feature map 407. DWT is then used to decompose the feature map 407 into different frequency components. The high frequency components are shown within dashed box 408. The low frequency component 409 goes through several residual blocks, indicated at 410, to merge the features into feature map 411. Finally, an IDWT layer 412 is used to restore the feature map to the original scale. The resulting feature map is shown at 413.

[0054] Therefore, in this example, the apparatus is further configured to decompose the attention-filtered data, merge relatively low frequency components of the attention-filtered data through a plurality of residual operations (the residual blocks shown at 410) to form convolved low frequency data, and perform a reconstruction operation in dependence on relatively high frequency components of the attention-filtered data and the convolved low frequency data.

[0055] In this example, for each input image, the apparatus is configured to form the respective set of decomposed data by decomposing the filtered feature map into a first plurality of sets of frequency-specific components each representing the occurrence of features of a respective frequency interval in the feature map, perform a convolution operation on each of the sets of frequency-specific components to form convolved data and decompose the convolved data into a second plurality of sets of frequency-specific components each representing the occurrence of features of a respective frequency interval in the convolved data.

[0056] The first subset of the second plurality of sets of frequency-specific components are merged to form first merged data. A masked and weighted combination of a first subset of the second plurality of sets of frequency-specific components and the first merged data is performed to form first combined data. A first convolutional combination of a second subset of the second plurality of sets of frequency-specific components is performed to form second combined data. The first and second combined data are upsampled to form first upsampled data. A masked and weighted combination of a first subset of the first plurality of sets of frequency-specific components (corresponding to the relatively low frequency components) and the first upsampled data is performed to form third combined data.

[0057] A second convolutional combination of a second subset of the first plurality of sets of frequency-specific components (corresponding to the relatively high frequency components) is performed to form fourth combined data.

[0058] The third and fourth combined data is upsampled to form second upsampled data and, as a result of the global residual connection 331 in FIG. 3, the output image is formed in dependence on the second upsampled data and convolved versions of the input images.

[0059] FIG. 5 shows an exemplary illustration of the upsample module. In contrast to previous computer vision tasks using wavelet transform, which only have single input (such as denoising, classification), the HDR imaging task generally has more than one input LDR image. Therefore, after each DWT layer, multiple groups of low and high components can be achieved, for example:

[00001] $\begin{matrix} L_{1} \overset{DWT}{⟹} {LL}_{1}, L H_{1}, H L_{1}, H H_{1} & (3) \end{matrix}$ $\begin{matrix} L_{2} \overset{DWT}{⟹} {LL}_{2}, L H_{2}, H L_{2}, H H_{2} & (4) \end{matrix}$ $\begin{matrix} L_{3} \overset{DWT}{⟹} {LL}_{3}, L H_{3}, H L_{3}, H H_{3} & (5) \end{matrix}$

[0060] Here L.sub.n is the n.sup.th input and [LL.sub.n,LH.sub.n,HL.sub.n,HH.sub.n] are the components with different frequency intervals from the n.sup.th input. However, when IDWT is used to upsample the feature map, only one set of these components is used. In the method described herein, the learnable merging module merges these low and high frequency components into one set that can be used during upsampling. Firstly, for the high frequency components, the components with the same frequency interval are grouped together:

LHs=Concat(LH.sub.1,LH.sub.2,LH.sub.3) (6)

HLs=Concat(HL.sub.1,HL.sub.2,HL.sub.3) (7)

HHs=Concat(HH.sub.1,HH.sub.2,HH.sub.3) (8)

[0061] These grouped components are shown at 501, 502 and 503 respectively in FIG. 5. These grouped components are then passed through several convolution layers to generate one set of high frequency components, shown at 504, 505 and 506 respectively, and collectively at 507.

[0062] The low frequency components LL.sub.n, shown at 508, 509, 510, are fused following the steps in the feature merge module (LL.sub.1 and LL.sub.3 are sent into the attention modules 511 and 512 respectively separately along with the reference low frequency component LL.sub.2 to generate corresponding attention masks M.sub.1.sup.att and M.sub.3.sup.att and the masks are applied to the corresponding input using the element-wise multiplication to get L′.sub.1 and L′.sub.3) and then concatenated with the feature map F′ 513 from the previous layer to generate a single low frequency component, shown at 514. As shown at 515, IDWT is then used to restore the feature map to the original scale using the generated low 514 and high 507 frequency components.

[0063] As described above, for the input of the network, a set of LDR images {L.sub.1, L.sub.2, L.sub.3} is used. This set of LDR images is mapped to the HDR domain {H.sub.1, H.sub.2, H.sub.3} according to the gamma correction:

[00002] $\begin{matrix} H_{i} = \frac{L_{i}^{γ}}{t_{i}} & (9) \end{matrix}$

[0064] where γ is the gamma correction parameter and t.sub.i is the exposure time of the LDR image L.sub.i. These two sets of images are concatenated together to generate the final input set {X.sub.1, X.sub.2, X.sub.3}:

X.sub.i=Concat(L.sub.i,H.sub.i) (10)

[0065] In the method described herein, the HDR imaging task is formulated as an image-to-image translation problem. Thus, the network is trained through minimizing the L1 loss between the predicted tonemapped output T(Ĥ) and the tonemapped ground truth T(H):

Loss=∥T(Ĥ)−T(H)∥.sub.1 (11)

[0066] where T(⋅) is the μ-law function which can be written as:

[00003] $\begin{matrix} T (H) = \frac{\log (1 + μ H)}{\log (1 + H)} & (12) \end{matrix}$

[0067] where μ is the coefficient to control the compression.

[0068] The method described herein therefore constitutes a learning-based approach combined with wavelet transform for HDR image fusion of inputs from multiple frames.

[0069] As described above, the apparatus comprises a wavelet transform module to decompose inputs into several components with different frequency intervals in the feature space, a parameterized learning module to process the feature fusion of the low-frequency component and high-frequency components in the network and an attention module to deal with the misalignment regions in the support frames.

[0070] Optical flow is not required in the method described herein.

[0071] The learnable modules are used to fuse several components with different frequency intervals from the different input images separately (low-frequency components and high-frequency components). For the feature merge module, only low-frequency components are adopted. Low-frequency components contain more structural information. Therefore, using low-frequency components in the merge stage may be beneficial for alleviating ghosting artifacts and recovering under-exposed and over-exposed regions of the inputs. High-frequency components can preserve detail information, which is helpful for reconstructing details during upsampling.

[0072] FIG. 6 shows an example of a computer-implemented image processing method. At step 601, the method comprises receiving a plurality of input images. At step 602, the method comprises, for each input image, forming a set of decomposed data by decomposing the input image or a filtered version thereof into a plurality of frequency-specific components each representing the occurrence of features of a respective frequency interval in the input image or the filtered version thereof. At step 603, the method comprises processing each set of decomposed data using one or more convolutional neural networks to form a combined image dataset. At step 604, the method comprises subjecting the combined image dataset to a construction operation that is adapted for image construction from a plurality of frequency-specific components to thereby form an output image representing a combination of the input images.

[0073] The apparatus may comprise an imaging device, such as a camera. The apparatus may be configured to, in response to an input from a user of the apparatus, cause the camera, or other imaging device, to capture each of the input images with a different exposure from the other input images.

[0074] FIG. 7 shows an example of apparatus 700 comprising a imaging device 701 configured to use the method describe herein to process image data captured by at least one image sensor in the device. The device 701 comprises image sensors 702, 703. Such a device 701 typically includes some onboard processing capability. This could be provided by the processor 704. The processor 704 could also be used for the essential functions of the device.

[0075] The transceiver 705 is capable of communicating over a network with other entities 710, 711. Those entities may be physically remote from the device 701. The network may be a publicly accessible network such as the internet. The entities 710, 711 may be based in the cloud. Entity 710 is a computing entity. Entity 711 is a command and control entity. These entities are logical entities. In practice they may each be provided by one or more physical devices such as servers and data stores, and the functions of two or more of the entities may be provided by a single physical device. Each physical device implementing an entity comprises a processor and a memory. The devices may also comprise a transceiver for transmitting and receiving data to and from the transceiver 705 of device 701. The memory stores in a non-transient way code that is executable by the processor to implement the respective entity in the manner described herein.

[0076] The command and control entity 711 may train the model used in the device. This is typically a computationally intensive task, even though the resulting model may be efficiently described, so it may be efficient for the development of the algorithm to be performed in the cloud, where it can be anticipated that significant energy and computing resource is available. It can be anticipated that this is more efficient than forming such a model at a typical imaging device.

[0077] In one implementation, once the algorithms have been developed in the cloud, the command and control entity can automatically form a corresponding model and cause it to be transmitted to the relevant imaging device. In this example, the model is implemented at the device 701 by processor 704.

[0078] In another possible implementation, an image may be captured by one or both of the sensors 702, 703 and the image data may be sent by the transceiver 705 to the cloud for processing. The resulting image could then be sent back to the device 701, as shown at 712 in FIG. 7.

[0079] Therefore, the method may be deployed in multiple ways, for example in the cloud, on the device, or alternatively in dedicated hardware. As indicated above, the cloud facility could perform training to develop new algorithms or refine existing ones. Depending on the compute capability near to the data corpus, the training could either be undertaken close to the source data, or could be undertaken in the cloud, e.g. using an inference engine.

[0080] Unlike prior methods, the approach described herein does not need to use optical flow to align the input frames. DWT is advantageously employed to reduce information loss caused by maxpooling and strided convolution operations. DWT is advantageously used to perform the downsampling and IDWT is used for the upsampling.

[0081] Not using optical flow can save computational cost. Furthermore, some hardware does not support optical flow. Therefore, the method described herein may be used on a wider range of hardware and may be deployed in mobile devices.

[0082] Because of the wavelet transform, the information loss is reduced during downsampling. Thus, HDR images with better quality can be generated. Compared with previous methods, embodiments of the present disclosure may achieve a good balance between image quality and computational efficiency.

[0083] FIG. 8 shows a qualitative comparison between the method described herein (image 802) and baseline methods Unet (801) and AHDR (803). From FIG. 8, it can be seen that, in this example, Unet produces artifacts in the human face and arm and AHDR produces artifacts in the car and building. The HDR image produced using the method described herein does not have such artifacts and provides a better quality result. The method may achieve state-of-the-art performance and is also computationally efficient, having a good balance between accuracy and efficiency.

[0084] The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present disclosure may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the disclosure.

WAVELET TRANSFORM BASED DEEP HIGH DYNAMIC RANGE IMAGING

Assignee

Inventors

Cpc classification

Classification Explorer

G06T5/50

PHYSICS

Classification Explorer

G06T5/009

PHYSICS

Classification Explorer

G06T5/20

PHYSICS

Classification Explorer

H04N23/80

ELECTRICITY

Classification Explorer

G06T2207/20208

PHYSICS

Classification Explorer

G06T5/10

PHYSICS

Classification Explorer

G06T2207/20084

PHYSICS

Classification Explorer

G06T2207/10144

PHYSICS

Classification Explorer

G06T2207/20064

PHYSICS

Classification Explorer

G06T2207/20081

PHYSICS

Classification Explorer

G06T2207/20221

PHYSICS

Classification Explorer

H04N19/63

ELECTRICITY

Classification Explorer

G06T5/007

PHYSICS

International classification

Classification Explorer

G06T5/10

PHYSICS

Classification Explorer

G06T5/50

PHYSICS

Classification Explorer

G06T5/00

PHYSICS

Classification Explorer

G06T5/20

PHYSICS

Abstract

Claims

Description