Method for generating an adaptive multiplane image from a single high-resolution image

11704778 · 2023-07-18

Assignee

Inventors

Cpc classification

International classification

Abstract

A method to compute a variable number of image planes, which are selected to better represent the scene while reducing the artifacts on produced novel views. This method analyses the structure of the scene by means of a depth map and selects the position in the Z-axis to split the original image into individual layers. The method also determines the number of layers in an adaptive way.

Claims

1. A method of generating an adaptive multiplane image (AMI) from a single image, comprising: receiving the single image as an input; generating a depth map by performing a depth estimation process; performing adaptative slicing of the single image into partial image planes; processing the partial image planes with occluded regions by inpainting; and rearranging resulting image planes to form a AMI representation, wherein the generating of the depth map comprises: feeding the input to a Time-Frequency Decomposition (TFD) layer in a new feature map with half of an input resolution and with four times more features in a channel dimension; applying a sequence of convolutional layers; generating a new set of wavelet coefficients; and transforming the new set of wavelet coefficients to the input resolution by an Inverse Time-Frequency Decomposition (I-TFD).

2. The method as in claim 1, wherein the adaptative slicing comprises: receiving the depth map; applying a filtering process in the depth map; applying an edge detector to the filtered depth map; setting a depth value of detected borders to zero; applying a morphological erosion operation, generating a border-aware depth map; and computing the transition index Γ vector by using the normalized histogram h: Γ = Δ 2 h h .

3. The method as in claim 2, wherein peaks detected from the transition index vector are selected as candidates for layer transitions.

4. The method as in claim 3, wherein in the adaptative slicing once a non-zero value is selected from Γ, provided a current number of layers N is lower than a maximum number of layers N.sub.max, N is incremented and the selected value from Γ is stored as a peak and neighbors from the selected value are set to zero.

5. The method as in claim 4, wherein each partial image plane is defined as an interval between two transitions, resulting in N−1 image planes.

6. The method as in claim 5, wherein a depth of each image plane is defined as an average depth in a considered interval.

7. The method as in claim 1, wherein the inpainting comprises: applying a morphological dilation in an occluded region to generate an inpainting input image frame; defining a binary mask corresponding to the inpainting input image frame; and computing target color pixels based at least in part on the inpainting input image frame.

8. The method as in claim 7, wherein the target color pixels correspond to pixels from image masked at least in part on the inpainting input image frame.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The objectives and advantages of the current invention will become clearer through the following detailed description of the example and non-limitative drawings presented at the end of this document:

(2) FIG. 1 presents a Multiplane Image (MPI) representation of a scene with constant interval between layers.

(3) FIG. 2 presents the process of rendering an MPI with image planes at a fixed grid interval.

(4) FIG. 3 illustrates an Overview of the present invention for generating an adaptive multiplane image from a high-resolution image.

(5) FIG. 4 illustrates the Convolutional Neural Network (CNN) associated with Time-Frequency Decomposition (TFD).

(6) FIG. 5 depicts a diagram of the adaptive slicing algorithm, where N and N.sub.max are the current and the maximum number of image planes, respectively.

(7) FIG. 6 depicts an example of depth histogram and the selection of transitions based on the transition index.

(8) FIG. 7 depicts a comparative example of the proposed adaptive slicing algorithm and the traditional slicing method based on a fixed grid.

(9) FIG. 8 illustrates the definition of hidden region for training the inpainting process.

(10) FIG. 9 illustrates a use case of the present invention in a virtual window application.

(11) FIG. 10 depicts a use case of the present invention in a dynamic photo application on Smartphone.

(12) FIG. 11 depicts a use case of the present invention for 3D effect generation from a static photo.

(13) FIG. 12 illustrates the qualitative results of the proposed invention (adaptive slicing) compared with uniform slicing.

DETAILED DESCRIPTION

(14) One of the reasons that previous methods require an elevated number of image planes is that, traditionally, such planes are computed in a fixed grid, without explicitly taking into account the content of the image.

(15) For example, in FIG. 1 an MPI representation from a scene is shown with five illustrative image planes aligned in the Z-axis with respect to the observer 107. In this example, the gap between each layer is defined in a fixed grid from the layer lying in the far Z 101 to the layer in the near Z 102. The calculation of this fixed grid in the Z-axis is usually performed in the disparity domain, which is equivalent to the inverse of the depth. This process produces a set of layers with fixed intervals 103, 104, 105, and 106.

(16) FIG. 2 shows a more detailed illustration of the rendering process of an MPI representation based on a fixed grid. The multiplane image layers in 201 are rendered considering an observer 202 at position Po, resulting in the image 203. In this example, five image planes, 204, 205, 206, 207, and 208 are equally spaced as represented by the intervals 209, 210, 211, and 212. For such representation of the scene, the gap produced between the layers 206 and 207 does not accurately represent the real distance between the trees 213 and 214. This effect can result in artifacts on a rendered new view.

(17) Traditionally, a common strategy to reduce this undesirable effect is to increase the number of layers, as previously mentioned. However, this strategy also increases the computational requirements, i.e., memory and computational power, to compute, store, and render the produced MPI representation.

(18) In this sense, the method proposed by the present invention is divided into three main modules: (i) depth estimation module, (ii) adaptive slicing module, and (iii) inpainting module. In the proposed pipeline, as illustrated in FIG. 3, a single high-resolution image 301 is processed by the depth estimation step 302, which produces a depth map 303. If the depth map 303 corresponding to the image 301 is already available, the depth estimation process performed by the depth estimation step 302 can be skipped. This could happen for capturing devices equipped with a depth sensor, such as time-of-flight sensors or a similar apparatus capable of predicting a depth map.

(19) Then, the depth map 303 is processed by the adaptive slicing step 304, which produces a set of partial image planes 305, 306, and 307, which depends on the content of the image represented by the depth map. The partial image planes are composed by three different regions: (i) actual color image 309, which is a copy from the input image; (ii) transparent regions 308, which allow colors from precedent layers to appear in the rendered view; and (iii) occluded regions 310, which correspond to the pixels that are not visible at this image layer.

(20) The partial image planes that have occluded regions, e.g., 306 and 307, are processed by the inpainting step 311. This step produces a color texture that will be inpainted in the occluded regions (e.g., in 310), resulting in the inpainted image planes 312 and 313. The resulting image planes from this process, in this case 305, 312, and 313, are then arranged to form the AMI representation 314, which can be rendered to a new point of view 315.

(21) The main advantage of the present invention is in the adaptive slicing module, which produces a set of partial images planes. Differently from the state of the art, the number of partial image planes generated depends on the content of the image represented by the depth map. Moreover, each partial image plane has a depth position (in the Z-axis) computed to better represent the scene, instead of using a fixed grid. This information is also computed by the adaptive slicing module.

(22) Moreover, CNN architectures are used for depth estimation and image inpainting, which uses the Discrete Wavelet Transform (DWT), or any other Time-Frequency Decomposition (TFD), to achieve high-resolution estimation with low-memory requirements. This allows the method to compute an AMI representation in higher image resolution compared to previous methods, while requiring less memory and less computation time.

(23) The goal of the depth estimation step is to obtain a depth map, i.e., a matrix representation with the same number of columns and rows of the input image, where each value represents the distance from the observer to the 3D scene surface represented in the input image. Despite this information could be obtained by a dedicated apparatus, such as time-of-flight sensors, stereo vision approaches, etc., for the vast majority of images captured by current devices such as Smartphones, digital cameras, etc., the depth information could not be available. Therefore, in the present invention the possibility to include this step to estimate the depth information from a single-color image is also included.

(24) Estimating depth from a single image uses deep learning and deep neural networks. In this sense, convolutional neural networks (CNN) can perform this task satisfactorily. To mitigate the high computational consumption associated with these strategies, it's proposed the use of a TFD in conjunction with convolutional filters to reduce the required memory of the network.

(25) This method is illustrated in FIG. 4, where an input image or feature map 401 is fed to a TFD layer 402, resulting in a new feature map 403 with half of the input resolution (in number of rows and columns), and with four times more features in the channels dimension. Then, a sequence of convolutional layers 404 is applied to generate a new set of coefficients or features 405, which are transformed back to the original resolution by an Inverse Time-Frequence Decomposition (I-TFD) 406.

(26) Despite the tensors 403 and 405 having the same number of features as in 401 and 407 respectively, the channels after the TFD are compact in the spatial dimensions and arranged accordingly to specific frequency responses in the channels dimension, as a characteristic of the Time-Frequency Transforms. Due to this arrangement, satisfactory results are achieved by using more compact convolutional filters, with a similar number of filters in the channels dimension, but with smaller filters size, if compared to convolutional layers applied directly in the input 401. As a result, an efficient CNN can be implemented with smaller convolutional filters, therefore requiring less memory and lower computations if compared to traditional CNN architectures.

(27) Multiple blocks of TFD layers, convolutional layers, and inverse TFD layers can be arranged in a structured way to implement a CNN architecture. The final CNN can be trained on annotated depth data, or on stereo pair of images for disparity estimation, in order to predict a depth map or a disparity map from an input high-resolution color image.

(28) Additionally, the method was evaluated by using the Discrete Wavelet Transform (DWT) due to its simplicity, but any other TFD could be used, such as Short-Time Fourier Transform, Gabor Transform, Bilinear Time-Frequency Distribution, among others.

(29) The main contribution of the present invention is the adaptive slicing module. Considering a depth map predicted by a CNN or produced by a specific depth estimation apparatus, the goal of this step is to generate a set of partial image planes, each lying in a specific depth distance in the Z-axis, accordingly to the image content, in order to better represent the 3D scene. The main idea of this algorithm is to slice the scene in regions of high discontinuities, such as in the boundaries of objects or in the regions of borders. In this way, the boundaries of the generated image planes tend to follow the structure of the scene, which prevents from creating border artifacts. In addition, if some regions in the depth are empty (no object or structure lying in a given range of depth), no additional image plane will be placed at this region.

(30) The diagram illustrated in FIG. 5 presents an overview of the adaptive slicing method. Since depth maps estimated by CNN or by other means are susceptible to noise and imprecise borders, a filtering process is firstly applied in the input depth map. This process can be implemented by any kind of smoothing and border-preserving filter, such as a bilateral median filtering.

(31) Then, an edge detecting step is applied in the filtered depth map in order to detect transitions in the structure of the scene. Examples of edge detectors for this process are Canny, Sobel, Prewitt, or Roberts operators, among others. The depth values corresponding to the detected borders in 501 are set to zero, so the resulting depth map has abrupt border regions, passing through zero. In order to increase the gap between two regions of different depth, a morphological erosion operation is applied, resulting in a border-aware depth map 502. At this point, the normalized histogram, represented by h, is then used to compute the transition index Γ, defined by the follow equation:

(32) Γ = Δ 2 h h

(33) which represents the normalized second derivative of the depth map histogram. The transition index Γ is a vector with the same size as the histogram h and represents the normalized transitions in the histogram. The higher the values in Γ, the more abrupt is the normalized transition in h. In the diagram of FIG. 5, the arrows in 503 and 504 represent these vectors. An example of these two vectors from a real depth map is also depicted in FIG. 6.

(34) Peaks from the transition index vector are selected as candidates for layer transitions. This process is demonstrated in the right part of the diagram in FIG. 5. Once a non-zero value is selected from Γ, if the current number of layers N is lower than the maximum number of layers N.sub.max 1 N is incremented and the selected value from Γ is stored as a peak. Then, the neighbors from the selected value are set to zero (reset), so the neighbor peaks will not be selected in the next iteration.

(35) The number of neighbors to be set to zero is a parameter from the algorithm and can be defined accordingly to the number of bins in the depth map histogram and on the maximum number of image planes. Due to this process, some peaks are intentionally ignored in the process, as can be seen in FIG. 6 by the cross markers. Since the extremities of the histogram are pre-selected to define the boundaries in the range of depth values, the peaks close to the borders were ignored. In addition, peaks close to the previously selected values (higher values) are also ignored.

(36) In the end of the process described in FIG. 5, a number of N<N.sub.max transitions will be selected. Then, each partial image plane is defined as the interval between two transitions, resulting in N−1 image planes. The depth of each image plane is then defined as the average depth in the considered interval. In this way, the average error between the real depth value and the image plane depth is minimal.

(37) The main advantage of using the proposed adaptive slicing step compared to a slicing method based on a fixed grid is that adaptive image planes represent the content of the image with more precision.

(38) As an example, in FIG. 7, an adaptive multiplane image in 701 composed of three image planes, 702, 703, and 704, is compared with an equivalent MPI composed of five planes, on which the layer 702 is equivalent to two layers in 705, 703 is equivalent to two layers in 706, and 704 is equivalent to a single layer in 707. In the final AMI (701), the distance between the image planes is adapted to the content of the image, contrarily to the fixed interval from traditional MPI, from 708 to 711.

(39) Once the partial image planes are computed, the occluded regions need to be inpainted, to avoid showing the gaps between layers during the novel view synthesis of the scene. As previously illustrated in FIG. 3, the regions that require inpainting (310) correspond to portions of image layers that are covered by a non-transparent layer closer to the camera, i.e., with lower depth value. This process could be performed by classic inpainting techniques from the state of the art, such as Telea's algorithm.

(40) The solution adopted in the present invention for the problem of image inpainting is a CNN based on TFD, as previously discussed and illustrated in FIG. 4. The difference between the CNN used for depth estimation and the CNN used for image inpainting is that the former predicts a single channel map, corresponding to the depth map, and in the latter, the CNN predicts three channels, corresponding to red, green, and blue (RGB) colors.

(41) In order to handle a variable number of image planes, the inpainting process operates in a single image layer. The goal of the inpainting in this context is to produce color pixels in occluded regions of a given image layer. Therefore, during the training process, regions from a given image are removed from the network input. These same regions are provided as targets for the optimization processes, in order to drive the learning process to optimize the network to generate pixel values that are similar to the original pixels removed from the input. This is a standard training process for inpainting. The difference in this process presented in this invention is in the way that the hidden region in the input image is defined.

(42) This process is illustrated in FIG. 8, where the image layer 801 has the region corresponding to the foreground 802 occluded. In this layer, it is applied a morphological dilation 804 in the occluded region, resulting in the input image frame 803 for training the inpainting model. Then, a binary mask 805 corresponding to the removed regions 806 is defined and used to compute the target color pixels. The color pixels provided to supervise the model correspond to the pixels from the image 801, masked by the removed regions 806. After the model training, during inference, the image layer 801 is fed to the model and the pixels generated in the region 802 are used as inpainting.

(43) Contrarily to classical inpainting methods, on which the model is trained with simple geometric crops from the input image for supervision, such as random squares and rectangles, in the present method the scene structure is considered to define the target region, as illustrated in 806. This process guides the model to learn an inpainting that is coherent with the context, i.e., foreground or background portions of the image.

(44) Moreover, the present invention could be adapted to a broad range of applications based on 3D visual effect, novel views synthesis, or dynamic content creation. For such applications, the methods and algorithms presented in the present invention can be implemented on specific hardware devices, such as Smartphones, Smart TVs, and other devices equipped with one or more processors and memory and/or permanent storage, or digital screens for displaying the results produced by the visual effect. The specific implementation can change accordingly to different devices and, as an illustrative example, could follow the following scheme: an image 301, stored in the device's memory, is processed accordingly to the method described in the present invention, in a such way that the individual steps are performed by the processor and the result 314 containing the AMI representation can be immediately used for synthesizing the effect in the device's screen or stored for future use. Each layer of the AMI representation can be stored as a binary file, with our without compression, along with the respective position in the Z axis for each layer, for a subsequent synthesis of the effect. In what follows, it is shown how the method could be applied, but not limited, to three different use cases:

(45) I) A virtual window application is illustrated in FIG. 9. In this scenario, a Smart TV or a display apparatus 901 emulates a virtual window for an observer in 902. The display apparatus 901 can be equipped with a sensor mechanism to detect the position of the observer and render the AMI accordingly. For a given scene being rendered in 901, the observer at its initial position 902 sees the produced image as illustrated in 907. If the observer moves to its right in 906, the rendered scene changes accordingly, as shown in 911. In this case, the mounted horse moved faster relative to the threes and the sky, because it is closer to the observer. The same effect can be observed for the observer at positions 903, 904, 905 and the corresponding rendered views in 908, 909, and 910.

(46) II) Another possible application with the present invention is the creation of a dynamic photo for Smartphones, as illustrated in FIG. 10. In this example, a Smartphone 1001 displays a rendered view 1008 from an AMI computed from a picture accordingly to the Smartphone device position. If the device moves horizontally 1003, from the position 1006 to 1007, the rendered view changes accordantly from 1011 to 1012. In a similar way, a vertical movement 1002 with the device from 1004 to 1005 produces rendered views that change from 1009 to 1010. The device's movement could be estimated by the accelerometer or gyroscope. The produced visual effect provides a notion of depth to the user, since closer objects appears to move faster than parts of the image farther away. Therefore, the picture seems as a dynamic photography, improving the user experience.

(47) III) FIG. 11 illustrates a case of use of the present invention applied to generate a video of 3D effect from a single and static photo. In this example, a Smart TV or a display device 1101 renders an AMI representation in real time, producing a video sequence observed by the user 1102. Each video frame 1103, 1104, 1105, etc., is rendered in real time by the display device 1101. This application could be used to animate static pictures, such as landscape or portrait photos, in a more realistic manner, providing to the user a notion of depth in the scene.

(48) Additionally, the effectiveness of the present invention is evaluated on the depth estimation and novel view synthesis tasks. Although the present invention is not completely dependent on estimated depth maps for the cases when a depth sensing apparatus is available, it could be commonly applied to user scenarios where no depth information is provided, therefore requiring an estimated depth map. In addition, the quality of the generated novel views was evaluated by considering the efficiency aspect. Both experimental setups are detailed next.

(49) The proposed depth estimation method is evaluated by comparing it with state of art approaches on the well know and public NYUv2 depth dataset, published by Silberman et al., on ECCV 2012. Four different metrics were considered:

(50) Threshold:

(51) % of y i s . t . max ( y i y i * , y i * y i ) = δ < thr

(52) where γ.sub.i and γ.sub.i* are the predict and ground truth depth values and thr is defined as 1.25, 125.sup.2, and 125.sup.3 respectively for ∂.sub.1, ∂.sub.2, and ∂.sub.3. In this metric, higher is better.

(53) Abs. Relative Error:

(54) 1 .Math. "\[LeftBracketingBar]" T .Math. "\[RightBracketingBar]" .Math. y T .Math. "\[LeftBracketingBar]" y - y * .Math. "\[RightBracketingBar]" / y *

(55) where T represents the evaluation samples. In this metric, lower is better.

(56) RMSE (linear):

(57) 1 .Math. "\[LeftBracketingBar]" T .Math. "\[RightBracketingBar]" .Math. y T .Math. y i - y i * .Math. 2

(58) RMSE (log):

(59) 1 .Math. "\[LeftBracketingBar]" T .Math. "\[RightBracketingBar]" .Math. y T .Math. log y i - log y i * .Math. 2

(60) Table 1 presents results obtained with the present invention compared to previous methods from the state of the art, as well as the present invention considering a classical CNN architecture and the proposed DWT-based CNN.

(61) Considering the use of the proposed DWT-based CNN, it represents an improvement compared to a classic CNN (without DWT) of 11.5% in the RMSE (linear) metric, while reducing the model size from 17 to 16 MB. This demonstrates that the proposed structure using DWT in conjunction with convolutional filters not only allows a compact model but also improves its accuracy.

(62) TABLE-US-00001 Abs. Rel. RMSE Model Method ∂.sub.1 ∂.sub.2 ∂.sub.3 Error (lin) log.sub.10 size DORN 0.828 0.965 0.992 0.115 0.509 0.051 421 MB (CVPR′ 18) Pattern- 0.846 0.968 0.994 0.121 0.497 — — Affinitive (CVPR′ 19) SharpNet 0.888 0.979 0.995 0.139 0.495 0.047 — (ICCVW′ 19) DenseNet 0.846 0.974 0.994 0.123 0.465 0.053 165 MB (arXiv′ 18) Present 0.751 0.938 0.984 0.168 0.601 0.072  17 MB invention (classic CNN) Present 0.800 0.956 0.989 0.149 0.532 0.063  16 MB invention (DWT-based CNN) Present 0.861 0.960 0.982 0.117 0.422 0.055  16 MB invention (DWT, rigid alignment)

(63) Compared to previous methods, the present invention is slightly less precise than recent approaches based on very deep CNN architecture, such as DenseNet. However, the method is one order of magnitude smaller (from 165 MB to 16 MB), which is the main concern in this invention. In addition, scores were also reported considering a rigid alignment between predictions and ground truth depth maps (based on median and standard deviation) in the last row of the table, since the present method for novel view synthesis is invariant to the scale and shift of the depth map prediction.

(64) The present invention was also evaluated by considering the quality of the generated novel views from a single high-resolution image. Since in this task no ground-truth is available (considering unconstrained pictures taken from a non-static scene), the following methodology was adopted: assuming that a scene represented by a relatively large number of image planes can be projected to a new point of view with a high scene fidelity, and the rendered image can be considered as a ground-truth image. Then, comparing two different strategies for generating an MPI representation with few image planes: 1) uniform (fixed grid) slicing and 2) the proposed adaptive slicing approach.

(65) Standard image evaluation metrics are used to compare the proposed adaptive slicing algorithm with a uniform slicing approach, considering the same number of image layers. Specifically, the Structural Similarity (SSIM) and the Peak Signal-to-Noise Ratio (PSNR) metrics were used to compare the projected AMI (or MPI for uniform distribution) with the considered ground truth based on a high number of image planes. In these experiments, the ground truth was defined to have MPI formed by 64 image planes, and the evaluated AMI and MPI to have up to 8 image planes. In FIG. 12 it is presented qualitative results and, in the table below, the metric results for these sample images.

(66) These examples demonstrate that the AMI produced by the adaptive slicing algorithm results in a higher similarity with the ground truth image when compared to the uniform MPI representation, even when using a smaller number of image planes. This fact can be observed from FIG. 12, where the artifacts produced by the uniform slicing are more salient than the artifacts produced by the adaptive slicing, but also from the table below, on which the SSIM and PSNR metrics from the present invention are higher and the effective number of image layers is lower.

(67) TABLE-US-00002 Effective number of Image Method SSIM PSNR layers Sample 1 Uniform 0.563 17.600 8 slicing Adaptive 0.600 18.086 7 slicing Sample 2 Uniform 0.706 20.290 8 slicing Adaptive 0.830 22.872 4 slicing Sample 3 Uniform 0.837 24.277 8 slicing Adaptive 0.874 26.212 6 slicing

(68) Finally, the present invention was evaluated in a set of more than 500 high-resolution publicly available images collected from the Internet. The average results of the SSIM and PSNR metrics are presented in the table below, which confirms the superiority of the present invention when compared to a uniform slicing approach.

(69) TABLE-US-00003 Method SSIM PSNR Uniform 0.660 20.488 slicing Adaptive 0.723 21.969 slicing

(70) The invention may include one or a plurality of processors. In this sense, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), or an AI-dedicated processor such as a neural processing unit (NPU).

(71) The processors control the processing of the input data in accordance with a predefined operating rule stored in the non-volatile memory and/or the volatile memory. The predefined operating rule model is provided through training or learning.

(72) In the present invention, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule of a desired characteristic are made. The learning may be performed in a device itself in which it may be implemented through a separate server/system.

(73) The learning algorithm is a technique for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

(74) Although the present invention has been described in connection with certain preferred embodiments, it should be understood that it is not intended to limit the disclosure to those particular embodiments. Rather, it is intended to cover all alternatives, modifications and equivalents possible within the spirit and scope of the disclosure as defined by the appended claims.