Method for generating an adaptive multiplane image from a single high-resolution image
11704778 · 2023-07-18
Assignee
Inventors
- Diogo Carbonera Luvizon (São Paulo, BR)
- Henrique Fabricio Gagliardi (São Paulo, BR)
- Otavio Augusto Bizetto Penatti (São Paulo, BR)
Cpc classification
International classification
Abstract
A method to compute a variable number of image planes, which are selected to better represent the scene while reducing the artifacts on produced novel views. This method analyses the structure of the scene by means of a depth map and selects the position in the Z-axis to split the original image into individual layers. The method also determines the number of layers in an adaptive way.
Claims
1. A method of generating an adaptive multiplane image (AMI) from a single image, comprising: receiving the single image as an input; generating a depth map by performing a depth estimation process; performing adaptative slicing of the single image into partial image planes; processing the partial image planes with occluded regions by inpainting; and rearranging resulting image planes to form a AMI representation, wherein the generating of the depth map comprises: feeding the input to a Time-Frequency Decomposition (TFD) layer in a new feature map with half of an input resolution and with four times more features in a channel dimension; applying a sequence of convolutional layers; generating a new set of wavelet coefficients; and transforming the new set of wavelet coefficients to the input resolution by an Inverse Time-Frequency Decomposition (I-TFD).
2. The method as in claim 1, wherein the adaptative slicing comprises: receiving the depth map; applying a filtering process in the depth map; applying an edge detector to the filtered depth map; setting a depth value of detected borders to zero; applying a morphological erosion operation, generating a border-aware depth map; and computing the transition index Γ vector by using the normalized histogram h:
3. The method as in claim 2, wherein peaks detected from the transition index vector are selected as candidates for layer transitions.
4. The method as in claim 3, wherein in the adaptative slicing once a non-zero value is selected from Γ, provided a current number of layers N is lower than a maximum number of layers N.sub.max, N is incremented and the selected value from Γ is stored as a peak and neighbors from the selected value are set to zero.
5. The method as in claim 4, wherein each partial image plane is defined as an interval between two transitions, resulting in N−1 image planes.
6. The method as in claim 5, wherein a depth of each image plane is defined as an average depth in a considered interval.
7. The method as in claim 1, wherein the inpainting comprises: applying a morphological dilation in an occluded region to generate an inpainting input image frame; defining a binary mask corresponding to the inpainting input image frame; and computing target color pixels based at least in part on the inpainting input image frame.
8. The method as in claim 7, wherein the target color pixels correspond to pixels from image masked at least in part on the inpainting input image frame.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The objectives and advantages of the current invention will become clearer through the following detailed description of the example and non-limitative drawings presented at the end of this document:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
DETAILED DESCRIPTION
(14) One of the reasons that previous methods require an elevated number of image planes is that, traditionally, such planes are computed in a fixed grid, without explicitly taking into account the content of the image.
(15) For example, in
(16)
(17) Traditionally, a common strategy to reduce this undesirable effect is to increase the number of layers, as previously mentioned. However, this strategy also increases the computational requirements, i.e., memory and computational power, to compute, store, and render the produced MPI representation.
(18) In this sense, the method proposed by the present invention is divided into three main modules: (i) depth estimation module, (ii) adaptive slicing module, and (iii) inpainting module. In the proposed pipeline, as illustrated in
(19) Then, the depth map 303 is processed by the adaptive slicing step 304, which produces a set of partial image planes 305, 306, and 307, which depends on the content of the image represented by the depth map. The partial image planes are composed by three different regions: (i) actual color image 309, which is a copy from the input image; (ii) transparent regions 308, which allow colors from precedent layers to appear in the rendered view; and (iii) occluded regions 310, which correspond to the pixels that are not visible at this image layer.
(20) The partial image planes that have occluded regions, e.g., 306 and 307, are processed by the inpainting step 311. This step produces a color texture that will be inpainted in the occluded regions (e.g., in 310), resulting in the inpainted image planes 312 and 313. The resulting image planes from this process, in this case 305, 312, and 313, are then arranged to form the AMI representation 314, which can be rendered to a new point of view 315.
(21) The main advantage of the present invention is in the adaptive slicing module, which produces a set of partial images planes. Differently from the state of the art, the number of partial image planes generated depends on the content of the image represented by the depth map. Moreover, each partial image plane has a depth position (in the Z-axis) computed to better represent the scene, instead of using a fixed grid. This information is also computed by the adaptive slicing module.
(22) Moreover, CNN architectures are used for depth estimation and image inpainting, which uses the Discrete Wavelet Transform (DWT), or any other Time-Frequency Decomposition (TFD), to achieve high-resolution estimation with low-memory requirements. This allows the method to compute an AMI representation in higher image resolution compared to previous methods, while requiring less memory and less computation time.
(23) The goal of the depth estimation step is to obtain a depth map, i.e., a matrix representation with the same number of columns and rows of the input image, where each value represents the distance from the observer to the 3D scene surface represented in the input image. Despite this information could be obtained by a dedicated apparatus, such as time-of-flight sensors, stereo vision approaches, etc., for the vast majority of images captured by current devices such as Smartphones, digital cameras, etc., the depth information could not be available. Therefore, in the present invention the possibility to include this step to estimate the depth information from a single-color image is also included.
(24) Estimating depth from a single image uses deep learning and deep neural networks. In this sense, convolutional neural networks (CNN) can perform this task satisfactorily. To mitigate the high computational consumption associated with these strategies, it's proposed the use of a TFD in conjunction with convolutional filters to reduce the required memory of the network.
(25) This method is illustrated in
(26) Despite the tensors 403 and 405 having the same number of features as in 401 and 407 respectively, the channels after the TFD are compact in the spatial dimensions and arranged accordingly to specific frequency responses in the channels dimension, as a characteristic of the Time-Frequency Transforms. Due to this arrangement, satisfactory results are achieved by using more compact convolutional filters, with a similar number of filters in the channels dimension, but with smaller filters size, if compared to convolutional layers applied directly in the input 401. As a result, an efficient CNN can be implemented with smaller convolutional filters, therefore requiring less memory and lower computations if compared to traditional CNN architectures.
(27) Multiple blocks of TFD layers, convolutional layers, and inverse TFD layers can be arranged in a structured way to implement a CNN architecture. The final CNN can be trained on annotated depth data, or on stereo pair of images for disparity estimation, in order to predict a depth map or a disparity map from an input high-resolution color image.
(28) Additionally, the method was evaluated by using the Discrete Wavelet Transform (DWT) due to its simplicity, but any other TFD could be used, such as Short-Time Fourier Transform, Gabor Transform, Bilinear Time-Frequency Distribution, among others.
(29) The main contribution of the present invention is the adaptive slicing module. Considering a depth map predicted by a CNN or produced by a specific depth estimation apparatus, the goal of this step is to generate a set of partial image planes, each lying in a specific depth distance in the Z-axis, accordingly to the image content, in order to better represent the 3D scene. The main idea of this algorithm is to slice the scene in regions of high discontinuities, such as in the boundaries of objects or in the regions of borders. In this way, the boundaries of the generated image planes tend to follow the structure of the scene, which prevents from creating border artifacts. In addition, if some regions in the depth are empty (no object or structure lying in a given range of depth), no additional image plane will be placed at this region.
(30) The diagram illustrated in
(31) Then, an edge detecting step is applied in the filtered depth map in order to detect transitions in the structure of the scene. Examples of edge detectors for this process are Canny, Sobel, Prewitt, or Roberts operators, among others. The depth values corresponding to the detected borders in 501 are set to zero, so the resulting depth map has abrupt border regions, passing through zero. In order to increase the gap between two regions of different depth, a morphological erosion operation is applied, resulting in a border-aware depth map 502. At this point, the normalized histogram, represented by h, is then used to compute the transition index Γ, defined by the follow equation:
(32)
(33) which represents the normalized second derivative of the depth map histogram. The transition index Γ is a vector with the same size as the histogram h and represents the normalized transitions in the histogram. The higher the values in Γ, the more abrupt is the normalized transition in h. In the diagram of
(34) Peaks from the transition index vector are selected as candidates for layer transitions. This process is demonstrated in the right part of the diagram in
(35) The number of neighbors to be set to zero is a parameter from the algorithm and can be defined accordingly to the number of bins in the depth map histogram and on the maximum number of image planes. Due to this process, some peaks are intentionally ignored in the process, as can be seen in
(36) In the end of the process described in
(37) The main advantage of using the proposed adaptive slicing step compared to a slicing method based on a fixed grid is that adaptive image planes represent the content of the image with more precision.
(38) As an example, in
(39) Once the partial image planes are computed, the occluded regions need to be inpainted, to avoid showing the gaps between layers during the novel view synthesis of the scene. As previously illustrated in
(40) The solution adopted in the present invention for the problem of image inpainting is a CNN based on TFD, as previously discussed and illustrated in
(41) In order to handle a variable number of image planes, the inpainting process operates in a single image layer. The goal of the inpainting in this context is to produce color pixels in occluded regions of a given image layer. Therefore, during the training process, regions from a given image are removed from the network input. These same regions are provided as targets for the optimization processes, in order to drive the learning process to optimize the network to generate pixel values that are similar to the original pixels removed from the input. This is a standard training process for inpainting. The difference in this process presented in this invention is in the way that the hidden region in the input image is defined.
(42) This process is illustrated in
(43) Contrarily to classical inpainting methods, on which the model is trained with simple geometric crops from the input image for supervision, such as random squares and rectangles, in the present method the scene structure is considered to define the target region, as illustrated in 806. This process guides the model to learn an inpainting that is coherent with the context, i.e., foreground or background portions of the image.
(44) Moreover, the present invention could be adapted to a broad range of applications based on 3D visual effect, novel views synthesis, or dynamic content creation. For such applications, the methods and algorithms presented in the present invention can be implemented on specific hardware devices, such as Smartphones, Smart TVs, and other devices equipped with one or more processors and memory and/or permanent storage, or digital screens for displaying the results produced by the visual effect. The specific implementation can change accordingly to different devices and, as an illustrative example, could follow the following scheme: an image 301, stored in the device's memory, is processed accordingly to the method described in the present invention, in a such way that the individual steps are performed by the processor and the result 314 containing the AMI representation can be immediately used for synthesizing the effect in the device's screen or stored for future use. Each layer of the AMI representation can be stored as a binary file, with our without compression, along with the respective position in the Z axis for each layer, for a subsequent synthesis of the effect. In what follows, it is shown how the method could be applied, but not limited, to three different use cases:
(45) I) A virtual window application is illustrated in
(46) II) Another possible application with the present invention is the creation of a dynamic photo for Smartphones, as illustrated in
(47) III)
(48) Additionally, the effectiveness of the present invention is evaluated on the depth estimation and novel view synthesis tasks. Although the present invention is not completely dependent on estimated depth maps for the cases when a depth sensing apparatus is available, it could be commonly applied to user scenarios where no depth information is provided, therefore requiring an estimated depth map. In addition, the quality of the generated novel views was evaluated by considering the efficiency aspect. Both experimental setups are detailed next.
(49) The proposed depth estimation method is evaluated by comparing it with state of art approaches on the well know and public NYUv2 depth dataset, published by Silberman et al., on ECCV 2012. Four different metrics were considered:
(50) Threshold:
(51)
(52) where γ.sub.i and γ.sub.i* are the predict and ground truth depth values and thr is defined as 1.25, 125.sup.2, and 125.sup.3 respectively for ∂.sub.1, ∂.sub.2, and ∂.sub.3. In this metric, higher is better.
(53) Abs. Relative Error:
(54)
(55) where T represents the evaluation samples. In this metric, lower is better.
(56) RMSE (linear):
(57)
(58) RMSE (log):
(59)
(60) Table 1 presents results obtained with the present invention compared to previous methods from the state of the art, as well as the present invention considering a classical CNN architecture and the proposed DWT-based CNN.
(61) Considering the use of the proposed DWT-based CNN, it represents an improvement compared to a classic CNN (without DWT) of 11.5% in the RMSE (linear) metric, while reducing the model size from 17 to 16 MB. This demonstrates that the proposed structure using DWT in conjunction with convolutional filters not only allows a compact model but also improves its accuracy.
(62) TABLE-US-00001 Abs. Rel. RMSE Model Method ∂.sub.1 ∂.sub.2 ∂.sub.3 Error (lin) log.sub.10 size DORN 0.828 0.965 0.992 0.115 0.509 0.051 421 MB (CVPR′ 18) Pattern- 0.846 0.968 0.994 0.121 0.497 — — Affinitive (CVPR′ 19) SharpNet 0.888 0.979 0.995 0.139 0.495 0.047 — (ICCVW′ 19) DenseNet 0.846 0.974 0.994 0.123 0.465 0.053 165 MB (arXiv′ 18) Present 0.751 0.938 0.984 0.168 0.601 0.072 17 MB invention (classic CNN) Present 0.800 0.956 0.989 0.149 0.532 0.063 16 MB invention (DWT-based CNN) Present 0.861 0.960 0.982 0.117 0.422 0.055 16 MB invention (DWT, rigid alignment)
(63) Compared to previous methods, the present invention is slightly less precise than recent approaches based on very deep CNN architecture, such as DenseNet. However, the method is one order of magnitude smaller (from 165 MB to 16 MB), which is the main concern in this invention. In addition, scores were also reported considering a rigid alignment between predictions and ground truth depth maps (based on median and standard deviation) in the last row of the table, since the present method for novel view synthesis is invariant to the scale and shift of the depth map prediction.
(64) The present invention was also evaluated by considering the quality of the generated novel views from a single high-resolution image. Since in this task no ground-truth is available (considering unconstrained pictures taken from a non-static scene), the following methodology was adopted: assuming that a scene represented by a relatively large number of image planes can be projected to a new point of view with a high scene fidelity, and the rendered image can be considered as a ground-truth image. Then, comparing two different strategies for generating an MPI representation with few image planes: 1) uniform (fixed grid) slicing and 2) the proposed adaptive slicing approach.
(65) Standard image evaluation metrics are used to compare the proposed adaptive slicing algorithm with a uniform slicing approach, considering the same number of image layers. Specifically, the Structural Similarity (SSIM) and the Peak Signal-to-Noise Ratio (PSNR) metrics were used to compare the projected AMI (or MPI for uniform distribution) with the considered ground truth based on a high number of image planes. In these experiments, the ground truth was defined to have MPI formed by 64 image planes, and the evaluated AMI and MPI to have up to 8 image planes. In
(66) These examples demonstrate that the AMI produced by the adaptive slicing algorithm results in a higher similarity with the ground truth image when compared to the uniform MPI representation, even when using a smaller number of image planes. This fact can be observed from
(67) TABLE-US-00002 Effective number of Image Method SSIM PSNR layers Sample 1 Uniform 0.563 17.600 8 slicing Adaptive 0.600 18.086 7 slicing Sample 2 Uniform 0.706 20.290 8 slicing Adaptive 0.830 22.872 4 slicing Sample 3 Uniform 0.837 24.277 8 slicing Adaptive 0.874 26.212 6 slicing
(68) Finally, the present invention was evaluated in a set of more than 500 high-resolution publicly available images collected from the Internet. The average results of the SSIM and PSNR metrics are presented in the table below, which confirms the superiority of the present invention when compared to a uniform slicing approach.
(69) TABLE-US-00003 Method SSIM PSNR Uniform 0.660 20.488 slicing Adaptive 0.723 21.969 slicing
(70) The invention may include one or a plurality of processors. In this sense, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), or an AI-dedicated processor such as a neural processing unit (NPU).
(71) The processors control the processing of the input data in accordance with a predefined operating rule stored in the non-volatile memory and/or the volatile memory. The predefined operating rule model is provided through training or learning.
(72) In the present invention, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule of a desired characteristic are made. The learning may be performed in a device itself in which it may be implemented through a separate server/system.
(73) The learning algorithm is a technique for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
(74) Although the present invention has been described in connection with certain preferred embodiments, it should be understood that it is not intended to limit the disclosure to those particular embodiments. Rather, it is intended to cover all alternatives, modifications and equivalents possible within the spirit and scope of the disclosure as defined by the appended claims.