CODING SCHEME FOR VIDEO DATA USING DOWN-SAMPLING/UP-SAMPLING AND NON-LINEAR FILTER FOR DEPTH MAP
20230051960 · 2023-02-16
Inventors
Cpc classification
H04N13/117
ELECTRICITY
H04N13/161
ELECTRICITY
H04N19/59
ELECTRICITY
H04N13/243
ELECTRICITY
H04N19/597
ELECTRICITY
International classification
Abstract
Methods of encoding and decoding video data are provided. In an encoding method, source video data comprising one or more source views is encoded into a video bitstream. Depth data of at least one of the source views is nonlinearly filtered and downsampled prior to encoding. After decoding, the decoded depth data is up-sampled and nonlinearly filtered.
Claims
1. A method of encoding video data compricing: receiving the video data, wherein the video data comprises at least one source view, wherein each of the at least one source view comprises a texture map and a depth map; processing the depth map of the at least one source view so as to generate a processed depth map, wherein the processing comprises nonlinear filtering of the depth map so as to generate a nonlinearly filtered depth map; and down-sampling the nonlinearly filtered depth map so as to generate the processed depth map; and encoding the processed depth map and the texture map so as to generate a video bitstream, wherein the nonlinear filtering comprises enlarging the area of at least one foreground object in the depth map.
2. The method of claim 1, wherein the nonlinear filtering comprises applying a filter, wherein the filter is designed using a machine learning algorithm.
3. The method of claim 1, wherein the non linear filtering is performed by a neural network, wherein the neural network comprises a plurality of layers, wherein the down-sampling is performed between two of the layers.
4. The method of claim 1, further comprising: processing the depth map according to a plurality of sets of processing parameters, wherein the processing of the depth map comprises generating a respective plurality of processed depth maps, wherein the processing parameters comprise at least one of a definition of the nonlinear filtering, a definition of the down-sampling, and a definition of processing operations to reconstruct the depth map, selecting a set of processing parameters, wherein the set of processing parameters are arranged to reduce a reconstruction error of a reconstructed depth map after the respective processed depth map has been encoded and decoded; and generating a metadata bitstream identifying the selected set of parameters.
5. A method of decoding video data comprising: receiving a video bitstream, wherein the video bitstream comprises an encoded depth map and an encoded texture map for at least one source view; decoding the encoded depth map so as to produce a decoded depth map; decoding the encoded texture map so as to produce a decoded texture map; and processing the decoded depth map so as to generate a reconstructed depth map, wherein the processing comprises: up-sampling the decoded depth map so as to generate an up-sampled depth map; and nonlinear filtering of the up-sampled depth map so as to generate the reconstructed depth map, wherein the nonlinear filtering comprises reducing the area of at least one foreground object in the depth map.
6. The method of claim 5, further comprising, detecting that the decoded depth map has a lower resolution than the decoded texture map before the processing.
7. The method of claim 5, wherein the processing of the decoded depth map is based on the decoded texture map.
8. The method of claim 5, further comprising: up-sampling the decoded depth map; identifying peripheral pixels of at least one foreground object in the up-sampled depth map; determining, whether the peripheral pixels are more similar to the foreground object or to the background based on the decoded texture map; and applying nonlinear filtering only to peripheral pixels that are determined to be more similar to the background.
9. The method of claim 5, wherein the nonlinear filtering comprises smoothing the edges of at least one foreground object.
10. The method of, further comprising: receiving a metadata bitstream, wherein the metadata bitstream is associated with the video bitstream, wherein the metadata bitstream identifies a set of parameters, wherein the set of parameters comprises a definition of the nonlinear filtering and/or a definition of the up-sampling; and processing the decoded depth map based on the identified set of parameters.
11. A computer program stored on a non-transitory medium, wherein the computer program when executed on a processor performs the method as claimed in claim 1.
12. A video encoder encoder comprising: an input circuit, wherein the input circuit is arranged to receive a video data, wherein the video data comprises at least one source view, wherein each of the at least one source view comprises a texture map and a depth map; a video processor circuit, wherein the video processor circuit is arranged to process the depth map of the at least one source view so as to generate a processed depth map, wherein the processing comprises: nonlinear filtering of the depth map so as to generate a nonlinearly filtered depth map; and down-sampling the nonlinearly filtered depth map so as to generate the processed depth map; an encoder circuit, wherein the encoder circuit is arranged to encode the texture map of the at least one source view, and the processed depth map, so as to generate a video bitstream; and an output circuit, wherein the output circuit is arranged to output the video bitstream, wherein the nonlinear filtering comprises enlarging the area of at least one foreground object in the depth map.
13. A video decoder decoder comprising: a bitstream input circuit, wherein the bitstream input circuit is arranged to receive a video bitstream, wherein the video bitstream comprises an encoded depth map and an encoded texture map for at least one source view; a first decoder circuit, wherein the first decoder circuit is arranged to decode the encoded depth map from the video bitstream so as to produce a decoded depth map; a second decoder circuit, wherein the second decoder circuit is arranged to decode the encoded texture map from the video bitstream so as to produce a decoded texture map; a reconstruction processor circuit, wherein the reconstruction processor circuit is arranged to process the decoded depth map so as to generate a reconstructed depth map, wherein the processing comprises: up-sampling the decoded depth map so as to generate an up-sampled depth map; and nonlinear filtering of the up-sampled depth map so as to generate the reconstructed depth map; and an output circuit, wherein output circuit is arranged to output the reconstructed depth map, wherein the nonlinear filtering comprises reducing the area of at least one foreground object in the depth map.
14. A computer program stored on a non-transitory medium, wherein the computer program when executed on a processor performs the method as claimed in claim 5.
15. The video encoder of claim 12, wherein the nonlinear filtering comprises applying a filter, wherein the filter is designed using a machine learning algorithm.
16. The video encoder of claim 12, wherein the nonlinear filtering is performed by a neural network, wherein the neural network comprises a plurality of layers, wherein the down-sampling is performed between two of the layers.
17. The video encoder of claim 12, further comprising: a depth map processing circuit, wherein the depth map processing circuit is arranged to process the depth map according to a plurality of sets of processing parameters, wherein the processing of the depth map comprises generating a respective plurality of processed depth maps, wherein the processing parameters comprise at least one of a definition of the nonlinear filtering, a definition of the down-sampling, and a definition of processing operations to reconstruct the depth map; a selection circuit, wherein the selectin circuit is arranged to select a set of processing parameters, wherein the set of processing parameters are arranged to reduce a reconstruction error of a reconstructed depth map after the respective processed depth map has been encoded and decoded; and an generator circuit, wherein the generator circuit is arranged to generate a metadata bitstream identifying the selected set of parameters.
18. The video decoder of claim 13, further comprising a detection circuit, wherein the detection circuit is arranged to detect that the decoded depth map has a lower resolution than the decoded texture map before the processing.
19. The video decoder of claim 13, wherein the processing of the decoded depth map is based on the decoded texture map.
20. The video decoder of claim 13, further comprising: an up-sampling circuit, wherein the up-sampling circuit is arranged to up-sample the decoded depth map; an identification circuit, wherein the identification circuit is arranged to identify peripheral pixels of at least one foreground object in the up-sampled depth map; a determining circuit, wherein the determining circuit is arranged to determine whether the peripheral pixels are more similar to the foreground object or to the background based on the decoded texture map; and a filtering circuit, wherein the filtering circuit is arranged to applying nonlinear filtering only to peripheral pixels that are determined to be more similar to the background.
21. The video decoder of claim 13, wherein the nonlinear filtering comprises smoothing the edges of at least one foreground object.
22. The video decoder of claim 13, further comprising a receiver circuit, wherein the receiver circuit is arranged to receive a metadata bitstream, wherein the metadata bitstream is associated with the video bitstream, wherein the metadata bitstream identifies a set of parameters, wherein the set of parameters comprises a definition of the nonlinear filtering and/or a definition of the up-sampling, wherein processing the decoded depth map based on the identified set of parameters.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0077] For a better understanding of the invention, and to show more clearly how it may be carried into effect, reference will now be made, by way of example only, to the accompanying drawings, in which:
[0078]
[0079]
[0080]
[0081]
[0082]
[0083]
[0084]
[0085]
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0086] The invention will be described with reference to the Figures.
[0087] It should be understood that the detailed description and specific examples, while indicating exemplary embodiments of the apparatus, systems and methods, are intended for purposes of illustration only and are not intended to limit the scope of the invention. These and other features, aspects, and advantages of the apparatus, systems and methods of the present invention will become better understood from the following description, appended claims, and accompanying drawings. It should be understood that the Figures are merely schematic and are not drawn to scale. It should also be understood that the same reference numerals are used throughout the Figures to indicate the same or similar parts.
[0088] Methods of encoding and decoding immersive video are disclosed. In an encoding method, source video data comprising one or more source views is encoded into a video bitstream. Depth data of at least one of the source views is nonlinearly filtered and down-sampled prior to encoding. Down-sampling the depth map helps to reduce the volume of data to be transmitted and therefore helps to reduce the bit rate. However, the inventors have found that simply down-sampling can lead to thin or small foreground objects, such as cables, disappearing from the down-sampled depth map. Embodiments of the present invention seek to mitigate this effect, and to preserve small and thin objects in the depth map.
[0089] Embodiments of the present invention may be suitable for implementing part of a technical standard, such as ISO/IEC 23090-12 MPEG-I Part 12 Immersive Video. Where possible, the terminology used herein is chosen to be consistent with the terms used in MPEG-I Part 12. Nevertheless, it will be understood that the scope of the invention is not limited to MPEG-I Part 12, nor to any other technical standard.
[0090] It may be helpful to set out the following definitions/explanations:
[0091] A “3D scene” refers to visual content in a global reference coordinate system.
[0092] An “atlas” is an aggregation of patches from one or more view representations after a packing process, into a picture pair which contains a texture component picture and a corresponding depth component picture.
[0093] An “atlas component” is a texture or depth component of an atlas.
[0094] “Camera parameters” define the projection used to generate a view representation from a 3D scene.
[0095] “Pruning” is a process of identifying and extracting occluded regions across views, resulting in patches.
[0096] A “renderer” is an embodiment of a process to create a viewport or omnidirectional view from a 3D scene representation, corresponding to a viewing position and orientation.
[0097] A “source view” is source video material before encoding that corresponds to the format of a view representation, which may have been acquired by capture of a 3D scene by a real camera or by projection by a virtual camera onto a surface using source camera parameters.
[0098] A “target view” is defined as either a perspective viewport or omnidirectional view at the desired viewing position and orientation.
[0099] A “view representation” comprises 2D sample arrays of a texture component and a corresponding depth component, representing the projection of a 3D scene onto a surface using camera parameters.
[0100] A machine-learning algorithm is any self-training algorithm that processes input data in order to produce or predict output data. In some embodiments of the present invention, the input data comprises one or more views decoded from a bitstream and the output data comprises a prediction/reconstruction of a target view.
[0101] Suitable machine-learning algorithms for being employed in the present invention will be apparent to the skilled person. Examples of suitable machine-learning algorithms include decision tree algorithms and artificial neural networks. Other machine-learning algorithms such as logistic regression, support vector machines or Naïve Bayesian model are suitable alternatives.
[0102] The structure of an artificial neural network (or, simply, neural network) is inspired by the human brain. Neural networks are comprised of layers, each layer comprising a plurality of neurons. Each neuron comprises a mathematical operation. In particular, each neuron may comprise a different weighted combination of a single type of transformation (e.g. the same type of transformation, sigmoid etc. but with different weightings). In the course of processing input data, the mathematical operation of each neuron is performed on the input data to produce a numerical output, and the outputs of each layer in the neural network are fed into one or more other layers (for example, sequentially). The final layer provides the output.
[0103] Methods of training a machine-learning algorithm are well known. Typically, such methods comprise obtaining a training dataset, comprising training input data entries and corresponding training output data entries. An initialized machine-learning algorithm is applied to each input data entry to generate predicted output data entries. An error between the predicted output data entries and corresponding training output data entries is used to modify the machine-learning algorithm. This process can be repeated until the error converges, and the predicted output data entries are sufficiently similar (e.g. ±1%) to the training output data entries. This is commonly known as a supervised learning technique.
[0104] For example, where the machine-learning algorithm is formed from a neural network, (weightings of) the mathematical operation of each neuron may be modified until the error converges. Known methods of modifying a neural network include gradient descent, backpropagation algorithms and so on.
[0105] A convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. CNNs are regularized versions of multilayer perceptrons.
[0106]
[0107] The decoder 400 decodes the encoded views (texture and depth). It passes the decoded views to a synthesizer 500. The synthesizer 500 is coupled to a display device, such as a virtual reality headset 550. The headset 550 requests the synthesizer 500 to synthesize and render a particular view of the 3-D scene, using the decoded views, according to the current position and orientation of the headset 550.
[0108] An advantage of the system shown in
[0109]
[0110] The video encoder 300 also includes a depth decoder 340, a reconstruction processor 350 and an optimizer 360. These components will be described in greater detail in connection with the second embodiment of the encoding method, to be described below with reference to
[0111] Referring to
[0112] The source views received at the input 310 may be the views captured by the array of cameras 10. However this is not essential and the source views need not be identical to the views captured by the camera. Some or all of the source views received at the input 310 may be synthesized or otherwise processed source views. The number of source views received at the input 310 may be larger or smaller than the number of views captured by the array of cameras 10.
[0113] In the embodiment of
[0116] This processing operation effectively grows the size of all local foreground objects and hence keeps small and thin objects. However, the decoder should preferably be aware of what operation was applied, since it should preferably undo the introduced bias and shrink all objects to align the depth map with the texture again.
[0117] According to the present embodiment, the memory requirement for the video decoder is reduced. The original pixel-rate was: 1Y+0.5CrCb+1D, where Y=luminance channel, CrCb=chrominance channels, D=depth channel. According to the present example, using down-sampling by a factor of four (2×2), the pixel-rate becomes: 1Y+0.5CrCb+0.25D. Consequently a 30% pixel-rate reduction can be achieved. Most practical video decoders are 4:2:0 and do not include monochrome modes. In that case a pixel reduction of 37.5% is achieved.
[0118]
[0119] Note that the operation of the depth decoder 340 and the reconstruction processor 350 will be described in greater detail below, with reference to the decoding method (see
[0120] Effectively, the video encoder 300 implements a decoder in-the-loop, to allow it to predict how the bitstream will be decoded at the far end decoder. The video encoder 300 selects the set of parameters that will give the best performance at the far end decoder (in terms of minimizing reconstruction error, for a given target bit rate or pixel rate). The optimization can be carried out iteratively, as suggested by the flowchart of
[0121] The parameters tested may include parameters of the nonlinear filtering 120a, parameters of the down-sampling 130a, or both. For example, the system may experiment with down-sampling by various factors in one or both dimensions. Likewise, the system may experiment with different nonlinear filters. For example, instead of a max filter (which assigns to each pixel the maximum value in a local neighborhood), other types of ordinal filter may be used. For instance, the nonlinear filter may analyze the local neighborhood around a given pixel, and may assign to the pixel the second highest value in the neighborhood. This may provide a similar effect to a max filter while helping to avoid sensitivity to single outlying values. The kernel size of the nonlinear filter is another parameter that may be varied.
[0122] Note that parameters of the processing at the video decoder may also be included in the parameter set (as will be described in greater detail below). In this way, the video encoder may select a set of parameters for both the encoding and decoding that help to optimize the quality versus bit rate/pixel rate. The optimization may be carried out for a given scene, or for a given video sequence, or more generally over a training set of diverse scenes and video sequences. The best set of parameters can thus change per sequence, per bit rate and/or per allowed pixel rate.
[0123] The parameters that are useful or necessary for the video decoder to properly decode the video bitstream may be embedded in a metadata bitstream associated with the video bitstream. This metadata bitstream may be transmitted/transported to the video decoder together with the video bitstream or separately from it.
[0124]
[0125] The method of
[0126] One example of the method of
[0127] In the present embodiment, in order to undo the bias (foreground objects that have grown in size), the nonlinear filtering 240 of the up-scaled depth-maps comprises a color adaptive, conditional, erosion filter (steps 242, 244, and 240a in
[0128] The nonlinear filtering 240 according to the present embodiment will now be described in greater detail.
[0129] The steps taken to perform the adaptive erosion are: [0130] 1. Find local foreground edges—that is, peripheral pixels of foreground objects (marked X in
[0136] As mentioned above, this process can be noisy and may lead to jagged edges in the depth map. The steps taken to smoothen the object edges represented in the depth-map are: [0137] 1. Find local foreground edges—that is, peripheral pixels of foreground objects (like those marked X in
[0141] This smoothening will tend to convert outlying or protruding foreground pixels into background pixels.
[0142] In the example above, the method used the number of background pixels in a 3×3 kernel to identify whether a given pixel was an outlying peripheral pixel projecting from the foreground object. Other methods may also be used. For example, as an alternative or in addition to counting the number of pixels, the positions of foreground and background pixels in the kernel may be analyzed. If the background pixels are all on one side of the pixel in question, then it may be more likely to be a foreground pixel. On the other hand, if the background pixels are spread all around the pixel in question, then this pixel may be an outlier or noise, and more likely to really be a background pixel.
[0143] The pixels in the kernel may be classified in a binary fashion as foreground or background. A binary flag encodes this for each pixel, with a logical “1” indicating background and a “0” indicating foreground. The neighborhood (that is, the pixels in the kernel) can then be described by an n-bit binary number, where n is the number of pixels in the kernel surrounding the pixel of interest. One exemplary way to construct the binary number is as indicated in the table below:
TABLE-US-00001 b.sub.7 = 1 b.sub.6 = 0 b.sub.5 = 1 b.sub.4 = 0 b.sub.3 = 0 b.sub.2 = 1 b.sub.1 = 0 b.sub.0 = 1
[0144] In this example b =b.sub.7 b.sub.6 b.sub.5 b.sub.4 b.sub.3 b.sub.2 b.sub.1 b.sub.0=10100101.sub.2=165. (Note that the algorithm described above with reference to
[0145] Training comprises counting for each value of b how often the pixel of interest (the central pixel of the kernel) is foreground or background. Assuming equal cost for false alarms and misses, the pixel is determined to be a foreground pixel if it is more likely (in the training set) to be a foreground pixel than a background pixel, and vice versa.
[0146] The decoder implementation will construct b and fetch the answer (pixel of interest is foreground or pixel of interest is background) from a look up table (LUT).
[0147] The approach of nonlinearly filtering the depth map at both the encoder and the decoder (for example, dilating and eroding, respectively, as described above) is counterintuitive, because it would normally be expected to remove information from the depth map. However, the inventors have surprisingly found that the smaller depth maps that are produced by the nonlinear down-sampling approach can be encoded (using a conventional video codec) with higher quality for a given bit rate. This quality gain exceeds the loss in reconstruction; therefore, the net effect is to increase end-to-end quality while reducing the pixel-rate.
[0148] As described above with reference to
[0149] When the parameters of the nonlinear filtering and down-sampling at the video encoder have been selected to reduce the reconstruction error, as described above, the selected parameters may be signaled in a metadata bitstream, which is input to the video decoder. The reconstruction processor 450 may use the parameters signaled in the metadata bitstream to assist in correctly reconstructing the depth map. Parameters of the reconstruction processing may include but are not limited to: the up-sampling factor in one or both dimensions; the kernel size for identifying peripheral pixels of foreground objects; the kernel size for erosion; the type of non-linear filtering to be applied (for example, whether to use a min-filter or another type of filter); the kernel size for identifying foreground pixels to smooth; and the kernel size for smoothing.
[0150] An alternative embodiment will now be described, with reference to
[0151] The network parameters (weights) of the second part of the network may be transmitted as meta-data with the bit-stream. Note that different sets of neural net parameters maybe created corresponding with different coding configurations (different down-scale factor, different target bitrate, etc.) This means that the up-scaling filter for the depth map will behave optimally for a given bit-rate of the texture map. This can increase performance, since texture coding artefacts change the luminance and chroma characteristics and, especially at object boundaries, this change will result in different weights of the depth up-scaling neural network.
[0152]
[0153] I=Input 3-channel full-resolution texture map
[0154] Ĩ=Decoded full-resolution texture map
[0155] D=Input 1-channel full-resolution depth map
[0156] D.sub.down=down-scaled depth map
[0157] {tilde over (D)}.sub.down=down-scaled decoded depth map
[0158] C.sub.k=Convolution with k×k kernel
[0159] P.sub.k=Factor k downscale
[0160] U.sub.k=Factor k upsampling
[0161] Each vertical black bar in the diagram represents a tensor of input data or intermediate data—in other words, the input data to a layer of the neural network. The dimensions of each tensor are described by a triplet (p, w, h) where w and h are the width and height of the image, respectively, and p is the number of planes or channels of data. Accordingly, the input texture map has dimensions (3, w, h)—the three planes corresponding to the three color channels. The down-sampled depth map has dimensions (1, w/2, h/2).
[0162] The downscaling Pk may comprise a factor k downscale average, or a max-pool (or min-pool) operation of kernel size k. A downscale average operation might introduce some intermediate values but the later layers of the neural network may fix this (for example, based on the texture information).
[0163] Note that, in the training phase, the decoded depth map, {tilde over (D)}.sub.down is not used. Instead, the uncompressed down-scaled depth map D.sub.down is used. The reason for this is that the training phase of the neural net requires calculation of derivatives which is not possible for the non-linear video encoder function. This approximation will likely be valid, in practice—especially for higher qualities (higher bit rates). In the inference phase (that is, for processing real video data), the uncompressed down-scaled depth map D.sub.down is obviously not available to the video decoder. Therefore, the decoded depth map, {tilde over (D)}.sub.down is used. Note also that the decoded full-resolution texture map Ĩ is used in the training phase as well as the inference phase. There is no need to calculate derivatives as this is helper information rather than data processed by the neural network.
[0164] The second part of the network (after video decoding) will typically contain only a few convolutional layers due to the complexity constraints that may exist at a client device.
[0165] Essential for using the deep learning approach is the availability of training data. In this case, these are easy to obtain. The uncompressed texture image and full resolution depth map are used at the input side before video encoding. The second part of the network uses the decoded texture and the down-scaled depth map (via the first half of the network as input for training) and the error is evaluated against the ground-truth full resolution depth map that was also used as input. So essentially, patches from the high-resolution source depth map serves both as input and as output to the neural network. The network hence has some aspects of both the auto-encoder architecture and the UNet architecture. However, the proposed architecture is not just a mere combination of these approaches. For instance, the decoded texture map enters the second part of the network as helper data to optimally reconstruct the high-resolution depth map.
[0166] In the example illustrated in
[0167] The encoded depth map is transported to the video decoder 400 in the video bitstream. It is decoded by the depth decoder 426 in step 226. This produces the down-scaled decoded depth map {tilde over (D)}.sub.down. This is up-sampled (U.sub.2) to be used in the part of the neural network at the video decoder 400. The other input to this part of the neural network is the decoded full-resolution texture map I, which is generated by texture decoder 424. This second part of the neural network has three layers. It produces as output a reconstructed estimate D that is compared with the original depth map D to produce a resulting error e.
[0168] As will be apparent from the foregoing, the neural network processing may be implemented at the video encoder 300 by the video processor 320 and at the video decoder 400 by the reconstruction processor 450. In the example shown, the nonlinear filtering 120 and the down-sampling 130 are performed in an integrated fashion by the part of the neural network at the video encoder 300. At the video decoder 400, the up-sampling 230 is performed separately, prior to the nonlinear filtering 240, which is performed by the neural network.
[0169] It will be understood that the arrangement of the neural network layers shown in
[0170] In several of the embodiments described above, reference was made to max filtering, max pooling, dilation or similar operations, at the encoder. It will be understood that these embodiments assume that the depth is encoded as 1/d (or other similar inverse relationship), where d is distance from the camera. With this assumption, high values in the depth map indicate foreground objects and low values in the depth map denote background. Therefore, by applying a max- or dilation-type operation, the method tends to enlarge foreground objects. The corresponding inverse process, at the decoder, may be to apply a min- or erosion-type operation.
[0171] Of course, in other embodiments, depth may be encoded as d or log d (or another variable that has a directly correlated relationship with d). This means that foreground objects are represented by low values of d, and background by high values of d. In such embodiments, a min filtering, min pooling, erosion or similar operation may be performed at the encoder. Once again, this will tend to enlarge foreground objects, which is the aim. The corresponding inverse process, at the decoder, may be to apply a max- or dilation-type operation.
[0172] The encoding and decoding methods of
[0173] Storage media may include volatile and non-volatile computer memory such as RAM, PROM, EPROM, and EEPROM. Various storage media may be fixed within a computing device or may be transportable, such that the one or more programs stored thereon can be loaded into a processor.
[0174] Metadata according to an embodiment may be stored on a storage medium. A bitstream according to an embodiment may be stored on the same storage medium or a different storage medium. The metadata may be embedded in the bitstream but this is not essential. Likewise, metadata and/or bitstreams (with the metadata in the bitstream or separate from it) may be transmitted as a signal modulated onto an electromagnetic carrier wave. The signal may be defined according to a standard for digital communications. The carrier wave may be an optical carrier, a radio-frequency wave, a millimeter wave, or a near field communications wave. It may be wired or wireless.
[0175] To the extent that an embodiment is implemented partly or wholly in hardware, the blocks shown in the block diagrams of
[0176] Generally, examples of methods of encoding and decoding data, a computer program which implements these methods, and video encoders and decoders are indicated by below embodiments.
EMBODIMENTS
[0177] 1. A method of encoding video data comprising one or more source views, each source view comprising a texture map and a depth map, the method comprising:
[0178] receiving (110) the video data;
[0179] processing the depth map of at least one source view to generate a processed depth map, the processing comprising: [0180] nonlinear filtering (120), and [0181] down-sampling (130); and
[0182] encoding (140) the processed depth map and the texture map of the at least one source view, to generate a video bitstream. [0183] 2. The method of embodiment 1, wherein the nonlinear filtering comprises enlarging the area of at least one foreground object in the depth map [0184] 3. The method of embodiment 1 or embodiment 2, wherein the nonlinear filtering comprises applying a filter designed using a machine learning algorithm. [0185] 4. The method of any one of the preceding embodiments, wherein the non-linear filtering is performed by a neural network comprising a plurality of layers and the down-sampling is performed between two of the layers. [0186] 5. The method of any one of the preceding embodiments, wherein the method comprises processing (120a, 130a) the depth map according to a plurality of sets of processing parameters, to generate a respective plurality of processed depth maps,
[0187] the method further comprising:
[0188] selecting the set of processing parameters that reduces a reconstruction error of a reconstructed depth map after the respective processed depth map has been encoded and decoded; and
[0189] generating a metadata bitstream identifying the selected set of parameters. [0190] 6. A method of decoding video data comprising one or more source views, the method comprising:
[0191] receiving (210) a video bitstream comprising an encoded depth map and an encoded texture map for at least one source view;
[0192] decoding (226) the encoded depth map, to produce a decoded depth map;
[0193] decoding (224) the encoded texture map, to produce a decoded texture map; and
[0194] processing the decoded depth map to generate a reconstructed depth map, wherein the processing comprises: [0195] up-sampling (230), and [0196] nonlinear filtering (240). [0197] 7. The method of embodiment 6, further comprising, before the step of processing the decoded depth map to generate the reconstructed depth map, detecting that the decoded depth map has a lower resolution than the decoded texture map [0198] 8. The method of embodiment 6 or embodiment 7, wherein the nonlinear filtering comprises reducing the area of at least one foreground object in the depth map. [0199] 9. The method of any one of embodiments 6-8, wherein the processing of the decoded depth map is based at least in part on the decoded texture map. [0200] 10. The method of any one of embodiments 6-9, comprising:
[0201] up-sampling (230) the decoded depth map;
[0202] identifying (242) peripheral pixels of at least one foreground object in the up-sampled depth map;
[0203] determining (244), based on the decoded texture map, whether the peripheral pixels are more similar to the foreground object or to the background; and
[0204] applying nonlinear filtering (240a) only to peripheral pixels that are determined to be more similar to the background. [0205] 11. The method of any one of embodiments, wherein the nonlinear filtering comprises smoothing (250) the edges of at least one foreground object. [0206] 12. The method of any one of embodiments 6-11, further comprising receiving a metadata bitstream associated with the video bitstream, the metadata bitstream identifying a set of parameters,
[0207] the method further comprising processing the decoded depth map according to the identified set of parameters. [0208] 13. A computer program comprising computer code for causing a processing system to implement the embodiments of any one of embodiments 1 to 12 when said program is run on the processing system. [0209] 14. A video encoder (300) configured to encode video data comprising one or more source views, each source view comprising a texture map and a depth map, the video encoder comprising:
[0210] an input (310), configured to receive the video data;
[0211] a video processor (320), configured to process the depth map of at least one source view to generate a processed depth map, the processing comprising: [0212] nonlinear filtering (120), and [0213] down-sampling (130);
[0214] an encoder (330), configured to encode the texture map of the at least one source view, and the processed depth map, to generate a video bitstream; and
[0215] an output (360), configured to output the video bitstream. [0216] 15. A video decoder (400) configured to decode video data comprising one or more source views, the video decoder comprising:
[0217] a bitstream input (410), configured to receive a video bitstream, wherein the video bitstream comprises an encoded depth map and an encoded texture map for at least one source view;
[0218] a first decoder (426), configured to decode from the video bitstream the encoded depth map, to produce a decoded depth map;
[0219] a second decoder (424), configured to decode from the video bitstream the encoded texture map, to produce a decoded texture map;
[0220] a reconstruction processor (450), configured to process the decoded depth map to generate a reconstructed depth map, wherein the processing comprises: [0221] up-sampling (230), and [0222] nonlinear filtering (240),
[0223] and an output (470), configured to output the reconstructed depth map.
[0224] Hardware components suitable for use in embodiments of the present invention include, but are not limited to, conventional microprocessors, application specific integrated circuits (ASICs), and field-programmable gate arrays (FPGAs). One or more blocks may be implemented as a combination of dedicated hardware to perform some functions and one or more programmed microprocessors and associated circuitry to perform other functions.
[0225] More specifically, the invention is defined by the appended CLAIMS.
[0226] Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure and the appended claims. In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. If a computer program is discussed above, it may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. If the term “adapted to” is used in the claims or description, it is noted the term “adapted to” is intended to be equivalent to the term “configured to”. Any reference signs in the claims should not be construed as limiting the scope.