Hardware-friendly transform method in codecs for plenoptic point clouds

Abstract

A hardware-friendly transform method in codecs for plenoptic point clouds. Given that existing video-based point cloud compression codec (V-PCC) is based on multimedia processor video codecs embedded in System-on-Chip (SoC) mobile devices, the remaining V-PCC steps should be as efficient as possible to ensure fair power consumption. In this sense, the method seeks to reduce the complexity of the transform, using integer transforms and imposing limits on the number of distinct transform dimensions, in which these limits are designed in order to minimize the losses of coding efficiency.

Claims

1. A method of Hardware-friendly transform in codecs for plenoptic point clouds, comprising: obtaining plenoptic samples vector v; transforming the vector v from a forward matrix M.sub.N; obtaining the transformed vector t; encoding the transformed vector t into a binary file and sending the encoded binary file to a decoder; decoding the binary file into a received vector t′; padding the received vector t′ with K zeros, to form the vector {circumflex over (t)}, performing the inverse transform with the inverse transform matrix O.sub.L in vector {circumflex over (t)}, generating the vector {circumflex over (v)}; and discarding last K values of the vector {circumflex over (v)}, generating reconstructed vector v′, such that v′=v if t′=t, wherein the reconstructed vector v′ follows the following relationship, assuming t′=t: $v^{'} = \underset{Discarding}{\underset{︸}{[\begin{matrix} I_{2^{M}} & 0_{2^{M} \times K} \end{matrix}]}} \times {(F_{2^{M}})}^{- 1} \times \overset{Padding}{\overset{︷}{[\begin{matrix} I_{2^{M}} \\ 0_{K \times 2^{M}} \end{matrix}] \times}} M_{N} \underset{\underset{t^{'} = t}{︸}}{\times} v$ such that: $[\begin{matrix} I & 0 \end{matrix}] \times {(F_{2^{M}})}^{- 1} \times [\begin{matrix} I \\ 0 \end{matrix}] \times M_{N} = I_{N}$ in which F.sub.2.sub.M is a power of two sized transform, M.sub.N is a direct transform matrix of arbitrary size N.

2. The method according to claim 1, wherein the forward transform matrix M.sub.N is one of a floating-point, a fixed-point approximation, or an integer transform matrix.

3. The method according to claim 1, wherein the K discarded values follow the relationship:
K=L−N where L is a dimension of the inverse transform matrix and N is a size of the vector v.

4. The method according to claim 1, wherein the inverse transform matrix has a power of two size.

5. The method according to claim 1, wherein the power of two sized transform F.sub.2.sub.M is a Hadamard matrix in natural order, such that F.sub.2.sub.M=H.sub.2.sub.M: $H_{2} = [\begin{matrix} 1 & 1 \\ 1 & - 1 \end{matrix}] H_{2^{M}} = H_{2} .Math. H_{2^{M - 1}}$ in such a way that N≠2.sup.M, ∀M ∈ custom character , with K=2.sup.M−N, where the forward transform M.sub.N is an adapted Hadamard matrix built as: $M_{N} = {([\begin{matrix} I & 0 \end{matrix}] \times H_{2^{M}} \times [\begin{matrix} I \\ 0 \end{matrix}])}^{- 1} .$

6. The method according to claim 1, wherein arbitrary size adapted DCT matrices is used in an encoder and integer HEVC DCT is used in a decoder.

7. The method according to claim 1, wherein the inverse transform matrix is pre-multiplied by a permutation matrix in such a way that values padded in the padding are at an end of the decoded vector.

8. The method according to claim 1, wherein arbitrary size integer DCT matrix is used, both in an encoder and in a decoder.

9. The method according to claim 1, wherein values of the transformed vector t or reconstructed vector v′ are multiplied by fixed-point or floating-point constants to keep their values correctly scaled.

Description

BRIEF DESCRIPTION OF DRAWINGS

(1) The objectives and advantages of the present invention will be clearer through the following detailed description of the example and the non-limiting drawings presented at the end of this document:

(2) FIG. 1 illustrates an expanded view of the Plenoptic Attribute Encoder of the state of the art.

(3) FIG. 2 illustrates an expanded view of the Plenoptic Attributes Decoder of the state of the art.

(4) FIG. 3 shows the common method for performing a direct and inverse 1D transform.

(5) FIG. 4 represents the method for performing the direct and inverse 1D transform according to an embodiment.

(6) FIG. 5 illustrates the proposed direct transform flow according to an embodiment.

(7) FIG. 6 illustrates the proposed inverse transform flow according to an embodiment.

(8) FIG. 7 shows the adapted Hadamard transform matrix used for the case where the next power of two is 4 according to an embodiment.

(9) FIG. 8 shows the adapted Hadamard transform matrices used for the case where the next power of 2 is 8 according to an embodiment.

(10) FIG. 9 shows the seven adapted Hadamard matrices for the case where the next power of two is 16 according to an embodiment.

(11) FIG. 10 illustrates the datapath of the Hadamard 1D transform with size eight according to an embodiment.

(12) FIG. 11 shows an example of a mobile SoC and some of its interfaces according to an embodiment.

(13) FIG. 12 illustrates an embodiment of point cloud capture and visualization according to an embodiment.

(14) FIG. 13 illustrates integer DCT matrices for sizes up to 15 according to an embodiment.

(15) FIGS. 14, 15 and 16 present graphs of the rate-distortion curves for the Longdress, RedAndBlack, and Soldier point clouds, respectively, according to an embodiment.

(16) FIG. 17 illustrates energy estimates according to the number of views, considering integer and floating-point transforms with multiplications and additions and shifts according to an embodiment.

(17) FIG. 18 shows the energy estimates of low complexity alternatives according to an embodiment.

(18) FIG. 19 presents the energy efficiency advantage over some integer alternatives according to an embodiment.

(19) FIG. 20 presents the energy efficiency advantage over a floating-point alternative according to an embodiment.

(20) FIGS. 21, 22 and 23 show the relationship between coding efficiency and energy efficiency for Longdress, RedAndBlack, and Soldier point clouds, respectively, according to an embodiment.

DETAILED DESCRIPTION

(21) FIG. 1 illustrates an expanded view of the Plenoptic Attribute Encoder of the state of the art. The attribute image main (101), which is a video frame obtained by projecting color information from the point cloud, goes through a video compression (102) generating the attribute sub-bitstream main (103) that is embedded in the full compressed bitstream. The reconstructed attribute image main (104) is the equivalent image being retrieved in the decoder. The differential encoder (107) within the Plenoptic Attribute Encoder (105) uses the reconstructed attribute image main (104) and the plenoptic views attribute images (106) to generate differential images. A view padding (109) may be required before the transform (110), which converts the differential images into a compact representation. The present invention improves the stages of padding and transform. Therefore, they will be more detailed later. The scaling (111) performs mapping to the range supported by video compression, from 0 to 255 in the case of an 8-bit video encoder. An additional step of adding 128 or half of the supported range is added to the scaling process, depending on the type of transform coefficient being generated. Some coefficients can be discarded before, such as padded ones (112), or after scaling. Then the remaining transformed images go through an image padding process (113) to generate an image suited for video compression. Video compression (114) generates the plenoptic attributes sub-bitstreams (115). Transform and scaling metadata (114) are also sent to the compressed bitstream. The reconstructed occupancy map (108) can be used by the differential encoder to ignore the values in unoccupied pixels and is used by image padding.

(22) An expanded view of the Plenoptic Attribute Decoder is illustrated in FIG. 2. The attribute sub-bitstream main (201) is decoded using video decompression (202) generating the reconstructed attribute image main (203). In the Plenoptic Attribute Decoder (204), video decompression (206) decodes attribute sub-bitstreams (205). The inverse scaling (208) using plenoptic metadata information (207) remaps the values to the range of the used transform. The inverse transform (209) returns the data to the differential coder format, which is added to the reconstructed attribute image main (203) generating the reconstructed attribute images (211). The reconstructed plenoptic views (212) are passed to the video-based point cloud decoder for the complete plenoptic point cloud reconstruction. If the number of views is not compatible with the inverse transform size (209), padding with zero (213) can be applied. In this case, views that are not part of the actual point cloud are discarded (214).

(23) FIG. 3 shows the common method for performing a forward and inverse 1D transform, considering the data flow of a plenoptic point clouds codec, such as the one from FIGS. 1 and 2. In the encoder (301), each vector v of N plenoptic samples (302) constructed with one sample of each of the N attribute images (106) or residual attribute images (107) is transformed (303) using a transform matrix (304). This process is performed for each vector of samples from the attribute images from plenoptic views(106).

(24) A forward 1D transform over a vector v with N positions uses a transform matrix M with N×N values, denoted as M.sub.N. Thus, the transformed vector t (305) is the result of a multiplication of the transform matrix M.sub.N with the input vector v, that is:
t=M×v

(25) The obtained transformed vector (305) is forwarded to the next encoding steps (306), such as the scaling (110). On the decoder side (307), after the initial steps of the Plenoptic Attribute Decoder (308) the already scaled decoded transform samples (309) are inversely transformed (310), using the inverse of the transform matrix (304) that was used in the encoder. In general, because the transform matrix is designed to be orthonormal, the inverse transform matrix is the transposition of the forward transform matrix. For example, when the transform is the floating-point DCT, the inverse matrix is equal to the transposed DCT matrix. The same is true for the integer DCT of HEVC. Given its symmetry, for the case where the transform matrix is Hadamard, the inverse is Hadamard itself, that is, H.sup.−1=H.

(26) Assuming a scenario in which no loss of information was imposed to t after the transform in such a way that t′=t, the executed operation is:

(27) $v^{'} = M^{- 1} \times t^{'} = \overset{\overset{I_{N}}{︷}}{M^{- 1} \times \underset{\underset{t^{'} = t}{︸}}{M \times v}} = v$

(28) Thus, in this case, the output vector v′ (311) is equal to the input vector v (302). This means that the reconstructed plenoptic samples (312) have been perfectly reconstructed. Of course, to achieve compression, it is expected loss of information and thus t′≠t, resulting in v′≠v.

(29) By restricting the number of available transforms, such as allowing the codec to use only power of two sized transforms, there is not always one M.sub.N for all possible N. Therefore, the vector v must be adjusted to become compatible with one of the available transform sizes. To do this, a padding method must be used. A possible padding method is based on repeating the last available value in v until the new padded vector size p is compatible with the transform. Considering a transform matrix O.sub.2.sub.M with size 2.sup.M×2.sup.M such that 2.sup.M−1<N≤2.sup.M, the vector p requires 2.sup.M values to be compatible with O. This means that 2.sup.M−N values must be included in such a padded vector. The repetition vector can be represented as:

(30) $p = [\begin{matrix} I_{N} \\ L_{2^{M}} - N \end{matrix}] \times v$

(31) where all rows from L.sub.2.sub.M.sub.−N are equal to the last row of I.sub.N.

(32) In this case, the transform of v using O can be expressed as:

(33) $\hat{t} = O \times p = O \times [\begin{matrix} I_{N} \\ L_{2^{M} - N} \end{matrix}] \times v$

(34) So, the resulting {circumflex over (t)} has size 2.sup.M. If the inverse is to be obtained without losses, all the 2.sup.M transformed views must be transmitted to the decoder. To avoid sending the extra information due to padding, and thus reducing the coding efficiency by increasing the rate, a possible approach is to discard the extra K=2.sup.M−N attributes, transmitting only the original number of plenoptic attributes N. In such a case, considering {circumflex over (t)}′ as transformed vector {circumflex over (t)} with the 2.sup.M−N last values discarded, the decoding operation may rely on zero padding to ensure that {circumflex over (t)}′ is compatible with the inverse transform O.sup.−1. Using this method causes error in the decoded views because v=v′ if, and only if, the last 2.sup.M−N values from {circumflex over (t)} were equal to 0. In all the other cases v≠v′, therefore there are errors (e=v−v′≠0). This means that this approach also hurts the coding efficiency, this time on the distortion since the reconstruction will not be perfect due to the missing information discarded by the encoder.

(35) This invention brings as solution a method that includes a transform with size N on the encoder and N or 2.sup.M on the decoder such that the inverse is perfect, assuming that t′=t. In the case where the encoder uses N sized transform and the decoder uses 2.sup.M, the inverse transform matrix cannot be obtained by the forward transform matrix, making the present invention different from the prior art illustrated in FIG. 3.

(36) FIG. 4 represents the transform flow from the present invention. On the encoder side (401), each plenoptic samples vector v (402) is transformed (403) using a specific forward matrix (404) to obtain the transformed vector t (405). The remaining encoding operations (406) from the Plenoptic Attributes Encoder are performed on the transformed views to create the binary file that is somehow sent to a decoder (407). After the initial decoding operations (408), the transform is to be computed. However, to have a small complexity, the decoder only has powers of two transforms available (409), i.e., the transform matrices are in the form O.sub.2.sub.M. When the received vector t′ (410) size is not compatible with the available transforms, K zeros are padded to the end of t′ forming vector {circumflex over (t)} (411). The transform (412) proceeds using O.sub.2.sub.M and {circumflex over (t)}, generating vector {circumflex over (v)} (413). As the original point cloud had N views, the last K values from vector v′ are discarded (414), generating the decoded vector v′ (415). The reconstructed plenoptic samples (416) are forwarded to the next steps of decoding. Assuming that t′=t and that the inverse and forward matrices are designed as presented in the next paragraphs, then v′=v.

(37) To arrive at this asymmetry, it is possible to depart from a known forward power of two sized transform (F.sub.2.sub.M), to define an arbitrary forward matrix (M.sub.N), such that v=v′ when considering the transform using M.sub.N, padding with zeros, performing the inverse with (F.sub.2.sub.M).sup.−1 and then discarding the last K values from the result, i.e.:

(38) $v^{'} = \underset{Discarding}{\underset{︸}{[\begin{matrix} I_{2^{M}} & 0_{2^{M} \times K} \end{matrix}]}} \times {(F_{2^{M}})}^{- 1} \times \overset{Padding}{\overset{︷}{[\begin{matrix} I_{2^{M}} \\ 0_{K \times 2^{M}} \end{matrix}] \times}} M_{N} \underset{\underset{t^{'} = t}{︸}}{\times} v$

(39) To ensure that v′=v, the following equation must be true:

(40) $[\begin{matrix} I & 0 \end{matrix}] \times {(F_{2^{M}})}^{- 1} \times [\begin{matrix} I \\ 0 \end{matrix}] \times M_{N} = I_{N}$

(41) For the above equation to hold,

(42) $([\begin{matrix} I & 0 \end{matrix}] \times {(F_{2^{M}})}^{- 1} \times [\begin{matrix} I \\ 0 \end{matrix}])$
must be equal to (M.sub.N).sup.−1 since (M.sub.N).sup.−1×M.sub.N=I.sub.N. Thus, it is possible to obtain M.sub.N for any value of N<2.sup.M as

(43) $M_{N} = {([\begin{matrix} I & 0 \end{matrix}] \times {(F_{2^{M}})}^{- 1} \times [\begin{matrix} I \\ 0 \end{matrix}])}^{- 1} .$

(44) If such forward matrix M.sub.N can be found, it can be used in the encoder to avoid transmitting padding while allowing the decoder to use (F.sub.2.sub.M).sup.−1 to perfectly reconstruct the original vector v using zero padding (only on the decoder). In short, the forward transform matrix is obtained by the inverse of the multiplication of the discard matrix by the inverse transform matrix multiplied by the padding matrix.

(45) In the article “Arbitrarily Shaped Transform Coding Based on a New Padding Technique”, a method was proposed to perform the forward floating-point transform with arbitrary size, being an optimal padding technique for the 1D transform. In this case, the values can be discarded without loss of information, as mentioned earlier. The proposal shows that the padded values could be interlaced in the input vector to minimize the energy of the coefficients after the transform, helping the compression. Additionally, for each shape N and a known direct transform O.sub.2.sub.M, a floating-point direct matrix with size M.sub.N may exist to be used in the encoder such that the inverse transform uses (O.sub.2.sub.M).sup.−1. The results presented in the article consider 2D floating-point DCT. Although this solves arbitrary size transforms in the decoder while also avoiding the need for padding in the encoder, it still requires using floating-point transform in the decoder.

(46) In this invention, power of two sized transform with integer transform coefficients may be adopted on the decoder side, while an arbitrary size floating-point transform may be adopted on the encoder side. In this case, by using an appropriate scaling, the optimal padding approach may be used in the encoder with arbitrary size floating-point transform. This ensures that the decoder has a small complexity (using integer inverse transform) and reduces the number of available sizes (only powers of two), thus reducing the area demands if the decoder is embedded in a hardware accelerator.

(47) Given that the optimal padding may be interlaced on the input, the decoder must perform the de-interlacing. The inverse transform matrix may be pre-multiplied by a permutation matrix such that after the inverse transform operation, the “decoded” padded values are by the end of the decoded vector, avoiding extra operations. Then, the last K values can be discarded without information loss. This case is represented in FIG. 4, where the forward transform matrix (404) is of arbitrary size N, while the inverse transform matrix (409) has power of two size and may be pre-multiplied by a permutation matrix. On the other hand, it is possible to perform the reordering after the transform for every padded decoded vector.

(48) A final option is using an arbitrary size integer or fixed-point transform to address only the issue with floating-point operations. In this case, the plenoptic attributes codec is the same, and the resulting complexity will depend on the used coefficients. On the other hand, while this approach does not solve the issue of having arbitrary sizes transform, it also does not have the problem of requiring padding and dealing with coding efficiency reduction by discarding padded values.

(49) In all three cases, there is no need for padding to be actually performed on the encoder side, thus no extra information is sent to the decoder. This also means that no information is lost by discarding padded values on the encoder. FIG. 5 illustrates the forward transform flow proposed in this invention. First, a vector v with N input values is received (501). Such a vector is transformed using the forward transform matrix M.sub.N, that can be floating-point, a fixed-point approximation, or an integer transform matrix (502). If the transform is integer, a scaling by a floating- or fixed-point constant may be necessary (503). The scaled transformed vector is then sent (504) to the next steps of the Plenoptic Attributes Encoder.

(50) FIG. 6 illustrates the inverse transform flow proposed in this invention. The decoder must know the inverse matrix transform O.sub.L somehow (601), where L≥N. Such information may be signalized in the bitstream using an implicit definition from levels and/or profiles in a standard or other method. The vector t′ with N values is obtained (602). The decoder then verifies if the inverse transform is compatible with vector t′ (603). If t′ is compatible with the inverse transform matrix, then t is assigned to vector {circumflex over (t)} (604). Otherwise, if t′ is not compatible, then zero padding is performed over t′ and the result is assigned to {circumflex over (t)} (605). Vector {circumflex over (t)} is then transformed (606) using the inverse matrix, resulting in {circumflex over (v)}′. If a reorder is necessary and it was not already accounted for in the inverse transform matrix, {circumflex over (v)}′ is reordered (607). Considering that the reorder was already performed, this step can be skipped, or the identity matrix (I) may be used as permutation matrix. The decoder then verifies again if the inverse transform was compatible with vector t (608). If true, {circumflex over (v)}′ is assigned to v′ (609) , otherwise the last K=L−N values of {circumflex over (v)}′ may be discarded (610), resulting in vector v′. As in the encoder, a scaling by a floating- or fixed-point value may be necessary (611). Finally, the resulting vector is forwarded to the remaining steps of the plenoptic attributes decoding (612).

(51) The well-known Hadamard transform is a strong candidate to ensure the minimum number of operations in the transform. The Hadamard matrices have sizes that are powers of two by definition. Also, the values in the Hadamard matrix are always +1 or −1. Thus, no multiplication is needed. Moreover, the optimal method to compute the Hadamard transform using only N×log.sub.2N operations is well known in the art, and thus making such a transform the best candidate to allow low-complexity view transform.

(52) The Hadamard matrix may have different orderings. The order obtained by the recursive construction is known as the natural ordered Hadamard matrix. The recursive construction is as follows:

(53) $H_{2} = [\begin{matrix} 1 & 1 \\ 1 & - 1 \end{matrix}] H_{2^{M}} = H_{2} .Math. H_{2^{M - 1}}$

(54) Assuming a forward transform matrix F.sub.2.sub.M=H.sub.2.sub.M, and thus (F.sub.2.sub.M).sup.−1=H.sub.2.sub.M, a set of forward transform matrices can be obtained when N≠2.sup.M, ∀M ∈ custom character , with K=2.sup.M−N, such that:

(55) $M_{N} = {([\begin{matrix} I & 0 \end{matrix}] \times H_{2^{M}} \times [\begin{matrix} I \\ 0 \end{matrix}])}^{- 1}$

(56) FIGS. 7 to 9 show the obtained M.sub.N considering 2.sup.M=4,8,16. Other sizes for N>16 can be obtained considering the above equation. FIG. 7 shows the used transform matrix for the case where the next power of two is 4. There is only one case, i.e., for N=3 the encoder uses M.sub.3 (701), while the decoder uses H.sub.4 (702), considering zero padding on the input vector (213)

(57) FIG. 8 shows the used transform matrices for the case where the next power of two is 8. On the encoder side, for N=5 the encoder uses M.sub.5 (801), for N=6 the encoder uses M.sub.6 (802), and for N=7 the encoder uses M.sub.7 (803). On the other hand, in all these cases the decoder uses H.sub.8 (804), considering zero padding on the input vector (213) . As the H.sub.8 is in natural order, it was obtained recursively (805) from H.sub.4 (702).

(58) FIG. 9 shows the seven adapted Hadamard matrices for the case where the next power of two is 16. On the encoder side, for N=9 the encoder uses M.sub.9 (901), for N=10 the encoder uses M.sub.10 (902) , for N=11 the encoder uses M.sub.11 (903) , for N=12 the encoder uses M.sub.12 (904), for N=13 the encoder uses M.sub.13 (905), for N=14 the encoder uses M.sub.14 (906), and for N=15 the encoder uses M.sub.15 (907). On the other hand, in all these cases the decoder uses H.sub.16 (908), considering zero padding on the input vector (213). As the H.sub.16 is in natural order, it can be obtained recursively (909) from H.sub.8 (805).

(59) The Hadamard matrix in its natural order is also important because such a recursive construction allows for very efficient hardware architectures where small transforms can be computed using the same hardware already present in the larger transforms. FIG. 10 shows how simple is such an architecture for H.sub.8 (1001). Because the recursive construction of the transform matrix, such datapath contains also datapaths to compute the transform with H.sub.4 (1002) and with H.sub.2 (1003).

(60) Therefore, this realization has the lowest possible transform cost. Its energy requirements are small, since there are only a few operations and no transform coefficient to be multiplied. In addition, the hardware is recursive in nature (FIG. 10),and thus occupies a small area of silicon if embedded in a System-on-Chip (SoC) since a circuit that implements a certain size includes the circuit to run in parallel all the smaller sizes that precede it.

(61) FIG. 11 shows an example of a mobile SoC and some of its interfaces. A plenoptic point cloud bitstream can be received through the network antenna (1101), and then processed in the mobile SoC (1102) initially by the Modem subsystem (1103), and then stored in the flash memory of the device (1104), SD card (1105) or loaded directly into the main random access memory system (RAM) (1106) for on-demand decoding. The memory subsystem, containing an on-chip SRAM module (1107), makes bitstream data available to the whole SoC. To decode each attribute sub-bitstream (205), the mobile SoC (1102) can have an HEVC decoder built into multimedia IP (1108). After reverse scaling (208) and zero-padding (213), the inverse transform (209) can operate as software loaded to the CPU (1109) or GPU (1110) or even be embedded as a hardware accelerator synthesized within the multimedia IP (1108). Unused views should be discarded (214) and the other views can be displayed on a plenoptic screen. Even without a plenoptic screen, a device equipped with a plenoptic point clouds decoder may be able to simulate the different viewing angles using data from the device's sensors (1111), such as gyroscopes and/or accelerometer. According to motion sensor detection, the display (1112) can show colors related to a specific view. The speed at which the device can decode plenoptic point clouds will play an important role in ensuring the realism of the displayed media.

(62) Given the low complexity of the proposed embodiment, the encoder can also be embedded into a SoC. A camera with multiple lenses (1113) and some depth sensing methods may capture different views of a point cloud. The image data will be processed by the Image Signal Processor (ISP) (1114) and can be rendered into a point cloud. Then, a plenoptic point cloud encoder can be loaded on the CPU (1109), GPU (1110) or be enabled on the multimedia IP (1108).

(63) As a second embodiment, it is considered the case where the encoder adopts a floating-point 1D DCT transform with optimal padding, which can be implemented as an arbitrary-sized floating-point DCT on the encoder side. However, as explained, changing the position of the padding values can improve the efficiency of transform encoding, and therefore the inverse transform must be reordered. In addition, an integer transform can be a sufficient approximation of the floating-point DCT in the decoder. In this embodiment, as an integer transform, the inverse DCT of the HEVC is adopted for sizes 4, 8 and 16. Thus, the values must be scaled by 1.0/(64×√{square root over (2.sup.M)}).

(64) In the case of this embodiment, the hardware design of the reverse DCT of a HEVC IP Core can be partially reused, given the necessary modifications to operate only in one dimension. During the modification, the transform coefficients can be reordered. Of course, a person skilled in the art may be able to perform reordering (shift) on the decoded vector v′, rather than making the shifts over the transform matrix.

(65) This embodiment has small complexity on the decoder (integer and power of two sized transform), whereas keeps a considerable complexity at the encoder side (floating-point arbitrary size transform). FIG. 12 shows one possible use of this embodiment, that is for content creators and distributors with powerful processing capabilities that wish to adopt plenoptic point clouds, paying the larger complexity burden on their side (encoding) while alleviating the load on the clients' side (decoding) which may be on mobile devices. Some content (1201) is captured via a camera array (1202). A powerful server (1203) or workstation will be able to represent the captured content by means of a plenoptic point cloud and encode it to distribution using the Plenoptic Attributes Encoder. Such encoder may execute on CPU or even on FPGA accelerators integrated on modern servers. The encoded plenoptic point cloud is transmitted though the network (1204) to be decoded and displayed by mobile devices (1205) with relatively low complexity because of the adopted integer inverse transform.

(66) A third embodiment of this invention addresses only the issue of having floating-point operations in the transform. For this, fixed point can be used (because it can be implemented as an integer representation). However, integer DCT approximations can also be used. FIG. 13 shows integer approximations constructed following the same principles as the integer HEVC DCT. However, they do not consist of the same transform used in HEVC, because for that standard, the transform is only defined for power of two sizes, being 4, 8, 16, and 32.

(67) Considering a case where the number of views N=3, M.sub.3 (1301) will be used both on the encoder and on the decoder. However, for decoding the transpose of M.sub.3 is used. Similarly, for N=5, M.sub.5 (1302) is used on the encoder and M.sub.5.sup.T is used on the decoder. For sizes N={6,7,9,10,11,12,13,14,15}, the transform matrices are M.sub.6 (1303), M.sub.7 (1304), M.sub.9 (1305), M.sub.10 (1306) , M.sub.11 (1307) , M.sub.12 (1308) , M.sub.13 (1309) , M.sub.14 (1310) , and M.sub.15 (1311), respectively.

(68) The same strategy can be adopted to design integer transforms to sizes larger than N=15. Also, when the number of views is a power of two, the HEVC DCT can be used. There are 61 distinct coefficients, disregarding their signal, considering the 11 matrices in FIG. 13 together with the HEVC DCT for sizes 4, 8 and 16. The disadvantage of this embodiment is that it still requires a specific transform matrix for each number of plenoptic views. While this avoids the need for any padding strategy, it is expensive to manufacture a solution with a dedicated hardware IP for this embodiment. However, it is still a valid approach that is based on the integer arithmetic that can be accelerated into software (SIMD or GPU). In addition, if, during a specific period, content with a certain number of views is more likely to be adopted, a good solution would be to create a prototype of the transform for such a size in an FPGA. The advantage of this solution is that its coding efficiency is equivalent to that of arbitrary-sized floating-point DCT, while the efficiency of the FPGA will be better than performing the transform on general purpose devices (CPU/GPU).

(69) The present invention proposes alternative transform methods that can be adopted considering the tradeoff between coding efficiency and complexity. One advantage of having some alternatives with different coding efficiencies versus complexity is that they can be related to specific levels (of complexity) determined by an international standard. To show the effects of the three embodiments of this invention, first an analysis of coding efficiency is provided, then a complexity analysis is provided using energy efficiency estimates and, finally, cost-benefit results are provided.

(70) To evaluate the coding efficiency, three embodiments of this invention and five transforms were implemented considering the state of the art in TMC2v11.0: “Prior art DCT (fp, N)” uses arbitrary size floating-point DCT on both encoder and decoder. “Prior art DCT op (fp, N and 2.sup.M)” uses arbitrary size adapted floating-point DCTs on the encoder and power of two sized floating-point DCT on the decoder. “Prior art DCT (fp, 2.sup.M)” uses power of two sized floating-point DCTs on both encoder and decoder. “Prior art HEVC DCT (i, 2.sup.M)” uses the power of two sized integer DCT from HEVC on both encoder and decoder. “Prior art Hadamard (i, 2.sup.M)” uses the Hadamard transform on both encoder and decoder. Such a transform is integer and power of two sized by definition.

(71) Thus, “Prior art DCT (fp, 2.sup.M)”, “Prior art HEVC DCT (i, 2.sup.M)”, and “Prior art Hadamard (i, 2.sup.M)” need padding because the transform is not compatible with the tested point clouds. In these cases, repetition padding (of the last valid view) was adopted in the encoder, and because the transformed padded views are discarded on the encoder, the decoder uses zero-padding.

(72) In relation to the embodiments of this invention, in summary: “This invention, embodiment 1” uses arbitrary size adapted Hadamard matrices in the encoder and natural ordered Hadamard matrices in the decoder; “This invention, embodiment 2” uses arbitrary size adapted floating-point DCT on the encoder and integer HEVC DCT on the decoder; “This invention, embodiment 3” uses, in both encoder and decoder, integer DCT of arbitrary size built using the same method for matrices of size in powers of two of the HEVC DCT.

(73) Table 2 summarizes the tested transforms and their characteristics. In these tests, the encoder was configured with the default TMC2 parameter values of the C2-AI configuration, and the plenoptic attribute encoder was configured so that each attribute image was encoded with QP=QP.sub.main. The different transforms were tested over the Longdress, RedAndBlack, and Soldier point clouds from the original 8i Voxelized Surface Light Fields (VSLF) dataset, which uses 12-bit precision geometry.

(74) TABLE-US-00002 TABLE 2 Name Arithmetic Size Prior art DCT (fp, N) floating-point N Prior art DCT op floating-point N (encoder) and (fp, N and 2.sup.M) 2.sup.M (decoder) Prior art DCT floating-point 2.sup.M (fp, 2.sup.M) Prior art HEVC DCT integer 2.sup.M (i, 2.sup.M) Prior art Hadamard integer 2.sup.M (i, 2.sup.M) This invention, integer N (encoder) and embodiment 1 2.sup.M (decoder) This invention, floating-point N (encoder) and embodiment 2 (encoder) and 2.sup.M (decoder) integer (decoder) This invention, integer N embodiment 3

(75) By itself, coding efficiency is a tradeoff between rate and distortion. The rate was calculated considering the bit rates of all views (main and plenoptic). The lower the rate, the better. The distortion was calculated as the Peak Signal-To-Noise Ratio (PSNR) of the Y channel between the original and decoded point clouds, all taken as a single signal instead of a PSNR average between views. The higher the PSNR, the better (less noise). One way to assess coding efficiency is through rate-distortion curves, which are presented in FIGS. 14 through 16,for the Longdress, RedAndBlack, and Soldier point clouds, respectively.

(76) Table 3 shows the BD-rates of each transform in relation to multiple attribute encoding (when no transform is used). For these BD-rate values, the lower the value, the better. It is possible to note that embodiment 3 has no loss in coding efficiency compared to the state of the art using arbitrary size floating-point DCT. In addition, there is no loss in the coding efficiency of embodiment 2 in relation to the state of the art using optimal padding, thus showing that the integer transforms with limited sizes from HEVC in the decoder do not affect the coding efficiency. Finally, while embodiment 1 has a small reduction in coding efficiency compared to the other two embodiments, its coding efficiency is still better than those presented in the low complexity approaches from the state of the art using transforms with size 2.sup.M in both encoder and decoder and requiring padding.

(77) TABLE-US-00003 TABLE 3 RedAndB Transform Longdress lack Soldier Prior art DCT −89.37 −83.54 −78.49 (fp, N) Prior art DCT op −88.23 −81.9 −75.78 (fp, N and 2.sup.M) Prior art DCT −83.78 −64.4 −63.96 (fp, 2.sup.M) Prior art HEVC −83.81 −64.54 −63.9 DCT (i, 2.sup.M) Prior art −83.28 −57.86 −65.77 Hadamard (i, 2.sup.M) This invention, −87.48 −80.38 −70.18 embodiment 1 This invention, −88.23 −81.88 −75.74 embodiment 2 This invention, −89.38 −83.51 −78.46 embodiment 3

(78) Considering a system where alternative methods are implemented in hardware, one way to evaluate the complexity of each method is by its energy efficiency. To estimate the energy efficiency of each method, the estimates provided in Table 1 were used. Energy efficiency estimates are obtained considering the number of operations required to calculate the transform over one sample.

(79) First, to demonstrate that integer arithmetic is preferred over floating-point arithmetic, the energy estimates of the floating-point transforms were compared with “Prior art HEVC DCT (i, 2.sup.M)” and “This invention, embodiment 3”. Moreover, to show that integer constant multiplication can be more efficiently performed by additions and shifts, the included integer transforms were compared with both implementations, i.e., using multipliers (×) or using additions and shifts (+and <<). FIG. 17 shows the results. When using arbitrary sizes, the energy increases according to the number of views, i.e., the transform size. When using power of two sizes, the energy increases earlier since it requires the use of the next larger power of two transform size. Comparing “Prior art DCT (fp, 2.sup.M)” with “Prior art HEVC DCT (i, 2.sup.M)” (using ×), it is possible to see that an equivalent integer transform is more energy-efficient than its floating-point counterpart. But the largest differences are when comparing the adoption of × versus +and <<. “Prior art HEVC DCT (i, 2.sup.M)” using multiplication requires 11.16× more energy than what is required by using only additions and shifts. In the case of “This invention, embodiment 3”, 10.87× more energy is required in the implementation with multipliers. By these results, it is clear that an integer or fixed-point transform is far superior in terms of energy efficiency.

(80) FIG. 18 shows the energy estimates of the low-complexity alternatives, being “This invention, embodiment 1” (encoder and decoder), “This invention, embodiment 2” (decoder only), “This invention, embodiment 3” (encoder and decoder are equivalent), “Prior art HEVC DCT (i, 2.sup.M)”, and “Prior art Hadamard (i, 2.sup.M)”. “This invention, embodiment 2” encoder energy is similar to the “Prior art DCT (fp, N)” shown in FIG. 17. It is clear that “This invention, embodiment 1” is the best alternative in terms of energy efficiency.

(81) To put in evidence the energy efficiency advantage of “This invention, embodiment 1” with respect to “Prior art HEVC DCT (i, 2.sup.M)” and “This invention, embodiment 3”, the ratio between both alternatives and “This invention, embodiment 1” were computed, both for encoder and decoder. FIG. 19 shows the obtained results. In the best case, “This invention, embodiment 1” was about 12× more energy-efficient than the “Prior art HEVC DCT (i, 2.sup.M)” for both encoder and decoder. In the worst case, “This invention, embodiment 1” was more than twice more energy-efficient than its counter parts. Considering the decoder, this also means that “This invention, embodiment 1” has about the same benefits compared to “This invention, embodiment 2”.

(82) FIG. 20 shows the improvement of adopting “This invention, embodiment 1” instead of an arbitrary size floating-point transform, such as “Prior art DCT (fp, N)”. “This invention, embodiment 1” is at least 40× more energy-efficient and up to 180× more energy-efficient for the larger tested number of views. If more views are used, the differences tent to increase even more.

(83) FIG. 21 shows the BD-Rate (%) versus Energy (pJ) for each transform considering the Longdress point cloud. The three options that are better than the other, both in terms of coding efficiency and complexity, are the “This invention, embodiment 1”, “This invention, embodiment 2” and “This invention, embodiment 3”. While “Prior art DCT (fp, 2.sup.M)” has virtually the same BD-Rate as “This invention, embodiment 3”, the latter uses much less energy than the former. Also, “Prior art DCT op (fp, N and 2.sup.M)” and “This invention, embodiment 2” have similar BD-Rate, but the latter also requires much less energy, but only on the decoder side. “This invention, embodiment 2” uses floating-point operations in the encoder hence requiring more energy than “This invention, embodiment 1” and “This invention, embodiment 3”.

(84) FIGS. 22 and 23 show the tradeoff results for RedAndBlack and Soldier point clouds, respectively. The results are similar to those from Longdress. A slight variation in the energy estimates occur for Soldier, since it has 13 views instead of 12 as in Longdress and RedAndBlack. Although for Soldier the coding efficiency of “This invention, embodiment 1” decreases more than on the other two point clouds, it is still better than the prior art options in a similar energy consumption level.

(85) Although the present invention has been described in connection with certain preferential embodiments, it should be understood that it is not intended to limit disclosure to such particular embodiments. Instead, it is intended to cover all possible alternatives, modifications and equivalents within the spirit and scope of the invention, as defined by the attached claims.

Hardware-friendly transform method in codecs for plenoptic point clouds

Assignee

Inventors

Cpc classification

Classification Explorer

H04N19/42

ELECTRICITY

Classification Explorer

H04N19/60

ELECTRICITY

Classification Explorer

G06T9/007

PHYSICS

International classification

Classification Explorer

H04N19/60

ELECTRICITY

Abstract

Claims

Description