IMAGE ENCODING AND DECODING, VIDEO ENCODING AND DECODING: METHODS, SYSTEMS AND TRAINING METHODS

Abstract

Lossy or lossless compression and transmission, comprising the steps of: (i) receiving an input image; (ii) encoding it using an encoder trained neural network, to produce a y latent representation; (iii) encoding the y latent representation using a hyperencoder trained neural network, to produce a z hyperlatent representation; (iv) quantizing the z hyperlatent representation using a predetermined entropy parameter to produce a quantized z hyperlatent representation; (v) entropy encoding the quantized z hyperlatent representation into a first bitstream, using predetermined entropy parameters; (vi) processing the quantized z hyperlatent representation using a hyperdecoder trained neural network to obtain a location entropy parameter μ.sub.y, an entropy scale parameter σ.sub.y, and a context matrix A.sub.y of the y latent representation; (vii) processing the y latent representation, the location entropy parameter μ.sub.y and the context matrix A.sub.y, to obtain quantized latent residuals; (viii) entropy encoding the quantized latent residuals into a second bitstream, using the entropy scale parameter σ.sub.y; and (ix) transmitting the bitstreams.

Claims

1. A computer-implemented method for lossy or lossless image or video compression and transmission, the method including the steps of: (i) receiving an input image; (ii) encoding the input image using an encoder trained neural network, to produce a y latent representation; (iii) encoding the y latent representation using a hyperencoder trained neural network, to produce a z hyperlatent representation; (iv) quantizing the z hyperlatent representation using a predetermined entropy parameter to produce a quantized z hyperlatent representation; (v) entropy encoding the quantized z hyperlatent representation into a first bitstream, using predetermined entropy parameters; (vi) processing the quantized z hyperlatent representation using a hyperdecoder trained neural network to obtain a location entropy parameter μ.sub.y, an entropy scale parameter σ.sub.y, and a context matrix A.sub.y of the y latent representation; (vii) processing the y latent representation, the location entropy parameter μ.sub.y and the context matrix A.sub.y, using an implicit encoding solver, to obtain quantized latent residuals; (viii) entropy encoding the quantized latent residuals into a second bitstream, using the entropy scale parameter σ.sub.y; and (ix) transmitting the first bitstream and the second bitstream.

2. The method of claim 1, in which the implicit encoding solver solves the implicit equations: (I) the quantized latent residuals equal a quantisation function of the sum of the y latent representation minus μ.sub.y minus A.sub.y acting on the quantised y latent representation; and (II) the quantised y latent representation equals the quantized latent residuals plus μ.sub.y plus A.sub.y acting on the quantised y latent representation.

3. The method claim 2, wherein the implicit encoding solver solves the implicit equations by defining B=I−A, where A is a m×m matrix, and I is the m×m identity matrix, wherein (a) if B is lower triangular, then the serial method forward substitution is used; or (b) if B is upper triangular, then the serial method backward substitution is used; or (c) B is factorised as a triangular decomposition, and then B*y=μ+{circumflex over (ξ)}, where {circumflex over (ξ)} is the quantized residual, is solved by inverting lower triangular factors with forward substitution, and by inverting upper triangular factors with backward substitution; or (d) B is factorised with a QR decomposition, where Q is an orthonormal matrix and R is an upper triangular matrix, and the solution is y=R.sup.−1 Q.sup.tμ, where Q.sup.t is Q transpose, or B is factorized using B=QL, where L is a lower triangular matrix, or B=RQ, or B=LQ, where Q is inverted by its transpose, R is inverted with back substitution, and L is inverted with forward substitution, and then respectively, the solution is y=L.sup.−1 Q.sup.tμ, or y=Q.sup.t R.sup.−1μ, or y=Q.sup.t L.sup.−1μ; or (e) B=D+L+U, with D a diagonal matrix, and here L is a strictly lower triangular matrix, and U a strictly upper triangular matrix, and then the iterative Jacobi method is applied, until a convergence criterion is met; or (f) the Gauss-Seidel method is used; or (g) the Successive Over Relaxation method is used, or (h) the Conjugate Gradient method is used.

4. The method of claim 2, wherein the implicit encoding solver solves the implicit equations using an iterative solver, in which iteration is terminated when a convergence criterion is met.

5. The method claim 1, wherein the implicit encoding solver returns residuals, and a quantized latent representation y.

6. The method of claim 1, wherein the matrix A is lower triangular, upper triangular, strictly lower triangular, strictly upper triangular, or A has a sparse, banded structure, or A is a block matrix, or A is constructed so that its matrix norm is less than one, or A is parametrised via a matrix factorisation such as a LU or QR decomposition.

7. A computer-implemented method for lossy or lossless image or video decoding, the method including the steps of: (i) receiving a first bitstream and a second bitstream; (ii) decoding the first bitstream using an arithmetic decoder, using predetermined entropy parameters, to produce a quantized z hyperlatent representation; (iii) decoding the quantized z hyperlatent representation using a hyperdecoder trained neural network, to obtain a location entropy parameter μ.sub.y, an entropy scale parameter σ.sub.y, and a context matrix A.sub.y of a y latent representation; (iv) decoding the second bitstream using the entropy scale parameter σ.sub.y in an arithmetic decoder, to output quantised latent residuals; (v) processing the quantized latent residuals, the location entropy parameter μ.sub.y and the context matrix A.sub.y, using an implicit decoding solver, to obtain a quantized y latent representation; (vi) decoding the quantized y latent representation using a decoder trained neural network, to obtain a reconstructed image.

8. The method of claim 7, in which the implicit decoding solver solves the implicit equation that the quantised y latent representation equals the quantized latent residuals plus μ.sub.y plus A.sub.y acting on the quantised y latent representation.

9. The method of claim 7, in which the implicit decoding solver uses an iterative solver, in which iteration is terminated when a convergence criterion is reached.

10. The method of claim 7, in which the implicit decoding solver is not the same type of solver as the solver used in encoding.

11. A computer implemented method of training an encoder neural network, a decoder neural network, a hyperencoder neural network, and a hyperdecoder neural network, and entropy parameters, the neural networks, and the entropy parameters, being for use in lossy image or video compression, transmission and decoding, the method including the steps of: (i) receiving an input training image; (ii) encoding the input training image using the encoder neural network, to produce a y latent representation; (iii) encoding the y latent representation using the hyperencoder neural network, to produce a z hyperlatent representation; (iv) quantizing the z hyperlatent representation using an entropy parameter of the entropy parameters to produce a quantized z hyperlatent representation; (v) entropy encoding the quantized z hyperlatent representation into a first bitstream, using the entropy parameters; (vi) processing the quantized z hyperlatent representation using the hyperdecoder neural network to obtain a location entropy parameter μ.sub.y, an entropy scale parameter σ.sub.y, and a context matrix A.sub.y of the y latent representation; (vii) processing the y latent representation, the location entropy parameter μ.sub.y and the context matrix A.sub.y, using an implicit encoding solver, to obtain quantized latent residuals; (viii) entropy encoding the quantized latent residuals into a second bitstream, using the entropy scale parameter σ.sub.y; (ix) decoding the first bitstream using an arithmetic decoder, using the entropy parameters, to produce a quantized z hyperlatent representation; (x) decoding the quantized z hyperlatent representation using the hyperdecoder neural network, to obtain a location entropy parameter μ.sub.y, an entropy scale parameter σ.sub.y, and a context matrix A.sub.y of a y latent representation; (xi) decoding the second bitstream using the entropy scale parameter σ.sub.y in an arithmetic decoder, to output quantised latent residuals; (xii) processing the quantized latent residuals, the location entropy parameter μ.sub.y and the context matrix A.sub.y, using an (e.g. implicit) (e.g. linear) decoding solver, to obtain a quantized y latent representation; (xiii) decoding the quantized y latent representation using the decoder neural network, to obtain a reconstructed image; (xiv) evaluating a loss function based on differences between the reconstructed image and the input training image, and a rate term; (xv) evaluating a gradient of the loss function; (xvi) back-propagating the gradient of the loss function through the decoder neural network, through the hyperdecoder neural network, through the hyperencoder neural network and through the encoder neural network, and using the entropy parameters, to update weights of the encoder, decoder, hyperencoder and hyperdecoder neural networks, and to update the entropy parameters; and (xvii) repeating steps (i) to (xvi) using a set of training images, to produce a trained encoder neural network, a trained decoder neural network, a trained hyperencoder neural network and a trained hyperdecoder neural network, and trained entropy parameters; and (xviii) storing the weights of the trained encoder neural network, the trained decoder neural network, the trained hyperencoder neural network and the trained hyperdecoder neural network, and storing the trained entropy parameters.

12. The method of claim 11, in which the implicit encoding solver solves the implicit equations: (I) the quantized latent residuals equal a quantisation function of the sum of the y latent representation minus μ.sub.y minus A.sub.y acting on the quantised y latent representation; and (II) the quantised y latent representation equals the quantized latent residuals plus μ.sub.y plus A.sub.y acting on the quantised y latent representation.

13. The method of claim 11, in which the implicit decoding solver solves the implicit equation that the quantised y latent representation equals the quantized latent residuals plus μ.sub.y plus A.sub.y acting on the quantised y latent representation.

14. The method of claim 12, wherein the implicit encoding solver solves the implicit equations by defining B=I−A, where A is a m×m matrix, and I is the m×m identity matrix, wherein (a) if B is lower triangular, then the serial method forward substitution is used; or (b) if B is upper triangular, then the serial method backward substitution is used; or (c) B is factorised as a triangular decomposition, and then B*y=μ+{circumflex over (ξ)}, where {circumflex over (ξ)} is the quantized residual, is solved by inverting lower triangular factors with forward substitution, and by inverting upper triangular factors with backward substitution; or (d) B is factorised with a QR decomposition, where Q is an orthonormal matrix and R is an upper triangular matrix, and the solution is y=R.sup.−1 Q.sup.tμ, where Q.sup.t is Q transpose, or B is factorized using B=QL, where L is a lower triangular matrix, or B=RQ, or B=LQ, where Q is inverted by its transpose, R is inverted with back substitution, and L is inverted with forward substitution, and then respectively, the solution is y=L.sup.−1 Q.sup.tμ, or y=Q.sup.t R.sup.−1μ, or y=Q.sup.t L.sup.−1μ; or (e) B=D+L+U, with D a diagonal matrix, and here L is a strictly lower triangular matrix, and U a strictly upper triangular matrix, and then the iterative Jacobi method is applied, until a convergence criterion is met; or (f) the Gauss-Seidel method is used; or (g) the Successive Over Relaxation method is used, or (h) the Conjugate Gradient method is used.

15. The method of claim 12, wherein the (e.g. implicit) (e.g. linear) decoding solver solves the (e.g. implicit) equations by defining B=I−A, where A is a m×m matrix, and I is the m×m identity matrix, wherein (a) if B is lower triangular, then the serial method forward substitution is used; or (b) if B is upper triangular, then the serial method backward substitution is used; or (c) B is factorised as a triangular decomposition, and then B*y=μ+{circumflex over (ξ)}, where {circumflex over (ξ)} is the quantized residual, is solved by inverting lower triangular factors with forward substitution, and by inverting upper triangular factors with backward substitution; or (d) B is factorised with a QR decomposition, where Q is an orthonormal matrix and R is an upper triangular matrix, and the solution is y=R.sup.−1 Q.sup.tμ, where Q.sup.t is Q transpose, or B is factorized using B=QL, where L is a lower triangular matrix, or B=RQ, or B=LQ, where Q is inverted by its transpose, R is inverted with back substitution, and L is inverted with forward substitution, and then respectively, the solution is y=L.sup.−1 Q.sup.tμ, or y=Q.sup.t R.sup.−1μ, or y=Q.sup.t L.sup.−1μ; or (e) B=D+L+U, with D a diagonal matrix, and here L is a strictly lower triangular matrix, and U a strictly upper triangular matrix, and then the iterative Jacobi method is applied, until a convergence criterion is met; or (f) the Gauss-Seidel method is used; or (g) the Successive Over Relaxation method is used, or (h) the Conjugate Gradient method is used.

16. The method of claim 11, wherein the quantized y latent representation returned by the implicit encoding solver is used elsewhere in the data compression pipeline.

17. The method of claim 11, wherein the implicit encoding solver solves the implicit equations using an iterative solver, in which iteration is terminated when a convergence criterion is met.

18. The method of claim 11, wherein the implicit encoding solver returns residuals, and a quantized latent representation y.

19. The method of claim 11, wherein the matrix A is lower triangular, upper triangular, strictly lower triangular, strictly upper triangular, or A has a sparse, banded structure, or A is a block matrix, or A is constructed so that its matrix norm is less than one, or A is parametrised via a matrix factorisation such as a LU or QR decomposition.

20. The method of claim 11, wherein the implicit decoding solver is not the same type of solver as the solver used in encoding.

Description

BRIEF DESCRIPTION OF THE FIGURES

[0389] Aspects of the invention will now be described, by way of example(s), with reference to the following Figures, in which:

[0390] FIG. 1 shows a schematic diagram of an artificial intelligence (AI)-based compression process, including encoding an input image x using a neural network E( . . . ), and decoding using a neural network D( . . . ), to provide an output image {circumflex over (x)}. Runtime issues are relevant to the Encoder. Runtime issues are relevant to the Decoder. Examples of issues of relevance to parts of the process are identified.

[0391] FIG. 2 shows a schematic diagram of an artificial intelligence (AI)-based compression process, including encoding an input image x using a neural network E( . . . ), and decoding using a neural network D( . . . ), to provide an output image {circumflex over (x)}, and in which there is provided a hyper encoder and a hyper decoder. “Dis” denotes elements of a discriminator network.

[0392] FIG. 3 shows an example structure of an autoencoder with a hyperprior and a hyperhyperprior, where hyperhyperlatents ‘w’ encodes information regarding the latent entropy parameters ϕ.sub.z, which in turn allows for the encoding/decoding of the hyperlatents ‘z’, The model optimises over the parameters of all relevant encoder/decoder modules, as well as hyperhyperlatent entropy parameters ϕ.sub.w. Note that this hierarchical structure of hyperpriors can be recursively applied without theoretical limitations.

[0393] FIG. 4 shows a schematic diagram of an example encoding phase of an AI-based compression algorithm utilizing a linear implicit system (with corresponding Implicit Encoding Solver), for video or image compression. Related explanation is provided in section 1.4.2.

[0394] FIG. 5 shows a schematic diagram of an example decoding phase of an AI-based compression algorithm utilizing an implicit linear system (with corresponding Decode Solver), for video or image compression. Related explanation is provided in section 1.4.2.

[0395] FIG. 6 shows a worked example of constructing a Directed Acyclic Graph (DAG) given dependencies generated by an L-context matrix. (a) shows the L-context parameters associated with the i-th pixel in this example. The neighbouring context pixels are those directly above the current pixel, and the left neighbour. (b) shows pixels enumerated in raster scan order. Pixels in the same level of the Directed Acyclic Graph are coloured the same. (c) shows the resulting DAG. Those pixels on the same level are conditionally independent of each other, and can be encoded/decoded in parallel.

[0396] FIG. 7 shows an example encoding process with predicted context matrix Ly in an example AI-based compression pipeline. In this diagram a generic implicit solver is depicted, which could be any one of the methods discussed in Section 2.2.1, for example.

[0397] FIG. 8 shows an example decoding process with predicted context matrix Ly in an example AI-based compression pipeline. In this diagram a linear equation solver is depicted, which could be any one of the methods discussed in Section 2.2.2, for example.

[0398] FIG. 9 shows an example of an original image G (left) and its compressed reconstruction H (right), trained using a reconstruction loss incorporating contextual similarity Eq. (4.5). The compressed image has 0.16 bits per pixel and a peak-signal-to-noise ratio of 29.7.

[0399] FIG. 10 shows an example of an original image G (left) and its compressed reconstruction H (right), trained using a reconstruction loss incorporating adversarial loss Eq. (4.11) on VGG features. The compressed image has 0.17 bits per pixel and a peak-signal-to-noise ratio of 26.

[0400] FIG. 11 shows an example of an original image G (left) and its compressed reconstruction H (right), trained using a reconstruction loss incorporating adversarial loss Eq. (4.11) on VGG features. The compressed image has 0.18 bits per pixel and a peak-signal-to-noise ratio of 26.

[0401] FIG. 12 shows a diagram showing an example of architecture 1 of an example discriminator.

[0402] FIG. 13 shows a diagram showing an example of architecture 2 of an example discriminator.

[0403] FIG. 14 shows a diagram showing an example of architecture 3 of an example discriminator.

DETAILED DESCRIPTION

[0404] Technology Overview

[0405] We provide a high level overview of our artificial intelligence (AI)-based (e.g. image and/or video) compression technology.

[0406] In general, compression can be lossless, or lossy. In lossless compression, and in lossy compression, the file size is reduced. The file size is sometimes referred to as the “rate”.

[0407] But in lossy compression, it is possible to change what is input. The output image {circumflex over (x)} after reconstruction of a bitstream relating to a compressed image is not the same as the input image x. The fact that the output image {circumflex over (x)} may differ from the input image x is represented by the hat over the “x”. The difference between x and {circumflex over (x)} may be referred to as “distortion”, or “a difference in image quality”. Lossy compression may be characterized by the “output quality”, or “distortion”.

[0408] Although our pipeline may contain some lossless compression, overall the pipeline uses lossy compression.

[0409] Usually, as the rate goes up, the distortion goes down. A relation between these quantities for a given compression scheme is called the “rate-distortion equation”. For example, a goal in improving compression technology is to obtain reduced distortion, for a fixed size of a compressed file, which would provide an improved rate-distortion equation. For example, the distortion can be measured using the mean square error (MSE) between the pixels of x and {circumflex over (x)}, but there are many other ways of measuring distortion, as will be clear to the person skilled in the art. Known compression and decompression schemes include for example, JPEG, JPEG2000, AVC, HEVC, AVI.

[0410] Our approach includes using deep learning and AI to provide an improved compression and decompression scheme, or improved compression and decompression schemes.

[0411] In an example of an artificial intelligence (AI)-based compression process, an input image x is provided. There is provided a neural network characterized by a function E( . . . ) which encodes the input image x. This neural network E( . . . ) produces a latent representation, which we call y. The latent representation is quantized to provide ŷ, a quantized latent. The quantized latent goes to another neural network characterized by a function D( . . . ) which is a decoder. The decoder provides an output image, which we call {circumflex over (x)}. The quantized latent ŷ is entropy-encoded into a bitstream.

[0412] For example, the encoder is a library which is installed on a user device, e.g. laptop computer, desktop computer, smart phone. The encoder produces the y latent, which is quantized to ŷ, which is entropy encoded to provide the bitstream, and the bitstream is sent over the internet to a recipient device. The recipient device entropy decodes the bitstream to provide ŷ, and then uses the decoder which is a library installed on a recipient device (e.g. laptop computer, desktop computer, smart phone) to provide the output image {circumflex over (x)}.

[0413] E may be parametrized by a convolution matrix θ such that y=E.sub.θ(x).

[0414] D may be parametrized by a convolution matrix Ω such that {circumflex over (x)}=D.sub.Ω(ŷ).

[0415] We need to find a way to learn the parameters θ and Ω of the neural networks.

[0416] The compression pipeline may be parametrized using a loss function L. In an example, we use back-propagation of gradient descent of the loss function, using the chain rule, to update the weight parameters of θ and Ω of the neural networks using the gradients ∂L/∂w.

[0417] The loss function is the rate-distortion trade off. The distortion function is custom-character (x, {circumflex over (x)}), which produces a value, which is the loss of the distortion . The loss function can be used to back-propagate the gradient to train the neural networks.

[0418] So for example, we use an input image, we obtain a loss function, we perform a backwards propagation, and we train the neural networks. This is repeated for a training set of input images, until the pipeline is trained. The trained neural networks can then provide good quality output images.

[0419] An example image training set is the KODAK image set (e.g. at www.cs.albany.edu/˜xypan/research/snr/Kodak.html). An example image training set is the IMAX image set. An example image training set is the Imagenet dataset (e.g. at www.image-net.org/download). An example image training set is the CLIC Training Dataset P (“professional”) and M (“mobile”) (e.g. at http://challenge.compression.cc/tasks/).

[0420] In an example, the production of the bitstream from ŷ is lossless compression.

[0421] Based on Shannon entropy in information theory, the minimum rate (which corresponds to the best possible lossless compression) is the sum from i=1 to N of (p.sub.ŷ(ŷ.sub.i)*log.sub.2(p.sub.ŷ(ŷ.sub.i))) bits, where p.sub.ŷ is the probability of ŷ, for different discrete ŷ values ŷ.sub.i, where ŷ={ŷ.sub.1, ŷ.sub.2 . . . ŷ.sub.N}, where we know the probability distribution p. This is the minimum file size in bits for lossless compression of ŷ.

[0422] Various entropy encoding algorithms are known, e.g. range encoding/decoding, arithmetic encoding/decoding.

[0423] In an example, entropy coding EC uses ŷ and p.sub.ŷ to provide the bitstream. In an example, entropy decoding ED takes the bitstream and p.sub.ŷ and provides ŷ. This example coding/decoding process is lossless.

[0424] How can we get filesize in a differentiable way? We use Shannon entropy, or something similar to Shannon entropy. The expression for Shannon entropy is fully differentiable. A neural network needs a differentiable loss function. Shannon entropy is a theoretical minimum entropy value. The entropy coding we use may not reach the theoretical minimum value, but it is expected to reach close to the theoretical minimum value.

[0425] In the pipeline, the pipeline needs a loss that we can use for training, and the loss needs to resemble the rate-distortion trade off.

[0426] A loss which may be used for neural network training is Loss= custom-character +λ*R, where is the distortion function, λ is a weighting factor, and R is the rate loss. R is related to entropy. Both and R are differentiable functions.

[0427] There are some problems concerning the rate equation.

[0428] The Shannon entropy H gives us some minimum file size as a function of ŷ and p.sub.ŷ i.e. H(ŷ, p.sub.ŷ). The problem is how can we know p.sub.ŷ, the probability distribution of the input? Actually, we do not know p.sub.ŷ. So we have to approximate p.sub.ŷ. We use q.sub.ŷ as an approximation to p.sub.ŷ. Because we use q.sub.ŷ instead of p.sub.ŷ, we are instead evaluating a cross entropy rather than an entropy. The cross entropy CE(ŷ, q.sub.ŷ) gives us the minimum filesize for ŷ given the probability distribution q.sub.ŷ.

[0429] There is the relation

[00003] $H (\hat{y}, p_{\hat{y}}) = CE (\hat{y}, q_{\hat{y}}) + KL (p_{\hat{y}} || q_{\hat{y}})$

[0430] Where KL is the Kullback-Leibler divergence between p.sub.ŷ and q.sub.ŷ. The KL is zero, if p.sub.ŷ and q.sub.ŷ are identical.

[0431] In a perfect world we would use the Shannon entropy to train the rate equation, but that would mean knowing p.sub.ŷ, which we do not know. We only know q.sub.ŷ, which is an assumed distribution.

[0432] So to achieve small file compression sizes, we need q.sub.ŷ to be as close as possible to p.sub.ŷ. One category of our inventions relates to the q.sub.ŷ we use.

[0433] In an example, we assume q.sub.ŷ is a factorized parametric distribution.

[0434] One of our innovations is to make the assumptions about q.sub.ŷ more flexible. This can enable q.sub.ŷ to better approximate p.sub.ŷ, thereby reducing the compressed filesize.

[0435] As an example, consider that p.sub.ŷ is a multivariate normal distribution, with a mean μ vector and a covariant matrix Σ. Σ has the size N×N, where N is the number of pixels in the latent space. Assuming ŷ with dimensions 1×12×512×512 (relating to images with e.g. 512×512 pixels), then Σ has the size 2.5 million squared, which is about 5 trillion, so therefore there are 5 trillion parameters in Σ we need to estimate. This is not computationally feasible. So, usually, assuming a multivariate normal distribution is not computationally feasible.

[0436] Let us consider p.sub.ŷ, which as we have argued is too complex to be known exactly. This joint probability density function p(ŷ) can be represented as a conditional probability function, as the second line of the equation below expresses.

[00004] $\begin{matrix} p (\hat{y}) = p (({\hat{y}}_{1} {\hat{y}}_{2} .Math. {\hat{y}}_{N}) \\ = {p ({\hat{y}}_{1})}^{*} {p ({\hat{y}}_{2} | {\hat{y}}_{1})}^{*} {p ({\hat{y}}_{3} | {{\hat{y}}_{1}, {\hat{y}}_{2}})}^{*} .Math. \end{matrix}$

[0437] Very often p(ŷ) is approximated by a factorized probability density function

[00005] ${p ({\hat{y}}_{1})}^{*} {p ({\hat{y}}_{2})}^{*} {p ({\hat{y}}_{3})}^{*} .Math. p ({\hat{y}}_{N})$

[0438] The factorized probability density function is relatively easy to calculate computationally. One of our approaches is to start with a q.sub.ŷ which is a factorized probability density function, and then we weaken this condition so as to approach the conditional probability function, or the joint probability density function p(ŷ), to obtain smaller compressed filzesizes. This is one of the class of innovations that we have.

[0439] Distortion functions custom-character (x, {circumflex over (x)}), which correlate well with the human vision system, are hard to identify. There exist many candidate distortion functions, but typically these do not correlate well with the human vision system, when considering a wide variety of possible distortions.

[0440] We want humans who view picture or video content on their devices, to have a pleasing visual experience when viewing this content, for the smallest possible file size transmitted to the devices. So we have focused on providing improved distortion functions, which correlate better with the human vision system. Modern distortion functions very often contain a neural network, which transforms the input and the output into a perceptional space, before comparing the input and the output. The neural network can be a generative adversarial network (GAN) which performs some hallucination. There can also be some stabilization. It turns out it seems that humans evaluate image quality over density functions. We try to get p({circumflex over (x)}) to match p(x), for example using a generative method eg. a GAN.

[0441] Hallucinating is providing fine detail in an image, which can be generated for the viewer, where all the fine, higher spatial frequencies, detail does not need to be accurately transmitted, but some of the fine detail can be generated at the receiver end, given suitable cues for generating the fine details, where the cues are sent from the transmitter.

[0442] How should the neural networks E( . . . ), D( . . . ) look like? What is the architecture optimization for these neural networks? How do we optimize performance of these neural networks, where performance relates to filesize, distortion and runtime performance in real time? There are trade offs between these goals. So for example if we increase the size of the neural networks, then distortion can be reduced, and/or filesize can be reduced, but then runtime performance goes down, because bigger neural networks require more computational resources. Architecture optimization for these neural networks makes computationally demanding neural networks run faster.

[0443] We have provided innovation with respect to the quantization function Q. The problem with a standard quantization function is that it has zero gradient, and this impedes training in a neural network environment, which relies on the back propagation of gradient descent of the loss function. Therefore we have provided custom gradient functions, which allow the propagation of gradients, to permit neural network training.

[0444] We can perform post-processing which affects the output image. We can include in the bitstream additional information. This additional information can be information about the convolution matrix Ω, where D is parametrized by the convolution matrix Ω.

[0445] The additional information about the convolution matrix Ω can be image-specific. An existing convolution matrix can be updated with the additional information about the convolution matrix Ω, and decoding is then performed using the updated convolution matrix.

[0446] Another option is to fine tune the y, by using additional information about E. The additional information about E can be image-specific.

[0447] The entropy decoding process should have access to the same probability distribution, if any, that was used in the entropy encoding process. It is possible that there exists some probability distribution for the entropy encoding process that is also used for the entropy decoding process. This probability distribution may be one to which all users are given access; this probability distribution may be included in a compression library; this probability distribution may be included in a decompression library. It is also possible that the entropy encoding process produces a probability distribution that is also used for the entropy decoding process, where the entropy decoding process is given access to the produced probability distribution. The entropy decoding process may be given access to the produced probability distribution by the inclusion of parameters characterizing the produced probability distribution in the bitstream. The produced probability distribution may be an image-specific probability distribution.

[0448] FIG. 1 shows a schematic diagram of an artificial intelligence (AI)-based compression process, including encoding an input image x using a neural network, and decoding using a neural network, to provide an output image {circumflex over (x)}.

[0449] In an example of a layer in an encoder neural network, the layer includes a convolution, a bias and an activation function. In an example, four such layers are used.

[0450] In an example, we assume that q.sub.ŷ is a factorized normal distribution, where y={y.sub.1, y.sub.2 . . . y.sub.N), and ŷ=(ŷ.sub.1, ŷ.sub.2 . . . ŷ.sub.N}. We assume each ŷ.sub.i (i=1 to N) follows a normal distribution N e.g. with a mean μ of zero and a standard deviation σ of 1. We can define ŷ=Int(y−μ)+μ, where Int( ) is integer rounding.

[0451] The rate loss in the quantized latent space comes from, summing (Σ) from i=1 to N,

[00006] $Rate = (.Math. \log_{2} (q_{\hat{y}} ({\hat{y}}_{i}))) / N = (.Math. N ({\hat{y}}_{i} | μ = 0, σ = 1)) / N$

[0452] The output image {circumflex over (x)} can be sent to a discriminator network, e.g. a GAN network, to provide scores, and the scores are combined to provide a distortion loss.

[0453] We want to make the q.sub.ŷ flexible so we can model the p.sub.ŷ better, and close the gap between the Shannon entropy and the cross entropy. We make the q.sub.ŷ more flexible by using meta information. We have another neural network on our y latent space which is a hyper encoder. We have another latent space called z, which is quantized to {circumflex over (z)}. Then we decode the z latent space into distribution parameters such as μ and σ. These distribution parameters are used in the rate equation.

[0454] Now in the more flexible distribution, the rate loss is, summing (Σ) from i=1 to N,

[00007] $Rate = (.Math. N ({\hat{y}}_{i} | μ_{i}, σ_{i})) / N$

[0455] So we make the q.sub.ŷ more flexible, but the cost is that we must send meta information. In this system, we have

[00008] ${bitstream}_{\hat{y}} = EC (\hat{y}, q_{\hat{y}} (μ, σ)) \hat{y} = ED ({bitstream}_{\hat{y}}, q_{\hat{y}} (μ, σ))$

[0456] Here the z latent gets its own bitstream.sub.{circumflex over (z)} which is sent with bitstream.sub.ŷ. The decoder then decodes bitstream.sub.{circumflex over (z)} first, then executes the hyper decoder, to obtain the distribution parameters (μ, σ), then the distribution parameters (μ, σ) are used with bitstream.sub.ŷ to decode the ŷ, which are then executed by the decoder to get the output image {circumflex over (x)}.

[0457] Although we now have to send bitstream.sub.{circumflex over (z)}, the effect of bitstream.sub.{circumflex over (z)} is that it makes bitstream.sub.ŷ smaller, and the total of the new bitstream.sub.ŷ and bitstream.sub.{circumflex over (z)} is smaller than bitstream.sub.ŷ without the use of the hyper encoder. This is a powerful method called hyperprior, and it makes the entropy model more flexible by sending meta information. The loss equation becomes

[00009] $Loss = �� (x, \hat{x}) + λ_{1}^{*} R_{y} + λ_{2}^{*} R_{z}$

[0458] It is possible further to use a hyper hyper encoder for z, optionally and so on recursively, in more sophisticated approaches.

[0459] The entropy decoding process of the quantized z latent should have access to the same probability distribution, if any, that was used in the entropy encoding process of the quantized z latent. It is possible that there exists some probability distribution for the entropy encoding process of the quantized z latent that is also used for the entropy decoding process of the quantized z latent. This probability distribution may be one to which all users are given access; this probability distribution may be included in a compression library; this probability distribution may be included in a decompression library. It is also possible that the entropy encoding process of the quantized z latent produces a probability distribution that is also used for the entropy decoding process of the quantized z latent, where the entropy decoding process of the quantized z latent is given access to the produced probability distribution. The entropy decoding process of the quantized z latent may be given access to the produced probability distribution by the inclusion of parameters characterizing the produced probability distribution in the bitstream. The produced probability distribution may be an image-specific probability distribution.

[0460] FIG. 2 shows a schematic diagram of an artificial intelligence (AI)-based compression process, including encoding an input image x using a neural network, and decoding using a neural network, to provide an output image {circumflex over (x)}, and in which there is provided a hyper encoder and a hyper decoder.

[0461] In a more sophisticated approach, the distortion function custom-character (x, {circumflex over (x)}) has multiple contributions. The discriminator networks produce a generative loss L.sub.GEN. For example a Visual Geometry Group (VGG) network may be used to process x to provide m, and to process {circumflex over (x)} to provide {circumflex over (m)}, then a mean squared error (MSE) is provided using m and {circumflex over (m)} as inputs, to provide a perceptual loss. The MSE using x and {circumflex over (x)} as inputs, can also be calculated. The loss equation becomes

[00010] $Loss = λ_{1}^{*} R_{y} + λ_{2}^{*} R_{z} + λ_{3}^{*} MSE (x, \hat{x}) + λ_{4}^{*} L_{GEN} + λ_{5}^{*} VGG (x, \hat{x}),$

[0462] where the first two terms in the summation are the rate loss, and where the final three terms in the summation are the distortion loss custom-character (x, {circumflex over (x)}). Sometimes there can be additional regularization losses, which are there as part of making training stable.

[0463] Notes Re HyperPrior and HyperHyperPrior

[0464] Regarding a system or method not including a hyperprior, if we have a y latent without a HyperPrior (i.e. without a third and a fourth network), the distribution over the y latent used for entropy coding is not thereby made flexible. The HyperPrior makes the distribution over the y latent more flexible and thus reduces entropy/filesize. Why? Because we can send y-distribution parameters via the HyperPrior. If we use a HyperPrior, we obtain a new, z, latent. This z latent has the same problem as the “old y latent” when there was no hyperprior, in that it has no flexible distribution. However, as the dimensionality re z usually is smaller than re y, the issue is less severe.

[0465] We can apply the concept of the HyperPrior recursively and use a HyperHyperPrior on the z latent space of the HyperPrior. If we have a z latent without a HyperHyperPrior (i.e. without a fifth and a sixth network), the distribution over the z latent used for entropy coding is not thereby made flexible. The HyperHyperPrior makes the distribution over the z latent more flexible and thus reduces entropy/filesize. Why? Because we can send z-distribution parameters via the HyperHyperPrior. If we use the HyperHyperPrior, we end up with a new w latent. This w latent has the same problem as the “old z latent” when there was no hyperhyperprior, in that it has no flexible distribution. However, as the dimensionality re w usually is smaller than re z, the issue is less severe. An example is shown in FIG. 3.

[0466] The above-mentioned concept can be applied recursively. We can have as many HyperPriors as desired, for instance: a HyperHyperPrior, a HyperHyperHyperPrior, a HyperHyperHyperHyperPrior, and so on.

[0467] GB Patent application no. 2016824.1, filed 23 Oct. 2020, is incorporated by reference.

[0468] PCT application PCT/GB2021/051041 entitled “IMAGE COMPRESSION AND DECODING, VIDEO COMPRESSION AND DECODING: METHODS AND SYSTEMS”, filed 29 Apr. 2021, is incorporated by reference.

[0469] Notes Re Training

[0470] Regarding seeding the neural networks for training, all the neural network parameters can be randomized with standard methods (such as Xavier Initialization). Typically, we find that satisfactory results are obtained with sufficiently small learning rates.

[0471] Note

[0472] It is to be understood that the arrangements referenced herein are only illustrative of the application for the principles of the present inventions. Numerous modifications and alternative arrangements can be devised without departing from the spirit and scope of the present inventions. While the present inventions are shown in the drawings and fully described with particularity and detail in connection with what is presently deemed to be the most practical and preferred examples of the inventions, it will be apparent to those of ordinary skill in the art that numerous modifications can be made without departing from the principles and concepts of the inventions as set forth herein.

IMAGE ENCODING AND DECODING, VIDEO ENCODING AND DECODING: METHODS, SYSTEMS AND TRAINING METHODS

Inventors

Cpc classification

Classification Explorer

H04N19/91

ELECTRICITY

Classification Explorer

H04N19/42

ELECTRICITY

Classification Explorer

G06V10/422

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

H04N19/59

ELECTRICITY

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

H04N19/13

ELECTRICITY

Classification Explorer

H04N19/124

ELECTRICITY

International classification

Classification Explorer

H04N19/13

ELECTRICITY

Classification Explorer

H04N19/124

ELECTRICITY

Classification Explorer

H04N19/42

ELECTRICITY

Abstract

Claims

Description