IMAGE COMPRESSION AND DECODING, VIDEO COMPRESSION AND DECODING: TRAINING METHODS AND TRAINING SYSTEMS

Abstract

A computer-implemented method of training an image generative network f.sub.θ for a set of training images, in which an output image {circumflex over (x)} is generated from an input image x of the set of training images non-losslessly, and in which a proxy network is trained for a gradient intractable perceptual metric that evaluates a quality of an output image {circumflex over (x)} given an input image x, the method of training using a plurality of scales for input images from the set of training images. In an embodiment, a blindspot network b.sub.α is trained which generates an output image {tilde over (x)} from an input image x. Related computer systems, computer program products and computer-implemented methods of training are disclosed.

Claims

1. A computer-implemented method of training an image generative network f.sub.θ for a set of training images, in which an output image {circumflex over (x)} is generated from an input image x of the set of training images non-losslessly, and in which a proxy network is trained for a gradient intractable perceptual metric that evaluates a quality of an output image {circumflex over (x)} given an input image x, the method of training using a plurality of scales for input images from the set of training images, the method including the steps of: (i) receiving an input image x of the set of training images and generating one or more images which are derived from x to make a multiscale set of images {x.sub.i} which includes x; (ii) the image generative network f.sub.θ generating an output image is from an input image x.sub.iϵ{x.sub.i}, without tracking gradients for f.sub.θ; (iii) the proxy network outputting an approximated function output ŷ.sub.i, using the x.sub.i and the {circumflex over (x)}.sub.i as inputs; (iv) the gradient intractable perceptual metric outputting a function output y.sub.i, using the x.sub.i and the {circumflex over (x)}.sub.i as inputs; (v) evaluating a loss for the proxy network, using the y.sub.i and the ŷ.sub.i as inputs, and including the evaluated loss for the proxy network in a loss array for the proxy network; (vi) repeating steps (ii) to (v) for all the images x.sub.i in the multiscale set of images {x.sub.i}; (vii) using backpropagation to compute gradients of parameters of the proxy network with respect to an aggregation of the loss array assembled in executions of step (v); (viii) optimizing the parameters of the proxy network based on the results of step (vii), to provide an optimized proxy network; (ix) the image generative network f.sub.θ generating an output image {circumflex over (x)}.sub.i from an input image x.sub.iϵ{x.sub.i}; (x) the optimized proxy network outputting an optimized approximated function output ŷ.sub.i, using the {circumflex over (x)}.sub.i and the is as inputs; (xi) evaluating a loss for the generative network f.sub.θ, using the x.sub.i, the {circumflex over (x)}.sub.i and the optimized approximated function output ŷ.sub.i as inputs, and including the evaluated loss for the generative network f.sub.θ in a loss array for the generative network f.sub.θ; (xii) repeating steps (ix) to (xi) for all the images x.sub.i in the multiscale set of images {x.sub.i}; (xiii) using backpropagation to compute gradients of parameters of the generative network f.sub.θ with respect to an aggregation of the loss array assembled in executions of step (xi); (xiv) optimizing the parameters of the generative network f.sub.θ based on the results of step (xiii), to provide an optimized generative network f.sub.θ, and (xv) repeating steps (i) to (xiv) for each member of the set of training images.

2. The method of claim 1, wherein the one or more images which are derived from x to make a multiscale set of images {x.sub.i} are derived by downsampling.

3. The method of claim 1, wherein the generative network f.sub.θ includes an encoder, which encodes (by performing lossy encoding) an input image x into a bitstream, and includes a decoder, which decodes the bitstream into an output image {circumflex over (x)}.

4. The method of claim 1, wherein the method includes an iteration of a training pass of the generative network, and a training pass of the proxy network.

5. The method of claim 1, wherein the generative and proxy networks have separate optimizers.

6. The method of claim 1, wherein for the case of proxy network optimization, gradients do not flow through the generative network.

7. The method of claim 1, wherein the method is used for learned image or video compression.

8. The method of claim 1, wherein the gradient intractable perceptual metric is a perceptual loss function.

9. The method of claim 1, wherein the gradient intractable perceptual metric is VMAF, VIF, DLM or IFC, or a mutual information based estimator.

10. The method of claim 1, wherein the generative network includes a compression network, wherein a term is added to the total loss of the compression network to stabilise the initial training of the compression network.

11. The method of claim 1, wherein the generative loss includes a generic distortion loss which includes one or more stabilisation terms.

12. The method of claim 21, wherein the stabilisation terms include Mean Squared Error (MSE) or a combination of analytical losses with weighted deep-embeddings of a pre-trained neural network.

13. The method of claim 1, wherein a perceptual quality score is assigned to the image at each scale and is aggregated by an aggregation function.

14. The method of claim 1, wherein the set of images includes a downsampled image that has been downsampled by a factor of two in each dimension.

15. The method of claim 1, wherein the set of images includes a downsampled image that has been downsampled by a factor of four in each dimension.

16. The method of claim 1, wherein the mean of the ŷ.sub.i is used to train the image generative network by attempting to maximise or minimise the mean of the ŷ.sub.i using stochastic gradient descent.

17. The method of claim 1, wherein the predictions y.sub.i are used to train the proxy network to force its predictions to be closer to an output of the perceptual metric, using stochastic gradient descent.

18. The method of claim 1, wherein for each image x, an RGB image is provided.

19. A computer system configured to train an image generative network f.sub.θ for a set of training images, in which the system generates an output image {circumflex over (x)} from an input image x of the set of training images non-losslessly, and in which a proxy network is trained for a gradient intractable perceptual metric that evaluates a quality of an output image {circumflex over (x)} given an input image x, wherein the computer system is configured to: (i) receive an input image x from the set of training images and generate one or more images which are derived from x to make a multiscale set of images {x.sub.i} which includes x; (ii) use the image generative network f.sub.θ to generate an output image is from an input image x.sub.iϵ{x.sub.i}, without tracking gradients for f.sub.θ; (iii) use the proxy network to output an approximated function output ŷ.sub.i, using the x.sub.i and the {circumflex over (x)}.sub.i, as inputs; (iv) use the gradient intractable perceptual metric to output a function output y.sub.i, using the x.sub.i and the {circumflex over (x)}.sub.i as inputs; (v) evaluate a loss for the proxy network, using the y.sub.i and the ŷ.sub.i as inputs, and to include the evaluated loss for the proxy network in a loss array for the proxy network; (vi) repeat (ii) to (v) for all the images x.sub.i in the multiscale set of images {x.sub.i}; (vii) use backpropagation to compute gradients of parameters of the proxy network with respect to an aggregation of the loss array assembled in executions of (v); (viii) optimize the parameters of the proxy network based on the results of (vii), to provide an optimized proxy network; (ix) use the image generative network f.sub.θ to generate an output image is from an input image x.sub.iϵ{x.sub.i}; (x) use the optimized proxy network to output an optimized approximated function output ŷ.sub.i, using the x.sub.i and the {circumflex over (x)}.sub.i as inputs; (xi) evaluate a loss for the generative network f.sub.θ, using the x.sub.i, the {circumflex over (x)}.sub.i and the optimized approximated function output ŷ.sub.i as inputs, and to include the evaluated loss for the generative network f.sub.θ in a loss array for the generative network f.sub.θ; (xii) repeat (ix) to (xi) for all the images x.sub.i in the multiscale set of images {x.sub.i}; (xiii) use backpropagation to compute gradients of parameters of the generative network f.sub.θ with respect to an aggregation of the loss array assembled in executions of (xi); (xiv) optimize the parameters of the generative network f.sub.θ based on the results of (xiii), to provide an optimized generative network f.sub.θ, and (xv) repeat (i) to (xiv) for each member of the set of training images.

20. A computer-implemented method of training an image generative network f.sub.θ for a set of training images, in which an output image {circumflex over (x)} is generated from an input image x of the set of training images non-losslessly, and in which a proxy network is trained for a gradient intractable perceptual metric that evaluates a quality of an output image {circumflex over (x)} given an input image x, the method including the steps of: (i) the image generative network f.sub.θ generating an output image {circumflex over (x)} from an input image x of the set of training images, without tracking gradients for f.sub.θ; (ii) the proxy network outputting an approximated function output ŷ, using x and {circumflex over (x)} as inputs; (iii) the gradient intractable perceptual metric outputting a function output y, using x and {circumflex over (x)} as inputs; (iv) evaluating a loss for the proxy network, using y and ŷ as inputs; (v) using backpropagation to compute gradients of parameters of the proxy network with respect to the loss evaluated in step (iv); (vi) optimizing the parameters of the proxy network based on the results of step (v), to provide an optimized proxy network; (vii) the image generative network f.sub.θ generating an output image {circumflex over (x)} from an input image x, (viii) the optimized proxy network outputting an optimized approximated function output ŷ, using x and {circumflex over (x)} as inputs; (ix) evaluating a loss for the generative network f.sub.θ, using x, {circumflex over (x)} and the optimized approximated function output ŷ as inputs; (x) using backpropagation to compute gradients of parameters of the generative network f.sub.θ with respect to the loss evaluated in step (ix); (xi) optimizing the parameters of the generative network f.sub.θ based on the results of step (x), to provide an optimized generative network f.sub.θ, and (xii) repeating steps (i) to (xi) for each member of the set of training images.

Description

BRIEF DESCRIPTION OF THE FIGURES

[0116] Aspects of the invention will now be described, by way of example(s), with reference to the following Figures, in which:

[0117] FIG. 1 shows an example of a generative network f.sub.θ(x)={circumflex over (x)}, and a differentiable proxy network ĥ.sub.ϕ(x, {circumflex over (x)})=ŷ which approximates a non-differentiable target function (GIF) h.sub.ξ(x, {circumflex over (x)})=y. Note, we can train both networks f.sub.θ and ĥ.sub.ϕ at the same time.

[0118] FIG. 2A shows an example in which a training of f.sub.θ requires gradient flow via ĥ.sub.ϕ and parameter updates from the optimiser opt{f.sub.θ}. The dotted arrows indicate schematically the direction of back-propagation.

[0119] FIG. 2B shows an example in which a training of ĥ.sub.ϕ involves samples of f.sub.θ(x) and x, but does not require gradients for {circumflex over (x)}. ĥ.sub.ϕ is trained to minimise the loss L.sub.proxy(ŷ,y) with optimizer opt{ĥ.sub.ϕ}. The dotted arrows indicate schematically the direction of back-propagation.

[0120] FIG. 3 shows an example of a structure of a proxy network ĥ.sub.ϕ(x, {circumflex over (x)})=y.

[0121] FIG. 4 shows an example of a resblock component with 3 internal blocks (x3). For example “(128, 256, 2)” indicates there are 128 channels in α, 256 channels in β and “2” indicates a stride of 2 is used to downsample at the end of the sequence. The circle with a “+” at its centre indicates element-wise addition. For example “Conv2d(128, 128, 1)” indicates a 2D convolutional operation of input channels of size 128, output channels of size 128, stride of 1 and a default padding of size stride/2.

[0122] FIG. 5 shows an example in which an auto-encoder is the generative network of FIG. 1.

[0123] FIG. 6 shows an example of adversarial samples generated by a generative network f.sub.θ where h.sub.ξ is VMAF. The white bounding boxes indicate the corresponding enlarged regions in FIG. 7. Note that the distorted image has a VMAF score of 85 out of approximately 96.

[0124] FIG. 7 shows an example of adversarial samples generated by the generative network f.sub.θ where h.sub.ξ is VMAF. The images shown in the figure are enlarged views of the corresponding regions contained within the white bounding boxes shown in FIG. 6. Notice the checkerboard-like artifacts in the distorted image which have been learnt by the generative network f.sub.θ as a method of minimizing the loss corresponding to f.sub.θ because VMAF is susceptible to these types of artifacts which are possibly outside the boundary for which the function is well-defined, i.e. these artifacts align well with human perception, and the generative network f.sub.θ considers images with these artifacts perceptually more similar. The distorted image is referred to as an adversarial sample.

[0125] FIG. 8 shows an example of multiscale training for the case of images xϵ custom-character .sup.3 where for each image x, an RGB image at three different scales is provided. The generative network, along with the proxy and perceptual metric process each scale of image and perform an aggregation at the end using some function, such as a mean operator.

[0126] FIG. 9 shows a training example in which a set of adversarial samples {tilde over (x)}.sub.i is introduced, with associated labels {tilde over (y)}.sub.i. The loss surface of ĥ.sub.ϕ is directly discouraged to enter blind spots by training against the sample set {tilde over (x)}.sub.i with self-imposed label set {tilde over (y)}.

[0127] FIG. 10 shows a training example in which a blind spot network is introduced, with associated outputs {tilde over (x)}.sub.i. The loss surface of ĥ.sub.ϕ is directly discouraged to enter boundaries of blind spots by training against the samples from the blind spot network with self-imposed labels {tilde over (y)}.sub.i. The blind spot network itself is trained using a proxy network. The blind spot network can either use the same (as in this figure) or a different proxy network (not shown) from the encoder decoder network.

[0128] FIG. 11 shows a schematic diagram of an artificial intelligence (AI)-based compression process, including encoding an input image x using a neural network E( . . . ), and decoding using a neural network D( . . . ), to provide an output image {circumflex over (x)}. Runtime issues are relevant to the Encoder. Runtime issues are relevant to the Decoder. Examples of issues of relevance to parts of the process are identified.

DETAILED DESCRIPTION

Technology Overview

[0129] We provide a high level overview of some aspects of our artificial intelligence (AI)-based (e.g. image and/or video) compression technology.

[0130] In general, compression can be lossless, or lossy. In lossless compression, and in lossy compression, the file size is reduced. The file size is sometimes referred to as the “rate”.

[0131] But in lossy compression, it is possible to change what is input. The output image {circumflex over (x)} after reconstruction of a bitstream relating to a compressed image is not the same as the input image x. The fact that the output image {circumflex over (x)} may differ from the input image x is represented by the hat over the “x”. The difference between x and {circumflex over (x)} may be referred to as “distortion”, or “a difference in image quality”. Lossy compression may be characterized by the “output quality”, or “distortion”.

[0132] Although our pipeline may contain some lossless compression, overall the pipeline uses lossy compression.

[0133] Usually, as the rate goes up, the distortion goes down. A relation between these quantities for a given compression scheme is called the “rate-distortion equation”. For example, a goal in improving compression technology is to obtain reduced distortion, for a fixed size of a compressed file, which would provide an improved rate-distortion equation. For example, the distortion can be measured using the mean square error (MSE) between the pixels of x and {circumflex over (x)}, but there are many other ways of measuring distortion, as will be clear to the person skilled in the art. Known compression and decompression schemes include for example, JPEG, JPEG2000, AVC, HEVC, AVI.

[0134] In an example, our approach includes using deep learning and AI to provide an improved compression and decompression scheme, or improved compression and decompression schemes.

[0135] In an example of an artificial intelligence (AI)-based compression process, an input image x is provided. There is provided a neural network characterized by a function E( . . . ) which encodes the input image x. This neural network E( . . . ) produces a latent representation, which we call w. The latent representation is quantized to provide ŵ, a quantized latent. The quantized latent goes to another neural network characterized by a function D( . . . ) which is a decoder. The decoder provides an output image, which we call {circumflex over (x)}. The quantized latent w is entropy-encoded into a bitstream.

[0136] For example, the encoder is a library which is installed on a user device, e.g. laptop computer, desktop computer, smart phone. The encoder produces the w latent, which is quantized to ŵ, which is entropy encoded to provide the bitstream, and the bitstream is sent over the internet to a recipient device. The recipient device entropy decodes the bitstream to provide ŵ, and then uses the decoder which is a library installed on a recipient device (e.g. laptop computer, desktop computer, smart phone) to provide the output image {circumflex over (x)}.

[0137] E may be parametrized by a convolution matrix Θ such that w=E.sub.Θ(x).

[0138] D may be parametrized by a convolution matrix Ω such that i=D.sub.Ω(ŵ).

[0139] We need to find a way to learn the parameters Θ and Ω of the neural networks.

[0140] The compression pipeline may be parametrized using a loss function L. In an example, we use back-propagation of gradient descent of the loss function, using the chain rule, to update the weight parameters of Θ and Ω of the neural networks using the gradients ∂L/∂y.

[0141] The loss function is the rate-distortion trade off. The distortion function is custom-character (x, {circumflex over (x)}), which produces a value, which is the loss of the distortion . The loss function can be used to back-propagate the gradient to train the neural networks.

[0142] So for example, we use an input image, we obtain a loss function, we perform a backwards propagation, and we train the neural networks. This is repeated for a training set of input images, until the pipeline is trained. The trained neural networks can then provide good quality output images.

[0143] An example image training set is the KODAK image set (e.g. at www.cs.albany.edu/˜xypan/research/snr/Kodak.html). An example image training set is the IMAX image set. An example image training set is the Imagenet dataset (e.g. at www.image-net.org/download). An example image training set is the CLIC Training Dataset P (“professional”) and M (“mobile”) (e.g. at http://challenge.compression.cc/tasks/).

[0144] In an example, the production of the bitstream from w is lossless compression.

[0145] In the pipeline, the pipeline needs a loss that we can use for training, and the loss needs to resemble the rate-distortion trade off.

[0146] A loss which may be used for neural network training is Loss= custom-character +λ*R, where is the distortion function, λ is a weighting factor, and R is the rate loss. R is related to entropy. Both and R are differentiable functions.

[0147] Distortion functions custom-character (x, {circumflex over (x)}), which correlate well with the human vision system, are hard to identify. There exist many candidate distortion functions, but typically these do not correlate well with the human vision system, when considering a wide variety of possible distortions.

[0148] We want humans who view picture or video content on their devices, to have a pleasing visual experience when viewing this content, for the smallest possible file size transmitted to the devices. So we have focused on providing improved distortion functions, which correlate better with the human vision system. Modern distortion functions very often contain a neural network, which transforms the input and the output into a perceptional space, before comparing the input and the output. The neural network can be a generative adversarial network (GAN) which performs some hallucination. There can also be some stabilization. It turns out it seems that humans evaluate image quality over density functions.

[0149] Hallucinating is providing fine detail in an image, which can be generated for the viewer, where all the fine, higher spatial frequencies, detail does not need to be accurately transmitted, but some of the fine detail can be generated at the receiver end, given suitable cues for generating the fine details, where the cues are sent from the transmitter.

[0150] FIG. 11 shows a schematic diagram of an artificial intelligence (AI)-based compression process, including encoding an input image x using a neural network, and decoding using a neural network, to provide an output image x.

[0151] In an example of a layer in an encoder neural network, the layer includes a convolution, a bias and an activation function. In an example, four such layers are used.

[0152] There is provided a computer-implemented method for lossy image or video compression, transmission and decoding, the method including the steps of:

(i) receiving an input image at a first computer system;
(ii) encoding the input image using a first trained neural network, using the first computer system, to produce a latent representation;
(iii) quantizing the latent representation using the first computer system to produce a quantized latent;
(iv) entropy encoding the quantized latent into a bitstream, using the first computer system;
(v) transmitting the bitstream to a second computer system;
(vi) the second computer system entropy decoding the bitstream to produce the quantized latent;
(vii) the second computer system using a second trained neural network to produce an output image from the quantized latent, wherein the output image is an approximation of the input image. A related system including a first computer system, a first trained neural network, a second computer system and a second trained neural network, may be provided.

[0153] An advantage is that for a fixed file size (“rate”), a reduced output image distortion is obtained. An advantage is that for a fixed output image distortion, a reduced file size (“rate”) is obtained.

[0154] There is provided a computer implemented method of training a first neural network and a second neural network, the neural networks being for use in lossy image or video compression, transmission and decoding, the method including the steps of:

(i) receiving an input training image;
(ii) encoding the input training image using the first neural network, to produce a latent representation;
(iii) quantizing the latent representation to produce a quantized latent;
(iv) using the second neural network to produce an output image from the quantized latent, wherein the output image is an approximation of the input image;
(v) evaluating a loss function based on differences between the output image and the input training image;
(vi) evaluating a gradient of the loss function;
(vii) back-propagating the gradient of the loss function through the second neural network and through the first neural network, to update weights of the second neural network and of the first neural network; and
(viii) repeating steps (i) to (vii) using a set of training images, to produce a trained first neural network and a trained second neural network, and
(ix) storing the weights of the trained first neural network and of the trained second neural network. A related computer program product may be provided.

[0155] An advantage is that, when using the trained first neural network and the trained second neural network, for a fixed file size (“rate”), a reduced output image distortion is obtained; and for a fixed output image distortion, a reduced file size (“rate”) is obtained.

Example Aspects of Adversarial Learning of Differentiable Proxy of Gradient Intractable Networks

[0156] A generative network f.sub.θ which generates an output image {circumflex over (x)} from an input image x is provided. A differentiable proxy network ĥ.sub.ϕ which generates a function output ŷ from x and {circumflex over (x)} according to ĥ.sub.ϕ(x, {circumflex over (x)})=ŷ is provided. The differentiable proxy network ĥ.sub.ϕ approximates a non-differentiable target function (GIF) h.sub.ξ which generates a function output y from x and {circumflex over (x)} according to h.sub.ξ(x, {circumflex over (x)})=y. It is possible to train both networks f.sub.θ and ĥ.sub.ϕ at the same time. An example is shown in FIG. 1.

[0157] In an example, a training of f.sub.θ requires gradient flow via ĥ.sub.ϕ and parameter updates for f.sub.θ from an optimiser opt{f.sub.θ}. An example is shown in FIG. 2A, in which the dotted arrows indicate schematically the direction of back-propagation.

[0158] In an example, a training of ĥ.sub.ϕ involves samples of f.sub.θ(x) and x, but does not require gradients for {circumflex over (x)}. ĥ.sub.ϕ is trained to minimise the loss L.sub.proxy(ŷ,y) with optimizer opt{ĥ.sub.ϕ}. An example is shown in FIG. 2B, in which the dotted arrows indicate schematically the direction of back-propagation.

[0159] A generative network f.sub.θ which generates an output image {circumflex over (x)} from an input image x is provided. In an example, the generative network f.sub.θ includes an encoder, which encodes (e.g. which performs lossy encoding) an input image x into a bitstream, and includes a decoder, which decodes the bitstream into an output image {circumflex over (x)}. A differentiable proxy network ĥ.sub.ϕ which generates a function output ŷ from x and {circumflex over (x)} according to ĥ.sub.ϕ(x, {circumflex over (x)})=ŷ is provided. The differentiable proxy network ĥ.sub.ϕ approximates a non-differentiable target function (GIF) h.sub.ξ which generates a function output y from x and {circumflex over (x)} according to h.sub.ξ(x, {circumflex over (x)})=y. It is possible to train both networks f.sub.θ and ĥ.sub.ϕ at the same time. An example is shown in FIG. 5.

[0160] Adversarial samples may be generated by a generative network f.sub.θ where h.sub.ξ is VMAF. FIG. 6 shows an example of adversarial samples generated by a generative network f.sub.θ where h.sub.ξ is VMAF. The white bounding boxes indicate the corresponding enlarged regions in FIG. 7. Note that the distorted image has a VMAF score of 85 out of approximately 96.

[0161] A generative network f.sub.θ which generates an output image {circumflex over (x)}.sub.i from an input image x.sub.i is provided. In an example, the generative network f.sub.θ includes an encoder, which encodes (e.g. which performs lossy encoding) an input image x.sub.i into a bitstream, and includes a decoder, which decodes the bitstream into an output image {circumflex over (x)}.sub.i. A differentiable proxy network ĥ.sub.ϕ which generates a function output ŷ.sub.i from x.sub.i and {circumflex over (x)}.sub.i, according to ĥ.sub.ϕ(x.sub.i, {circumflex over (x)}.sub.i)=ŷ.sub.i is provided. The differentiable proxy network ĥ.sub.ϕ approximates a non-differentiable target function (GIF) h.sub.ξ which generates a function output y, from x.sub.i and {circumflex over (x)}.sub.i according to h.sub.ξ(x.sub.i, {circumflex over (x)}.sub.i)=y.sub.i. It is possible to train both networks f.sub.θ and ĥ.sub.ϕ at the same time. Multiscale training is provided for the case of multiscale images x.sub.iϵ custom-character .sup.3 where for each image x, an RGB image at a plurality of different scales is used. The generative network f.sub.θ, along with the proxy network ĥ.sub.ϕ and the perceptual metric h.sub.ξ process each scale of image and finally perform an aggregation using some aggregation function, such as a mean operator. FIG. 8 shows an example of multiscale training for the case of images x.sub.iϵ custom-character .sup.3 where for each image x, an RGB image at three different scales is provided: x.sub.i, where i=1, 2 or 3.

[0162] A generative network f.sub.θ which generates an output image {circumflex over (x)}.sub.i from an input image x.sub.i is provided. In an example, the generative network f.sub.θ includes an encoder, which encodes (e.g. which performs lossy encoding) an input image x.sub.i into a bitstream, and includes a decoder, which decodes the bitstream into an output image {circumflex over (x)}.sub.i. A differentiable proxy network ĥ.sub.ϕ which generates a function output ŷ.sub.i from x.sub.i and {circumflex over (x)}.sub.i according to ĥ.sub.ϕ(x.sub.i, {circumflex over (x)}.sub.i)=ŷ.sub.i is provided. The differentiable proxy network ĥ.sub.ϕ approximates a non-differentiable target function (GIF) h.sub.ξ which generates a function output y.sub.i from x.sub.i and {circumflex over (x)}.sub.i according to h.sub.ξ(x.sub.i, {circumflex over (x)}.sub.i)=y.sub.i. It is possible to train both networks f.sub.θ and ĥ.sub.ϕ at the same time. Multiscale training is provided for the case of multiscale images x.sub.iϵ custom-character .sup.3 where for each image x, an RGB image at a plurality of different scales is used. The generative network f.sub.θ, along with the proxy network ĥ.sub.ϕ and the perceptual metric h.sub.ξ process each scale of image and finally perform an aggregation using some aggregation function, such as a mean operator. In an example, a set of adversarial samples {tilde over (x)}.sub.i is introduced, with associated labels {tilde over (y)}.sub.i, which are generated according to ĥ.sub.ϕ(x.sub.i, {tilde over (x)}.sub.i)=ŷ.sub.i. The loss surface of ĥ.sub.ϕ is directly discouraged to enter blind spots by training against the sample set {tilde over (x)}.sub.i with self-imposed label set {tilde over (y)}.sub.i. FIG. 9 shows a training example in which a set of adversarial samples {tilde over (x)}.sub.i is introduced, with associated labels {tilde over (y)}.sub.i, and the loss surface of ĥ.sub.ϕ is directly discouraged to enter blind spots by training against the sample set {tilde over (x)}.sub.i with self-imposed label set {tilde over (y)}.sub.i.

[0163] A generative network f.sub.θ which generates an output image {circumflex over (x)}.sub.i from an input image x.sub.i is provided. In an example, the generative network f.sub.θ includes an encoder, which encodes (e.g. which performs lossy encoding) an input image x.sub.i into a bitstream, and includes a decoder, which decodes the bitstream into an output image {circumflex over (x)}.sub.i. A differentiable proxy network ĥ.sub.ϕ which generates a function output ŷ.sub.i from x.sub.i and {circumflex over (x)}.sub.i according to ĥ.sub.ϕ(x.sub.i, {circumflex over (x)}.sub.i)=ŷ.sub.i is provided. The differentiable proxy network ĥ.sub.ϕ approximates a non-differentiable target function (GIF) h.sub.ξ which generates a function output y.sub.i from x.sub.i and {circumflex over (x)}.sub.i according to h.sub.ξ(x.sub.i, {circumflex over (x)}.sub.i)=y.sub.i. It is possible to train both networks f.sub.θ and ĥ.sub.ϕ at the same time. Multiscale training may be provided for the case of multiscale images x.sub.iϵ custom-character .sup.3 where for each image x, an RGB image at a plurality of different scales is used. The generative network f.sub.θ, along with the proxy network ĥ.sub.ϕ and the perceptual metric h.sub.ξ process each scale of image and finally perform an aggregation using some aggregation function, such as a mean operator. In an example, a set of adversarial samples {tilde over (x)}.sub.i are generated by a blind spot network from a set of x.sub.i. The {tilde over (x)}.sub.i have associated labels {tilde over (y)}.sub.i, which are generated according to ĥ.sub.ϕ(x.sub.i, {tilde over (x)}.sub.i)={tilde over (y)}.sub.i. The loss surface of ĥ.sub.ϕ is directly discouraged to enter blind spots by training against the sample set {tilde over (x)}.sub.i with self-imposed label set {tilde over (y)}.sub.i. The blind spot network itself may be trained using a proxy network. The blind spot network can either use the same (as in FIG. 10) or a different proxy network (not shown in FIG. 10) from the encoder decoder network. FIG. 10 shows a training example in which a blind spot network is present.

[0164] In an example of a trained generative network, an encoder including a first trained neural network is provided on a first computer system, and a decoder is provided on a second computer system in communication with the first computer system, the decoder including a second trained neural network. The encoder produces a bitstream from an input image; the bitstream is transmitted to the second computer system, where the decoder decodes the bitstream to produce an output image. The output image may be an approximation of the input image.

[0165] The first computer system may be a server, e.g. a dedicated server, e.g. a machine in the cloud with dedicated GPUs e.g. Amazon Web Services, Microsoft Azure, etc, or any other cloud computing services.

[0166] The first computer system may be a user device. The user device may be a laptop computer, desktop computer, a tablet computer or a smart phone.

[0167] The first trained neural network may include a library installed on the first computer system.

[0168] The first trained neural network may be parametrized by one or several convolution matrices Θ, or the first trained neural network may be parametrized by a set of bias parameters, non-linearity parameters, convolution kernel/matrix parameters.

[0169] The second computer system may be a recipient device.

[0170] The recipient device may be a laptop computer, desktop computer, a tablet computer, a smart TV or a smart phone.

[0171] The second trained neural network may include a library installed on the second computer system.

[0172] The second trained neural network may be parametrized by one or several convolution matrices Ω, or the second trained neural network may be parametrized by a set of bias parameters, non-linearity parameters, convolution kernel/matrix parameters.

Notes Re VMAF

[0173] Video Multimethod Assessment Fusion (VMAF) is an objective full-reference video quality metric. It predicts subjective video quality based on a reference and distorted video sequence. The metric can be used to evaluate the quality of different video codecs, encoders, encoding settings, or transmission variants.

[0174] VMAF uses existing image quality metrics and other features to predict video quality: [0175] Visual Information Fidelity (VIF): considers information fidelity loss at four different spatial scales. [0176] Detail Loss Metric (DLM): measures loss of details, and impairments which distract viewer attention. [0177] Mean Co-Located Pixel Difference (MCPD): measures temporal difference between frames on the luminance component. [0178] Anti-noise signal-to-noise ratio (AN-SNR).

[0179] The above features are fused using a support-vector machine (SVM)-based regression to provide a single output score in the range of 0-100 per video frame, with 100 being quality identical to the reference video. These scores are then temporally pooled over the entire video sequence using the arithmetic mean to provide an overall differential mean opinion score (DMOS).

[0180] Due to the public availability of the training source code (“VMAF Development Kit”, VDK), the fusion method can be re-trained and evaluated based on different video datasets and features.

[0181] Regarding perceptual specific GIF's, some other examples apart from VMAF are: [0182] VIF—Visual Information Fidelity [0183] DLM—Detail Loss Metric [0184] IFC—Information Fidelity Criterion.

[0185] Regarding perceptual specific GIF's, an example class of GIFs is mutual information based estimators.

Notes Re Training

[0186] Regarding seeding the neural networks for training, all the neural network parameters can be randomized with standard methods (such as Xavier Initialization). Typically, we find that satisfactory results are obtained with sufficiently small learning rates.

Other Applications

[0187] As an alternative to applications described in this document which use a gradient intractable perceptual metric, the present invention may be re-purposed for applications relating to quantisation. In an application relating to quantisation, we can use a proxy network to learn any intractable gradient function in machine learning. So as an alternative to the perceptual metric, the quantisation (round) function may be used. A quantisation (round) function may be used in our pipeline on the latent space to convert it to a quantised latent space during encoding. This is a problem for training as a quantisation (round) function does not have usable gradients. It is possible to learn the quantisation (round) function using a proxy neural network (since we always know the ground truth values) and use this network (which allows gradients to be propagated) for quantisation during training. The method is similar to that described in the algorithms 1.1, 1.2 and 1.3, but the intractable gradient function is now the quantisation (round) function.

[0188] As an alternative to applications described in this document which use a gradient intractable perceptual metric, the present invention may be re-purposed for applications relating to a runtime device proxy. Techniques such as NAS (Network Architecture Search) can be used to drive the search for efficient architecture using the measured runtime on a device as the loss function to minimise. However, this is currently not possible as it's too time-consuming to execute each model on a device to assess its runtime per iteration of training. We use a proxy network to learn the mapping from architecture to runtime. This proxy is trained by generating 1000, or at least 1000, architectures randomly, timing their runtime on a device, and then fitting a neural network to this data. Having this runtime proxy allows us to get runtimes of architecture easily and within a few seconds of processing (e.g. through the forward pass of the proxy network). This proxy can be then be used as a stand-alone to assess run timings of architectures or in a NAS based setting to drive learning.

Note

[0189] It is to be understood that the arrangements referenced herein are only illustrative of the application for the principles of the present inventions. Numerous modifications and alternative arrangements can be devised without departing from the spirit and scope of the present inventions. While the present inventions are shown in the drawings and fully described with particularity and detail in connection with what is presently deemed to be the most practical and preferred examples of the inventions, it will be apparent to those of ordinary skill in the art that numerous modifications can be made without departing from the principles and concepts of the inventions as set forth herein.

IMAGE COMPRESSION AND DECODING, VIDEO COMPRESSION AND DECODING: TRAINING METHODS AND TRAINING SYSTEMS

Inventors

Cpc classification

Classification Explorer

G06N20/10

PHYSICS

Classification Explorer

G06N3/044

PHYSICS

Classification Explorer

G06N3/088

PHYSICS

Classification Explorer

G06N3/047

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

H04N19/59

ELECTRICITY

Classification Explorer

G06T9/002

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

International classification

Classification Explorer

G06T9/00

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Abstract

Claims

Description