Encoding and decoding image data

Abstract

Certain aspects of the present disclosure provide techniques for encoding image data for one or more images. In one embodiment, a method includes the steps of downscaling the one or more images, and encoding the one or more downscaled images using an image codec. Another embodiment concerns a computer-implemented method of decoding encoded image data, and a computer-implemented method of encoding and decoding image data.

Claims

1. A computer-implemented method of encoding image data for one or more images using an image codec, wherein the image codec comprises a downscaling process, the method comprising the steps of: downscaling the one or more images in accordance with the initial downscaling process, using an artificial neural network comprising a plurality of layers of neurons; subsequently encoding the one or more downscaled images using the image codec; and iterating, until a stopping condition is reached, the steps of: determining an importance value for a set of neurons or layers of neurons; removing the neurons and/or layers with importance value less than a determined amount from the neural network; and tuning the neural network in accordance with a cost function, wherein the cost function is arranged to determine a redetermined balance between the accuracy of the downscaling performed by neural network and the complexity of the neural network; applying a monotonic scaling function to the weighting of the neurons of the neural network; and tuning the neural network.

2. The method of claim 1, wherein the one or more images are downscaled using one or more filters.

3. The method of claim 2, wherein a filter of the one or more filters is an edge-detection filter.

4. The method of claim 2, wherein a filter of the one or more filters is a blur filter.

5. The method of claim 2, wherein the parameters of a filter used to downscale an image are determined using the results of the encoding of previous images by the image codec.

6. The method of claim 1, wherein in the removal step, the layer with lowest importance value is removed.

7. The method of claim 1, wherein the stopping condition is that the iterated steps have been performed a predetermined number of times.

8. The method of claim 1, wherein the iterated steps and the applying and tuning steps are iterated until a further stopping condition is reached.

9. The method of claim 1, wherein the step of downscaling the one or more images comprises the step of determining the processes to use to downscale an image using the results of the encoding of previous images by the image codec.

10. The method of claim 1, wherein the encoded image data comprises data indicative of the downscaling performed.

11. The method of claim 1, wherein the image codec is lossy.

12. A computer-implemented method of decoding encoded image data for one or more images, wherein the encoded image data was encoded using a method of encoding image data for one or more images using an image codec, wherein the image codec comprises a downscaling process, the method of encoding image data comprising the steps of: downscaling the one or more images in accordance with the initial downscaling process using an artificial neural network comprising a plurality of layers of neurons; and subsequently encoding the one or more downscaled images using the image codec; iterating, until a stopping condition is reached, the steps of: determining an importance value for a set of neurons or layers of neurons; removing the neurons and/or layers with importance value less than a determined amount from the neural network; and tuning the neural network in accordance with a cost function, wherein the cost function is arranged to determine a predetermined balance between the accuracy of the downscaling performed by neural network and the complexity of the neural network; applying a monotonic scaling function to the weighting of the neurons of the neural network; and tuning the neural network, wherein the method of decoding encoded image data comprising the steps of: decoding the encoded image data using the image codec to generate one or more downscaled images; and upscaling the one or more images.

13. The method of claim 12, wherein the encoded image data comprises data indicative of the downscaling performed, and the data indicative of the downscaling performed is used when upscaling the one or more images.

14. The method of claim 1, wherein the image data is video data, the one or more images are frames of video, and the image codec is a video codec.

15. A computing device comprising: a processor; and a memory, wherein the computing device is arranged to perform using the processor a method of encoding image data for one or more images using an image codec, wherein the image codec comprises a downscaling process, the method comprising the steps of: downscaling the one or more images in accordance with the initial downscaling process using an artificial neural network comprising a plurality of layers of neurons; subsequently encoding the one or more downscaled images using the image codec; and iterating, until a stopping condition is reached, the steps of: determining an importance value for a set of neurons or layers of neurons; removing the neurons and/or layers with importance value less than a determined amount from the neural network; and tuning the neural network in accordance with a cost function, wherein the cost function is arranged to determine a predetermined balance between the accuracy of the downscaling performed by neural network and the complexity of the neural network; applying a monotonic scaling function to the weighting of the neurons of the neural network; and tuning the neural network.

16. A non-transitory computer-readable medium comprising instructions that, when executed by a processor of a computing device, cause the computing device to perform a method of encoding image data for one or more images using an image codec, wherein the image codec comprises a downscaling process, the method comprising the steps of: downscaling the one or more images in accordance with the initial downscaling process using an artificial neural network comprising a plurality of layers of neurons; subsequently encoding the one or more downscaled images using the image codec; and iterating, until a stopping condition is reached, the steps of: determining an importance value for a set of neurons or layers of neurons; removing the neurons and/or layers with importance value less than a determined amount from the neural network; and tuning the neural network in accordance with a cost function, wherein the cost function is arranged to determine a predetermined balance between the accuracy of the downscaling performed by neural network and the complexity of the neural network; applying a monotonic scaling function to the weighting of the neurons of the neural network; and tuning the neural network.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Embodiments of the present invention will now be described by way of example only with reference to the accompanying schematic drawings of which:

(2) FIG. 1 is a schematic diagram of a method of encoding and decoding image data in accordance with an embodiment of the invention;

(3) FIG. 2 is a schematic diagram of a layers of an artificial neural network;

(4) FIGS. 3 and 4 show comparison bitrate-PSNR results for video sequences encoded and decoded by methods in accordance with embodiments of the invention, and known methods;

(5) FIG. 5 is a schematic diagram of an iteration process of the method of FIG. 1;

(6) FIG. 6 is a schematic diagram of the FR function of the iteration process of FIG. 5;

(7) FIG. 7 is a flowchart showing the steps of a method of encoding video data with a GOP structure;

(8) FIG. 8 is a flowchart showing the steps of a method of decoding encoded video data encoded using the method of FIG. 7;

(9) FIG. 9 is a schematic diagram of an example convolutional neural network used in image processing;

(10) FIG. 10 is a schematic diagram showing an example of filter pruning of a neural network in accordance with an embodiment of the invention;

(11) FIG. 11 is a flowchart showing the steps of a method of optimizing a neural network in accordance with an embodiment of the invention;

(12) FIGS. 12 to 14 show comparison results for neural networks optimized by methods in accordance with embodiments of the invention, and known neural networks;

(13) FIG. 15 shows examples of scaling coefficients for a 3×3 neural network layer across its 64 channels; and

(14) FIG. 16 is a schematic diagram of a computing device in accordance with an embodiment of the invention.

DETAILED DESCRIPTION

(15) Embodiments of the invention are now described.

(16) FIG. 1 is a schematic diagram showing at a high level the steps of an embodiment of the invention, in which video data is encoded and decoded. The embodiment is applicable to batch processing, i.e. processing a group of images or video frames together without delay constraints (e.g. an entire video sequence), as well as to stream processing, i.e. processing only a limited subset of a stream of images or video frames, or even a select subset of a single image, due to delay or buffering constraints.

(17) In the first downscaling stage 1 of FIG. 1, the input image(s) or video frames (IoVF) are filtered and downscaled in space and/or time (Sa/oT) with one, or combination of, the options described below.

(18) Option 1: One or more content-adaptive filters are applied to the IoVF. The content-adaptive filters track edge direction within one or more IoVFs, or flow of pixel or scene activity in time between two or more successive IoVFs, and downscale in Sa/oT by any rational factor

(19) $\frac{a}{b}$
with a, b any integers greater than zero such that b≥a. The integers a, b are chosen adaptively (i.e. based on the content of the IoVF, which may include the results of the overall process on previous IoVF) in each Sa/oT dimension using the adaptivity mechanisms 2 of FIG. 1, as described in more detail below.

(20) Option 2: An Sa/oT blur filter kernel is applied to the IoVF. The Sa/oT filter may be, but is not limited to, an Sa/oT Gaussian function followed by downscaling as in option 1 above.

(21) Option 3: An autoencoder is applied to the IoVF. The autoencoder comprises a series of layers such as (but not limited to) those shown in FIG. 2, followed by non-linearities (such as, but not limited to, operations: finding the maximum within groups of inputs; comparing inputs to thresholds and producing 0 or 1 according to the result; pooling of multiple values together with other non-linear operations like value rounding to an integer or multivalue assignment with a static or dynamic non-linear function). The autoencoder is designed as a multilayered neural network as discussed in references [10]-[12], trained with offline content & back propagation. The training may for example use, but is not limited to, stochastic gradient descent methods [10], performing downscaling in Sa/oT as per said option 1 above, and minimizing a specific cost function that includes one or both of the following: (I) a distance metric when upscaling with the corresponding autoencoder structure (e.g., mean square error or other); (II) a structural similarity metric between the high-resolution and the Sa/oT-downscaled IoVF (e.g., counting successive peaks in the signal and their signs or comparing the phase of the two signals when computed with a phase-generation transform).

(22) In certain embodiments, the application of the options 1 to 3 may be iterated within a loop, as now described with reference to FIGS. 5 and 6. FIG. 5 shows the process of the downscaling stage 1, in particular possible interactions between the different options, while FIG. 6 shows in more detail the operation FR shown in FIG. 5. In FIGS. 5 and 6 the options 1 to 3 are identified as (001).1 to (001).3, while the iteration loop of all the options is identified as (001).4.

(23) The steps of the iteration process are as follows:

(24) (I) The downscaled IoVFs by any factor

(25) $\frac{a}{b}$
in Sa/oT as per option 1 are upscaled back to their original Sa/oT resolution using an edge-adaptive or autoencoding-based upscaling (designed to approximately invert the downscaling process), and a filter kernel is applied, such as (but not limited to) a fixed or adaptive filter having complementary frequency response to the Sa/oT blur kernel of option 2 (i.e., being an Sa/oT high-frequency—or “detail”—enhancement kernel rather than an Sa/oT blur kernel).

(26) (II) The upscaled IoVFs are downscaled with any of the downscaling processes of options 1-3, and the downscaled error IoVF computed as the difference between the previously-downscaled IoVF and the downscaling of this part.

(27) (III) The downscaled error IoVF is upscaled following any of the options 1-3, and filtered with the filter kernel of step I above in order to be added to the previously-upscaled IoVF of step I above.

(28) (IV) Steps II and III above are repeated until norm-K (also known as L.sub.K) [15] of the downscaled error IoVFs does not change beyond a threshold between successive iterations, or for a fixed number of iterations (e.g., 3 iterations, or any other number).

(29) In embodiments of the invention, the filters of option 1 may include (but are not limited to) edge-adaptive filters that use thresholding to determine edge content in input IoVF or IoVF difference matrices, e.g., local or global thresholding. An example embodiment of such edge-adaptive filters is, but is not limited to, adaptive weighted combinations of adjacent samples from IoVFs in Sa/oT, e.g., for samples A, B, D, E, J, K, M, N of the successive IoVFs 1 & 2 of FIG. 5, the space-time combination:
R.sub.d=T.sub.t.sub.1(w.sub.iA)+T.sub.t.sub.2(w.sub.2B)+T.sub.t.sub.3(w.sub.3D)+T.sub.t.sub.4(w.sub.4E)+T.sub.t.sub.5(w.sub.5J)+T.sub.t.sub.6(w.sub.lK)+T.sub.t.sub.7(w.sub.7M)+T.sub.t.sub.8(w.sub.8N)
with:
∀iϵ{1, . . . ,8}: T.sub.t.sub.i(x)={x, ifx>t.sub.i;0,otherwise}

(30) where ∀i ϵ{1, . . . ,8}: t.sub.i, w.sub.i are thresholds and weights set adaptively in order to align the filtering of the equation R.sub.d to Sa/oT edges in the input IoVFs.

(31) In embodiments of the invention, the “blur” filter kernel of option 2 may include (but is not limited to) the 5×5 Gaussian function of the form:
G[x,y,t]=[1/(2πσ.sup.2)]exp(−(x.sup.2+y.sup.2+t.sup.2)/(2σ.sup.2)),

(32) with −2≤x, y, t≤2 being the function's discrete space-time support and a being the standard deviation of the Gaussian function.

(33) In embodiments of the invention, the autoencoder of option 3 may include, but is not limited to, autoencoders that, through a series of layers such as those shown in FIG. 2, minimize a cost function h.sub.W.sub.b that includes terms maintaining the distance when upscaling with the autoencoding process and when upscaling with a simpler process such as a bicubic or bilinear filter.

(34) Finally, in embodiments of the invention, option 4 may include, but is not limited to, any embodiments of options 1-3 within the iterative loop described, and computing the norm-1 or norm-2 between the error signal of successive iterations in order to establish if the norm is below a threshold.

(35) The adaptivity mechanisms 2 used in option 1 above are now discussed in more detail. The adaptivity mechanisms 2 comprise one or both of the following two parts:

(36) (i) Adaptive tuning of parameters for the downscaling ratio in Sa/oT between successive IoVF (with the downscaling ratio changing in each Sa/oT dimension, or within the same IoVF) and/or adaptively changing the filter or autoencoding parameters in Sa/oT in order to achieve the best results on training data or based on a modelling framework that approximates the cost functions used. Embodiments of the invention include, but are not limited to, this tuning using larger downscaling ratios for high-bitrate encoding and lower ratios for low-bitrate encoding, since it is expected that downscaling ratio will decrease monotonically to the utilized bitrate for the encoding of the low-resolution IoVFs. For example, for bitrates between 3,500 kbps and 7,000 kbps and HD-resolution video content at 25 to 30 frames-per-second (fps), the tuning of parameters can select: spatial downscaling ratios

(37) $\frac{a}{b} = \frac{3}{4}$
for both the horizontal and vertical spatial dimension; temporal downscaling ratio

(38) $\frac{a}{b} = 1;$
the use of the 5×5 blur kernel of eq. 3 and a corresponding 5×5 detail kernel such as a Laplacian of Gaussians with appropriate choice of standard deviation parameters; and three iterations of the iteration process described above.

(39) On the other hand, for bitrates between 100 kbps and 300 kbps, the adaptive tuning can select: spatial downscaling factors

(40) $\frac{a}{b} = \frac{1}{4}$
for the horizontal dimension;

(41) $\frac{a}{b} = \frac{1}{2}$
for the vertical spatial dimension;

(42) $\frac{a}{b} = \frac{1}{2}$
for the temporal dimension; the use of a 6-layer autoencoder with the mean square error after reconstruction; and two iterations of the iteration process described above.

(43) The choice of these parameters can be done based on a combination of offline training with representative IoVFs, as well based on online features extracted from the present IoVFs to be encoded, such as variance between frames, the rate-distortion characteristics of select encoding of certain frames with a fast & single-pass with the utilized video codec, etc. These tuning options are just some of the options that will be apparent to the skilled person.

(44) (ii) In order to control the complexity of the entire process, adaptive tuning of whether to apply the three options described above within areas of each IoVF based on a similarity criterion between successive IoVF, e.g., computing areas of the error between IoVF L.sub.K=f.sub.K(V.sub.i[x,y]−V.sub.i+1[x,y]) with i being the time index, and (x,y) representing an area within the IoVFs and f.sub.K representing a distance function, e.g., norm-K (also known as L.sub.K norm) [15], with K ϵ{0, . . . , inf}. When similarity is above a certain threshold (with the latter itself tuned adaptively), instead of applying any of the three options, the same downscaled output as in the previous IoVF, or the same parameters as used for the options, are used. For example, blocks of 16×16 can be replicated from the reconstruction of previous frames if their difference at low resolution has norm-2 that is below 0.01. The skilled person will appreciate that this is just one example of a threshold rule, and many other possible distance metrics would be apparent to the skilled person.

(45) In the next encoding stage 3 of FIG. 1, the filtered and downscaled IoVFs are encoded using an image or video codec. Any standard MPEG, ITU-T, Open, proprietary, or machine-learning encoder (e.g., autoencoder or other) may be used to compress the produced downscaled IoVFs to a binary stream. Examples include, but are not limited to: the MPEG-ITU-T AVC/H.264 video codec, the MPEG HEVC codec, the JPEG or JPEG2000 image codec, the AOMedia AV1 video codec, machine learning image codecs such as those proposed in research papers [10]-[14].

(46) In the next transmitting stage 4 of FIG. 1, the binary stream produced by the encoding stage 3 is stored or transmitted (over any network), by a lossy or lossless method. This includes the transmission or storage of subsets or “layers” of the produced bitstream, such as select frames or bitstream corresponding to Sa/oT portions of the entire set of IoVFs and decoding any of these subsets with any standard MPEG, ITU-T, Open, proprietary or machine-learning decoder to decompress the provided bitstream into “received” or “recovered” IoVFs. This stage may include the transmission of additional layers known as “refinement layers”, which contain information that can assist in the creation of the full-resolution image/video from the compressed downscaled version, and that is not present within the compressed bitstream of the downscaled version. However, importantly, the transmission/storage and use of said refinement layers is not present in all embodiments of the invention.

(47) In the next decoding stage 5 of FIG. 1, the binary stream from the transmitting stage 4 is decoded, using the corresponding decoder for the image or video codec using in the encoding stage 3.

(48) In the next upscaling stage 5 of FIG. 1, the decoded downscaled IoVFs from the transmitting stage 4 are filtered in Sa/oT with one or a combination of the following options: content-adaptive filters tracking edge direction or flow of pixel or scene activity in time, designed to match (DtM) those of option 1 of the downscaling stage 1, where the definition of “DtM” is one of the following: (i) identical steps are applied, with the only change being alternating signs of addition with subtraction and/or upscaling instead of downscaling; (ii) steps corresponding to those of option 1 of the first stage 1 according to a certain similarity criterion expressed by a closed-form expression (e.g., distance function or other) or optimization process (e.g., dynamic programming process or other convex optimization process that minimizes the error of the upscaled output or the error between the application of the two operators in tandem); an Sa/oT blur (or detail) kernel, such as a Sa/oT Gaussian function DtM that of option 2 the downscaling stage 1; an autoencoder DtM that of option 3 the downscaling stage 1; and/or an iterative process DtM that of the iterative process of the downscaling stage 1.

(49) In this upscaling stage 6, adaptive tuning 7 of parameters for the upscaling ratio in Sa/oT between successive IoVF (with the upscaling ratio changing in each Sa/oT dimension and being any rational number

(50) $\frac{b}{a}$
with a, b any integers greater man zero), or within the same IoVF and/or adaptively changing the filter or autoencoding parameters in Sa/oT, is performed so as to DtM the parameters used in the downscaling stage 1. This includes the adaptive tuning of whether to apply the DtM steps of the downscaling stage 1, or simply replicate the same content as in the previous IoVF (or the same parameters) are used, as in the adaptivity mechanisms 2.

(51) As noted in the comparison results 8 shown in FIG. 1, obtaining the decoded IoVFs and storing, displaying, or ingesting them into a subsequent machine learning or artificial intelligence system, has been shown to demonstrate a reduction in bitrate or quality increase (e.g., increase in PSNR, SSIM or other visual metric), compared with the same encoding stage 3 and decoding stage 5 but in the absence of some or all of the downscaling stage 1, adaptivity mechanisms 2, upscaling stage 6 and adaptive tuning 7.

(52) It will be appreciated that where video data has a GOP (“groups of pictures”) structure, i.e. it comprises a series of GOPs, the parameters for downscaling a whole GOP could be determined (i.e. the same parameters use to downscale all images in the GOP), as by their nature the images in a GOP are likely to share characteristics. Parameters for subsequent GOPs could then be determined, so that different GOPs are downscaled using different parameters, if appropriate. Each downscaled GOP can then encoded and added to the video bitstream using a known video codec in the usual way. A flowchart showing an example method of encoding such video data is shown in FIG. 7. The video bitstream can then be decoded using the video codec, and each GOP upscaled using the required parameters, and a flowchart showing an example method of decoding such video data is shown in FIG. 8.

(53) A method of optimizing an artificial neural network in accordance with an embodiment of the invention is now described. Such a neural network can for example be used as the autoencoder of option 3 of the downscaling stage 1 of the embodiment described above.

(54) The neural network of the present embodiment is a multi-layered neural network a part of which is shown in FIG. 2, which can also be known as a deep neural network (DNN). (Such DNN may be, but are not limited to, convolutional neural networks.) The method can serve to control the complexity-vs-accuracy of such a DNN. It has been observed that practical DNN designs are underpinned by an enormous reuse trend: there are perhaps 15-20 core architectures trained on massive amounts of data (such as: VGG, Inception, GoogleNet, DRAW, RNN LM, etc.), which are then adapted and repurposed within numerous specialized contexts [17]-[21][26][28]. However, it will be appreciated that the method equally applies arbitrary variants or combinations of such DNNs, as well as to new DNNs.

(55) Taking such a DNN, the method of the present embodiment provides pruning and complexity-accuracy scaling to optimize the DNN. The DNN may for example be implemented via any commonly-used library such as TensorFlow, caffe2, etc. Using the method, when a core DNN design like Inception or VGG-16 is fine-tuned for a specific dataset and problem context (e.g. training for IoVF SR as in the above embodiment, but also face recognition, ethnicity/gender recognition, etc.), the method can be used to also add resource-precision scaling.

(56) The method is now described in detail with reference to FIG. 11.

(57) In a first step 100, the initial DNN to be optimized is obtained. The DNN will comprise (artificial) neurons in layers, and filters within the layers. The following aspects of the method are then performed.

T1.1) Adaptive Filter Pruning

(58) The filters are pruned to maximize computational efficiency while minimizing the drop in predictive accuracy of the neural network. For each convolutional layer i, the relative importance of each filter F.sub.i inside the layer can be calculated using the sum of its absolute weights Σ.sub.∀j|F.sub.i,j|, with F.sub.i,j the weights of feature map F.sub.i at position j. Visual examples of feature maps are shown in FIG. 9. Other metrics of importance can be used that go beyond this weighting method, such as sum of squares or other non-linear functions of importance such as harmonic or exponential weighting of the feature map weights. The rationale in using such metrics of importance is that smaller kernel weights will produce feature maps with weak activations, which, when deactivated and removed, should not alter the accuracy of the DNN output in a significant way.

(59) The steps for pruning M filters from the ith convolutional layer are as follows:

(60) (i) For each DNN layer, loop through all feature maps.

(61) (ii) For each filter F.sub.i,j in convolutional layer i, evaluate the importance of neurons (i.e., weights) (step 101 of FIG. 11); this can be done by calculating the sum of its absolute kernel weights s.sub.i=Σ.sub.∀j|F.sub.i,j|, or using some other form of importance measure, such as, but not limited to, sum of squares, or product of squares or absolute values, calculation of entropy of F.sub.i,j (or other variants of information quantification measure), etc.

(62) (iii) Sort s.sub.i in descending order.

(63) (iv) Prune filters with smallest s.sub.i and remove kernels in the convolutional layer i corresponding to the pruned feature maps, which (step 102 of FIG. 11).

(64) (v) Calculate sensitivity to pruning filters for the new kernel matrix for each layer independently by evaluating the resulting pruned network's accuracy on the validation set.

(65) (vi) Create new kernel matrix by copying the remaining kernel weights.

(66) (vii) Retrain the network to compensate for performance degradation (step 103). Retraining can be performed after every single pruning action, after pruning all layers first or anything in between until the original accuracy is restored (the “yes” path of step 104). The remaining steps of the method (the “no” path of step 104 and beyond) are described later below.

(67) An example of the result of filter pruning is shown schematically in FIG. 10, showing the removal of the corresponding feature map and related kernels in the next layer.

T1.2) Adaptive Scaling in DNNs

(68) State-of-the-art deep learning libraries use dataflow graphs to define and run multilayer DNN models. Dataflow graphs explicitly define the linear and non-linear processes of a neural network for every layer, giving libraries like TensorFlow the ability to calculate differentials at every level for every node. Efficient calculation of differentials at every level is important to minimize loss functions. For example, a TensorFlow call corresponding to 64 channels of a convolutional layer (with each channel comprising a 3×3 filter that is applied with stride (per dimension) [1, 2]) is: con1=Conv(data, [3,3,64], [1,2], bias=0.0, stddev=5e-2, padding=‘SAME’, name=‘conv1’);

(69) Such calls are mapped to the corresponding C++ code segments (or directly to custom hardware designs that supports TensorFlow) that carry out the data and weight reordering for the actual convolutions and update the filter weights based on the utilized stochastic gradient descent. After the pruning process of T1.1 has been applied, or independent of said pruning process, a-priori scaling with a scaling coefficient matrix that allows for inherent prioritization according to the utilized coefficients is added. For example, in the above conv1 example, a-priori 3×3×64 scaling could be performed following a set of monotonic functions that prioritized within the 64 filters of the convolutional layer, as well as within the 3×3 masks. The specific shape of these monotonic functions could be fixed, or determined based on a learning framework. An example instantiation of the learned coefficients corresponding to the conv1 layer is shown in FIG. 15. However, it will be appreciated this represents only an example instantiation, and many other functions could be derived, such as exponential decay, hyperbolic functions, polynomial functions, etc. Due to the inherent prioritization imposed by these scaling coefficients, maintaining all the weights during the forward pass (and the backpropagation pass during training) corresponds to the least-efficient, “full-accuracy”, mode (i.e. the accuracy obtained using the entire cony layer). Adaptive, highly-efficient, “lossy”, modes are derived by simply retaining only small parts of the cony layer that are multiplied with largest scaling coefficients during training or inference, and ignoring the remainder parts, as in the example shown in the leftmost part of FIG. 15. Decreasing the retained part of each layer decreases the DNN accuracy. However, it can also lead to 8-128 times less data transfers, processing cycles & energy consumption due to the incurred compaction.

(70) Furthermore, in order to provision for incremental DNN training, training can be provided with subsets of the DNN weights, starting from those corresponding to the most significant scaling coefficients, and progressing to those corresponding to the least significant coefficients. Once the DNN weights have converged for a given scaling region, they can be kept constant in order to apply during the forward pass and train subsequent regions. An example (out of many possible) of three retained layers is shown in the leftmost part of FIG. 15. Retaining only some of these coefficient layers allows for complexity scaling during training and quickly exposes the convergence rate and accuracy characteristics of a DNN design prior to extending the training to more layers. This is a crucial aspect when attempting to customize a number of DNN designs for a given dataset and machine learning context, as headway can be made in the correct direction more quickly, allowing more time to be allocated to full DNN training for the subset of the designs that were deemed appropriate to the data and learning problem at hand. In order to control the scaling properties of the final solution, the shape and size of the retained layers can also be tunable by a user, as such parameters can be exposed as a hyperparameter set within the utilized DNN library.

(71) Next, the following steps are performed:

(72) (viii) Optionally retain only a subset of DNN layers (step 105).

(73) (ix) Apply monotonic layer weighting functions to the retained layers (step 106).

(74) (x) Retrain the network to compensate for performance degradation (step 107). Retraining can be performed after every single layer selection and weighting action, after weighting all layers first, or anything in between until the original accuracy is restored (the “yes” path of step 108).

(75) (xi) Once training is stopped in selective retainment and layer weighting (the “no” path of step 108), optionally restart the whole process (the “yes” option of step 109), or alternatively move to the inference step (step 110, following the “no” option of step 109) using new IoVFs or inputs not seen during training and selectively using layers to implement SR or classification inference on the new data.

(76) An example embodiment of layer weighting within Tensorflow [27] is now given. It will be appreciated that the invention is not limited to this embodiment. As this scaling can be implemented with a Hadamard product, e.g., a tf.mul( )command and broadcasting in NumPy prior to the cony command in TensorFlow, basic integration of such a framework within state-of-the-art DNN libraries requires no change in their codebase. Instead, it can be achieved by automated parsing of the Python layer of a TensorFlow DNN design (similar for caffe2 and elsewhere) and appropriate instrumentation. In order for the auto-differentiation and gradient flow processes of such libraries to leave the scaling coefficients unchanged, they can be defined as constants or hyperparameters, and the shape parameters of the multidimensional monotonic layering functions of FIG. 15 can be established with external optimization tools. Finally, since the proposed layering is applied as a preprocessing layer (and potentially as a backprop postprocessing layer as discussed in the next task), other optimizations like adaptive DNN weight quantization or BinaryNets [23]-[25][39][41] can be applied for system-specific acceleration. Importantly, the disclosed invention is not limited to a specific implementation library like Tensorflow or caffe, and can also be implemented in bespoke hardware processors and any programmable interface that can support DNN operations. Example accuracy-performance SR results with example embodiments of the disclosed pruning, projection operations and approximations for convolution and matrix product operations are shown in FIGS. 12 to 14.

T1.3) Optimization and Control of Scaling Coefficient Functions

(77) The described graceful degradation in accuracy for resource scaling and DNN layer compaction depends on the choice of the framework to design the shape of the monotonic scaling functions of each layer. Starting from a variety of initializations, e.g., using linearly-, polynomially- or exponentially-decaying functions, adjustment of the shape parameters of the functions within the backpropagation operation carried out within each layer is performed. As a non-limiting example, based on the average differential carried across each dimension of the layering coefficients of FIG. 15, a corresponding increase of decrease of the linear, polynomial or the exponential decay constants (and biases) is applied in order to adjust the shape of the utilized scaling functions according to the strength, homogeneity (and direction) of the gradient flow.

(78) Example bitrate-PSNR curves obtained from embodiments of the invention include, but are not limited to, those presented in FIGS. 3 and 4. These utilize settings: a=1, b=2, the blur kernel of eq. 3, the adaptive filtering process of eq. 1, three iterations of the iteration process of FIG. 5, and the MPEG/ITU-T H.264/AVC video codec as embodied by the open-source FFmpeg libx264 library. The horizontal and vertical black arrows indicate example bitrate reduction or PSNR increase offered by the example embodiments of the disclosed invention over the conventional video codec that only performs the encoding stage 3 and decoding stage 5.

(79) Embodiments of the invention include the methods described above performed on a computing device, such as the computing device shown in FIG. 16. The computing device 200 comprises a data interface 201, through which data can be sent or received, for example over a network. The computing device 200 further comprises a processor 202 in communication with the data interface 201, and memory 203 in communication with the processor 202. In this way, the computing device 200 can receive data, such as image data or video data, via the data interface 201, and the processor 202 can store the received data in the memory 203, and process it so as to perform the methods of embodiments the invention, including encoding image data, decoding encoded image data, and optimizing an artificial neural network.

(80) While the present invention has been described and illustrated with reference to particular embodiments, it will be appreciated by those of ordinary skill in the art that the invention lends itself to many different variations not specifically illustrated herein.

(81) Where in the foregoing description, integers or elements are mentioned which have known, obvious or foreseeable equivalents, then such equivalents are herein incorporated as if individually set forth. Reference should be made to the claims for determining the true scope of the present invention, which should be construed so as to encompass any such equivalents. It will also be appreciated by the reader that integers or features of the invention that are described as preferable, advantageous, convenient or the like are optional and do not limit the scope of the independent claims. Moreover, it is to be understood that such optional integers or features, whilst of possible benefit in some embodiments of the invention, may not be desirable, and may therefore be absent, in other embodiments.

REFERENCES

(82) [1] Dong, Jie, and Yan Ye. “Adaptive downsampling for high-definition video coding.” IEEE Transactions on Circuits and Systems for Video Technology 24.3 (2014): 480-488.

(83) [2] Douma, Peter, and Motoyuki Koike. “Method and apparatus for video upscaling.” U.S. Pat. No. 8,165,197. 24 Apr. 2012.

(84) [3] Su, Guan-Ming, et al. “Guided image up-sampling in video coding.” U.S. Pat. No. 9,100,660. 4 Aug. 2015.

(85) [4] Shen, Minmin, Ping Xue, and Ci Wang. “Down-sampling based video coding using super-resolution technique.” IEEE Transactions on Circuits and Systems for Video Technology21.6 (2011): 755-765.

(86) [5] van der Schaar, Mihaela, and Mahesh Balakrishnan. “Spatial scalability for fine granular video encoding.” U.S. Pat. No. 6,836,512. 28 Dec. 2004.

(87) [6] Boyce, Jill, et al. “Techniques for layered video encoding and decoding.” U.S. patent application Ser. No. 13/738,138.

(88) [7] Dar, Yehuda, and Alfred M. Bruckstein. “Improving low bit-rate video coding using spatio-temporal down-scaling.” arXiv preprint arXiv: 1404.4026 (2014).

(89) [8] Martemyanov, Alexey, et al. “Real-time video coding/decoding.” U.S. Pat. No. 7,336,720. 26 Feb. 2008.

(90) [9] Nguyen, Viet-Anh, Yap-Peng Tan, and Weisi Lin. “Adaptive downsampling/upsampling for better video compression at low bit rate.” Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on. IEEE, 2008.

(91) [10] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. “Reducing the dimensionality of data with neural networks.” science313.5786 (2006): 504-507.

(92) [11] van den Oord, Aaron, et al. “Conditional image generation with pixelcnn decoders.” Advances in Neural Information Processing Systems. 2016.

(93) [12] Theis, Lucas, et al. “Lossy image compression with compressive autoencoders.” arXiv preprint arXiv: 1703.00395(2017).

(94) [13] Wu, Chao-Yuan, Nayan Singhal, and Philipp Krähenbühl. “Video Compression through Image Interpolation.” arXiv preprint arXiv: 1804.06919 (2018).

(95) [14] Rippel, Oren, and Lubomir Bourdev. “Real-time adaptive image compression.” arXiv preprint arXiv: 1705.05823 (2017).

(96) [15] Golub, Gene H., and Charles F. Van Loan. Matrix computations. Vol. 3. JHU Press, 2012.

(97) [16] R. Timofte, et al., “NTIRE 2017 challenge on single image super-resolution: Methods and results,” Proc. Comp. Vis. and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conf. on Comp. Vis. and Pattern Recognition, CVPR, IEEE, 2017, https://goo.gl/TQRT7E.

(98) [17] B. Lim, et al. “Enhanced deep residual networks for single image super-resolution,” Proc. Comp. Vis. and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conf. on Comp. Vis. and Pattern Recogn., CVPR, IEEE, 2017, https://goo.gl/PDSTiV.

(99) [18] C. Dong, et al., “Accelerating the super-resolution convolutional neural network,” Proc. 2016 IEEE Conf. on Comp. Vis. and Pattern Recognition, CVPR, IEEE, 2016, https://goo.gl/Qa1UmX.Dong, Jie, and Yan Ye. “Adaptive downsampling for high-definition video coding.” IEEE Transactions on Circuits and Systems for Video Technology 24.3 (2014): 480-488.

(100) [19] Dong, C., Loy, C. C., He, K., Tang, X, “Learning a deep convolutional network for image super-resolution,” Proc. ECCV (2014) 184-199

(101) [20] Dong, C., Loy, C. C., He, K., Tang, X, “Image super-resolution using deep convolutional networks,” IEEE TPAMI 38(2) (2015) 295-307

(102) [21] Yang, C. Y., Yang, M. H., “Fast direct super-resolution by simple functions,” Proc. ICCV. (2013) 561-568

(103) [22] Dong, C., Loy, C., Tang, X.: Accelerating the Super-Resolution Convolutional Neural Network, Proc. ICCV (2016).

(104) [23] Han, Song, Huizi Mao, and William J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv: 1510.00149 (2015).

(105) [24] Han, Song, et al., “Learning both weights and connections for efficient neural network,” Advances in neural information processing systems. 2015.

(106) [25] Iandola, Forrest N., et al., “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size,” arXiv preprint arXiv:1602.07360 (2016)

(107) [26] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, 2015.

(108) [27] M. Abadi, “TensorFlow: A system for large-scale machine learning,” Proc. 12th USENIX Symp. on Oper. Syst. Des. and Implem. (OSDI), Savannah, Ga., USA. 2016.

(109) [28] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Proc. Advances in Neural Inf. Process. Syst., NIPS, 2014.

(110) [29] V. Sze, et al., “Hardware for machine learning: Challenges and opportunities,” arXiv preprint, arXiv:1612.07625, 2016.

(111) [30] C. Zhang, et al., “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” Proc. ACM/SIGDA Int. Symp. Field-Prog. Gate Arr., FPGA. ACM, 2015.

(112) [31] T. Chen, et al., “DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” Proc. ASPLOS, 2014.

(113) [32] A. Shafiee, “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” Proc. IEEE Int. Symp. on Comp. Archit., ISCA, 2016.

(114) [33] L. Chi, et al., “PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory,” Proc. IEEE Int. Symp. on Comp. Archit., ISCA, 2016.

(115) [34] B. Park, et al., “A 1.93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications,” Proc. ISSCC, 2015.

(116) [35] A. Cavigelli, et al., “Origami: A convolutional network accelerator,” Proc. GLVLSI, 2015.

(117) [36] H. Mathieu, et al., “Fast training of convolutional networks through FFTs,” Proc. ICLR, 2014.

(118) [37] C. Yang, et al., “Designing energy-efficient convolutional neural networks using energy-aware pruning,” arXiv preprint arXiv:1611.05128, 2016.

(119) [38] J. Hsu, “For sale: deep learning,” IEEE Spectrum, vol. 53, no. 8, pp. 12-13, August 2016.

(120) [39] M. Han, et al., “Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding,” Proc. ICLR, 2016.

(121) [40] K. Chen, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” Proc. ISSCC, 2016.

(122) [41] M. Courbariaux, and Y. Bengio, “Binarynet: Training deep neural networks with weights and activations constrained to +1 or −1,” arXiv preprint, arXiv:1602.02830, 2016.

Encoding and decoding image data

Assignee

Inventors

Cpc classification

Classification Explorer

H04N19/147

ELECTRICITY

Classification Explorer

G06N3/044

PHYSICS

Classification Explorer

G06N3/082

PHYSICS

Classification Explorer

H04N19/117

ELECTRICITY

Classification Explorer

H04N19/85

ELECTRICITY

Classification Explorer

H04N19/59

ELECTRICITY

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

H04N19/587

ELECTRICITY

Classification Explorer

G06N3/049

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

H04N19/192

ELECTRICITY

Classification Explorer

H04N19/80

ELECTRICITY

International classification

Classification Explorer

H04N19/59

ELECTRICITY

Classification Explorer

H04N19/85

ELECTRICITY

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

H04N19/80

ELECTRICITY

Abstract

Claims

Description