SPATIALLY ADAPTIVE IMAGE FILTERING
20220277430 · 2022-09-01
Assignee
Inventors
- Filippos KOKKINOS (London, GB)
- Ioannis Marras (London, GB)
- Matteo MAGGIONI (London, GB)
- Stefanos ZAFEIRIOU (London, GB)
- Gregory SLABAUGH (London, GB)
Cpc classification
G06V10/451
PHYSICS
International classification
G06V10/22
PHYSICS
G06V10/44
PHYSICS
Abstract
An image processor for transforming an input image, the image processor being configured to implement a trained artificial intelligence model, wherein the image processor is configured to: receive the input image; based on one or both of (i) the content of the input image and (ii) features extracted from the input image, process the image by the trained artificial intelligence model to: (i) determine a set of image filters; and (ii) for each of a plurality of subregions of the image, select an image filter from the set of image filters; and for each of the plurality of subregions of the image, apply the respective image filter to the subregion or to features extracted from that subregion. This may allow for differentiable selection of filters from a discrete learnable and decorrelated group of filters to allow for content based spatial adaptations
Claims
1. An image processor for transforming an input image, the image processor being configured to implement a trained artificial intelligence model, wherein the image processor is configured to: receive (101) the input image (201); based on one or both of (i) the content of the input image and (ii) features extracted from the input image, process (102) the image by the trained artificial intelligence model to: (i) determine a set of image filters (202, 203, 204, 205); and (ii) for each of a plurality of subregions of the image, select an image filter from the set of image filters; and for each of the plurality of subregions of the image, apply (103) the respective image filter to the subregion or to features extracted from that subregion.
2. The image processor of claim 1, wherein each of the plurality of subregions is a pixel of the input image.
3. The image processor of claim 1, wherein, for each of the plurality of subregions of the image, the selected image filter is applied to the features extracted from that subregion, and wherein the features extracted from the subregion of the input image are defined in the form of a tensor.
4. The image processor as claimed in claim 1, wherein the image processor is further configured to, for each of the plurality of subregions of the image, select an image filter from the set of image filters based on one or both of (i) the content of the respective subregion of the image and (ii) features extracted from the respective subregion of the image.
5. The image processor as claimed in claim 1, wherein the image processor is further configured to, for each of the plurality of subregions of the image, select an image filter from the set of image filters based on one or both of (i) the content of an area around the respective subregion of the image and (ii) features extracted from the areas around the respective subregion of the image.
6. The image processor as claimed in claim 1, wherein the trained artificial intelligence model is a convolutional neural network.
7. The image processor as claimed in claim 6, wherein the convolutional neural network comprises a regularizer which enforces variability to the learned set of image filters.
8. The image processor as claimed in claim 1, wherein the set of image filters comprises a pre-defined number of discrete filters.
9. The image processor as claimed in claim 1, wherein each image filter of the set of image filters is unique from the other members of the set.
10. The image processor as claimed in claim 1, wherein each image filter of the set of image filters is a kernel.
11. The image processor as claimed in claim 10, wherein the set of image filters comprises kernels having at least two different sizes.
12. The image processor as claimed in claim 1, wherein the image processor is configured to perform one or more of the following image operations: demosaicking, superresolution, semantic segmentation and image classification.
13. A method (100) for implementation at an image processor for transforming an input image, the image processor being configured to implement a trained artificial intelligence model, the method comprising: receiving (101) the input image (201); based on one or both of (i) the content of the input image and (ii) features extracted from the input image, processing (102) the image by the trained artificial intelligence model to: (i) determine a set of image filters (202, 203, 204, 205); and (ii) for each of a plurality of subregions of the image, select an image filter from the set of image filters; and for each of the plurality of subregions of the image, applying (103) the respective image filter to the subregion or to features extracted from that subregion.
14. The method of claim 13, wherein each of the plurality of subregions is a pixel of the input image.
15. The method of claim 13, wherein the set of image filters comprises a pre-defined number of discrete filters.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0025] The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:
[0026]
[0027]
[0028]
[0029]
DETAILED DESCRIPTION OF THE INVENTION
[0030] Described herein is a filtering unit for an image processor that can perform differentiable selection of filters from a discrete learnable and decorrelated group of filters. The selection may be made per pixel (or other subregion of the image) and thus the computation changes spatially according to the content of the input. The selection of the filters may be performed using a compact CNN network which is trained implicitly to select filters based on features it extracts from the input. The end result is the spatially varying application of the filters to the image or tensor to be filtered.
[0031] A common way of implementing a convolutional layer in CNNs is a matrix vector product between a kernel W ∈.sup.k×k×c.sup.
.sup.h×w×c.sup.
[0032] The output y∈.sup.c.sup.
y.sub.ij=.sub.k.sub.
[0033] where the neighbourhood is defined as .sub.k(i,j)={a,b∥a−i|≤k/2,|b−j|≤k/2}.
[0034] As can be seen in Equation (1), the same weights are applied on every position of x. This is a known property of convolutional layers which is known as translation equivariance. While this property has driven progress in computer vision tasks, weight sharing across all positions is not effective to properly produce a spatially varying output. This intrinsic failure results from the fact that the loss gradients from all image positions are fed into global kernels which are trained to minimize the error in all locations. The same problem arises in practice in a wide variety of problems which require a dense prediction or regression such as image segmentation, restoration and enhancement.
[0035] Instead of applying the same kernel on all pixels as described above, the method described herein selectively breaks the equivariance by selecting which filters (or kernels) from a discrete group should be deployed on which locations of an image. This is termed spatially varying convolution. The group of kernels Ŵ∈.sup.n×k×k×c.sup.
y.sub.ij=.sub.k.sub.
[0036] where z∈R.sup.h×w×n is a one-hot encoded index that indicates which kernel out of n kernels in the group should be selected for every pixel. The selection indices z are predicted from a kernel selection mechanism f given the image to be filtered as input, i.e. z=f(x). It can be seen from Equation 2 that different regions of an image are filtered with distinct kernels from Ŵ thus selectively breaking the translation equivariance property of convolutional layers.
[0037] An example of the kernel selection mechanism will now be described.
[0038] The discrete selection of kernels based on the input content can be learned using available training data. In order to extract features of interest, a compact CNN may be used that receives as input an image or embeddings x∈R.sup.h×w×n and gives as output probabilities z∈.sup.h×w×n. These probabilities represent the likelihood that each filter is most suitable for a particular pixel. The CNN may be trained implicitly to select the best kernel from the group, which is also simultaneously learned, by minimizing a task specific loss.
[0039] Preferably, the selection may be applied to be discrete and to deploy on each pixel the most confident arg max z.sub.ij kernel from the group according to the kernel selection CNN. However, the arg max function is non-differentiable and therefore not suitable to be used as a core component in modem deep learning literature.
[0040] The issue of discrete selection may be addressed using a differentiable relaxation of the Gumbel-Max Trick (as described in Emil Julius Gumbel, “Statistical theory of extreme values and some practical applications: a series of lectures”, Number 33, US Govt. Print. Office, 1954, and Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool, “Dynamic filter networks”, Advances in Neural Information Processing Systems, pages 667-675, 2016), which proposes that sampling of a discrete random variable can be converted into a deterministic selection given. The arg max operation of the Gumbel-Trick may be replaced with a soft max, which is differentiable, and a temperature τ, as follows:
{circumflex over (X)}.sub.k=soft max((log α.sub.k+G.sub.k)/τ) (3)
[0041] When τ.fwdarw.0 the softmax function approximates asymptotically an argmax function, while in the case τ.fwdarw.∞ the approximation returns samples from a uniform distribution.
[0042] The straight-through version of the Gumbel-softmax estimator may be used, which during the forward pass discretizes the selection to be binary while the backward pass is calculated based on the continuous selection probabilities z. The straight-through estimator allows for faster convergence and intuitive kernel selection maps, no matter the apparent inconsistency between the forward and backward pass which theoretically leads to biased gradient estimation.
[0043] A preferred component for robust selection is a regularizer enforcing variability to the set of learnable filters. This form of regularization penalizes the naïve solution where all filters are identical and the filter selection per pixel can be as good as chance. Simultaneously, filters that are dissimilar with each other act as unique linear operators or feature extractors when used in deep neural networks. Their application will yield different results, which increases the expressivity of the learnable group of filters or kernels by suppressing any redundancies. In order to maximize dissimilarity and enforce variability, the cosine distance between the kernels in a group can be penalized. This may be achieved by normalizing first and stacking afterwards all kernels on a matrix W.sub.f ∈R.sup.n×n.sup.
.sub.R=∥W.sub.fW.sub.f.sup.T−I∥.sub.F.sup.2 (4)
[0044] where I is the identity matrix.
[0045] In the case where kernels of different support sizes are deployed, the kernels may be padded to the maximum support size before the formation of matrix W.sub.f. The model may be trained with decorrelation regularization alongside task specific losses according to:
=
.sub.task+
.sub.R (5)
[0046] In the event where more than one filtering module is utilized, the regularization loss may be the average of the individual losses.
[0047] To address the limitations of spatial equivariant convolutional filtering, the content-adaptive filtering technique described herein modulates the processing of an input according to statistical cues that are derived from an image or a dataset. Therefore, different images will undergo a unique analysis based on the content that is depicted.
[0048] For each of the plurality of subregions of the image, the unit can select an image filter from the set of image filters based on one or both of (i) the content of the respective subregion of the image and (ii) features extracted from the respective subregion of the image. The unit may select an image filter from the set of image filters based on the content of (or features extracted from) a subregion or an area around a subregion. This may allow an appropriate filter to be applied to a particular subregion of the image.
[0049] Similarly to a conventional CNN, in a convolutional layer, a set of convolutional kernels is learned to provide task-specific filters. However, a filter selection mechanism is also learned that identifies which kernel to apply at each pixel. This way, the convolutional filtering can vary from pixel to pixel in the image.
[0050] Therefore, there are two main inter-related features of the approach:
[0051] Selecting the best filter (from a discrete set) to apply to each pixel in the image or tensor. The filter selection mechanism may be implemented as a classifier using a lightweight convolutional neural network.
[0052] Learning task-specific filters during training, which allows for the formation of a discrete set of filters. Redundancy between filters is penalized, in order to produce a set that are unique from one another.
[0053] At inference, the optimal filter is selected at each pixel, resulting in spatially varying convolution. This adapts the processing locally depending on the image content. These two features may achieve content-dependent filtering of images to handle image processing and computer vision tasks.
[0054] Instead of applying the same kernel on all pixels, the technique described herein may selectively break the equivariance by picking which kernels from a discrete group should be deployed at which locations of an image. The group of kernels may contain a pre-defined number of discrete kernels. Obtaining a group of learnable and decorrelated kernels promotes content-based image enhancement.
[0055] The filter selection mechanism is learned using using available training data. In order to extract features of interest, a compact CNN can be used that receives as input an image or embeddings and gives as output probabilities. The CNN can be trained implicitly to select the best kernel from the group, which is also simultaneously learned, by minimizing a task specific loss.
[0056] Preferably, the selection is discrete and for each pixel the most confident kernel (i.e. the filter with the highest probability for a particular pixel) from the group is deployed according to the kernel selection CNN by deploying a differentiable selection technique.
[0057] A preferred component for robust selection is a regularizer enforcing variability to the group of learnable kernels. This form of regularization penalizes the naive solution where all kernels are identical and the kernel selection per pixel can be as good as chance. Simultaneously, kernels that are dissimilar with each other act as unique linear operators or feature extractors when used in deep neural networks. Their application will yield different results which increases the expressivity of the learnable group of kernels by suppressing any redundancies. In order to maximize dissimilarity and enforce variability, the cosine distance between the kernels in a group is penalized.
[0058]
[0059]
[0060]
[0061] The transceiver 305 is capable of communicating over a network with other entities 310, 311. Those entities may be physically remote from the camera 301. The network may be a publicly accessible network such as the internet. The entities 310, 311 may be based in the cloud. In one example, entity 310 is a computing entity and entity 311 is a command and control entity. These entities are logical entities. In practice they may each be provided by one or more physical devices such as servers and datastores, and the functions of two or more of the entities may be provided by a single physical device. Each physical device implementing an entity comprises a processor and a memory. The devices may also comprise a transceiver for transmitting and receiving data to and from the transceiver 305 of camera 301. The memory stores in a non-transient way code that is executable by the processor to implement the respective entity in the manner described herein.
[0062] The command and control entity 311 may train the artificial intelligence models used in each module of the system. This is typically a computationally intensive task, even though the resulting model may be efficiently described, so it may be efficient for the development of the algorithm to be performed in the cloud, where it can be anticipated that significant energy and computing resource is available. It can be anticipated that this is more efficient than forming such a model at a typical camera.
[0063] In one implementation, once the deep learning algorithms have been developed in the cloud, the command and control entity can automatically form a corresponding model and cause it to be transmitted to the relevant camera device. In this example, the system is implemented at the camera 301 by processor 304.
[0064] In another possible implementation, an image may be captured by the camera sensor 302 and the image data may be sent by the transceiver 305 to the cloud for processing in the system. The resulting target image could then be sent back to the camera 301, as shown at 312 in
[0065] Therefore, the method may be deployed in multiple ways, for example in the cloud, on the device, or alternatively in dedicated hardware. As indicated above, the cloud facility could perform training to develop new algorithms or refine existing ones. Depending on the compute capability near to the data corpus, the training could either be undertaken close to the source data, or could be undertaken in the cloud, e.g. using an inference engine. The system may also be implemented at the camera, in a dedicated piece of hardware, or in the cloud.
[0066] The method is applicable to both linear and non-linear low- and high-level computer vision problems, for example in dense prediction and regressions tasks, as well as standard image classification. The unit may be used to replace convolutional layers in standard neural networks for the tasks of image demosaicking, superresolution, image classification and segmentation, or combinations therefore (such as joint denoising and demosaicking problems).
[0067] The method may be applied on an explicit linear domain for the problem of demosaicking and super-resolution, where the runtime performance is of importance. In both problems, in some implementations, experimental results surpassed the performance of competing linear methods whilst simultaneously achieving competitive accuracy with popular non-linear approaches. Furthermore, the proposed filtering unit may replace convolutional layers in established deep neural networks proposed in literature for the aforementioned problems as well as classification and segmentation tasks.
[0068] For the linear case, as a first step, the method can be trained and tested as an oneshot linear solution for image processing tasks. Note that while the kernel selection mechanism is non-linear, the application of the selected kernel per pixel constitutes a pure linear transformation. Although the expressiveness of the method is restricted in this form, the end-result is an application which runs in real-time and yet achieves competitive performance with more complex and non-linear systems. Simultaneously, the method allows for group of kernels with different support sizes.
[0069]
[0070] Instead of modulating or predicting kernels, the approach described herein performs differentiable selection of filters from a discrete learnable and decorrelated group of filters to allow for content based spatial adaptations. The selection can advantageously be made per pixel of the image and thus the computational graph changes spatially according to the content of the input. The selection of the filters can be performed using a compact CNN network which can be trained implicitly to select filters based on features it extracts from the input. The end result is the spatially varying application of the filters on the image or tensor to be filtered.
[0071] The formulation allows for fast and robust kernel selection with minimal overhead that depends mainly on the number of kernels of a group. Therefore, it may support kernels of arbitrary size.
[0072] Simultaneously, the set of filters or kernels can be regularized during training to be decorrelated and thus constitutes a set of unique and diverse operators. In other words, this regularized group of filters or kernels is enforced to have high variability, hence avoiding the naive solution of a group of redundant kernels.
[0073] A performance improvement has been experimentally observed across several computer vision tasks which provides strong empirical evidence for the need of spatial adaptivity and the benefits of selective filtering. The technique may result in improved image quality for image restoration purposes. Simultaneously, the described method may achieve better classification per pixel or per image than prior methods for high-level computer vision tasks.
[0074] The spatially varying convolution may allow the method to produce an output with zero error and learn the optimal filter set. A filter selection heatmap can be generated to depict which filters were selected per pixel. It can be seen whether the selection between the two optimal filters is the proper one to produce minimal error. When compared to a KPN, the method is capable of achieving lower error, while KPN fails to predict a Dirac filter for the majority of the pixels that would require such a filter.
[0075] With regards to computation overheard, the spatially varying convolution can be implemented in a parallel way using the standard im2col and col2im operations that break the spatial resolution of images into appropriate patches according to the kernel support sizes. Afterwards the patches can be easily filtered using a matrix vector operation per pixel as described in Equation (2). The same implementation is also a known fast solution for the spatially invariant convolution, however modem computation libraries apply a set of low-level optimization techniques to gain considerable execution time reduction.
[0076] The filtering unit may be deployed both as a standalone unit or as a part of a deep neural network.
[0077] The processor and method described herein are particularly advantageous for use in applications where runtime is of importance. The unit can be employed consecutively in conjunction with deep neural networks as a replacement of standard convolutional layers and enhance the original architectures with spatially varying computations, which in return may provide considerable performance improvement.
[0078] This method therefore provides the very appealing advantage of spatial adaptivity; an important component that is absent from many standard convolutional units, such as those commonly found in CNNs.
[0079] The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.