DEEP RECEPTIVE FIELD NETWORKS

Abstract

The invention provides a method for recognition of information in digital image data, said method comprising a learning phase on a data set of example digital images having known information, and characteristics of categories are computed automatically from each example digital image and compared to its known category, said method comprises training a convolutional neural network comprising network parameters using said data set, in which via deep learning each layer of said convolutional neural network is represented by a linear decomposition of all filters as learned in each layer into basis functions.

Claims

1. A method for recognition of information in digital image data as disclosed herein.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0056] Embodiments of the invention will now be described, by way of example only, with reference to the accompanying schematic drawings in which corresponding reference symbols indicate corresponding parts, showing an embodiment of a construction element, and showing in:

[0057] FIG. 1 State of the art convnet trained on random subsets of CIFAR-10;

[0058] FIG. 2 filters randomly sampled from all layers of the GoogLenet model, from left to right layer number increases;

[0059] FIG. 3 a representation of the method and device;

[0060] FIG. 4a, 4b both architectures trained on 300 randomly selected samples of MNIST and 300 randomly selected samples of CIFAR-10 on the bottom and trained on the full training sets on the top

[0061] FIG. 5: Computation time vs the size of the convolution filter. Note that RFNets depend mostly on the order of the function basis, and

[0062] FIG. 6 RFNet filters before (left) and after training (right) for 2 epochs on MNIST.

[0063] The drawings are not necessarily on scale.

DESCRIPTION OF PREFERRED EMBODIMENTS

[0064] Convolutional neural networks have large numbers of parameters to learn. This is their strength as they can solve extremely complicated problems. At the same time, the large number of parameters is a limiting factor in terms of the time needed and of the amount of data needed to train them (Krizhevsky et al., 2012; Coates et al., 2011). For the computation time, the GoogLenet architecture trains up to 21 days on a million images in a thousand classes on top notch GPU's to achieve a top-5-error. For limited data availability the small experiment in FIG. 1 quantifies the loss in performance relative to an abundance of data. For many practical small data problems, pretraining on a large general dataset is an alternative, or otherwise unsupervised pretraining on subsets of the data, but naturally training will be better when data are of the same origin and the same difficulty are being used. Therefore, the reduction of the effective number of free parameters is of considerable importance for the computation time and classification accuracy of low-data problems

[0065] The recent review in Nature describes deep learning as a series of transformations of representations of the original data. We aim to use this definition in its most direct form for images. Images, as signals in general, are special in that they demonstrate spatial coherence, being the correlation of the value of a pixel with the values in the pixels neighbourhood almost everywhere. (Only at the side of steep edges it remains undecided whether a pixel belongs to one side or to the other. The steepness of camera-recorded edges is limited by the bandwidth, as a consequence of which the steepest edges will not occur in practice.) When looking at the intermediate layers of convnets, the learned image filters are spatially coherent themselves, not only for the first layers but also for all but the last, fully-connected layer, although there is nothing in the network itself which forces the filters into spatial coherence. See FIG. 2, for an illustration from the intermediate layers 1 to 5.

[0066] In FIG. 2, Filters are randomly sampled from all layers of the GoogLenet model (see Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv:1409.4842, 2014), from left to right layer depth increases. Without being forced to do so, the model exhibits spatial coherence (seen as smooth functions almost everywhere) after being trained on ImageNet. This behaviour reflects the spatial coherence in natural images. It supports the assumption that higher layer feature maps can be seen as sufficiently smooth representations themselves.

[0067] In higher layers, the size of the coherent patches may be smaller indeed, the layer still shows coherence. We take this observation and deep learn the representation at each layer by a linear decomposition onto basis functions as they are known to be a compact approximation to locally smooth functions. This local (Taylor- or Hermite-) functional expansion is the basis of our approach.

[0068] In the literature, an elegant approach to reduce model complexity has been proposed by Bruna et al. by the convolutional scattering network cascading Wavelet transform convolutions with nonlinearity and pooling operators. On various subsets of MNIST, they show that this approach results in an effective tool for small dataset classification. The approach computes a translation invariant image representation, stable to deformations, while avoiding information loss by recovering wavelet coefficients in successive layers yielding state-of-the-art results on handwritten digit and texture classification, as these datasets exhibit the described invariants.

[0069] However, the approach is also limited in that one has to keep almost all possible cascade paths (equivalent to all possible filter combinations) according to the model to achieve general invariance. Only if the invariance group, which solves the problem at hand is known a priori, one can hard code the invariance network to reduce the feature dimensionality. This is effective when the problem and its invariances are known precisely, but for many image processing applications this is rarely the case. And, the reference does allow for infinite group invariances. In this work, we aim to devise an algorithm combining the best of both worlds: to be inspired by the use of a wavelet-basis to achieve low-data learning capacity of the scattering convolutional network, while still achieving the full learning capacity of the Convolutional Neural Network (CNN)-approach without the need to specify the invariance classes a priori.

[0070] Other attempts to tackle the complicated and extensive training in convnets, rely heavily on regularization and data augmentation for example by dropout. The maxout networks (Goodfellow, 2013) leverage dropout by introducing a new activation function. The approach improved state of the art results on different common vision benchmarks. Another perspective on reducing sample complexity has been made by Gens and Domingos (2014) by introducing deep symmetry networks. These networks apply non-fixed pooling over arbitrary symmetry groups and have been shown to greatly reduce sample complexity compared to convnets on NORB and rotated MNIST digits when aggregated over the affine group. Also focussing on modelling invariants is the convolutional kernel network approach introduced by Meiral et al. (2014) which learns parameters of stacked kernels. It achieves impressive classification results with less parameters to learn than a convnet. The many attempts to reduce model complexity, to reduce sample complexity, to regularize models more effectively, or to reduce training time of the convnet approach, may all be implemented independently in our method as well. In the experiments we focus on the simplest comparison, that is of the standard convnet with our standard receptive field net without these enhancements on neither side.

[0071] Convnets take an image, which in this case we consider to be a function f: R.sup.2.fwdarw.R, as their input. Each convolutional layer produces feature maps as outputs by subsequent convolution with a sampled binary spatial aperture w.sub.ij, application of a pooling operator and a nonlinear activation function. However, usually natural images are the sampled version of an underlying smooth function which can be sufficiently described by a set of appropriate smooth basis functions (Koenderink, structure of images). The family of Gaussian derivatives is known to be such a family of functions.

[0072] We assume, that not the simple functions expressed by the kernel's weights w.sub.ij are crucial for building invariances, but rather a learned combination of many such simple functions will exhibit the desired behaviour (Lecun Net with Bruna). For this reason we formulate the filter learning as a function approximation problem, which naturally introduces the Taylor expansion, as it can approximate any arbitrary continuous function.

[0073] A convolution in the receptive field network (RFnet) is using a convolution kernel F(x; y). In a standard CNN the values F(x; y) for all pixels (x; y) in a small neighbourhood are learned (as shared weights). In a RFNetwork the kernel function is of the form F(x; y)=Gs(x; y)f(x; y) where Gs is the Gaussian function that serves as an aperture defining the local neighbourhood. It has been shown in scale space theory (see Koenderink, SOI, Ter Haar Romeny, Book) that a Gaussian aperture leads to more robust results and that is the aperture that doesn't introduce spurious details (like ringing artefacts in images). The function f is assumed to be a linear combination of basis functions:

f(x,y)=.sub.1.sub.1+. . . +.sub.n.sub.n

[0074] Instead of learning function values F(x; y) (or f(x; y)) in a RFNetwork the weights .sub.i are learned. If we select a complete function basis we can be confident that any function F can be learned by the network. There are several choices for a complete set of basis functions that can be made. The simplest perhaps are the monomial basis functions:

[00002] $1, x, y, \frac{1}{2!} .Math. x^{2}, xy, \frac{1}{2!} .Math. y^{2}, \frac{1}{3!} .Math. x^{3}, \frac{1}{2!} .Math. x^{2} .Math. y, \frac{1}{2!} .Math. {xy}^{2}, \frac{1}{3!} .Math. y^{3}, .Math.$

[0075] These are the functions that appear in a Taylor series expansion of a function f and thus we call this basis the Taylor basis. In this case we can view the problem of learning a filter in the RFnet as a function approximation problem, where we learn an approximation of the convolution kernel F(x; y). For illustration we restrict ourselves to a first order Taylor polynomial, so then the convolution kernel is:

F=G.sup.8g=G.sup.8i f(0)+G.sup.8f.sub.z(0)x+G.sup.8f.sub.y(0)y. (1)

[0076] Where G is the Gaussian aperture with a given standard deviation =s and we are approximating an underlying function g(x; y). Now we define the basis Bm(x; y) and the to be learned parameters .sub.m as follows:

B.sub.0=G.sup.8;B.sub.1=G.sup.8x:B.sub.2=G.sup.8y;.sub.0=f(0);.sub.1=f.sub.x(0);.sub.2=f.sub.y(0) (2)

[0077] Including orders up to the power of nth in the Taylor expansion, it follows:

F=.sub.0B.sub.0+.sub.2B.sub.2+. . . +.sub.nB.sub.n (3)

[0078] Hence, the Taylor basis can locally synthesize arbitrary functions with a spatial accuracy of , where the bases are constant function-kernels and the function only depends on the parameters am. However, it is possible to choose multiple bases based on this approach. A closely related choice is the basis of the Hermite polynomials:

1,2x,2y,4x.sup.22,4y.sup.22 . . .

[0079] FIG. 3 shows one convolutional layer of the RFnet (not showing pooling and activation which in an embodiment are standard in our case). To the left the image I(x; y) or a feature map from a previous layer, which will be convolved with the filters in the first column. The first column displays the Hermite basis up to second order under the Gaussian aperture function. This is preprogrammed in any layer of the RFnet. In the second column Fx displays the effective filters as created by -weighted sums over the basis functions. Note that these filters visualized here as they are effective combinations of first column, which do not exist in the RFnet at any time. Note also that basis functions can produce any desired number of different filters.

[0080] It has been shown, that any derivative of the Gaussian function, which is our aperture, can be written as the multiplication of the Gaussian function with a Hermite polynomial. Using the Hermite basis we are thus using convolution kernels that are linear combinations of Gaussian derivatives. Both the Taylor basis and the Hermite basis are complete bases: any function F can be written as a linear combination of the basis functions. The mathematical identity requires the summation of an infinite amount of basis functions. Truncating the summation sequence at say m basis functions leaves us with an approximation of the arbitrary function F.

[0081] Observe that the Taylor basis and the Hermite basis are completely equivalent from a mathematical point of view. Any Hermite polynomial (up to order n) can be written as a linear combination of Taylor monomials (up to order n) and vice versa. Another basis that is often used to model the visual front-end are the Gabor functions. These are kernel functions that multiply the Gaussian aperture with sine/cosine functions of varying frequency. The learned 's interpretation changes from bases to bases. In the Taylor series, they are the learned functions derivatives under an aperture, whereas in the Hermite polynomial they denote slightly more complex meaning. Hence the exact form of the parameters am depends on the chosen parameterization of the basis functions. In this study we use the Hermite basis for the experiments below, as there is evidence that the receptive fields in the human visual brain can be modeled as linear combinations of Gaussian derivative functions. To show the properties of the RFnet, we use the Taylor basis, and directly apply it experimentally to approximate natural image patches in the experiment below.

REFERENCES

[0082] Young, The Gaussian derivative model for spatial vision Koenderink, SOI Koenderink, Receptive Field Families Bart ter Haar Romeny, Front-End Vision and Multi-Scale Image Analysis Lillholm, Statistics and category systems for the shape index descriptor of local 2nd order natural image structure

TABLE-US-00001 Algorithm 1 RFnet Learning-updating the parameters .sub.ij.sup.l between input map indexed by i and output custom-character map indexed by j of layer l in the Mini-batch Gradient Decent framework. 1: Input: input feature maps o.sub.i.sup.l1 for each training sample (computed for the previous layer. o.sup.l1 is the input image when l = 1), corresponding ground-truth labels {y.sub.1, y.sub.2, . . . , y.sub.K}, the basic kernels {B.sub.1, B.sub.2, . . . , B.sub.M}, previous parameter .sub.ij.sup.l. 2: compute the convolution {.sub.1, .sub.2, . . . , .sub.M} of {o.sub.i.sup.l1} respect to the basic kernels {B.sub.1, B.sub.2, . . . , B.sub.M} 3: obtain the output map o.sub.j.sup.l = .sub.ij1.sup.l .Math. .sub.1 + .sub.ij2.sup.l .Math. .sub.2 + . . . + .sub.ijM.sup.l .Math..sub.M 4: compute the .sub.jn.sup.l for each output neuron n of the output map o.sub.j.sup.l by equation (7) 5: compute the derivative (t.sub.jn.sup.l) of the activation function 6: [00003] $compute .Math. .Math. the .Math. .Math. gradient .Math. .Math. \frac{E}{_{ij}^{l}} .Math. .Math. respect .Math. .Math. to .Math. .Math. the .Math. .Math. weights .Math. .Math._{ij}^{l} .Math. .Math. using .Math. .Math. equation .Math. .Math. (7)$ 7: [00004] $update .Math. .Math. parameter .Math. .Math._{ij}^{l} =_{ij}^{l} - r .Math. \frac{1}{K} .Math. {.Math.}_{k = 1}^{K} .Math. .Math. {[\frac{E}{_{ij}^{l}}]}_{k}, r .Math. .Math. is .Math. .Math. the .Math. .Math. learning .Math. .Math. rate$ 8: Output: .sub.ij.sup.l, the output feature maps o.sub.j.sup.l

[0083] Convnets are typically trained with the backpropagation algorithm (see Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015, incorporated by reference). The gradient of the error function at the network's output is calculated with respect to all parameters in the network by applying the chain rule and doing a layer-wise backward pass through the whole network. For convolutional neural networks, the weights to be learned are the filter kernels. Traditionally, the filter kernels are randomly initialized and updated in a stochastic gradient decent manner. In our approach, the parameters of the convolutional layers are the parameters of the taylor approximators a as shown in equation above. These taylor approximators a are learned in a mini-batch gradient decent framework.

[0084] To solve the learning problem, we need to efficiently compute the derivative of the loss function respect to the parameters . Taking the derivative of the loss function E with respect to the parameters a is done by applying the chain rule:

[00005] $\begin{matrix} \frac{E}{_{ij}^{l}} = \underset{n}{.Math.} .Math. \underset{\underset{_{jn}^{l}}{}}{\frac{E}{o_{jn}^{l}} .Math. \frac{o_{jn}^{l}}{t_{jn}^{l}}} .Math. \underset{\underset{D_{c}}{}}{\frac{t_{jn}^{l}}{_{ij}^{l}}} & (4) \end{matrix}$

[0085] Here E is the loss function. 1 denotes the current layer, i indexes the input feature map, j the output feature map and n indexes the neuron of the j-th feature map. .sub.ij.sup.1 are the parameters between the i-th input feature map and the j-th output feature map of layer 1. o.sub.jn.sup.1 is the n-th neural value of the j-th output feature of the layer 1. t.sub.jn.sup.1 is output feature before the rectifier-activation function is applied to o.sub.jn.sup.1 (o.sub.jn.sup.1=(t.sub.jn.sup.1)). In an embodiment, we use the rectifier function as widely used in deep neural networks. To solve equation 4, we split it into two parts, .sub.jn.sup.1 and the derivative of the convolutional function Dc. For the first part .sub.jn.sup.1, it is trivial to solve if 1 is the last layer. For the inner layers, by applying the chain rule, .sub.jn.sup.1 is:

[00006] $\begin{matrix} _{jn}^{l} == (\underset{k}{.Math.} .Math. \underset{q}{.Math.} .Math._{kq}^{l + 1} (_{ij .Math. .Math. 1} .Math. B_{1} +_{ij .Math. .Math. 2} .Math. B_{2} + .Math. +_{ij .Math. .Math. M} .Math. B_{M})) .Math.^{} (t_{jn}^{}) & (5) \end{matrix}$

[0086] Here, k is the feature map index of the layer 1+1 and q is the neural index of feature map k on the layer 1+1. (t.sub.jn.sup.1) is the derivative of the activation function. In our network, rectifier function is used as the activation function. The second part of the equation 4 is only dependent on the parameters ij and can thus be calculated as follows if 0.sup.11.sub.jn denotes the output feature map of layer 11 (which is also the output feature of layer 1), the second part of the equation can be calculated as:

[00007] $\begin{matrix} D_{c} = \frac{t_{jn}^{l}}{_{ij}} = \frac{[o_{i}^{l - 1} .Math. (_{ij .Math. .Math. 1} .Math. B_{1} +_{ij .Math. .Math. 2} .Math. B_{2} + .Math. +_{ij .Math. .Math. M} .Math. B_{M})]}{_{ij}} = [\begin{matrix} o_{i}^{l - 1} .Math. B_{1} \\ o_{i}^{l - 1} .Math. B_{2} \\ o_{i}^{l - 1} .Math. .Math. \\ o_{i}^{l - 1} .Math. B_{M} \end{matrix}] & (6) \end{matrix}$

[0087] Bm where m is an element of {1, 2, 3, . . . , M} denotes the irreducible basis functions of the taylor approximators up to the order M. By substituting the two terms, we are able to calculate the derivative of the error with respect to all parameters in the network. The result is as follows:

[00008] $\begin{matrix} \frac{E}{_{ij}^{l}} = \underset{n}{.Math.} .Math._{jn}^{l} .Math. [\begin{matrix} o_{i}^{l - 1} .Math. B_{1} \\ o_{i}^{l - 1} .Math. B_{2} \\ o_{i}^{l - 1} .Math. .Math. \\ o_{i}^{l - 1} .Math. B_{M} \end{matrix}] & (7) \\ wrt . .Math._{jn}^{l} = {\begin{matrix} (y - t) .Math.^{} (t_{jn}^{l}) & l .Math. .Math. is .Math. .Math. the .Math. .Math. last .Math. .Math. layer \\ \underset{k}{.Math.} .Math. \underset{q}{.Math.} .Math._{kq}^{l + 1} (_{ij .Math. .Math. 1} .Math. B_{1} +_{ij .Math. .Math. 2} .Math. & l .Math. .Math. is .Math. .Math. the .Math. .Math. inner .Math. .Math. layer \\ B_{2} + .Math. +_{ij .Math. .Math. M} .Math. B_{M})) .Math.^{} (t_{jn}^{l}) \end{matrix} \end{matrix}$

[0088] The algorithm shows how the parameters are updated.

[0089] Reducing Number of Convolutions

[0090] When training the network, we convolve our learned filters F(x; y) with an image I(x; y). We have derived the equation for our filters parameters in ??. Due to the convolution being a linear operator, we can first convolve I(x; y) with the bases Bm and only afterwards multiply with the learned am parameters. So we can rewrite as:

[00009] $\begin{matrix} F (x, y) * I (x, y) = [\begin{matrix} _{0} \\ .Math. \\ _{m} \end{matrix}] .Math. [\begin{matrix} I (x, y) * B_{0} (x, y) \\ .Math. \\ I (x, y) * B_{m} (x, y) \end{matrix}] & (8) \end{matrix}$

[0091] The consequence is, that by convolving the input of each layer with the basis filters (specified in number only by their order) and taking the a-weighted sum, we can effectively create responses to any number and shape of filters. This makes RFnet largely independent of number of filters present in each layer, as a summation over basis-outpts is all that is needed. This decomposition of arbitrary filters is especially beneficial for large numbers of filters per layer and big filter sizes. Considering a single convnet layer with 128 channels input, 256 channels output and a filter size of 55 pixels, in this case 32768 single 2D filters have to be convolved and 819200 parameters to be learned. When applying our receptive fields approach, we are able to generate 32768 effective filter responses by convolving with the inputs 1920 times, that is 128 channels each convolved with 15 basis functions up to order four. We only have to learn 491520 parameters, that is 128 channels each is 15 basis functions times 128 channel inputs times 256 outputs. The number is only needed when the full basis set of filters up to the fourth order is in use, which is often not even needed. For MNIST for instance, a second order basis suffices, which means 98304 parameters to learn for a layer of 128*256 filters. To conclude, our approach requires more than an order of magnitude less convolutions or even less with only roughly having half up to an eighth of the number of parameters to learn, depending on the choice of basis. This is very promising, as the convolution operations have been found to be the bottleneck in fast training of convolutional networks.

[0092] Experiments

[0093] We present two experimental parts. The first focuses on comparison between four different convolutional architectures, on small datasets sampled from the MNIST-dataset (see Yann LeCun, Lon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998) to validate the ability of the RFnet to achieve competitive results on two object classification benchmarks, while showing more stable results on small training set sizes compared to the other approaches. The dataset sizes are chosen according to M. Ranzato, F.-J. Huang, Y-L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. CVPR, 2007.

[0094] In the second part, we demonstrate practical properties of the receptive fields approach. We approximate a natural image patch and benchmark the time benefit of the reduction in convolutions in an RFnet layer, compared to a classical convolutional network layer. All experiments were conducted in Theano. In the test phase, we always test on all examples of the test set and report the average error. The second dataset is the CIFAR-10 benchmark, to demonstrate classification on natural color images.

[0095] Image Classification with Small Sample Sizes

[0096] In this part we compare four convolutional architectures: i) Our own RFnet, with identical setup as our own implementation of ii) a published convnet architecture (Zeiler, Hinton), iii) the best published convnet results on MNIST with small training set size without data augmentation (Lecun) and iv) the convolutional scattering network approach that excels on MNIST, as an example of a predefined convnet. To also show that the RFnet can handle natural color images, we further compare it with our convnet implementation on the CIFAR-10 benchmark with the same approach and show the differences in training for various training set sizes.

[0097] The convnet implementation and experimental setup is chosen according to Zeiler et al. and is identical for our own implementations of the RFnet and the convnet. The convnet and the RFnet consist of 3 convolutional layers, where the last convolutional layer is fully connected to the 10 softmax outputs. After each convolutional layer, max pooling with a kernel size of 33 and a stride of 2 is applied, subesequently followed by local response normalization and a rectified linear unit activation function. Each convolutional layer consists of 64 filter kernels, with size of 55 pixels each, 77 respectively for the RFnet to compensate for the degrading aperture. On all weights of the last layer, dropout of 0.5 was applied and on all weights of the convolutional layers dropout of 0.2. We calculate crossentropy loss as errorfunction and the network is trained for 280 epochs with adadelta. Batch size for the convnet is 100 and the learning rate 1.0 at the beginning of training, linearly decreasing for each epoch until it reaches 0.01 after 280 epochs. As the RFnet's parameters are of different nature, we changed two parameters, we chose a batch size of 50 and a learning rate of 5.0 linearly decreasing to 0.05 after 280 epochs. The number of hermite bases. For MNIST it is 6 for all layers. For Cifar-10 we chose a basis of 10 in the first layer and a basis of 6 for all other layers. The architecture were trained on CIFAR-10 and MNIST with various training set sizes. The obtained results of the convnet are in line with the results reported in the original reference (M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. ICLR, 2013).

[0098] FIGS. 4a and 4b show both architectures trained on 300 randomly selected samples of MNIST and 300 randomly selected samples of CIFAR-10 on the bottom and trained on the full training sets on the top. The RFnet is much more stable and directly starts to converge for small training set sizes on both datasets. In comparison, the convnet remains at random predictions for a number of epochs, before it starts to converge. Furthermore, the final predictions are more accurate for the RFnet for small sample size. Training on the full set converges somewhat slower for the RFnet compared to the convnet. However, the final accuracy after 280 epochs is very similar, slightly worse for MNIST and slightly better for CIFAR-10.

TABLE-US-00002 Train Samples RFNet CNN A CNN B Scattering 300 4.33 8.10 7.18 4.70 1000 2.28 3.69 3.21 2.30 2000 1.57 2.06 2.53 1.30 5000 0.93 1.32 1.52 1.03 10000 0.81 0.83 0.85 0.88 20000 0.69 0.69 0.76 0.58 40000 0.47 0.47 0.65 0.53 60000 0.47 0.42 0.53 0.43

[0099] FIG. 6 illustrates the filters before and after training on MNIST.

[0100] Practical Properties of the Receptive Fields Network

[0101] The second part focuses on the properties of the receptive fields, namely their expressiveness and computational efficiency. We benchmark a standard convolutional layer for a forward and backwards pass, compared to an RFnet layer with the same number of feature maps and vary size, as well as number of convolution kernels. Furthermore we approximate natural image patches with a Taylor basis, to literally illustrate the expressiveness of our approach. The size and the number of convolution kernels varies as indicated.

[0102] For the timing experiment we use 9696 input images with 32 input channels, convolved with 64 filters where we vary the kernel size as 6, after which the Gaussian aperture tends to zero. To best show our theoretical improvement, we measure computation time on the cpu. In FIG. 5 we illustrate that our approach is less sensitive to the filter kernel size. In a RFNet, the number of convolutions depends on the order of the function basis, not on the number of pixels in a particular filter. FIG. 5 shows Computation time vs the size of the convolution filter. Note that RFNets depend mostly on the order of the function basis.

[0103] FIG. 6 shows RFNet filters before (left) and after training (right) for 2 epochs on MNIST. Note how our filters adapt to the task, and exhibit smooth contours already after two epochs of training.

[0104] It will also be clear that the above description and drawings are included to illustrate some embodiments of the invention, and not to limit the scope of protection. Starting from this disclosure, many more embodiments will be evident to a skilled person. These embodiments are within the scope of protection and the essence of this invention and are obvious combinations of prior art techniques and the disclosure of this patent.

EXAMPLES

[0105] 1. A method for recognition of information in digital image data, said method comprising a learning phase on a data set of example digital images having known information, and computing characteristics of categories automatically from each example digital image and comparing computed characteristics to their known category, said method comprises in said learning phase training a convolutional neural network comprising network parameters using said data set, in which in said learning phase via deep learning each layer of said convolutional neural network is represented by a linear decomposition into basis functions of all filters as learned in each layer. [0106] 2. The method of example 1, wherein the network parameters of the convolutional layers are parameters expressing the weights of each member in a set of basis functions selected from a Taylor expansion and a Hermite expansion, for providing approximators for a local image structure by adapting said network parameters during training. [0107] 3. The method of any one of the preceding examples, further comprising preprograming the basis functions in the network as Gaussian-shaped filters to decompose the filters. [0108] 4. The method of any one of the preceding examples, comprising using a receptive field network (RFnet) including a convolution kernel F(x; y) of the form F(x; y)=Gs(x; y)f(x; y), where Gs is a Gaussian function that serves as an aperture defining the local neighborhood. [0109] 5. The method of any one of the preceding examples, wherein a set of monomial basis functions:

[00010] $1, x, y, \frac{1}{2!} .Math. x^{2}, xy, \frac{1}{2!} .Math. y^{2}, \frac{1}{3!} .Math. x^{3}, \frac{1}{2!} .Math. x^{2} .Math. y, \frac{1}{2!} .Math. {xy}^{2}, \frac{1}{3!} .Math. y^{3}, .Math.$

are used for the learning function values, or functionally simpler functions that turn up in a Taylor series expansion of the function f. [0110] 6. The method of any one of the preceding examples, wherein the to be learned parameters .sub.m as follows:

B.sub.0=G.sup.8;B.sub.1=G.sup.8x;B.sub.2=G.sup.8y;.sub.0=f.sub.x(0);.sub.2=f.sub.y(0)

where the derivatives f at that position with index indicating their order are measured by the Gaussian linear filter of the same derivative order. [0111] 7. A method for recognition of categorical information from digital image data, said method comprising providing a trained neural network, trained using the method of any one of the preceding examples. [0112] 8. A device for recognition of categorical information from digital image data, comprising a computer system comprising a computer program product with, when running on said computer system, applies a trained neural network derived according to the method of any one of the preceding examples. [0113] 9. A computer program product which, when running on a data processor, performs the method of any one of the preceding examples. [0114] 10. A method for recognition of information in digital image data, said method comprising deriving a convolutional neural network architecture based on receptive field filter family as a basis to approximate arbitrary functions representing images by at least one selected from Taylor expansion, and Hermite functional expansion. [0115] 11. A computer program product for classification of data having local coherence, in particular spatial coherence, for instance data selected from images, time series, and speech data, said computer program product comprising a deep receptive field network, comprising a filter kernel comprising a linear combination of basis functions. [0116] 12. A computer program product for classification of data having local coherence, in particular spatial coherence, for instance data selected from images, time series, and speech data, said computer program product comprising a deep convolutional neural network comprising receptive field functions, wherein said receptive field functions comprise a linear combination of functional complete basis functions. [0117] 13. The computer program product of examples 11 or 12, wherein said neural network comprise weights that are learnt using a sample dataset, in particular said weights are learned for a whole patch at once. [0118] 14. The computer program product of examples 11 or 12 or 13, wherein said neural network comprises a or said kernel that is a linear combination of basis functions:

F(x,y)=.sub.1.sub.1+. . . +.sub.n.sub.n

[0119] wherein in particular .sub.i is a complete set of basis functions, with the parameters of the convolutional layers are the parameters of the parameters .

DEEP RECEPTIVE FIELD NETWORKS

Assignee

Inventors

Cpc classification

Classification Explorer

G06N3/088

PHYSICS

Classification Explorer

G06F18/214

PHYSICS

Classification Explorer

G06F18/2414

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06F18/217

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

G06V10/449

PHYSICS

International classification

Classification Explorer

G06K9/62

PHYSICS

Classification Explorer

G06K9/46

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Abstract

Claims

Description