Device and a method for image classification using a convolutional neural network

10832097 · 2020-11-10

Assignee

Inventors

Cpc classification

International classification

Abstract

A device for image classification comprising a convolutional neural network configured to generate a plurality of probability values, each probability value being linked to a respective one of a plurality of predetermined classes and indicating the probability that the image or a pixel of the image is associated with the respective class, and the convolutional neural network comprises a plurality of convolutional blocks and each of the convolutional blocks comprises: a first convolutional layer configured to perform a pointwise convolution using a first kernel, a second convolutional layer configured to perform a depthwise convolution using a second kernel, wherein the second kernel has one of a single row and a single column, a third convolutional layer configured to perform a depthwise convolution using a third kernel, wherein the third kernel has a single column if the second kernel has a single row, and the third kernel has a single row if the second kernel has a single column, and a fourth convolutional layer configured to perform a convolution using a fourth kernel.

Claims

1. A device for image classification comprising a processor configured to: receive an image captured by a camera, the image comprising a plurality of pixels; execute a convolutional neural network configured to generate a plurality of probability values, each probability value being linked to a respective one of a plurality of predetermined classes and indicating the probability that the image or a pixel of the image is associated with the respective class, wherein the convolutional neural network comprises a plurality of convolutional blocks and each of the convolutional blocks comprises: a first convolutional layer configured to perform a pointwise convolution using a first kernel; a second convolutional layer configured to perform a depthwise convolution using a second kernel, wherein the second kernel has one of a single row and a single column; a third convolutional layer configured to perform a depthwise convolution using a third kernel, wherein the third kernel has a single column if the second kernel has a single row, and the third kernel has a single row if the second kernel has a single column; and a fourth convolutional layer configured to perform a convolution using a fourth kernel; and generate an output image based on the plurality of probability values in response to classifying the image based on the plurality of probability values.

2. The device as claimed in claim 1, wherein the number of columns of the second kernel is three and the number of rows of the third kernel is three if the second kernel has a single row, and wherein the number of rows of the second kernel is three and the number of columns of the third kernel is three if the second kernel has a single column.

3. The device as claimed in claim 1, wherein the second convolutional layer is configured to perform a pooling operation, in particular a one-dimensional pooling operation, after the depthwise convolution.

4. The device as claimed in claim 1, wherein the fourth convolutional layer is configured to apply the fourth kernel in strides to pixels of an input of the fourth convolutional layer.

5. The device as claimed in claim 1, wherein the number of output channels of the fourth convolutional layer is twice the number of the output channels of the first convolutional layer.

6. The device as claimed in claim 1, wherein the second kernel has a single row, the second convolutional layer is configured to perform a pooling operation after the depthwise convolution, and the fourth kernel has two rows and a single column, or wherein the second kernel has a single column, the second convolutional layer is configured to perform a pooling operation after the depthwise convolution, and the fourth kernel has a single row and two columns, or wherein the second convolutional layer is configured not to perform a pooling operation after the depthwise convolution and the convolution performed by the fourth convolutional layer is a pointwise convolution.

7. The device as claimed in claim 1, wherein at least one of: the first convolutional layer, the second convolutional layer, the third convolutional layer, and the fourth convolutional layer is configured to perform a rectification after the convolution.

8. The device as claimed in claim 1, wherein the second convolutional layer is configured to perform the depthwise convolution using at least two different second kernels, each of the at least two different second kernels having one of a single row and a single column, and wherein the third convolutional layer is configured to perform the depthwise convolution using at least two different third kernels, each of the at least two different third kernels having one of a single row and a single column.

9. The device as claimed in claim 1, wherein at least one of: the number of output channels of the first convolutional layer is the number of input channels of the first convolutional layer multiplied with a factor; and at least one of the second convolutional layer and the third convolutional layer are configured to add a bias term after the depthwise convolution.

10. A system for image classification, the system comprising: a camera configured to capture an image; and a device comprising a convolutional neural network configured to generate a plurality of probability values, each probability value being linked to a respective one of a plurality of predetermined classes and indicating the probability that the image or a pixel of the image is associated with the respective class, wherein the convolutional neural network comprises a plurality of convolutional blocks and each of the convolutional blocks comprises: a first convolutional layer configured to perform a pointwise convolution using a first kernel; a second convolutional layer configured to perform a depthwise convolution using a second kernel, wherein the second kernel has one of a single row and a single column; a third convolutional layer configured to perform a depthwise convolution using a third kernel, wherein the third kernel has a single column if the second kernel has a single row, and the third kernel has a single row if the second kernel has a single column; and a fourth convolutional layer configured to perform a convolution using a fourth kernel; and wherein the device further comprises a processor configured to generate an output image based on the plurality of probability values in response to classifying the image based on the plurality of probability values.

11. A method for classifying images, the method comprising: receiving an image captured by a camera, the image comprising a plurality of pixels; using a convolutional neural network to generate a plurality of probability values, each probability value being linked to a respective one of a plurality of predetermined classes and indicating the probability that the image or a pixel of the image is associated with the respective class, wherein the convolutional neural network comprises a plurality of convolutional blocks and each of the convolutional blocks comprises: a first convolutional layer performing a pointwise convolution using a first kernel, a second convolutional layer performing a depthwise convolution using a second kernel, wherein the second kernel has one of a single row and a single column, a third convolutional layer performing a depthwise convolution using a third kernel, wherein the third kernel has a single column if the second kernel has a single row, and the third kernel has a single row if the second kernel has a single column, and a fourth convolutional layer performing a convolution using a fourth kernel; generating an output image based on the plurality of probability values in response to classifying the image based on the plurality of probability values.

12. The method as claimed in claim 11, wherein the number of columns of the second kernel is 3 and the number of rows of the third kernel is 3 if the second kernel has a single row, and wherein the number of rows of the second kernel is 3 and the number of columns of the third kernel is 3 if the second kernel has a single column.

13. The method as claimed in claim 11, wherein the second convolutional layer performs a pooling operation.

14. The method as claimed in claim 13, wherein the pooling operation is a one-dimensional pooling operation, after the depthwise convolution.

15. The method as claimed in one of the claim 11, wherein the fourth convolutional layer applies the fourth kernel in strides to pixels of an input of the fourth convolutional layer, and wherein the number of output channels of the fourth convolutional layer is twice the number of the output channels of the first convolutional layer.

16. The method as claimed in one of the claim 11, wherein the second kernel has a single row, the second convolutional layer performs a pooling operation after the depthwise convolution, and the fourth kernel has two rows and a single column, or wherein the second kernel has a single column, the second convolutional layer performs a pooling operation after the depthwise convolution, and the fourth kernel has a single row and two columns, or wherein the second convolutional layer does not perform a pooling operation after the depthwise convolution and the convolution performed by the fourth convolutional layer is a pointwise convolution.

Description

BRIEF DESCRIPTION OF DRAWINGS

(1) The invention will be described in more detail in the following in an exemplary manner with reference to an embodiment and to the drawings. There are shown in these:

(2) FIG. 1 shows a schematic representation of an exemplary embodiment of a system for image classification;

(3) FIG. 2A shows a schematic representation of a conventional convolution;

(4) FIG. 2B shows a schematic representation of a depthwise convolution;

(5) FIG. 2C shows a schematic representation of a pointwise convolution; and

(6) FIG. 3 shows a comparison of an exemplary embodiment of the invention with other network architectures.

DETAILED DESCRIPTION

(7) Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the various described embodiments. However, it will be apparent to one of ordinary skill in the art that the various described embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.

(8) One or more includes a function being performed by one element, a function being performed by more than one element, e.g., in a distributed fashion, several functions being performed by one element, several functions being performed by several elements, or any combination of the above.

(9) It will also be understood that, although the terms first, second, etc. are, in some instances, used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first contact could be termed a second contact, and, similarly, a second contact could be termed a first contact, without departing from the scope of the various described embodiments. The first contact and the second contact are both contacts, but they are not necessarily the same contact.

(10) The terminology used in the description of the various described embodiments herein is for describing embodiments only and is not intended to be limiting. As used in the description of the various described embodiments and the appended claims, the singular forms a, an and the are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term and/or as used herein refers to and encompasses all possible combinations of one or more of the associated listed items. It will be further understood that the terms includes, including, comprises, and/or comprising, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

(11) As used herein, the term if is, optionally, construed to mean when or upon or in response to determining or in response to detecting, depending on the context. Similarly, the phrase if it is determined or if [a stated condition or event] is detected is, optionally, construed to mean upon determining or in response to determining or upon detecting [the stated condition or event] or in response to detecting [the stated condition or event], depending on the context.

(12) FIG. 1 schematically illustrates a system 10 for image classification. The system 10 includes a camera 11 and a device 12.

(13) The camera 11 is mounted on a vehicle and captures images 13 of the area in front of the vehicle. Alternatively, the camera 11 may be directed to an area in the rear and/or at a side of the vehicle.

(14) The images 13 captured by the camera 11 are fed to the device 12. The device 12 performs a method for classifying the images 13 and generates output images 14 which contain probability values for each of the images 13 and, in particular, for each pixel of the images 13. Each probability value indicates the probability that the respective image 13 or pixel is associated with one of a plurality of predetermined classes.

(15) For classifying the images 13 the device 12 includes a convolutional neural network 19 that consists of several convolutional layers and performs multiple convolution and pooling operations as explained below. The different convolutional layers are trained to detect different patterns in the images 13. The final layer of the convolutional neural network 19 outputs individual probability values that are assigned to each image 13 and, in particular, to each pixel. Each probability value indicates the probability that the respective image 13 (or pixel) is associated with one of a plurality of predetermined classes (or object categories). The classes divide the objects shown in the images 13 in different categories that can be typically found in road scenes. For example, there can be a class for vehicles, another class for pedestrians, another class for roads, another class for buildings etc. The probability value of a given image 13 (or pixel) for one of the classes then indicates the probability that the respective image 13 (or pixel) shows an object from this particular class.

(16) In one example, there are the following classes: vehicle, pedestrian, road and building. The probability values output by the final layer of the convolutional neural network 19 for one of the images 13 are, for example, 0.1, 0.1, 0.6 and 0.2 indicating the probability that this particular image 13 shows a vehicle, a pedestrian, a road and a building, respectively.

(17) The device 11, the system 10 and the method performed by the device 11 are exemplary embodiments according to the first, second and third aspect of the application, respectively.

(18) When examining the applicability of documents D1 and D2, reduced accuracies were exhibited. Further examination of the relevant architectures revealed that this shortcoming results from bottlenecks in data flow through the convolutional neural network itself. Both algorithms disclosed in documents D1 and D2 force large compression factors inside the convolutional blocks. This kind of compression makes the network lose information too quickly thus harming performance when there is little to no redundancy in the information, i.e., in a small network. Furthermore, both documents D1 and D2 choose to not address the first layer of the network, leaving it rather computationally expensive.

(19) The unified general convolution block structure as explained in the following can safely replace the more expensive convolution layers disclosed in documents D1 and D2 without the risk of losing accuracy.

(20) The convolutional neural network 19 of the device 11 illustrated in FIG. 1 includes a plurality of convolutional blocks 20.sub.i. In the exemplary embodiment illustrated in FIG. 1, the convolutional neural network 19 includes there convolutional blocks denoted by the reference numerals 20.sub.1, 20.sub.2 and 20.sub.3. The convolutional blocks 20.sub.1, 20.sub.2 and 20.sub.3 are arranged consecutively.

(21) The convolutional blocks 20.sub.1, 20.sub.2 and 20.sub.3 have the same structure and each of them includes four consecutively arranged convolutional layers: a first convolutional layer 21.sub.i, a second convolutional layer 22.sub.i, a third convolutional layer 23.sub.i and a fourth convolutional layer 24.sub.i (with i=1, 2, 3).

(22) The image 13 captured by the camera 11 includes a plurality of pixels arranged in an array. The image 13 may, for example, be an RGB image with channels for red, green and blue, a greyscale image, a grey and red image or any other suitable image.

(23) The image 13 is fed to the first convolutional layer 21.sub.1 of the convolutional block 20.sub.1, which is the input layer of the convolutional neural network 19. All other convolutional layers of the convolutional blocks 20.sub.1, 20.sub.2 and 20.sub.3 are hidden layers.

(24) Each of the convolutional layers applies a convolutional operation to its input, thereby an input volume having three dimensions (width, height and depth) is transformed to a 3D output volume of neuron activations. The output is passed to the next convolutional layer. The fourth convolutional layer 24.sub.i of the convolutional block 20.sub.i passes its output to the first convolutional layer 21.sub.i+1 of the next convolutional block 20.sub.i+1 (for i=1, 2). The fourth convolutional layer 24.sub.3 of the last convolutional block, which is the convolutional block 20.sub.3 in the current example, passes its result to fully connected layers 25 arranged after the last convolutional block.

(25) Each of the four convolutional layers 21.sub.i, 22.sub.i, 23.sub.i and 24.sub.i includes a respective kernel which slides over the input signal of the respective convolutional layer, wherein a convolution is performed between the kernel and the input signal.

(26) Modern convolutional neural networks apply a pooling operator for subsampling. The pooling operator, in its common configuration, takes a patch of a predetermined size from the image, for example a 22 patch, and returns a single number which is the maximal element of the patch. This kind of operator is biologically motivated and holds many benefits. However, it also significantly reduces the amount of information within the network and forces a data compression. As each displacement of the pooling kernel commonly takes four numbers and outputs just a single number, each pooling layer compresses the data by a factor of four, thus creating a bottleneck. The common method for coping with the pooling-induced bottlenecks is to double the number of channels (kernels) right before the pooling. By doing so the network only loses about a half of its data which is shown to be a healthy compression.

(27) In the unified general convolution block structure proposed herein the number of channels are only compressed to half of their incoming amount in the first part of the convolutional block. For example, if one of the convolutional blocks takes 16 input channels the first part of the convolutional block compresses them to 8 channels.

(28) Further, to lessen the aggressiveness of the pooling operator while keeping a highly efficient architecture, a dimension wise separation of the pooling operator is proposed. Instead of pooling from a 22 patch, we pool from a 12 patch (or a 21 patch) and afterwards pool from a 21 patch (or a 12 patch).

(29) In addition, instead of convolving all input channels together with a 33 kernel, each channel is only convolved with itself, first with a 31 kernel (or a 13 kernel) and then with a 13 kernel (or a 31 kernel), and a 11 kernel is used which convolves the input channels.

(30) In a first variant, the convolutional blocks 20.sub.i (with i=1, 2, 3) have the following structure:

(31) first convolutional layer 21.sub.i: 11ch/2;

(32) second convolutional layer 22.sub.i: dw 13+1d mp;

(33) third convolutional layer 23.sub.i: dw 31; and

(34) fourth convolutional layer 24.sub.i: 21ch+1d stride.

(35) The first convolutional layer 21.sub.i performs a pointwise convolution (which is not a depthwise, but a regular convolution) using a first kernel with a size of 11, i.e., the first kernel has a single row and a single column. The number of output channels of the first convolutional layer 21.sub.i is ch/2.

(36) The second convolutional layer 22.sub.i performs a depthwise convolution (dw) using a second kernel, wherein the second kernel has a size of 13, i.e., the second kernel has a single row and three columns. The depthwise convolution is followed by a max-pooling operation (mp). The pooling operation can be a one-dimensional pooling operation along a row. For example, if the pooling operator (or pooling kernel) has the size of 12 and is configured to output the maximum value, the pooling operator compares two input values that are arranged next to each other in a row and outputs the larger value.

(37) The third convolutional layer 23.sub.i performs a depthwise convolution (dw) using a third kernel, wherein the third kernel has a size of 31, i.e., the third kernel has three rows and a single column.

(38) The fourth convolutional layer 24.sub.i performs a convolution (which is not a depthwise, but a regular convolution) using a fourth kernel with a size of 21, i.e., the fourth kernel has two rows and a single column. The number of output channels of the fourth convolutional layer 24.sub.i is ch, which means that the number of output channels of the fourth convolutional layer 24.sub.i is twice the number of the output channels of first convolutional layer 21.sub.i.

(39) Further, the fourth convolutional layer 24.sub.i applies the fourth kernel in strides to the pixels of the input of the fourth convolutional layer 24.sub.i, for example, in strides of two. The stride of the fourth kernel can be one-dimensional. The fourth kernel then convolves around the input by shifting a number of units along a column at a time, but not along a row.

(40) In a second variant, the convolutional blocks 20.sub.i (with i=1, 2, 3) have the following structure:

(41) first convolutional layer 21.sub.i: 11ch/2;

(42) second convolutional layer 22.sub.i: dw 31+1d mp;

(43) third convolutional layer 23.sub.i: dw 13; and

(44) fourth convolutional layer 24.sub.i: 12ch+1d stride.

(45) The structure of the convolutional blocks 20.sub.i in the second variant is similar to the structure of the convolutional blocks 20.sub.i in the first variant. The difference is that the second convolutional layer 22.sub.i of the second variant performs a depthwise convolution using a second kernel having a size of 31, i.e., the second kernel has three rows and a single column. Further, the pooling operation can be a one-dimensional pooling operation along a column. For example, if the pooling operator (or pooling kernel) has the size of 21 and is configured to output the maximum value, the pooling operator compares two input values that are arranged next to each other in a column and outputs the larger value.

(46) Further, the third convolutional layer 23.sub.i of the second variant performs a depthwise convolution using a third kernel, wherein the third kernel has a size of 13, i.e., the third kernel has a single row and three columns.

(47) The fourth convolutional layer 24.sub.i of the second variant performs a convolution using a fourth kernel with a size of 12, i.e., the fourth kernel has a single row and two columns.

(48) Further, the fourth convolutional layer 24.sub.i applies the fourth kernel in strides to the pixels of the input of the fourth convolutional layer 24.sub.i, for example, in strides of two. The stride of the fourth kernel can be one-dimensional. The fourth kernel then convolves around the input by shifting a number of units along a row at a time, but not along a column.

(49) In a third variant, the convolutional blocks 20.sub.i (with i=1, 2, 3) have the following structure:

(50) first convolutional layer 21.sub.i: 11ch/2;

(51) second convolutional layer 22.sub.i: dw 31;

(52) third convolutional layer 23.sub.i: dw 13; and

(53) fourth convolutional layer 24.sub.i: 11ch.

(54) The structure of the convolutional blocks 20.sub.i in the third variant is similar to the structure of the convolutional blocks 20.sub.i in the first variant. The difference is that in the third variant no max-pooling is performed in the second convolutional layer 22.sub.i and the fourth convolutional layer 24.sub.i performs a pointwise convolution using a fourth kernel with a size of 11.

(55) In addition, the fourth convolutional layer 24.sub.i applies the fourth kernel in strides of one to the pixels of the input of the fourth convolutional layer 24.sub.i.

(56) In the following, the idea behind the depthwise convolution is explained as can also be found at the following internet address: http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/.

(57) A conventional (or regular) convolution layer applies a kernel to all of the channels of the input image. It slides this kernel across the image and at each step performs a weighted sum of the input pixels covered by the kernel across all input channels. The convolution operation thereby combines the values of all the input channels. If an image as shown in FIG. 2A has 3 input channels, then running a single convolution kernel across this image results in an output image with only 1 channel per pixel.

(58) This means that for each input pixel, no matter how many channels it has, the convolution writes a new output pixel with only a single channel. In practice many convolution kernels run across the input image. Each kernel gets its own channel in the output.

(59) A depthwise convolution works as shown in FIG. 2B. Unlike a conventional convolution it does not combine the input channels but it performs a convolution on each input channel separately. For an image with 3 input channels, a depthwise convolution creates an output image that also has 3 channels. Each channel gets its own set of weights. The purpose of the depthwise convolution is to filter the input channels.

(60) The depthwise convolution can be followed by a pointwise convolution as shown in FIG. 2C. The pointwise convolution is the same as a regular convolution but with a 11 kernel. The pointwise convolution adds up all the channels as a weighted sum. As with a regular convolution, many of these pointwise kernels are usually stacked together to create an output image with many channels. The purpose of the pointwise convolution is to combine the output channels of the depthwise convolution to create new features.

(61) A depthwise convolution followed by a pointwise convolution is called a depthwise separable convolution. A conventional convolution does both filtering and combining in a single step, but with a depthwise separable convolution these two operations are done as separate steps.

(62) FIG. 3 shows an exemplary embodiment of the first variant compared to three other network architectures. Baseline is a conventional neural network, also called vanilla CNN. MobileNet is the model proposed by document D1. ShuffleNet is the model proposed by document D2. dw and mp stand for depthwise convolution and max-pooling, respectively. fc layer means fully connected layer. 10 classes are used in all four network architectures shown in FIG. 3. The shapes in FIG. 3 and the following tables are given in the following format: rows of the kernelcolumns of the kerneloutput channels.

(63) In the following the results of experiments are presented that were undertaken to evaluate the unified convolution block structure as proposed herein.

(64) TABLE-US-00001 TABLE 1 Layer Shape out Floats out 3 3 64 + mp 16 16 64 16384 3 3 128 + mp 8 8 128 8192 3 3 256 + mp 4 4 256 4096 1 1 10 1 1 10 10

(65) Table 1 shows the data compression in floating point numbers (short floats) throughout the baseline network without efficiency optimization. Convolutional neural networks are mostly well capable to handle a data compression of factor two.

(66) TABLE-US-00002 TABLE 2 Layer Shape out Floats out 3 3 64 + mp 16 16 64 16384 dw 3 3 64 + stride 8 8 64 4096 1 1 128 8 8 128 8192 dw 3 3 128 + stride 4 4 128 2048 1 1 256 4 4 256 4096 1 1 10 1 1 10 10

(67) Table 2 shows the data compression in floats throughout the MobileNet model as proposed by document D1. Stride is an operation similar to max-pooling. The critical compression points are highlighted in bold. Layers from the same block have the same background color.

(68) TABLE-US-00003 TABLE 3 Layer Shape out Floats out 3 3 64 + mp 16 16 64 16384 gc4 1 1 32 + shuffle 16 16 32 8192 dw 3 3 32 + stride 8 8 32 2048 gc4 1 1 128 8 8 128 8192 gc4 1 1 64 + shuffle 8 8 64 4096 dw 3 3 128 + stride 4 4 64 1024 gc4 1 1 256 4 4 256 4096 1 1 10 1 1 10 10

(69) Table 3 shows the data compression in floats throughout the ShuffleNet model as proposed by document D2. shuffle and gc4 are as explained in document D2. Stride is an operation similar to max-pooling. The critical compression points are highlighted in bold. Layers from the same block have the same background color.

(70) TABLE-US-00004 TABLE 4 Layer Shape out Floats out 1 1 32 32 32 32 32768 dw 1 3 32 + 1d mp 32 16 32 16384 dw 3 1 32 32 16 32 16384 2 1 64 + 1d stride 16 16 64 16384 1 1 64 16 16 64 16384 dw 1 3 64 + 1d mp 16 8 64 8192 dw 3 1 64 16 8 64 8192 2 1 128 + 1d stride 8 8 128 8192 1 1 128 8 8 128 8192 dw 1 3 128 + 1d mp 8 4 128 4096 dw 3 1 128 8 4 128 4096 2 1 256 + 1d stride 4 4 256 4096 1 1 10 1 1 10 10

(71) Table 4 shows the data compression in floats throughout an exemplary model of the invention. 1d mp means one-dimensional max-pooling. The invention successfully eliminates all bottlenecks in the network while keeping the number multiplications to a minimum (see tables 5 and 6).

(72) TABLE-US-00005 TABLE 5 Num. Model Multiplications Ratio to Baseline Accuracy (%) Baseline 39559168 1.00 79.50 MobileNet 4653056 0.12 77.70 ShuffleNet 2361344 0.06 75.60 Invention 5726208 0.14 81.20

(73) Table 5 shows a model comparison on the Cifar10 data set, which is a fundamental data set in computer vision. In table 5 the number of multiplications and the accuracy of each of the models are compared. Notice how the model according to an exemplary embodiment of the invention is significantly more accurate then both MobileNet and ShuffleNet while only being slightly more expensive than MobileNet.

(74) TABLE-US-00006 TABLE 6 Num. Model Multiplications Ratio to Baseline Accuracy (%) Baseline 1781760 1.00 91.08 MobileNet 532992 0.30 85.64 ShuffleNet 492288 0.28 82.73 Invention 390144 0.22 88.93

(75) Table 6 shows a model comparison on the SVHN (street view house number) data set. Notice that when the model according to an exemplary embodiment of the invention is extremely small, this model is both significantly cheaper and more accurate then both MobileNet and ShuffleNet. Since the models are initialized with a random seed, all accuracies in tables 5 and 6 are the mean accuracy of multiple runs. This helps us even out cases of particularly good or particularly bad initialization.

(76) While this invention has been described in terms of the preferred embodiments thereof, it is not intended to be so limited, but rather only to the extent set forth in the claims that follow.