Convolutional neural network pruning method based on feature map sparsification

Abstract

A convolutional neural network pruning method based on feature map sparsification, which relates to how to compress the convolutional neural network to reduce the number of parameters and amount of computation so as to facilitate actual deployment, is provided. In the training process, by adding regularization to the feature map L1 or L2 after the activation layer in the loss function, the corresponding feature map channels have different sparsity. Under a certain pruned ratio, the convolution kernels corresponding to the channels are pruned according to the sparsity of the feature map channels. After fine-tune pruning, the network obtains new accuracy, and the pruned ratio is adjusted according to the change of accuracy before and after pruning. After multiple iterations, the near-optimal pruned ratio is found, and pruning is realized to the maximum extent under the condition that the accuracy does not decrease.

Claims

1. A convolution neural network pruning method based on feature map sparsification, wherein the method being applied to crop disease classification and specifically comprising the following steps: 1) collecting pictures of crop leaves as a training data set; 2) performing sparsification training on a feature map in the convolutional neural network through the training data set of step 1); in the training process, adding a sparsification item of the feature map to a loss function of the convolutional neural network; ${loss}_{f} = \underset{(x, y)}{.Math.} l (f (x, W), y) + λ \times {.Math.}_{l = 1}^{l = L} {.Math.}_{c = 1}^{c = C_{l}} (\frac{1}{H_{l, c} W_{l, c}} {.Math.}_{i = 1}^{i = H_{l, c}} {.Math.}_{j = 1}^{j = W_{l, c}} g (m_{i, j}))$ wherein, the first item is the loss resulted from the model prediction, x is the input of network, i.e. the picture data of crop leaves, W is the weight of network, f(x, W) is the output of network and y is the sample label; the second item is a loss item of a feature map after all activation layers, λ is a sparsification factor controlling the proportional relationship between the loss resulted from the convolution neural network prediction and the sparsification item loss, l is the activation layer index, L is the number of the activation layer, c is the channel index of the feature map after the lth activation layer, C.sub.l is the channel number of the feature map after the lth activation layer, H.sub.l,c and W.sub.l,c are the height and width of the cth channel after the lth activation layer, respectively, and m.sub.i,j is No. (i,j) numerical value of corresponding feature map; and g( ) is L1, L2 regularization or other regular items; L1 regularization formula is:
g(m.sub.i,j)=∥m.sub.i,j∥.sub.1 L2 regularization formula is:
g(m.sub.i,j)=∥m.sub.i,j∥.sub.2 calculating a mean value of the feature map channel by traversing the entire training data set using the mean value as sparsity of the feature map, the sparsity being different due to different input samples, at the same time, saving the sparsity of the feature map, adding a feature map channel selection layer, and after training the convolutional neural network and making it convergent, saving the highest accuracy of the verification set and the corresponding network weight; 3) network pruning: 3.1) setting an initial pruned ratio, and setting an upper limit of the pruned ratio as 1 and a lower limit of the pruned ratio as 0; 3.2) taking a weight of the network with the highest accuracy of the verification set as the weight of the convolutional neural network, and pruning according to the following rules: sorting the sparsity of each channel of the feature map from small to large, i.e. the sort.sub.min.fwdarw.max{feature map sparsity}, and then setting a value of non-learnable parameter mask of the channel selection layer corresponding to the first n channels as 0 and a value of non-learnable parameter mask of the channel selection layers corresponding to the remaining channels as 1 according to the pruned ratio for the sparsity of the feature map of each channel; retraining the pruned network until convergence of the network, and obtaining the highest accuracy of the pruned verification set; 3.3) comparing the highest accuracy of verification sets before and after pruning, if the highest accuracy of verification set after pruning is greater than or equal to the highest accuracy of verification set before pruning, taking the current pruned ratio as a new lower limit of the pruned ratio, and increasing the pruned ratio; otherwise, taking the current pruned ratio as a new upper limit of the pruned ratio, and reducing the pruned ratio, repeating steps 3.2) and 3.3) until the difference between the upper and lower limits of the pruned ratio is less than a certain threshold, which meets the termination condition, and then going to step 4); 4) saving the pruned network: removing the channel selection layer and copying the weight data to new network, the new network being the pruned convolutional neural network; 5) inputting the pictures of crop leaves collected on site into the pruned network and outputting them as a crop disease category.

2. The convolutional neural network pruning method based on feature map sparsification according to claim 1, wherein in step 2), the channel selection layer is constructed as follows: supposing the feature map after a certain layer has C channels, defining the C non-learnable parameters of the channel selection layer as mask=[m.sub.1, m.sub.2, m.sub.3 . . . , m.sub.C], wherein m.sub.1, m.sub.2, m.sub.3, . . . m.sub.C are the coefficients corresponding to the C channels in the feature map, and their values are 0 or 1, 0 meaning that the channel cannot be transferred to the later calculation, and 1 meaning that the channel can be transferred to the later calculation.

3. The convolutional neural network pruning method based on feature map sparsification according to claim 1, wherein step 2), the mean value of feature map channels is calculated as follows: at the beginning of each training epoch, defining the corresponding mean variable ch_avg for each channel of the feature map after each activation layer, with the initial value being 0; when calculating the first batch of the training epoch, getting ch_avg; $ch_avg = \frac{1}{batch_size} {.Math.}_{i = 0}^{i = batch_size} {.Math.}_{j = 1}^{H} {.Math.}_{k = 1}^{W} .Math. m_{i, j, k} .Math.$ wherein, batch_size is the batch size, H and W are the height and width of the feature map, respectively, and m.sub.i,j,k is the No. (j,k) numerical value of the corresponding channel of the ith sample; for batches staring from the second, calculating the mean value of the channel according to the above formula and recording it as new_ch_avg; at the same time, updating ch_avg as follows:
ch_avg<←(momentum×ch_avg+(1−momentum)×new_ch_avg) wherein, momentum is a momentum parameter with a value queen 0.9 and 0.99; the meaning of “←” to assign the right value of “←” to the left.

4. The convolutional neural network pruning method based on feature map sparsification according to claim 1, wherein in step 3.3), the termination condition is defined as follows: defining the ratio of the number of pruned channels to the total number of network channels as the pruned ratio, expressed as pruned_ratio; the upper limit of the pruned ratio is expressed as upper_ratio, and the lower limit of pruned ratio is expressed as lower_ratio; setting the termination condition as upper_ratio−lower_ratio<η, and the value of η is between 0.005 and 0.2 as the value involves the number of iterations to find the near optimal pruned ratio.

5. The convolutional neural network pruning method based on feature map sparsification according to claim 4, wherein in step 3.3), the way to increase or decrease the pruned ratio is as follows: pruned_ratio is the pruned ratio of this iteration; increasing the pruned ratio as follows: $lower_ratio \leftarrow pruned_ratio$ $pruned_ratio \leftarrow \frac{pruned_ratio + upper_ratio}{2}$ decreasing the pruned ratio as follows: $upper_ratio \leftarrow pruned_ratio$ ${pruned}_{ratio} \leftarrow \frac{{pruned}_{ratio} + {lower}_{ratio}}{2}$ the meaning of “←” is to assign the right value of “←” to the left.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a flow chart of the method of the present invention;

(2) FIG. 2 is a schematic diagram of a channel selection layer of the present invention;

(3) FIG. 3 is a diagram showing copying of convolution layer weight according to the present invention.

DESCRIPTION OF THE EMBODIMENTS

(4) Hereinafter, the specific embodiments of the present invention will be described in further detail with reference to the drawings.

(5) The classification of crop (tomato) diseases is selected as the task, the diseases include tomato powdery mildew, early blight, spot disease and so on, and the dataset is the picture set of crop (tomato) leaves. The convolutional neural network adopts the structure of feature extraction unit superposition composed of convolution layer, batch normalization layer and ReLu activation layer, with a final linear layer outputting the category. The feature extraction unit is denoted as C, the pool layer is denoted as M, the linear layer is denoted as L, and the 16-layer network structure is denoted as [C(64),C(64),M,C(128),C(128),C(128),M,C(256),C(256),C(256),M,C(512),C(512),C(512),M,L], where the number in parentheses indicates the number of channels.

(6) As shown in FIG. 1, which is the flow chart of the method of the present invention, the specific implementation steps of the invention are as follows: performing sparsification training on the feature maps output by each feature extraction unit in the above-mentioned convolutional neural network;

(7) 1.1) Adding a sparsification term of feature maps behind all activation layers to the loss function of the convolutional neural network:

(8) $L = \underset{(x, y)}{.Math.} l (f (x, W), y) + λ \times {.Math.}_{l = 1}^{l = L} {.Math.}_{c = 1}^{c = C_{l}} (\frac{1}{H_{l, c} W_{l, c}} {.Math.}_{i = 1}^{i = H_{l, c}} {.Math.}_{j = 1}^{j = W_{l, c}} g (m_{i, j}))$

(9) where the first item is the loss caused by the model prediction, x is the input of the network, i.e. the tomato leaf picture data, W is the weight of the network, f(x, W) is the output of the network, and y is the sample label. In this example, the sample label is an integer between 0 and 16.

(10) The second item is the sparsification item of the feature map after all activation layers, λ is the sparsification factor controlling the proportional relationship between the two items, its value is preferably 0.0001-0.0005. l is the activation layer index, with a value range of 1-10. L is the number of the activation layer, which is 10. c is the channel index of the feature map after the lth activation layer, C.sub.l is the channel number of the feature map after the lth activation layer, H.sub.l,c and W.sub.l,c are the height and width of the cth channel of the feature map after the lth activation layer, respectively, m.sub.i,j is the No. (i,j) numerical value of the corresponding feature map, and g( ) is L1 or L2 regularization.

(11) L1 regularization formula is:
g(m.sub.i,j)=∥m.sub.i,j∥.sub.1

(12) L2 regularization formula is:
g(m.sub.i,j)=∥m.sub.i,j∥.sub.2

(13) 1.2) Adding a channel selection layer after the activation layer. Supposing the feature map after a certain layer has C channels, defining the C non-learnable parameters of the channel selection layer as mask=[m.sub.1, m.sub.2, m.sub.3, . . . , m.sub.C], wherein m1, m2, m3, . . . mC are the coefficients corresponding to the C channels in the feature map, and their values are 0 or 1. 0 means that the channel cannot transfer to the later calculation, and 1 means that the channel can transfer to the later calculation. As shown in FIG. 2, in the case of C=5, mask=[1, 0, 0, 1, 1], the channel corresponding to value of 1 can transfer through the channel selection layer. In the initial state, values of all the C non-learnable parameters are 1.

(14) 1.3) In the training process, calculating the mean value for each channel of the feature map after the activation layer, as the evidence of the importance of the channel. Different input samples have different calculated feature maps. The mean value of the feature map channel is obtained by traversing the entire training dataset.

(15) At the beginning of each training epoch, defining the corresponding mean variable ch_avg for each channel of the feature map after each activation layer, with the initial value being 0; When the first batch of the training epoch is calculated, obtain ch_avg:

(16) $ch_avg = \frac{1}{batch_size} {.Math.}_{i = 0}^{i = batch_size} {.Math.}_{j = 1}^{H} {.Math.}_{k = 1}^{W} .Math. m_{i, j, k} .Math.$

(17) wherein, batch_size is the batch size, H and W are the height and width of the feature map, respectively, and m.sub.i,j,k is the No. (j, k) numerical value of the corresponding channel of the ith sample; for batches starting from the second, calculating the mean value of the channel according to the above formula and recording it as new_ch_avg; At the same time, update ch_avg as follows:
ch_avg<←(momentum×ch_avg+(1−momentum)×new_ch_avg)

(18) wherein, momentum is a momentum parameter with a value between 0.9 and 0.99; the meaning of “←” is to assign the right value of “←” to the left.

(19) 1.4) After the feature map is subject to sparsification training for several epochs, the network converges, and the highest accuracy initial_acc of the verification set is recorded. In this example, after training for 160 epochs, the network can converge, and the highest accuracy of the verification set is 88.03%.

(20) 2) network pruning:

(21) 2.1) The convolutional neural network loads the network weight with the highest accuracy in the verification set during training. Sorting ch_avg of the channels of the feature map from small to large.

(22) The ratio of the number of pruned channels to the total number of network channels is defined as pruned ratio, which is expressed as pruned_ratio. The number of pruned channels is pruned_channels=pruned_ratio×the total number of channels in the network. Setting an upper limit upper_ratio of the pruned ratio as 1 and a lower limit lower_ratio of the pruned ratio as 0; the initial pruned ratio is 0.5. The mean value of the feature map of each channel that has been sorted, that issort.sub.min.fwdarw.max{ch_avg}, the mask value of the channel selection layer corresponding to the first pruned_channels number of channels is set to 0, and the mask value of the channel selection layer corresponding to the remaining channels is set to 1.

(23) 2.2) After changing the mask value of the channel selection layer, fine-tune the network, that is, continue to train a certain number of epoch, with a value of 60. After the network converges, record the highest accuracy pruned_acc of the pruned network on the verification set.

(24) 2.3) Determine whether the termination condition is met, and the value of the termination condition is set to upper_ratio−lower_ratio<η, the value of η involves the number of iterations to find the near-optimal pruned ratio, and the general value is 0.005-0.02. If the termination condition is met, determine whether pruned_acc+ε is greater than initial_acc, if so, let lower.sub.ratio=pruned_ratio, and save the network weight at this time; if not, do not deal with it. Then proceed to step 3). If the termination condition is not met, proceed to step 2.4).

(25) 2.4) Compare the highest accuracy pruned_acc after pruning with the highest accuracy initial_acc before pruning, and set the maximum allowable accuracy loss after pruning as ε. If pruned_acc+ε>initial_acc, it means that under this pruned ratio, the accuracy of the network can be maintained, the pruned ratio can be increased, and the network weight at this time can be saved; If pruned_acc+ε<initial_acc, it means that the accuracy decreases under this pruned ratio, and it is necessary to reduce the pruning degree.

(26) Increase the pruned ratio as follows:

(27) $lower_ratio \leftarrow pruned_ratio$ $pruned_ratio \leftarrow \frac{pruned_ratio + upper_ratio}{2}$

(28) Decrease the pruned ratio as follows:
upper_ratio←pruned_ratio

(29) $pruned_ratio \leftarrow \frac{pruned_ratio + lower_ratio}{2}$

(30) Repeat steps 2.1) to 2.4) according to the new lower_ratio, pruned_ratio and upper_ratio until the termination condition of step 2.3) is met. The meaning of “←” is to assign the right value of “←” to the left.

(31) 3) Save the pruned network

(32) Remove the channel selection layer as follows, and copy the weight data to the new network.

(33) 3.1) Sum each channel selection layer to obtain an array cfg=[c.sub.1, c.sub.2, c.sub.3 . . . c.sub.n-1, c.sub.n]. On the basis of keeping the network structure unchanged, redefine the number of channels in each layer of the network according to cfg.

(34) 3.2) Copy the weights of convolution layer, batch normalization layer and linear layer to the new network.

(35) For the weight of the convolution layer, as shown in FIG. 3, the mask value of the channel selection layer after the previous feature extraction unit is [1,0,1], and the mask value of channel selection layer after the next feature extraction unit is [1,0]. The number of input channels and output channels of the convolution layer corresponding to the new network is 2 and 1, respectively, and the weight to be copied corresponds to the gray part of the convolution kernel in FIG. 3.

(36) For the batch normalization layer, copy the channel weight data with the mask value of 1 in the channel selection layer to the new network, which is the pruned convolutional neural network;

(37) Input the collected pictures of crop (tomato) leaves into the pruned convolutional neural network, and outputting as a category of crop (tomato) diseases.

(38) In this embodiment, the number of parameters of the network before pruning is 20.04 M, and the amount of computation is 12.59 GFlops. The optimal pruned ratio obtained by this method is 56.25%, the number of parameters of the network after pruning is 4.54 M, and the amount of computation is 3.02 GFlops. The number of parameters is reduced by 77.35%, and the forward calculation speed is increased by more than 4 times.

(39) The above embodiments are used to explain the present invention, but not to limit it. Any modifications and changes made to the present invention within the spirit of the present invention and the protection scope of the claims shall fall into the protection scope of the present invention.

Convolutional neural network pruning method based on feature map sparsification

Assignee

Inventors

Cpc classification

Classification Explorer

G06V10/451

PHYSICS

Classification Explorer

G06N3/082

PHYSICS

Classification Explorer

G06F18/214

PHYSICS

Classification Explorer

G06V10/82

PHYSICS

Classification Explorer

G06V10/454

PHYSICS

Classification Explorer

G06F18/217

PHYSICS

Classification Explorer

G06V10/513

PHYSICS

Classification Explorer

G06F18/2136

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06F18/241

PHYSICS

Classification Explorer

G06F18/2193

PHYSICS

Classification Explorer

G06N3/063

PHYSICS

International classification

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G06N3/04

PHYSICS

Classification Explorer

G06K9/62

PHYSICS

Abstract

Claims

Description