METHOD FOR RECOGNIZING FACIAL EXPRESSIONS BASED ON ADVERSARIAL ELIMINATION

Abstract

The present disclosure relates to a method for recognizing facial expressions based on adversarial elimination. First, a facial expression recognition network is built based on a deep convolutional neural network. On a natural facial expression data set, the facial expression recognition network is trained through a loss function to make facial expression features easier to distinguish. Then some key features of input images are actively eliminated by using an improved confrontation elimination method to generate a new data set to train new networks with different weight distributions and feature extraction capabilities, forcing the network to perform expression classification discrimination based on more features, which reduces the influence of interference factors such as occlusion on the network recognition accuracy rate, and improving the robustness of the facial expression recognition network. Finally, the final expression classification predicted results are obtained by using network integration and a relative majority voting method.

Claims

1. A method for recognizing facial expressions based on adversarial elimination, comprising the following steps: preprocessing data, acquiring a natural facial expression data set and using images in the data set as input images, and preprocessing the input images to obtain a preprocessed data set; building a facial expression recognition network; preprocessing the images in the data set according to the method in step 1, inputting the preprocessed images into the facial expression recognition network, training the network by using a loss function, and stopping training when the network converges to obtain a category prediction output of a corresponding expression; generating multiple facial expression recognition sub-networks with different weight distributions by using an improved adversarial elimination method, wherein with the improved adversarial elimination method, the training data set of each sub-network can be different, so that the sub-networks can extract different expression features, and thus the generated network has diversity and complementarity; and performing network integration on the multiple sub-networks, and making final classification discrimination based on multiple expression prediction classifications obtained from the multiple sub-networks.

2. The method for recognizing facial expressions based on adversarial elimination according to claim 1, wherein the preprocessing specifically refers to first performing data normalization on the input images, scaling the images to a fixed size, and then performing operations such as data normalization, horizontal flipping, image rotation, and image cropping on images in a train set to obtain a preprocessed data set.

3. The method for recognizing facial expressions based on adversarial elimination according to claim 1, wherein the building a facial expression recognition network comprises the following steps: selecting a ResNet34 model as a main network structure of the facial expression recognition network; fixing all layers of the ResNet34 model except the last fully connected layer, and changing the number of outputs of the last fully connected layer to the number of categories n of the facial expression data set; and pre-training the facial expression recognition network, importing Imagenet training weights to the modified ResNet34 model, recorded as the facial expression recognition network h.sub.t; and setting an initial facial expression recognition network serial number t=0.

4. The method for recognizing facial expressions based on adversarial elimination according to claim 1, wherein a computational formula of the loss function is as follows: L Arcface = - 1 T .Math. i = 1 T log e s ( cos ( θ y i + m ) ) e s ( cos ( θ y i + m ) ) + .Math. j = 1 , j y i n e scos θ j , where a batch size and the number of expression categories are T and n respectively, y.sub.i represents a category label of the ith sample image, and θ.sub.j represents an included angle between the jth column of a weight matrix and the feature, θ.sub.yi represents an included angle between the y.sub.ith column of the weight matrix and the feature, s and m represent a feature scale and an additional angle edge penalty respectively.

5. The method for recognizing facial expressions based on adversarial elimination according to claim 1, wherein the improved adversarial elimination algorithm comprises the following steps: performing class activation mapping on the facial expression recognition network h.sub.t by using the following method, for any input image x in the train set, generating its heat map V.sub.x.sup.c under a corresponding target category c, setting the kth feature map output by the last convolutional layer as A.sup.k, where A.sub.ij.sup.k represents a point (i,j) on the feature map A.sup.k, the weight of A.sup.k to a specific expression category c is defined as W.sub.k.sup.c, then the acquisition way of V.sub.x.sup.c is as follows:
V.sub.x.sup.c=relu(Σ.sub.kW.sub.k.sup.c.Math.A.sup.k), where a computational formula of the weight W.sub.k.sup.c is: W k c = .Math. i .Math. j α ij kc .Math. relu ( Y c A ij k ) , in the above formula, relu is an activation function, and α.sub.ij.sup.kc is gradient weights of the target category c and A.sup.k; and Y.sup.c is a score of the target category c; setting a threshold G, where G is the maximum value in V.sub.x.sup.c; keeping a target region having a value equal to G in V.sub.x.sup.c, and setting the values of the remaining regions to 0; upsampling V.sub.x.sup.c to the size of the input image to obtain a key target region R.sub.x corresponding to the input image x; calculating average pixels of all images in the train set, and replacing pixels in the key target region R.sub.x corresponding to the image x in the train set with the average pixels, so as to erase the key target region for which the facial expression recognition network makes classification discrimination from the trained image to generate a new train set; assigning the serial number t of the facial expression recognition network to t+1, generating a new facial expression recognition network h.sub.t according to step 2, sending the newly generated train set and an original test set to h.sub.t according to the method in step 3 for training, and finishing the train when the model converges; and comparing accuracy rates of the sub-network h.sub.t and an initial facial expression recognition network h.sub.0 on the test set, when an accuracy rate difference is not larger than 5%, repeating steps 5.1 to step 5.5 to generate a new sub-network; and when the accuracy rate difference is larger than 5%, discarding the subnetwork h.sub.t, and setting z=t−1, and finally obtaining z generated subnetworks: h.sub.1, h.sub.2, . . . , h.sub.z-1, h.sub.z.

6. The method for recognizing facial expressions based on adversarial elimination according to claim 1, wherein a method for network integration is: performing network integration on z+1 facial expression recognition networks h.sub.0, h.sub.1, h.sub.2, . . . , h.sub.z-1, h.sub.z, then expressing a predicted output of a network h.sub.β on the input image x as an n-dimensional vector h.sub.β(x)=(h.sub.β.sup.1(x); h.sub.β.sup.2(x); . . . ; h.sub.β.sup.n(x)), where the network h.sub.β represents any network from network h.sub.0 to network h.sub.z; then performing classification discrimination on output vectors of all networks by using a relative majority voting method to obtain a classification predicted result H(x), that is, the predicted result is a category with the highest score; and if there are multiple categories with the highest score, randomly selecting one category; and the formula of the relative majority voting method is as follows: H ( x ) = C arg max j .Math. β = 0 z h β j ( x ) , where h.sub.β.sup.j(x) is an output of the network h.sub.β on a category c.sub.j.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0023] FIG. 1 is a flow chart of a method for recognizing facial expressions based on adversarial elimination of the present disclosure.

[0024] FIG. 2 is a structure diagram of an improved adversarial elimination method of the present disclosure.

[0025] FIGS. 3A and 3B are schematic diagrams of obtaining a target region through a heat map of the present disclosure.

[0026] FIG. 4 is a structure diagram of network integration of the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0027] In order to enable those skilled in the art to better understand and use the present disclosure, the technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings and specific implementations. The following embodiments are only used to illustrate the present disclosure and are not used to limit the scope of the present disclosure.

[0028] The present disclosure relates to a method for recognizing facial expressions based on adversarial elimination. The flow chart thereof is shown in FIG. 1. The method includes the following steps:

[0029] Step 1: a natural expression data set RAF-DB is selected as a train set and a test set data, and 12271 train set images and 3068 test set images are used as input images and are preprocessed. Specifically, the input images are scaled to 224×224 first, and are then subjected to data normalization. Operations such as horizontal flipping, image rotation, and image cropping are performed on the train set images for data enhancement. The rotation angle range is within 45 degrees. After performing the foregoing operations on the images, a preprocessed data set is obtained.

[0030] Step 2: NVIDIA GeForce RTX3090 GPU is used as a training platform, and Pytorch is used as a deep learning framework. A batch-size of training is set to 32, a learning rate is 0.0001, and an optimization method uses Adam gradient descent method.

[0031] Step 3: A ResNet34 model is selected as a main network structure of a facial expression recognition network.

[0032] Step 3.1: all layers of the ResNet34 model except the last fully connected layer are fixed, and the number of outputs of the last fully connected layer is changed to the number of facial expression categories 7 of RAF-DB. Basic expression categories include surprise, fear, anger, happiness, sadness, disgust, and neutral. Imagenet training weights are imported into the modified ResNet34 model by using the Pytorch deep learning framework, and the model is recorded as a facial expression recognition network h.sub.t. An initial facial expression recognition network serial number is set to be t=0. The structure of the fine-tuned ResNet34 is as shown in Table 1.

TABLE-US-00001 TABLE 1 Structure of fine-tuned ResNet34 Network Convolution Step layer Type kernel size, number size conv1 Conv 7 × 7.64 2 conv2_x Max pool 3 × 3.64 2 Conv [00004] [ 3 × 3.64 3 × 3.64 ] × 3 1 conv3_x Conv [00005] [ 3 × 3.128 3 × 3.128 ] × 4 1 conv4_x Conv [00006] [ 3 × 3.256 3 × 3.256 ] × 6 1 conv5_x Conv [00007] [ 3 × 3.512 3 × 3.512 ] × 3 1 Global average pool, 7-dims fc

[0033] Step 4: the data set images are preprocessed according to the method in step 1, the preprocessed images are input into the facial expression recognition network, the facial expression recognition network is trained by using a loss function below, and the training is stopped when the network converges to obtain a category prediction output of a corresponding expression. A loss function computational formula is as follows:

[00008] L Arcface = - 1 T .Math. i = 1 T log e s ( cos ( θ y i + m ) ) e s ( cos ( θ y i + m ) ) + .Math. j = 1 , j y i n e s cos θ j , ( 1 )

where a batch size and the number of expression categories are T and n respectively, y.sub.i represents a category label of the ith sample image, and θ.sub.j represents an included angle between the jth column of a weight matrix and the feature, θ.sub.yi represents an included angle between the y.sub.ith column of the weight matrix and the feature, s and m represent a feature scale and an additional angle edge penalty respectively.

[0034] Step 5: Multiple facial expression recognition sub-networks with different weight distributions are generated by using an improved adversarial elimination method. With the improved adversarial elimination method, the training data set of each sub-network can be different, so that each sub-network can extract different expression features, and thus the generated network has diversity and complementarity. FIG. 2 is a structure diagram of the improved adversarial elimination method. The specific steps of the improved adversarial elimination method are as follows.

[0035] Step 5.1: Class activation mapping is performed on the facial expression recognition network h.sub.t by using the following method. For any input image x in the train set, its heat map V.sub.x.sup.c is generated under a corresponding target category c. The kth feature map output by the last convolutional layer is set as A.sup.k. A.sub.ij.sup.k represents a point (i,j) on the feature map A.sup.k. The weight of the kth feature map to a specific expression category c is defined as W.sub.k.sup.c, then the acquisition way is as follows:


V.sub.x.sup.c=relu(Σ.sub.kW.sub.k.sup.c.Math.A.sup.k)  (2),

where a computational formula of the weight is:

[00009] W k c = .Math. i .Math. j α ij kc .Math. relu ( Y c A ij k ) . ( 3 )

[0036] In the above formula, relu is an activation function, and α.sub.ij.sup.kc is gradient weights of the target category c and A.sup.k; and Y.sup.c is a score of the target category c.

[0037] Step 5.2: FIGS. 3A and 3B are schematic diagrams of obtaining a target region through a heat map. In particular, FIG. 3(a) is a heat map V.sub.x.sup.c of an input image x in a corresponding target category c, with a size of 7×7. A threshold G is set, which is the maximum value in V.sub.x.sup.c. FIG. 3(b) is a target region R.sub.x corresponding to the input image x. First, the target region having a value equal to G in V.sub.x.sup.c is kept, and the values of the remaining regions is to 0. V.sub.x.sup.c is upsampled to the size of an original input image by nearest neighbor interpolation to obtain a target region R.sub.x with a size of 100×100.

[0038] Step 5.3: Average pixels of all images in the train set are calculated on three channels R, G, and B respectively. Pixels of a corresponding channel in the target region R.sub.x corresponding to the image x in the train set are replaced with the average pixels of the three channels R, G, and B, so as to erase a key target region for which the facial expression recognition network makes classification discrimination from the trained image to generate a new train set.

[0039] Step 5.4: The serial number t of the facial expression recognition network is assigned t+1, a new facial expression recognition network h.sub.t is generated according to step 3, the newly generated train set and an original test set are sent to h.sub.t according to the method in step 4 for training, and the train is finished when the model converges.

[0040] Step 5.5: Accuracy rates of the sub-network h.sub.t and an initial facial expression recognition network h.sub.0 on the test set are compared, when an accuracy rate difference is not larger than 5%, steps 5.1 to step 5.5 are repeated to generate a new sub-network; and when the accuracy rate difference is larger than 5%, the subnetwork h.sub.t is discarded, and finally 10 facial expression recognition subnetworks are obtained.

[0041] Step 6: A network integration part of the present disclosure is as shown in FIG. 4, and this part includes two decision-making layers. The first decision-making layer performs network integration on 11 facial expression recognition networks h.sub.0, h.sub.1, h.sub.2, . . . , h.sub.9, h.sub.10, then expresses a predicted output of a network h.sub.β on the input image x as an n-dimensional vector h.sub.β(x)=(h.sub.β.sup.1(x); h.sub.β.sup.2(x); . . . ; h.sub.β.sup.n(x)), where the network h.sub.β represents any network from network h.sub.0 to network h.sub.z. The second decision-making layer performs classification discrimination on output vectors of all networks by using a relative majority voting method to obtain a classification predicted result H(x), that is, the predicted result is a category with the highest score; and if there are multiple categories with the highest score, one category is randomly selected. The formula of the relative majority voting method is as follows:

[00010] H ( x ) = C arg max j .Math. β = 0 10 h β j ( x ) , ( 4 )

where h.sub.i.sup.j(x) is an output of a network h.sub.i on a category c.sub.j.

[0042] The description above is only used to illustrate the present disclosure, not to limit the technical solutions described in the present disclosure. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present disclosure shall be encompassed in the protection scope of the present disclosure.