METHOD OF FACE EXPRESSION RECOGNITION

20230011635 · 2023-01-12

Assignee

Inventors

Cpc classification

International classification

Abstract

The present invention provides a method of facial expression recognition including 3 steps: step 1: collecting facial expression data, which contributes to solve the problem of lacking data, disparate and bias data, that cause the overfitting problem when training the deep learning model; step 2: designing a new deep learning network that able to focus on special regions of the face to extract and learn the important features of facial expressions by intergating ensemble attention modules into basic deep network architecture like ResNet; step 3: training the ensemble attention deep learning model in step 2 on the collected dataset in step 1, using the combination of two loss functions including ArcFace and Softmax to reduce the overfitting problem.

Claims

1. Method of face facial expression recognition comprising: Step 1: Collecting face expression data, a facial expression dataset is collected with the purpose of training a deep learning model effectively, characteristics the collected facial expression dataset includes a richness and diversity, covering many special cases in reality, and distribution according to the following aspects: Expressions: happy, sad, angry, surprise, disgust, fear, neutral, Genders: male, female, Ages: children, teenagers, adults, the elderly, Geography: Europeans, Asians, Vietnamese, Face position: frontal, left or right side with angle fluctuating from 0° to 90°, face up or down with angle fluctuating from 0° to 45°, Step 2: Designing a new deep learning network (model) for facial expression recognition; the new deep learning network architecture is built based on basic network (ResNet blocks) and is integrated ensemble attention modules. These modules aim to support the new deep learning network to extract more valuable features of facial expression and learn to classify them; Step 3: Training the ensemble attention deep learning model using a combination of two loss functions including ArcFace and Softmax, a final loss function is a summation of two loss funtions with an alpha parameter as a weight of the combination, The formula is:
L.sub.final=alpha*L.sub.ArcFace+(1−alpha)*L.sub.Softmax In which, the alpha parameter is updated automatically based on a learning rate, In an earlier phase of training, alpha is set to a high value to prioritize the ArcFace loss function and reduce overfiting, After the model's training process is more stable, the alpha is gradually decreased to classify the facial expression based on Softmax loss.

2. The method of facial expression recognition according claim 1, further comprising: In step 2: The network is designed based on ResNet blocks, and the attention modules are intergated into these ResNet blocks including a CBAM (Convolutional Block Attention Module) and an U-net, These modules attempt to extract more valuable features based on channel attention and spatial attention mechanisms, they orientate the network to attent and learn focus on important weights during training process, in that: The CBAM module is made up of two successively smaller modules: a channel attention module and a spatial attention module, in that: The input of the channel attention module is the features extracted from the ResNet block, This ResNet block can consist of two layers (used in ResNet 18 and 34) or three layers (used in ResNet 50, 101, 152), These input features are pooled into two one-dimensional vectors and then are fed into a deep neural network, The output of this module is a one-dimensional vector, which then is multiplied by the input features, and forwarded to the spatial attention module, In the spatial attention module, the input features are merged into two two-dimensional matrices and put fed into the convolutional layers, the output of this spatial attention module is again multiplied by the input features, and forwarded to the next ResNet block, The U-net module consists of an encoder and a decoder, The purpose of the U-net module is similar to CBAM, to help the network concentrate on spatial features and perform more accurate expression classification, The outputs of the CBAM and U-net modules are combined to generate a final feature set, To avoid these attention modules removing useful features, the input features from the ResNet block is added to the generated feature set to produce the final features and passed to the next block, The output features of CBAM and U-net have the same size as the input features, The ensemble attention modules and the ResNet blocks can be serialized N times (recommend with N=4 or 5) to build a more deeply attention network architecture.

3. The method of facial expression recognition according claim 1, further comprising: In step 3, using combined two loss functions, which are ArcFace and Softmax, in training process of the model, The final loss function is the summation of two loss funtions with an alpha parameter as a weight of the combination, The formula is:
L.sub.final=alpha*L.sub.ArcFace+(1−alpha)*L.sub.Softmax In that, the alpha parameter is updated automatically based on a learning rate, In the earlier phase of training, while the learning rate is high (recommend with learning rate=0.01), alpha is set to a high value (e.g., alpha=0.9) to prioritize the ArcFace loss function and reduce overfiting, After the model is more stable, the alpha is gradually decreased to classify the facial expression based on Softmax loss, The deceasing of the learning rate is decided based on the accuracy on the validation dataset, If after 10 epochs, the accuracy on the validation dataset doesn't increase, the learning rate will be reduced to 1/10 of the earlier learning rate, The corresponding decreasing rate of alpha is decided based on the training experiment, and depending on the train dataset.

Description

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

[0010] FIG. 1 is the architecture diagram of the deep learning model that is integrated ensemble attention modules to use for facial expression recognition.

[0011] FIG. 2 is a flow diagram of training the ensemble attention deep learning model using a combination of two loss functions: ArcFace and Softmax.

DETAILED DESCRIPTION OF THE INVENTION

[0012] The detailed description of the invention is interpreted in connection with the drawings, which are intended to illustrate variations of the invention without limiting the scope of the patent.

[0013] In this description of the invention, the terms of “RetinaFace”, “ResNet”, “ArcFace”, “Softmax”, “FER+”, and “AffectNet” are proper nouns, which are the name of the model or the dataset.

[0014] Method of facial expression recognition includes the following steps:

[0015] Step 1: Collecting facial expression data.

[0016] The purpose of this step is enhancing the facial expression data since the avaiable datasets are relatively small and comparatively different with real life situations, that makes the deep learning models have to face up with the overfitting problem. The characteristics of our collected dataset includes the richness and diversity, covering many special cases in reality, and reasonable distribution according to the following aspects: [0017] Expressions: happy, sad, angry, surprise, disgust, fear, neutral. [0018] Genders: male, female. [0019] Ages: children, teenagers, adults, the elderly. [0020] Geography: Europeans, Asians, Vietnamese. [0021] Face position: frontal, left or right side with the angle fluctuating from 0° to

[0022] 90°, face up or down with angle fluctuating from 0° to 45°.

[0023] From these raw data, the face detection and alignment from the original images is performed by the RetinaFace model. Then, the detected faces are cropped, normalized and aligned. Next, they are fed into the proposed ensemble attention deep learning model for further processing in the following steps.

[0024] Step 2: Designing a new deep learning network (model) for facial expresion recognition.

[0025] FIG. 1 describes the architecture of the proposed deep learning model that is integrated ensemble attention modules to use for facial expression recognition. The network is designed based on ResNet blocks, and the attention modules are intergated into these ResNet blocks including CBAM (Convolutional Block Attention Module) and U-net. These modules attempt to extract more valuable features based on channel attention and spatial attention mechanisms. In other words, they orientate the network to focus on the important weights during the training process.

[0026] Firstly, the CBAM module is made up of two successively smaller modules: the channel attention module and the spatial attention module. The input of the channel attention module is the features extracted from the ResNet block. This ResNet block can consist of two layers (used in ResNet 18 and 34) or three layers (used in ResNet 50, 101, 152). These input features are pooled into two one-dimensional vectors, and then are fed into a deep neural network. The output of this module is a one-dimensional vector, which then is multiplied by the input features, and forwarded to the spatial attention module. In the spatial attention module, the input features are merged into two two-dimensional matrices and fed into the convolutional layers. Similarly, the output of this spatial attention module is again multiplied by the input features, and forwarded to the next ResNet block. Secondly, the U-net module consists of an encoder and a decoder. The purpose of the U-net module is similar to CBAM, to help the network concentrate on spatial features and perform more accurate expression classification.

[0027] Thirdly, the outputs of the CBAM and U-net modules are combined to generate a final feature set. To avoid these attention modules removing useful features, the input features from the ResNet block is added to the generated feature set to produce the final features and passed to the next block. The output features of CBAM and U-net have the same size as the input features. The ensemble attention modules and the ResNet blocks can be serialized N times (recommend with N=4 or 5) to build a more deeply attention network architecture.

[0028] Step 3: Training the ensemble attention deep learning model using the combination of two loss functions includes ArcFace and Softmax.

[0029] FIG. 2 shows this training process.

[0030] This step aims to use these two loss functions for training the model to reduce overfitting problem. The Softmax loss function is used popularly to train many other deep learning models; however, it has a disadvantage of not solving the overfitting problem. This invention proposes to use ArcFace loss function together with Softmax loss function. Despite of the effectively applying to face recognition of Arcface loss function, it wasn't noticed to use for facial expression recognition. Arcface loss function potentially restricts the overfitting problem while training the model, and ables to classify facial expressions better. It was proved to enhance the classification results on learned features, and help the training process more stable. The Arcface loss function is defined as folow (this is an available formula used in face recognition research; nevertheless, the formula is given here to show how to apply in this invention):

[00001] L Arc Face = - 1 N .Math. i = 1 N log e s ( cos ( θ y i + m ) ) e s ( cos ( θ yi + m ) ) + .Math. j = 1 , j yi n e scos θ j ( 1 )

[0031] Where N is the number of trained images; s and m are two constants used to change the magnitude of the value of the features, and increase the ability to classify the features; θ.sub.y1 is the angle between the extracted features and the weights of deep learning network. The learning objective is to maximize the angular distance θ for feature discrimination of different facial expressions. The final loss function is the summation of two loss funtions with an alpha parameter in the equation (2) as a weight of the combination. This is a new formula that first time proposes to use in this invention:


L.sub.final=alpha*L.sub.ArcFace+(1−alpha)*L.sub.Softmax   (2)

[0032] The alpha parameter is updated automatically based on the learning rate. In the earlier phase of training, while the learning rate is high (recommend with learning rate=0.01), alpha is set to a high value (e.g., alpha=0.9) to prioritize the ArcFace loss function and reduce overfiting. After the model's training process is more stable, the alpha is gradually decreased to classify the facial expression based on Softmax loss. The deceasing of the learning rate is decided based on the accuracy on the validation dataset. If after 10 epochs, the accuracy on the validation dataset doesn't increase, the learning rate will be reduced to 1/10 of the earlier learning rate. The corresponding decreasing rate of alpha is decided based on the training experiment, and depending on the train dataset.

[0033] At the end of step 3, the ensemble attention deep learning model has been trained and used to predict facial expressions from images. This model can be applied in some software or computer programs for image processing to build related products. Basically, the input of the software can be the camera RTSP (Real Time Streaming Protocol) link or offline video, and the output is the facial expression analysis results of the people appeared in those camera or video. For example, person A has a happy expression, person B has an angry expression, etc.

[0034] Although the above descriptions contain many specifics, they are not intended to be a limitation of the embodiment of the invention, but are intended only to illustrate some preferred execution options.