FACIAL EXPRESSION RECOGNITION METHOD AND SYSTEM COMBINED WITH ATTENTION MECHANISM

20230298382 · 2023-09-21

Assignee

Inventors

Cpc classification

International classification

Abstract

Provided are a facial expression recognition method and system combined with an attention mechanism. The method comprises: detecting faces comprised in each video frame in a video sequence, and extracting corresponding facial ROIs, so as to obtain facial pictures in each video frame; aligning the facial pictures in each video frame on the basis of location information of facial feature points of the facial pictures; inputting the aligned facial pictures into a residual neural network, and extracting spatial features of facial expressions corresponding to the facial pictures; inputting the spatial features of the facial expressions into a hybrid attention module to acquire fused features of the facial expressions; inputting the fused features of the facial expressions into a gated recurrent unit, and extracting temporal features of the facial expressions; and inputting the temporal features of the facial expressions into a fully connected layer, and classifying and recognizing the facial expressions.

Claims

1. A facial expression recognition method combined with an attention mechanism, comprising following steps: detecting a face comprised in each of video frames in a video sequence, and extracting a corresponding facial region of interest (ROI), so as to obtain a facial picture in each of the video frames; correcting the facial picture in each of the video frames on the basis of location information of a facial feature point of the facial picture in each of the video frames, so that the facial picture in each of the video frames is aligned relative to a plane rectangular coordinate system; inputting the aligned facial picture in each of the video frames of the video sequence into a residual neural network, and extracting a spatial feature of a facial expression corresponding to the facial picture; inputting the spatial feature of the facial expression extracted from the video sequence into a hybrid attention module, the hybrid attention module calculates a feature weight of the facial expression through an attention mechanism, a weight higher than a threshold is assigned to an ROI of a facial expression change and a weight lower than the threshold is assigned to a region irrelevant to the facial expression change to correlate feature information of the facial expression between the video frames, a dependency relationship of the facial expression between the adjacent video frames is extracted, and irrelevant interference features are eliminated to acquire a fused feature of the facial expression; inputting the fused feature of the facial expression acquired from the video sequence into a recurrent neural network, and extracting a temporal feature of the facial expression; inputting the temporal feature of the facial expression extracted from the video sequence into a fully connected layer, and classifying and recognizing the facial expression in a video based on a facial expression template pre-stored in the fully connected layer.

2. The facial expression recognition method combined with the attention mechanism according to claim 1, wherein correcting the facial picture in each of the video frames on the basis of location information of the facial feature point of the facial picture in each of the video frames, so that the facial picture in each of the video frames is aligned relative to the plane rectangular coordinate system comprises: detecting a plurality of facial expression feature points in the facial picture in each of the video frames, the plurality of facial expression feature points are respectively distributed in an eye area, an eyebrow area, a nose area, a mouth area and a facial contour area; determining a position of a middle point of a face in the facial picture based on a feature point in the eye area and a feature point in the eyebrow area in the facial picture in each of the videos, and aligning the facial picture based on the position of the middle point of the face; the aligning is alignment relative to the plane rectangular coordinate system, and two sides of the aligned facial picture are respectively parallel to two axes of the plane rectangular coordinate system.

3. The facial expression recognition method combined with the attention mechanism according to claim 2, wherein aligning the facial picture based on the position of the middle point of the face comprises: using an affine transformation matrix to align the facial picture based on the position of the middle point of the face.

4. The facial expression recognition method combined with the attention mechanism according to claim 2, wherein before inputting the aligned facial picture in each of the video frames in the video sequence into the residual neural network, further comprising the following steps: adjusting a size of the aligned facial picture uniformly to a picture of a preset size.

5. The facial expression recognition method combined with the attention mechanism according to claim 4, wherein the residual neural network, the hybrid attention module, the recurrent neural network and the fully connected layer all need to be pre-trained, and then perform the facial expression recognition after the training; in a training phase, the facial picture inputted to the residual neural network needs to be subjected to the facial picture alignment and adjusted to the picture with the uniform size, and a corresponding facial expression label needs to be marked on each of the facial pictures; the facial expression label is a recognition result of the facial expression of each of the facial pictures.

6. The facial expression recognition method combined with the attention mechanism according to claim 1, wherein the hybrid attention module is composed of a self-attention module and a spatial attention module; the self-attention module calculates a self-attention weight of an expression of a single frame on a space dimension through a fully connected layer and an activation function (sigmoid), assigns the weight to the spatial feature, and obtains a spatial attention feature vector; the spatial attention module passes through an average pooling layer, 2D convolution layer (with kernel size 3×3 and padding size 1), and the sigmoid activation function on spatial attention features of a plurality of frames, extracts an attention weight on a frame dimension, and performs feature fusion on the features of the frames, calculates expression change features between the adjacent frames, and obtains a fused feature vector fused with a space-time attention weight.

7. A facial expression recognition system combined with an attention mechanism, comprising: a facial picture detection unit, which detects a face comprised in each of video frames in a video sequence, and extracting a corresponding facial region of interest (ROI), so as to obtain a facial picture in each of the video frames; a facial picture alignment unit, which corrects the facial picture in each of the video frames on the basis of location information of a facial feature point of the facial picture in each of the video frames, so that the facial picture in each of the video frames is aligned relative to a plane rectangular coordinate system; a spatial feature extraction unit, which inputs the aligned facial picture in each of the video frames of the video sequence into a residual neural network, and extracting a spatial feature of a facial expression corresponding to the facial picture; a fused feature extraction unit, which inputs the spatial feature of the facial expression extracted from the video sequence into a hybrid attention module, the hybrid attention module calculates a feature weight of the facial expression through an attention mechanism, a weight higher than a threshold is assigned to an ROI of a facial expression change and a weight lower than the threshold is assigned to a region irrelevant to the facial expression change to correlate feature information of the facial expression between the video frames, a dependency relationship of the facial expression between the adjacent video frames is extracted, and irrelevant interference features are eliminated to acquire a fused feature of the facial expression; a temporal feature extraction unit, which inputs the fused feature of the facial expression acquired from the video sequence into a recurrent neural network, and extracts a temporal feature of the facial expression; a facial expression recognition unit, which inputs the temporal feature of the facial expression extracted from the video sequence into a fully connected layer, and classifies and recognizes the facial expression in a video based on a facial expression template pre-stored in the fully connected layer.

8. The facial expression recognition system combined with the attention mechanism according to claim 7, wherein the facial picture alignment unit detects a plurality of facial expression feature points in the facial picture in each of the video frames, the plurality of facial expression feature points are respectively distributed in an eye area, an eyebrow area, a nose area, a mouth area and a facial contour area; and determines a position of a middle point of a face in the facial picture based on a feature point in the eye area and a feature point in the eyebrow area in the facial picture in each of the videos, and aligns the facial picture based on the position of the middle point of the face; the aligning is alignment relative to the plane rectangular coordinate system, and two sides of the aligned facial picture are respectively parallel to two axes of the plane rectangular coordinate system.

9. The facial expression recognition system combined with the attention mechanism according to claim 8, further comprising: a picture resizing unit, which adjusts a size of the corrected facial picture uniformly to a picture of a preset size before inputting the aligned facial picture in each of the video frames in the video sequence into the residual neural network.

10. The facial expression recognition system combined with the attention mechanism according to claim 7, wherein the hybrid attention module used in the fused feature extraction unit is composed of a self-attention module and a spatial attention module, the self-attention module calculates a self-attention weight of an expression of a single frame on a space dimension through a fully connected layer and an activation function(sigmoid), assigns the weight to the spatial feature, and obtains a spatial attention feature vector; the spatial attention module passes through an average pooling layer, 2D convolution layer(with kernel size 3×3 and padding size 1), and the sigmoid activation function on spatial attention features of a plurality of frames, extracts an attention weight on a frame dimension, and performs feature fusion on the features of the frames, calculates expression change features between the adjacent frames, and obtains a fused feature vector fused with a space-time attention weight.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0040] FIG. 1 is a flowchart of a facial expression recognition method combined with an attention mechanism provided by an embodiment of the present disclosure.

[0041] FIG. 2 is a technical flowchart of a facial expression recognition method combined with an attention mechanism provided by an embodiment of the present disclosure.

[0042] FIG. 3 is a structural diagram of an overall model of facial expression recognition combined with an attention mechanism provided by an embodiment of the present disclosure.

[0043] FIG. 4 is a diagram of the internal structure of a hybrid attention module provided by an embodiment of the present disclosure.

[0044] FIG. 5 is an expression classification confusion matrix result diagram of the method of the present disclosure on three datasets provided by an embodiment of the present disclosure.

[0045] FIG. 6 is a framework diagram of a facial expression recognition system combined with an attention mechanism provided by an embodiment of the present disclosure.

DESCRIPTION OF THE EMBODIMENTS

[0046] In order to make the purpose, technical solution and advantages of the present disclosure more clear, the present disclosure will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present disclosure, not to limit the present disclosure.

[0047] FIG. 1 is a flowchart of a facial expression recognition method combined with an attention mechanism provided by an embodiment of the present disclosure. As shown in FIG. 1, the method includes the following steps:

[0048] S101. detecting a face included in each of video frames in a video sequence, and extracting a corresponding facial region of interest (ROI), so as to obtain a facial picture in each video frame;

[0049] S102. correcting the facial picture in each video frame on the basis of location information of facial feature point of the facial picture in each video frame, so that the facial picture in each video frame is aligned relative to the plane rectangular coordinate system;

[0050] S103. inputting the aligned facial picture in each video frame of the video sequence into a residual neural network, and extracting a spatial feature of a facial expression corresponding to the facial picture;

[0051] S104. inputting the spatial feature of the facial expression extracted from the video sequence into a hybrid attention module, the hybrid attention module calculates the feature weight of the facial expression through the attention mechanism, a weight higher than the threshold is assigned to an ROI of facial expression change and a weight lower than the threshold is assigned to a region irrelevant to the facial expression change to correlate the feature information of facial expression between video frames, a dependency relationship of facial expression between adjacent video frames is extracted, and irrelevant interference features are eliminated to acquire a fused feature of the facial expression;

[0052] S105. inputting the fused feature of the facial expression acquired from the video sequence into a recurrent neural network, and extracting a temporal feature of the facial expression;

[0053] S106. inputting the temporal feature of the facial expression extracted from the video sequence into a fully connected layer, and classifying and recognizing the facial expression in the video based on a facial expression template pre-stored in the fully connected layer.

[0054] Specifically, a detailed technical solution of the facial expression recognition method based on the hybrid attention mechanism provided by the present disclosure is described as follows. FIG. 2 is a technical flowchart of a facial expression recognition method combined with an attention mechanism provided by an embodiment of the present disclosure. As shown in FIG. 2, the method includes the following steps.

[0055] S1 is obtaining face data in a dataset. The dataset may be a video sequence, and the harr feature extraction method is adopted to detect the face in each video frame in the video sequence through the grayscale change of the picture and the pixel region difference D.sub.face, and extract a facial region of interest (ROI), thereby obtaining the facial picture data contained in each video frame in the video sequence.


D.sub.face=Σ.sub.k≤i.sub.1.sub.,l≤j.sub.1f(x,y)+Σ.sub.k≤i.sub.2.sub.,l≤j.sub.2f(x,y)−(Σ.sub.k≤i.sub.3.sub.,l≤j.sub.3f(x,y)+Σ.sub.k≤i.sub.4.sub.,l≤j.sub.4f(x,y))

[0056] In the formula, (i, j) is the coordinate interval of the current divided region, (x, y) is the coordinate of a single pixel in the region, and f(x, y) sums the pixel coordinates in the current region.

[0057] S2 is extracting facial feature points. The facial feature point detection method in the dlib library is adopted to extract 68 feature points of the face from the facial picture data in S1, and the 68 feature points correspond to the eyes, eyebrows, nose, mouth and facial contour respectively, and the facial feature point sequence P.sup.(t) is obtained.


p.sup.(t)={(x.sub.1.sup.(t),y.sub.1.sup.(t)), (x.sub.2.sup.(t), y.sub.2.sup.(t)), (x.sub.3.sup.(t), y.sub.3.sup.(t)), . . . , (x.sub.68.sup.(t), y.sub.68.sup.(t))}

[0058] In the formula, (x.sub.i.sup.(t), y.sub.i.sup.(t)) is a coordinate position of the i-th key point of the facial picture in the t-th video frame in the video sequence, 1≤i≤68.

[0059] S3 is aligning faces. Based on the facial feature point sequence of the facial picture under each video frame obtained from S2, the faces in respective video frames are aligned, and the information of the middle point of the face is calculated according to the location information of the eye area and eyebrow area in the point information of extracted 68 feature points of the face. The affine transformation matrix is adopted to obtain the corrected facial picture in each video frame.

[00001] [ u v 1 ] = [ a 1 b 1 c 1 a 2 b 2 c 2 0 0 1 ] [ x y 1 ] { u = a 1 x + b 1 y + c 1 v = a 2 x + b 2 y + c 2

[0060] In the formula, (x, y) is the coordinates of the middle point of the current face, (u, v) is the coordinates after transformation of facial picture, c.sub.1 and c.sub.2 represent the lateral shift amount, a.sub.1, a.sub.2, b.sub.1, and b.sub.2 represent variation parameters such as rotation and scaling of the current facial picture.

[0061] S4 is generating an input dataset. The aligned facial picture is adjusted to a picture in a size of 224*224; FIG. 3 is a structural diagram of an overall model of facial expression recognition combined with an attention mechanism provided by an embodiment of the present disclosure. As shown in FIG. 3, the overall model includes: video frame cutting, a residual convolutional neural network, a hybrid attention module, a recurrent neural network, and a fully connected layer. The details are as follows:

[0062] One-hot encoding is performed on the label L corresponding to each video expression to obtain the input L.sub.h; a frame sequence is generated with n frames as a group. Since the number of each video frame is different, by referencing the TSN network processing flow, the video frame is divided into K parts, one frame is randomly selected from each part as the final input frame, and a sequence of K frames is obtained and concatenated with a corresponding label to form a dataset. The data is packaged into an iterative object dataloader as the input for network training.


L.sub.h=δ(L)


dataset=((w,h,c,frame),L.sub.h)


dataloader=f(batchsize,dataset)

[0063] In the formula, δ is the one-hot encoding rule; w, h, and c respectively represent the height, width, and number of channels of the current frame, and frame represents the number of video frames; batchsize represents the number of samples selected for a single training; the function f represents operations such as randomly scrambling the dataset, setting the batchsize size, and setting the number of processes.

[0064] S5 is extracting spatial feature through the ResNet network. The dataset object dataloader is input into the residual convolutional neural network ResNet50 to extract the spatial feature of the facial expression in the video sequence, and obtain the extracted feature data T.


T=ResNet(dataloader)

[0065] The residual network ResNet50 is utilized as the spatial feature extraction network. The residual network may effectively solve the problems of gradient disappearance or gradient explosion as the number of network layers deepens. Through identity mapping of a residual block, the network transmits the current output to the next layer structure, and the shortcut connection will not generate additional parameters, so the computational complexity will not be increased. In the meantime, the Batch Normalization and Dropout layers used in the network may effectively prevent problems such as model overfitting and gradient disappearance.

[0066] S6 is inputting the extracted spatial feature into the hybrid attention module. The purpose of the hybrid attention module is to calculate the feature weight of the facial expression through the attention mechanism, assign a higher weight to the ROI of facial expression change and a lower weight to a region irrelevant with facial expression change, so that the network learns the features in the attention region, extracts the dependency relationship between frames, and eliminates irrelevant features from the video. The hybrid attention module consists of a self-attention module and a spatial attention module. The self-attention module calculates the self-attention weight of an expression of a single frame on the space dimension through a fully connected layer and an activation function(sigmoid), assigns the weight to the spatial feature, and obtains the spatial attention feature vector. The self-attention module only calculates weights in a single frame, and ignores the information correlation between frames, so the cascaded spatial attention module passes through an average pooling layer, 2D convolution layer (with kernel size 3×3 and padding size 1), and the sigmoid activation function on the spatial attention features of multiple frames, extracts the attention weight on the frame dimension, and performs feature fusion on features of multiple frames to obtain a feature vector that is fused with a space-time attention weight.

[0067] FIG. 4 is a diagram of the internal structure of a hybrid attention module provided by an embodiment of the present disclosure. As shown in FIG. 4, the spatial feature first enters the self-attention module, calculates the feature correlation of a single frame to obtain a self-attention weight θ. The obtained self-attention weight weights the input feature, and outputs a new self-attention feature vector F.sup.i.sub.weight1. Then the first feature fusion is performed to input the fused feature F.sup.i.sub.att1 into the spatial attention module, the expression change features between adjacent frames are calculated to obtain the spatial attention weight θ.sup.1. Weighted calculation is performed to obtain the spatial attention feature vector F.sup.i.sub.weight2, and the second feature fusion is performed to obtain the final output feature F.sup.i.sub.att2 of the hybrid attention module.


F.sub.weight1.sup.i=δ(T.sup.i*θ)


F.sub.weight2.sup.i=δ(F.sub.att1.sup.i*θ.sup.1)

[0068] In the formula, T.sup.i represents the i-th frame feature vector extracted by the ResNet network, and δ represents the sigmoid function.

[0069] Specifically, the hybrid attention module is utilized to perform two feature fusions, in which the first feature fusion calculates the self-attention feature F.sup.i.sub.weight1 and the input feature T.sup.i to obtain F.sup.i.sub.att1.

[00002] F att 1 i = .Math. i = 1 n F w e i ght 1 i T i .Math. i = 1 n F w e i ght 1 i

[0070] In the formula, n represents the total number of frames of the current video. In the second feature fusion, the obtained spatial attention feature vector F.sup.i.sub.weight2 is calculated with F.sup.i.sub.att1 to obtain F.sup.i.sub.att2.

[00003] F att 2 i = .Math. i = 1 n f weight 2 i F att 1 i .Math. i = 1 n f weight 2 i

[0071] S7 is inputting the fused facial feature into the recurrent neural network for temporal feature extraction. The present disclosure selects the gated recurrent unit (GRU) as the recurrent neural network to extract temporal features, and the gated recurrent unit is simpler than other recurrent neural network structural models, especially in models with deeper networks. GRU is able to forget and select memory simultaneously through a gate, and parameters are significantly reduced and the efficiency is higher. The temporal feature is obtained by GRU as a three-dimensional feature vector F.


F=GRU(F.sub.att2.sup.i)=[batchsize,frame,hidden]

[0072] In the formula, hidden is the size of hidden layer of the GRU unit, and the hidden layer unit is set to 128 in the model.

[0073] S8 is outputting the feature to the fully connected layer to obtain a prediction result. The feature vector obtained by the GRU unit is adjusted in dimension and then input into a fully connected layer to obtain a final expression classification result.

[0074] After performing the above steps, facial expression recognition under video sequence is realized. During the training process, the cross-entropy loss function is utilized to optimize the loss function value through the stochastic gradient descent algorithm, sigmoid is utilized as the activation function, the weight decay is set to 0.0001, and the momentum is set to 0.9. The learning rate is dynamically adjusted during the process, and finally the optimum result is obtained.

[0075] The experiment adopted accuracy rate, confusion matrix, receiver operating characteristic curve (ROC) area as the evaluation index of expression recognition. Specifically, the larger the accuracy value and the ROC area of the receiver operating characteristic curve, the better the recognition effect; the confusion matrix shows the prediction accuracy of each specific expression.

[0076] Specifically, the comparison of accuracy rate of facial expression recognition performed on the CK+ dataset between the method of the present disclosure and other methods is shown in Table 1:

TABLE-US-00001 TABLE 1 Comparison of methods in the case of CK+ dataset Method Average accuracy BDBN  96.7% LOMo  92.0% 3DIR + landmarks 93.21% DTAGN 97.25% Inception-w  97.5% Present disclosure 98.46%

[0077] Specifically, the comparison of accuracy rate of facial expression recognition performed on the Oulu-CASIA dataset between the method of the present disclosure and other methods is shown in Table 2:

TABLE-US-00002 TABLE 2 Comparison of methods in the case of Oulu-CASIA dataset Method Average accuracy LOMo  74.0% PPDN 84.59% DTAGN 81.46% Inception-w 85.24% FaceNet2ExpNet  87.7% Present disclosure 87.31%

[0078] Specifically, the comparison of accuracy rate of facial expression recognition performed on the AFEW dataset between the method of the present disclosure and other methods is shown in Table 3:

TABLE-US-00003 TABLE 3 Comparison of methods in the case of AFEW dataset Method Average accuracy Mode variational LSTM 48.83% spatio-temporal RBM 46.36% DenseNet-161  51.4% Present disclosure 53.44%

[0079] It can be seen from Tables 1, 2, and 3 that the facial expression recognition method combined with hybrid attention mechanism constructed by the present disclosure has excellent performance in accuracy of the three datasets. The accuracy rates of the method of the present disclosure in performing facial recognition on the CK+ dataset and the AFEW dataset are better than the current mainstream methods.

[0080] FIG. 5 is a diagram of a confusion matrix identified by the method of the present disclosure on three datasets. The confusion matrix is a standard format used for accuracy evaluation, and is used to compare the prediction result and the actual classification value. It can be seen from FIG. 5 that the method of the present disclosure has good classification results on both CK+ and Oulu-CASIA datasets. Since the AFEW dataset is taken from a natural environment, its performance on the confusion matrix is different from that of a dataset in an experimental environment, but is still better.

[0081] Table 4 is the comparison of the ROC areas of the present disclosure on various datasets, and ROC is a performance index to measure the pros and cons of deep learning methods. The ROC area is in the range of 0.5 to 1, and the classifier with a larger value has a better classification effect. It can be seen from Table 4 that the ROC areas of the method of the present disclosure on the three datasets are all much greater than 0.5, indicating that the method of the present disclosure has a better effect on facial expression recognition and classification.

TABLE-US-00004 TABLE 4 Comparison of ROC areas in case of different datasets Datasets ROC area CK+ 0.98 AFEW 0.76 Oulu-CASIA 0.90

[0082] FIG. 6 is a framework diagram of a facial expression recognition system combined with an attention mechanism provided by an embodiment of the present disclosure. As shown in FIG. 6, the system includes:

[0083] a facial picture detection unit 610, which is configured to detect a face included in each video frame in a video sequence, and extract a corresponding facial region of interest (ROI), so as to obtain a facial picture in each video frame;

[0084] a facial picture alignment unit 620, which is configured to correct the facial picture in each video frame on the basis of location information of facial feature point of the facial picture in each video frame, so that the facial picture in each video frame is aligned relative to the plane rectangular coordinate system;

[0085] a spatial feature extraction unit 630, which is configured to input the aligned facial picture in each video frame of the video sequence into a residual neural network, and extract a spatial feature of a facial expression corresponding to the facial picture;

[0086] a fused feature extraction unit 640, which is configured to input the spatial feature of the facial expression extracted from the video sequence into a hybrid attention module, the hybrid attention module calculates the feature weight of the facial expression through the attention mechanism, a weight higher than the threshold is assigned to an ROI of facial expression change and a weight lower than the threshold is assigned to a region irrelevant to the facial expression change to correlate the feature information of facial expression between video frames, a dependency relationship of facial expression between adjacent video frames is extracted, and irrelevant interference features are eliminated to acquire a fused feature of the facial expression;

[0087] a temporal feature extraction unit 650, which is configured to input the fused feature of the facial expression acquired from the video sequence into a recurrent neural network, and extract a temporal feature of the facial expression; and

[0088] a facial expression recognition unit 660, which is configured to input the temporal feature of the facial expression extracted from the video sequence into a fully connected layer, and classify and recognize the facial expression in the video based on a facial expression template pre-stored in the fully connected layer.

[0089] A picture resizing unit 670 is configured to, before inputting the aligned facial picture in each video frame in the video sequence into the residual neural network, adjust the size of the aligned facial picture uniformly to a picture of a preset size.

[0090] Specifically, for the detailed functions of various units in FIG. 6, please refer to the description in the foregoing method embodiments, and details are not repeated here.

[0091] It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure should all be included within the scope to be protected by the present disclosure.