Compound expression recognition method with few samples of multi-domain adversarial learning
11837021 · 2023-12-05
Assignee
Inventors
Cpc classification
G06V10/774
PHYSICS
International classification
G06V10/774
PHYSICS
Abstract
Disclosed is a compound expression recognition method with few samples of multi-domain adversarial learning. To extract compound expression features with diversity and complexity with few samples, multiple small sample datasets are fused, and divided into expression sub-domains, and multi-domain adversarial learning is performed to improve the performance of compound expression recognition. Based on the generative adversarial network framework, the face domain and the contour-independent compound expression domain are fused in the generative network to enhance diversity and complexity, and two discriminators are designed to guide the generator. The face discriminator uses the face domain to guide the generator and identify the generator to generate expression-independent face identity attributes, so that the generator has identity diversity. The compound expression fusing discriminator fuses the basic expression domain and the contour-related compound expression domain together to guide the generator and identify the complexity of the expressions generated by the generator.
Claims
1. A compound expression recognition method with few samples of multi-domain adversarial learning, comprising: S1: collecting few sample datasets of compound expressions; S2: dividing the small sample dataset of the compound expressions into a face sub-domain, a contour-independent compound expression sub-domain, a contour-related compound expression sub-domain and a basic expression sub-domain; wherein the face sub-domain refers to the face identity set that is irrelevant with the expression, the contour-independent compound expression sub-domain refers to the compound expressions features without facial contours, the contour-related compound expression sub-domain refers to the compound expressions features with facial contours, the basic expression sub-domain refers to the features of six basic expressions, including happiness, sadness, surprise, anger, fury and disgust; S3: constructing a generator, a compound expression discriminator, and a face identity discriminator; wherein the generator is configured to fuse the face sub-domain and the contour-independent compound expression sub-domain to generate an image with both identity diversity and compound expression complexity; the compound expression fusing discriminator is configured to mix combine the contour-related compound expression sub-domain and with the basic expression sub-domain, wherein by calculating a cross transition space, it guides the generator and determines the complexity of the expression in the generated compound expression image; the face discriminator is configured to assess whether the compound expression image generated by the generator adheres to the facial feature distribution within the face sub-domain, which guides the generator by providing feedback and ensures that the generated compound expression exhibits diversity in terms of identity; S4: utilizing the face sub-domain and the contour-independent compound expression sub-domain to train the generator, utilizing the face sub-domain to train the face discriminator, and utilizing the contour-related compound expression sub-domain and the basic expression sub-domain to train the compound expression fusing discriminator; S5: inputting pictures comprising the faces into the trained compound expression fusing discriminator, outputting classification vectors of a plurality of the compound expressions, selecting a component vector with the highest softmax value among the classification vectors, and obtaining classification results that match the expressions.
2. The compound expression recognition method with few samples of multi-domain adversarial learning according to claim 1, wherein the generator comprises a face encoding module, a contour-independent compound expression encoding module, a noise fusing module, a paired fused domain embedding module and a fused decoding module; wherein the face encoding module is configured to encode facial identity information that is irrelevant with the expressions; the compound expression encoding module is configured to encode contour-independent compound expression features; the noise fused module is configured to fuse a facial feature d.sub.f, a contour-independent compound expression feature d.sub.ce with a random noise d.sub.z to obtain a fusing feature; the fusing domain embedding module embeds the fusing feature to form a paired mixed domain feature d.sub.pair; the fused decoding module is configured to decode the paired fused domain feature d.sub.pair to generate an image I.sub.g.
3. The compound expression recognition method with few samples of multi-domain adversarial learning according to claim 2, wherein a processing process of the generator is:
I.sub.g=G.sub.fuse(d.sub.pair)=G.sub.fuse(Emb(con(α.Math.d.sub.f,β.Math.d.sub.ce,d.sub.z))) wherein α and β are control parameters, which are used to control a strength of the features d.sub.f and d.sub.ce in the fusing feature, α, βϵ(0.8, 1.2]; α+β=2, wherein con represents a channel-wise connection operation, and Emb represents the embedding module.
4. The compound expression recognition method with few samples of multi-domain adversarial learning according to claim 2, wherein the compound expression fusing discriminator comprises a compound expression recognition module, a basic expression recognition module, a compound expression fully connected layer, a basic expression fully connected layer and a expression-intersection-calculation module; the compound expression recognition module is configured to extract contour-related compound expression features; the basic expression recognition module is configured to extract basic expression features; the compound expression fully connected layer is configured to fully connect the contour-related compound expression features; the basic expression fully connected layer is configured to fully connect the basic expression features; the expression-intersection-calculation module is configured to calculate an intersection of a fully connected compound expression feature vector and a basic expression feature vector, and select the component vector with the highest softmax value as the classification result according to the intersection.
5. The compound expression recognition method with few samples of multi-domain adversarial learning according to claim 4, wherein the compound expression recognition module adopts the following equation to normalize a spectrum;
6. The compound expression recognition method with few samples of multi-domain adversarial learning according to claim 1, wherein the contour-independent compound expression sub-domain uses AU prior knowledge, 68 face landmarks, and a landmark located in a center of a forehead for division of AU areas.
7. The compound expression recognition method with few samples of multi-domain adversarial learning according to claim 1, wherein a loss function of the generator is:
L.sub.G=L.sub.C+L.sub.D
L.sub.C=−{λ.sub.G.sub.
L.sub.D=E.sub.I.sub.
8. The compound expression recognition method with few samples of multi-domain adversarial learning according to claim 7, wherein a loss function L.sub.f of the face discriminator is:
L.sub.f=E.sub.(I.sub.
9. The compound expression recognition method with few samples of multi-domain adversarial learning according to claim 8, wherein a loss function L.sub.ce of the compound expression fusing discriminator is:
L.sub.ce=E.sub.(I.sub.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
DESCRIPTION OF THE EMBODIMENTS
(4) In order to clearly describe the object, technical solution, and advantages of the present disclosure, descriptions in the present disclosure will be incorporated in detail below with reference to the accompanying drawings and embodiments. It is important to note that the specific embodiments are intended solely for the purpose of illustrating the present disclosure. They should not be interpreted as limiting the scope of the present disclosure in any way. In addition, the technical features involved in the various embodiments of the present disclosure described below can be combined with each other as long as they do not conflict with each other.
(5) The present disclosure provides a compound expression recognition method with few samples with multi-domain adversarial learning, which includes the following steps:
(6) Step S1 is performed to collect the few sample datasets of compound expressions and basic expression.
(7) Step S2 is performed to divide the few sample datasets of compound expressions and basic expression into a face sub-domain, a contour-independent compound expression sub-domain, a contour-related compound expression sub-domain and a basic expression sub-domain.
(8) Specifically, the few-sample compound expression dataset adopted in the present disclosure is established to understand the facial expression responses of people's compound emotions. It has 22 categories, 6 of them are basic expressions (happiness, fury, sadness, disgust, surprise, and fear), 1 of them is neutral expression, and 15 of them are compound expressions.
(9) The dataset has a total of 230 sequences, 22 expression categories, and 6670 expression images, which are divided into four subsets according to the requirements of multi-domain learning in the present disclosure. In the dataset, subset No. 1 has a total of 230 images of neutral expressions, which are adopted by the present disclosure to construct the face domain. The subset No. 2 includes 6 basic expression categories with a total of 1610 images, which are defined by the present disclosure as the basic expression domain. There is a total of 5060 images of the remaining 15 compound expressions, which are defined by the present disclosure as the compound expression domain. The 15 compound expressions are copied into two copies, one copy is directly used as a contour-related compound expression sub-domain without pre-processing, and the other copy is divided and constructed as a contour-independent compound expression sub-domain.
(10) The purpose of the pairing is to fuse the local information of the AU (action unit) area that is not related to the contour and the global area of the face that contains the contour, thereby expanding the data diversity of the compound expression. In order to generate face images with diversity and complexity, the pre-processing of the present disclosure uses the hard coding of the AU area. Utilizing hard coding will cause random loss of some local information. However, since the cross information from the basic expression and the contour-related compound expression domains may be supplemented during the training phase, it is possible to prevent overfitting results from a single feature under small samples.
(11) Therefore, the pre-processing of the present disclosure adopts a hard coding method for area division based on AU prior knowledge. Firstly, the present disclosure adopts a conventional 68 facial landmarks to locate facial areas. Then, the AU prior knowledge is adopted to divide the area into 5 AU areas, as shown in
(12) TABLE-US-00001 TABLE 1 Definition of contour-independent AU areas AU area AU code Face landmark 68 Definition 1 1, 2, 4, 5, 6, 7 17, 18, 19, 20, 21, 36, left eye area 37, 38, 39, 40, 41 2 1, 2, 4, 5, 6, 7 22, 23, 24, 25, 26, 42, right eye area 43, 44, 45, 46, 47 3 9, 11 29, 30, 31, 32, 33, 34, 35 nose area 4 10, 12, 15, 16, 17, 48, 49, 50, 51, 60, 61, 62, left area of 18, 24, 25, 26 67, 66, 57, 58, 59 mouth 5 10, 12, 15, 16, 17, 54, 53, 52, 51, 62, 63, 64, right area of 18, 24, 25, 26 66, 65, 57, 56, 55 mouth
(13) The macro-expression dataset selected by the present disclosure mainly involves 6 basic expressions, consists of 123 subjects, and contains a total of 1236 expression sequences. If the expression peak frame serves as the basic expression, 1236 images are available.
(14) A total of 353 face identities in the two datasets are adopted as the face domain, which is used as the paired input face of the network, and the 2846 basic expressions contained therein are used as the training set and the test set. In the two small sample datasets, there are 353 face identities, 2846 basic expressions, 5060 compound expressions, 5060 contour-independent compound expressions, and a total of 7906 images.
(15) Compared with other large-scale expression datasets of hundreds of thousands of levels, the sample space of the dataset selected by the present disclosure is minimal, which is convenient for carrying out multi-domain adversarial experiments with few samples of small samples. The features of the present disclosure with few samples are boundary conditions of values of the above data.
(16) Step S3 is performed to construct a generator, a compound expression discriminator, and a face identity discriminator.
(17) Different from other generative adversarial networks, the present disclosure fuses face sub-domain and contour-independent compound expression sub-domain in the generator to generate compound expression images with diverse identity. By constructing a face discriminator and a compound expression discriminator, the cross transition space is calculate by fusing contour-related expression sub-domain and basic expression sub-domain, the generator is guided to generate the face images with the compound expression. Finally, the obtained compound expression discriminator possesses the stable ability in discriminating both diversity and complexity.
(18) The framework of the network is shown in
(19) TABLE-US-00002 TABLE 2 Definitions of symbols in the network framework Symbols Definition and Description I.sub.f Input image for face sub- domain I.sub.pce Input image for contour- independent compound expression domain divided based on AU arca prior knowledge I.sub.g Fake face image synthesized by generator E.sub.f Encoding of facial feature E.sub.ce Encoding of compound expression feature d.sub.f Face mixing module d.sub.ce Compound expression feature mixing module d.sub.z Noise mixing module d.sub.pain Embedded encoding module that is mixed with facial features, contour-independent compound expression features, and noise for pairing G.sub.fuse Generator decodes mixed features to synthesize synthetic images with diverse compound expression features T.sub.e Basic expression sub-domain T.sub.ce Contour-related compound expression sub-domain D.sub.f Face ID Discriminator D.sub.fuse Compound expression fusing discriminator, fusing D.sub.ce and D.sub.e D.sub.ce Compound expression recognition module D.sub.e Basic expression recognition module (e.g., happiness, surprise, sadness, etc.)
(20) Generator: The purpose of the generator is to generate images with both identity diversity and compound expression complexity. The generator encodes contour-independent compound expression I.sub.pce and face I.sub.f according to two sets of inputs. The contour-independent compound expression feature encoding module E.sub.ce is responsible for encoding compound expression, and the facial feature encoding module E.sub.f encodes face identity information that is not related to expression, such as contour and texture. The paired fusing domain embedding module embeds and fuses the facial feature d.sub.f, the contour-independent compound expression feature d.sub.ce and random noise d.sub.z to form a paired mixed domain feature encoding d.sub.pair.
(21) Equation (1) defines the entire calculation process of the generator. By controlling the integration of face and expression features with random noise, the paired fusing domain embedding module performed the channel-wise feature fusion and embedding operation to form the generated features. Then, the generated features are decoded to generate a image containing face and compound expression information.
I.sub.g=G.sub.fuse(d.sub.pair)=G.sub.fuse(Emb(con(α.Math.d.sub.f,β.Math.d.sub.ce,d.sub.z))),where α,βϵ(0.8,1.2];α+β=2 (1)
(22) The image I.sub.g is generated by decoding the feature in the paired fusing domain embedding module by the fused decoding module G.sub.fuse, where con represents matrix addition (channel-wise concatenation), and Emb means the embedding encoding, α and β are control parameters for controlling the feature strengths of d.sub.f and d.sub.ce in the embedding encoding. That is, α and β are respectively used to control the diverse effects of the generated face of the compound expression. If α>β, the generated feature is more inclined to the face identity feature I.sub.f, otherwise the generated feature is more inclined to the I.sub.ce compound expression feature. Since a needs to control the contour-related information, and in order to ensure the identity consistency through the discriminator D.sub.f, α should be at least greater than 0.8; β puts more emphasis on the local compound expression features of facial features, and is configured to control the diversity and complexity that are generated, and therefore β should not be less than 1. Therefore, equation (1) satisfies the boundary constraints of α, βϵ(0.8, 1.2] and α+β=2.
(23) Discriminator: The method of the present disclosure is different from the basic generative adversarial network. The method of the present disclosure includes two discriminators. The face discriminator D.sub.f is responsible for identifying identity information that have nothing to do with expressions, so as to help the generator generate diverse identity information; the compound expression fusing discriminator D.sub.fuse is responsible for capturing expression-related features. It achieves this goal by combining the contour-related compound expression feature D.sub.ce and the basic expression features D.sub.e into the discriminator. This fusion allows the calculation of intersecting expression features, effectively guiding the generator to produce compound expressions that exhibit both diversity and complexity. The identification effect of D.sub.fuse is also the ultimate goal of the present disclosure. The identification ability of D.sub.fuse is mainly used in the inferring and verification stage, and its performance determines the identification effect of the compound expression trained with few samples.
(24) The result matrix of compound expression fusing discriminator D.sub.fuse is initialized with the classification results of 15 compound expressions output by D.sub.ce. Then, the 6 basic expressions are filled in the corresponding positions. The compound expression fusing discriminator combines two discriminator modules to produce the final compound expression classification result, which may be expressed as:
D.sub.fuse(D.sub.ce,D.sub.e)=F,F.sub.i,j=(θ.Math.C.sub.i,j+μ(E.sub.i+E.sub.j))×init, where θ+μ=1 (2)
(25) In the equation, i and j represent the expression classification positions of i and j respectively, that is, the pairwise combination of expressions. F.sub.i,j represents the classification result of the compound expression fusing discriminator, C.sub.i,j represents the classification result of the compound expression recognition module, E.sub.x represents the classification result of the basic expression, where init represents the initialization value of the compound expression matrix. Specifically, θ and μ are adjustable parameters that control compound expression and basic classification to affect the fusion result. The goal of equation (2) is to calculate the most likely expression intersection in the fusion matrix, which includes both the result of the compound expression itself and the influence of the basic expression. Since it is impossible to have a mutually exclusive combination of basic expressions (for example: two sets of actions showing happiness and sadness on the face at the same time), the impossible combination of expression intersections is also avoided.
(26) The generator determines the diversity of samples, which is a key factor to improve the performance of compound expression recognition. A stable generator will help to generate samples stably and avoid diversity loss caused by small number of samples. The present disclosure uses spectral norm (SN) to constrain the discriminator, which controls the boundary of the generator to reduce instability. Since the discriminator fuses two sub-discriminators D.sub.ce and D.sub.e, the training samples of the discriminator are derived from the contour-related compound expression domain and the basic expression domain. Therefore, there are two separate sets of networks. Generally speaking, standard SN utilizes power iterations to estimate the threshold norm of the activation matrix of each layer of the network, and the weight matrix of the network is then divided by the spectral norm to obtain a boundary constraint, which is approximately constrained to be the Lipschitz constant, thereby avoiding the instability at module level.
(27) If they are all normalized by SN, it is found that D.sub.e reaches the Lipschitz constant faster than D.sub.ce according to experiments. Moreover, experiments have shown that using the SN norm of D.sub.e as a multiplier for standard spectral normalization helps to balance the normalization speed of the two sets of parameters, as defined by equation (3). Specifically, the present disclosure uses the following update rule for spectral normalization, where σ denotes the standard spectral norm of the weight matrix:
(28) In the equation, W.sub.e represents the parameters of the basic expression recognition module, W represents the parameters of the compound expression recognition module, D.sub.ce and D.sub.e have their own independent parameters and fully connected (FC) layer, and SN mainly controls the parameters of the FC layer and the former layer, so the parameters are controlled separately using independent SNs within their respective network ranges.
(29) Step S4 is performed to utilize the face sub-domain and the contour-independent compound expression sub-domain to train the generator, utilize the face sub-domain to train the face discriminator, and utilize the contour-related compound expression sub-domain and the basic expression sub-domain to train the compound expression fusing discriminator.
(30) Training loss: The overall goal of training the discriminator is to distinguish between compound expressions and faces, while verifying whether the features of compound expressions and faces are correctly separated through consistency constraints. The compound expression fusing discriminator D.sub.fuse deals with compound expression recognition tasks, where there are 15 types of compound expressions, namely K.sub.ce=15, D.sub.fuse ϵR.sup.K.sup.
L.sub.f=E.sub.(I.sub.
L.sub.ce=E.sub.(I.sub.
(31) E.sub.(I.sub.
(32) The purpose of the generator is producing diverse and compound results to deceive the two discriminators. Therefore, it is necessary to extract compound expression features that are not related to facial features and contours as many as possible. It is required to classify the loss, and the classification consists of face identity classification and compound expression classification, as defined in equation (6). The loss function equation is defined as follows:
L.sub.C=−{λ.sub.G.sub.
(33) In the equation, Δ.sub.G.sub.
(34) Since D.sub.fuse combines contour-related domain and the basic expression domain of the compound expression, and the generator performs generation by using the pairing of the face domain and the contour-independent domain of the compound expression, there are certain differences between the two sets of domains. The purpose of this disclosure is to find the cross domain of the two sets of domains as the training target, and the Wasserstein distance is adopted as the cross-entropy for generating adversary. For this purpose, the Wasserstein distance is improved with respect to the characteristics of the fusion domain, and a double intersection domain loss function is designed to help the generator achieve the goal, as expressed in the following:
L.sub.D=E.sub.I.sub.
(35) I.sub.ce˜Pdata(I.sub.ce) represents I.sub.ce obeying the distribution of Pdata(I.sub.ce), I.sub.e˜Pdata(I.sub.e) represents I.sub.e obeying the distribution of Pdata(I.sub.e), and I.sub.g˜P(I.sub.g) represents I.sub.g obeying the distribution of P(I.sub.g). Equation (7) illustrates the loss is minimized as the mixed identification result of I.sub.g and mixed identification results of I.sub.ce and I.sub.e, that is, the generated expression should include compound expression and basic expression in the sample.
(36) By combining equation (6) and equation (7), the loss function of the generator may be defined as follows:
L.sub.G=L.sub.C+L.sub.D (8)
(37) Through equation (8), the parameters may be updated during the generator training process.
(38) Configuration for training network: default hyperparameters of the network: α=0.9, β=1.1, θ=0.5, μ=0.5, λ.sub.G.sub.
(39) Step S5 is performed to input generated face images into the trained compound expression fusing discriminator, output classification vectors of various compound expressions, select the component vector with the highest softmax value among the classification vectors, and obtain the classification results that match the expressions.
(40) The present disclosure only adopts the D.sub.fuse discriminator when making inference, and the model has about 29.3 million parameters, and the advantage thereof is that the complexity of network parameters during inferring is small, while the recognition accuracy of compound expression is high. After completing the network training, the obtained D.sub.fuse discriminator model will be used for inferring process, as shown in
(41) TABLE-US-00003 TABLE 3 Mapping relationship between classification results and expressions No. of result Code Chinese label English label 1 HS Happily surprised 2 HD
Happily disgusted 3 SA
Sadly angry 4 AD
Angrily disgusted 5 AP
Appalled 6 HT
Hatred 7 AS
Angrily surprised 8 SS
Sadly surprised 9 DS
Disgustedly surprised 10 FS
Fearfully surprised 11 AW
Awed 12 SF
Sadly fearful 13 FD
Fearfully disgusted 14 FA
Fearfully angry 15 SD
Sadly disgusted
(42) Compared with the related art, the above technical solutions conceived by the present disclosure may achieve the following advantageous effects.
(43) Based on the known AU prior knowledge, the present disclosure believes that the compound expression emerge during the transitional stages of basic expressions; therefore, it is posited that compound expressions emerge within the cross-space bridging two distinct basic expressions. Moreover, it is possible to fuse the local features of AU with the global feature encoding of the contour to generate diverse compound expressions. This fusion approach aims to enhance the available data in the few sample datasets. Specifically, in order to obtain a model that encompasses both identity diversity and expression complexity, the present disclosure combines the face domain and the contour-independent compound expression domain to extract features separately and fuse them to generate images with both diversity and complexity in terms of expressions.
(44) In the meantime, to effectively identify compound expressions, two discriminators are adopted to identify identity and expression features respectively. By employing these discriminators, the mutual influence of the two sets of features is reduced, leading to more accurate and reliable compound expression identification. The present disclosure believes that the intersection of basic expressions and compound expressions may improve the capability of discriminating compound expressions. Therefore, a fusion method is adopted to fuse contour-related compound expressions and basic expression features and calculate the cross space, so as to obtain a discriminator with improved compound expression recognition performance. Specifically, in the expression discriminator, the compound expression recognition module is utilized to extract the features of the compound expression sub-domain, and the basic expression recognition module is utilized to extract the features of the basic expression sub-domain; the two sets of features are fully connected, respectively, and by constructing the two sets of fully connected results and calculating the intersection space, the recognition result of the compound expression may be obtained. The disclosure can obtain a small-size recognition model with high generalization on the few sample datasets.
(45) It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure should all be included within the scope to be protected by the present disclosure.