SPEECH EMOTION RECOGNITION METHOD AND SYSTEM BASED ON FUSED POPULATION INFORMATION
20220328065 · 2022-10-13
Assignee
Inventors
- Taihao LI (Zhejiang, CN)
- Shukai ZHENG (Zhejiang, CN)
- Yulong LIU (Zhejiang, CN)
- Guanxiong PEI (Zhejiang, CN)
- Shijie MA (Zhejiang, CN)
Cpc classification
G10L25/18
PHYSICS
International classification
G10L25/18
PHYSICS
Abstract
The present invention discloses a speech emotion recognition method and system based on fused population information. The method includes the following steps: S1: acquiring a user's audio data; S2: preprocessing the audio data, and obtaining a Mel spectrogram feature; S3: cutting off a front mute segment and a rear mute segment of the Mel spectrogram feature; S4: obtaining population depth feature information through a population classification network; S5: obtaining Mel spectrogram depth feature information through a Mel spectrogram preprocessing network; S6: fusing the population depth feature information and the Mel spectrogram depth feature information through SENet to obtain fused information; and S7: obtaining an emotion recognition result from the fused information through a classification network.
Claims
1. A speech emotion recognition method based on fused population information, comprising the following steps: S1: acquiring a user's audio data, expressed as X.sub.audio, through a recording acquisition device; S2: preprocessing the acquired audio data X.sub.audio to generate a Mel spectrogram feature, expressed as X.sub.mel; S3: calculating energy of Mel spectrograms in different time frames for the generated Mel spectrogram feature X.sub.mel, cutting off a front mute segment and a rear mute segment by setting a threshold to obtain a Mel spectrogram feature, expressed as X.sub.input, with a length of T; S4: inputting the Mel spectrogram feature X.sub.input obtained in S3 into a population classification network to obtain population depth feature information, expressed as H.sub.p; S5: inputting the Mel spectrogram feature X.sub.input obtained in S3 into a Mel spectrogram preprocessing network to obtain Mel spectrogram depth feature information, expressed as H.sub.m; S6: fusing the population depth feature information H.sub.p extracted in S4 with the Mel spectrogram depth feature information H.sub.m extracted in S5 through a channel attention network SENet to obtain a fused feature, expressed as H.sub.f; and S7: inputting the fused feature H.sub.f in S6 into the population classification network through a pooling layer to perform emotion recognition; the population classification network is composed of a three-layer Long Short Term Memory (LSTM) network structure, and the S4 specifically comprises the following steps: S4_1: first, segmenting the inputted Mel spectrogram feature X.sub.input with the length of T into three Mel spectrogram segments
2. The speech emotion recognition method based on fused population information of claim 1, wherein the Mel spectrogram preprocessing network in the S5 is composed of a ResNet network and a feature map scaling (FMS) network which are cascaded, and the S5 specifically comprises the following steps: first, expanding the Mel spectrogram feature X.sub.input with the length of T into a 3D matrix; second, extracting emotion-related information from the Mel spectrogram feature X.sub.input by using the ResNet network structure and adopting a two-layer convolution and maximum pooling structure; and third, effectively combining the emotion-related information extracted by the ResNet network through an FMS network architecture to finally obtain the Mel spectrogram depth feature information H.sub.m.
3. The speech emotion recognition method based on fused population information of claim 1, wherein the S6 specifically comprises the following steps: S6_1: the population depth feature information H.sub.p is a 1D vector in space R.sup.C, where C represents a channel dimension; the Mel spectrogram depth feature information H.sub.m is a 3D matrix in space R.sup.T×W×C, where T represents a time dimension, W represents a width dimension, and C represents the channel dimension; performing global average pooling on the Mel spectrogram depth feature information H.sub.m in the time dimension T and the width dimension W through the SENet network, and converting the Mel spectrogram depth feature information H.sub.m into a C-dimensional vector to obtain a 1D vector H.sub.p_avg in the space R.sup.C; wherein
H.sub.m=[H.sup.1,H.sup.2,H.sup.3, . . . , H.sup.C]
where,
H.sup.c=└[h.sub.1,1.sup.c,h.sub.2,1.sup.c,h.sub.3,1.sup.c, . . . , h.sub.T,1.sup.c,].sup.T,[h.sub.1,2.sup.c,h.sub.2,2.sup.c,h.sub.3,2.sup.c, . . . , h.sub.T,2.sup.c,].sup.T, . . . ,[h.sub.1,W.sup.c,h.sub.2,W.sup.c,h.sub.3,W.sup.c, . . . , h.sub.T,W.sup.c,].sup.T┘
in addition,
H.sub.p_avg=[h.sub.p_avg.sup.1,h.sub.p_avg.sup.2,h.sub.p_avg.sup.3, . . . , h.sub.p_avg.sup.C] a formula of the global average pooling is as follows:
H.sub.c=└H.sub.p_avg,H.sub.p┘ S6_3: inputting the spliced feature H.sub.c obtained in S6_2 into a two-layer fully-connected network to obtain a channel weight vector W.sub.c, where a calculation formula of the two-layer fully-connected network is as follows:
Y=W*X+b where Y represents an output of the two-layer fully-connected network, X represents an input of the two-layer fully-connected network, W represents a weighting parameter of the two-layer fully-connected network, and b represents a bias parameter of the two-layer fully-connected network; and S6_4: multiplying the channel weight vector W.sub.c obtained in S6_3 by the Mel spectrogram depth feature information H.sub.m obtained in S5 to obtain an emotion feature matrix, and performing global average pooling on the emotion feature matrix in a dimension of T×W to obtain a fused feature, expressed as H.sub.f.
4. The speech emotion recognition method based on fused population information of claim 1, wherein the S7 specifically comprises the following steps: S7_1: after passing through the pooling layer, inputting the fused feature H.sub.f obtained in S6 into the two-layer fully-connected network to obtain a 7-dimensional feature vector H.sub.b, where 7 represents a number of all emotion categories; and S7_2: taking the 7-dimensional feature vector H.sub.b=[h.sub.b.sup.1,h.sub.b.sup.2,h.sub.b.sup.3,h.sub.b.sup.4,h.sub.b.sup.5,h.sub.b.sup.6,h.sub.b.sup.7] obtained in S7_1 as an independent variable of a Softmax operator, calculating a final value of Softmax as a probability value of an inputted audio belonging to each emotion category, and finally selecting the category with the maximum probability value as a final audio emotion category, wherein a calculation formula of Softmax is as follows:
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0038]
[0039]
[0040]
[0041]
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0042] In order to make the objectives, technical solutions and technical effects of the present invention more clearly, the present invention will be further explained in detail below in combination with the accompanying drawings of the specification.
[0043] As shown in
[0044] a speech signal acquisition module, configured to acquire a user's speech signal; usually a high-fidelity single microphone or a microphone array is adopted to reduce the degree of distortion in speech signal acquisition;
[0045] a speech signal preprocessing module, configured to preprocess the acquired speech signal, perform endpoint detection on the speech, remove a front mute segment and a rear mute segment of the speech, and generate data that can be used for neural network; specifically, the module converts a speech signal from a time-domain signal to a frequency-domain signal (that is, converting the speech signal from an audio sample into Mel spectrogram features) for subsequent processing through operations of pre-emphasis, framing, windowing, short-time Fourier transform, trigonometric filtering and mute segment removal for the speech; wherein the speech is denoised by spectral subtraction, the speech is pre-emphasized by a Z transform method, and the Mel spectrogram features are extracted from the speech by the short-time Fourier transform method;
[0046] an emotion prediction module, configured to process Mel spectrogram features through a designed network model to predict an emotion category of the user's audio; and
[0047] a data storage module, configured to store user's speech data and emotional label data in MySQL and other databases.
[0048] As shown in
[0049] S1: a user's audio data, expressed as X.sub.audio, is acquired through a recording acquisition device.
[0050] S2: the acquired audio data X.sub.audio is preprocessed by pre-emphasis and short-time Fourier transform to generate a Mel spectrogram feature, expressed as X.sub.mel, wherein Mel spectrogram is a matrix in a dimension of T′×128.
[0051] S3: the energy of the Mel spectrogram in different time frames is calculated for the generated Mel spectrogram feature X.sub.mel, a front mute segment and a rear mute segment are cut off by setting a threshold to obtain a Mel spectrogram feature, expressed as X.sub.input, with the network input in a dimension of T×128.
[0052] Wherein the step of cutting off a front mute segment and a rear mute segment realizes the removal of a mute frame by the following steps: adding up the energy of the Mel spectrogram in different frequency dimensions of various frames, setting a threshold, and removing the frames lower than the threshold.
[0053] S4: the X.sub.input obtained in S3 is inputted into a population classification network to obtain population depth feature information H.sub.p; the population classification network is composed of a three-layer LSTM network structure, the LSTM network is a recurrent neural network structure that can effectively solve the problem of long sequence dependence, and multi-layer LSTM is often used to solve the problem of sequence dependence such as speech. S4 specifically includes the following steps:
[0054] S4_1: first, segmenting the inputted Mel spectrogram feature with the length of T into three Mel spectrogram segments
in equal length in an overlapped manner, wherein the segmentation method is as follows: 0 to
is segmented as a first segment,
to
is segmented as a second segment, and
to T is segmented as a third sections; and
[0055] S4_2: inputting the three Mel spectrogram segments segmented in S4_1 into the three-layer LSTM network in turn, and then taking the last output from the LSTM network as a final state. Through this method, three hidden features in a dimension of 256 are obtained for the three Mel spectrogram segments at last, and finally the three hidden features are averaged as the final population feature information H.sub.p. The three-layer LSTM can effectively extract the information of long-lived sequence such as the Mel spectrogram; the text content and other information unrelated to the population information in the Mel spectrogram can be effectively removed by taking the last state of LSTM and averaging, so that the accuracy of population information extraction can be improved.
[0056] S5: the X.sub.input obtained in S3 is inputted into a Mel spectrogram preprocessing network to obtain Mel spectrogram depth feature information H.sub.m.
[0057] The Mel spectrogram preprocessing network is composed of a ResNet network and an FMS network which are cascaded, and the specific network structure is as shown in
[0058] The ResNet network can expand the network depth and improve the network learning ability, and meanwhile, it can solve the problem of gradient disappearance in deep learning; the FMS network can effectively extract information from the network, which helps the ResNet network to efficiently extract useful information from the network.
[0059] S6: the population depth feature information H.sub.p extracted in S4 is fused with the Mel spectrogram depth feature information H.sub.m extracted in S5 through a channel attention network SENet, as shown in
[0060] S6_1: the information feature information H.sub.p obtained in S4 is a 1D vector in space R.sup.C, where C represents a channel dimension; the Mel spectrogram depth feature information H.sub.m obtained in S5 is a 3D matrix in space R.sup.T×W×C, where T represents a time dimension, W represents a width dimension, and C represents a channel dimension; performing global average pooling on the H.sub.m in the time dimension T and the width dimension W through the SENet network, converting the H.sub.m into a C-dimensional vector to obtain the 1D vector H.sub.p_avg in space R.sup.C; specifically,
H.sub.m=[H.sup.1,H.sup.2, H.sup.3, . . . , H.sup.C]
[0061] where,
H.sup.c=└[h.sub.1,1.sup.c,h.sub.2,1.sup.c,h.sub.3,1.sup.c, . . . h.sub.T,1.sup.c,].sup.T,[h.sub.1,2.sup.c,h.sub.2,2.sup.c,h.sub.3,2.sup.c, . . . , h.sub.T,2.sup.c,].sup.T, . . . ,[h.sub.1,W.sup.c,h.sub.2,W.sup.c,h.sub.3,W.sup.c, . . . , h.sub.T,W.sup.c,].sup.T┘
[0062] The feature after the average pooling is as follows:
H.sub.p_avg=[h.sub.p_avg.sup.1,h.sub.p_avg.sup.2,h.sub.p_avg.sup.3, . . . , h.sub.p_avg.sup.C]
[0063] a formula of the global average pooling is as follows:
[0064] S6_2: splicing the H.sub.p_avg obtained in S6_1 with the population depth feature information H.sub.p to obtain a spliced feature H.sub.c, expressed as:
H.sub.c=└H.sub.p_avg,H.sub.p┘
[0065] S6_3: inputting the spliced feature H.sub.c obtained in S6_2 into a two-layer fully-connected network to obtain a channel weight vector W.sub.c. Specifically, a calculation formula of the fully-connected network is as follows:
Y=W*X+b
[0066] where Y represents an output of the network, X represents an input of the network, W represents a weighting parameter of the network, and b represents a bias parameter of the network; and
[0067] S6_4: multiplying the weighting parameter obtained in S6_3 by the Mel spectrogram depth feature information H.sub.m obtained in S5 to obtain a fused feature H.sub.f;
[0068] The SENet automatically calculates the weighting coefficient of each channel through the network, so that the important information extracted from the network can be effectively enhanced, and meanwhile, the weight of useless information can be reduced. In addition, the SENet in which the population information is added can emphasize on extracting information related to the pronunciation characteristics of the population according to different populations, and further improve the accuracy of emotion recognition.
[0069] S7: the fused feature H.sub.f in S6 is inputted into the population classification network through a pooling layer to perform emotion recognition; that is, the 3D matrix of T×128×256 is converted into a 256-dimensional 1D vector and then inputted to the classification network for emotion recognition; the classification network is composed of a layer of 256-dimensional fully-connected network and a layer of 7-dimensional fully-connected network; finally, probabilities of seven emotion categories are calculated for the outputted 7-dimensional feature through a Softmax operator, and the one with the maximum probability is the final emotion category, specifically including the following steps:
[0070] S7_1: after passing through the pooling layer, inputting the H.sub.f obtained in S6 into the two-layer fully-connected network to obtain a 7-dimensional feature vector H.sub.b, where 7 represents a number of all emotion categories; and
[0071] S7_2: taking the feature vector H.sub.b=[h.sub.b.sup.1,h.sub.b.sup.2,h.sub.b.sup.3,h.sub.b.sup.4,h.sub.b.sup.5,h.sub.b.sup.6,h.sub.b.sup.7] obtained in S7_1 as an independent variable of the Softmax operator, calculating a final value of the Softmax as a probability value of an inputted audio belonging to each emotion category, and finally selecting the category with the maximum probability value as a final audio emotion category, wherein a calculation formula of the Softmax is as follows:
[0072] where e is a constant.
[0073] In conclusion, the method provided by the embodiment increases the accuracy of extracting audio emotion features based on fused population information, so that it can increase the emotion recognition ability of an entire model.
[0074] Those mentioned above are only the preferred embodiments of the present invention, rather than limiting the present invention in any form. Although the implementation process of present application has been explained in detail in the preceding text, for those of skilled in the art, the technical solutions recorded in the above-mentioned embodiments can be modified, or a part of the technical features can be equivalently alternated. Any modification or equivalent alternation within the spirit and principle of the present invention will fall within the protection scope of the present invention.