Method and terminal for generating simulated voice of virtual teacher
11727915 · 2023-08-15
Assignee
Inventors
Cpc classification
G10L25/18
PHYSICS
G10L13/033
PHYSICS
International classification
G10L15/22
PHYSICS
G10L13/033
PHYSICS
G10L25/18
PHYSICS
Abstract
Disclosed are a method and a terminal for generating simulated voices of virtual teachers. Real voice samples of teachers are collected and converted into text sequences, and a text emotion polarity training set and a text tone training set are constructed according to the text sequences; a lexical item emotion model is constructed based on lexical items in the text sequences and is trained by using the emotion polarity training set, and word vectors, an emotion polarity vector, and a weight parameter are obtained by training; and the similarity between the word vector and the emotion polarity vector is calculated, and emotion features are extracted according to a similarity calculation result, a conditional vocoder is constructed according to the voice styles and emotion features to generate new voices with emotion changes. The method and the terminal contribute to satisfying the application requirements of high-quality virtual teachers.
Claims
1. A method for generating simulated voices of virtual teachers, comprising the following steps: collecting real voice samples of teachers, converting the real voice samples into text sequences, and constructing a text emotion polarity training set and a text tone training set according to the text sequences; constructing a lexical item emotion model based on lexical items in the text sequences, training the lexical item emotion model by using the text emotion polarity training set, calculating a similarity according to a word vector, an emotion polarity vector and a weight parameter obtained by training, extracting emotion features according to a similarity calculation result, and constructing voices with emotion changes based on the emotion features and voice features of the real voice samples; and obtaining a feature vector of voice rhythm information according to the voices, generating voice style features, obtaining texts to be synthesized, extracting tone features of the texts to be synthesized by using the text tone training set, and generating the simulated voices of the texts to be synthesized based on the voice style features and the tone features.
2. The method for generating simulated voices of virtual teachers according to claim 1, wherein the converting the real voice samples into text sequences comprises: denoising and editing the real voice samples, and then saving a processing result as a WAV file; weighting and framing voice signals in the WAV files, and smoothing the voice signals by windowing; and acquiring text sequences corresponding to the voice signals by using a voice conversion algorithm, filtering lexical items in the text sequences, and dividing the filtered text sequences into different paragraphs by using a segmentation algorithm.
3. The method for generating simulated voices of virtual teachers according to claim 1, wherein the constructing a text emotion polarity training set according to the text sequences comprises: removing stop words, punctuation marks and low-frequency lexical items of lexical item sequences in the text sequences, correcting grammatical errors and spelling errors of the text sequences, and labeling the parts of speech of the lexical items; acquiring emotion word lists and emotion rules of voices, and labeling the emotion polarities of the lexical items by combining the categories of the lexical items and context lexical items thereof; and constructing the text emotion polarity training set according to the emotion polarities of the lexical items and context lexical items thereof.
4. The method for generating simulated voices of virtual teachers according to claim 2, wherein the constructing a text tone training set according to the text sequences comprises: correcting punctuation marks of the text sequences divided into different paragraphs, and configuring corresponding tones for the corrected text sequences; labeling the text sequences with Pinyin according to the tones of the text sequences; and constructing the text tone training set according to the tones and Pinyin of the text sequences in different paragraphs.
5. The method for generating simulated voices of virtual teachers according to claim 1, wherein the constructing a lexical item emotion model based on lexical items in the text sequences, and training the lexical item emotion model by using the text emotion polarity training set comprise: extracting lexical items containing emotion polarities from the text sequences, and constructing a mapping relationship between the lexical items and lexical frequencies based on the extracted lexical items and the lexical frequencies thereof; constructing the lexical item emotion model based on a neural network and the mapping relationship between the lexical items and the lexical frequencies, and calculating word vectors according to the lexical item emotion model; and training the lexical item emotion model by using the text emotion polarity training set to an emotion polarity vector and a weight parameter.
6. The method for generating simulated voices of virtual teachers according to claim 1, wherein the calculating a similarity according to a word vector, an emotion polarity vectors and a weight parameter obtained by training, and extracting emotion features according to a similarity calculation result comprise: calculating the similarity between the word vector and the emotion polarity vector:
7. The method for generating simulated voices of virtual teachers according to claim 1, wherein the constructing voices with emotion changes based on the emotion features and voice features of the real voice samples comprises: extracting the voice features of the real voice samples by using fast Fourier transform, nonlinear transform, and filter banks; and constructing a conditional model of a vocoder by taking the emotion features and the voice features as preconditions and input variables of the neural network vocoder, and using the vocoder to generate voices with emotion changes.
8. The method for generating simulated voices of virtual teachers according to claim 1, wherein the obtaining a feature vector of voice rhythm information according to the voices, and generating voice style features and coding states comprise: converting the voice rhythm information into a rhythm feature vector by using a two-dimensional convolution neural network, batch standardization, a correction linear unit and a single-layer circulation neural network layer; mining multi-rhythm features in the voices by using a pair of multi-cycle neural network layers, giving weights to style features by using an attention mechanism, and acquiring a style coding vector; and according to the style coding vector, generating voice style features and coding states thereof.
9. The method for generating simulated voices of virtual teachers according to claim 8, wherein the obtaining texts to be synthesized, extracting tone features of the texts to be synthesized by using the text tone training set, and generating the simulated voices of the texts to be synthesized based on the voice style features and the tone features comprise: constructing a tone prediction model, training the tone prediction model by using the text tone training set, updating weight parameters in the tone prediction model by using an error back-propagation algorithm, and mapping Pinyin sub-labels to a vector with implied tone features; capturing the fluctuation changes of tones by hollow convolution, and converting into a tone feature coding state with fixed dimensions by using a fully-connected layer; mining text feature information by using a double-layer circulation neural network layer, and outputting feature vectors of the texts to be synthesized by the fully-connected connection layer and a modified linear unit; and giving weights to the coding states of voice style features and tone features by using an attention mechanism, fusing the coding states through addition operation processing, and according to the texts to be synthesized and simulated voice features, generating voice sequences with voice styles and emotion features.
10. A terminal for generating simulated voices of virtual teachers, comprising a memory, a processor, and a computer program stored on the memory and runnable on the processor, wherein the processor implements the method for generating simulated voices of virtual teachers according to claim 1 when executing the computer program.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
DETAILED DESCRIPTION OF THE EMBODIMENTS
(11) In order to explain the technical contents, achieved objectives and effects of the present disclosure, detailed description will be made below in combination with embodiments and accompanying drawings.
(12) Referring to
(13) collecting real voice samples of teachers, converting the real voice samples into text sequences, and constructing a text emotion polarity training set and a text tone training set according to the text sequences;
(14) constructing a lexical item emotion model based on lexical items in the text sequences, training the lexical item emotion model by using the text emotion polarity training set, calculating a similarity according to a word vector, an emotion polarity vector and a weight parameter obtained by training, extracting emotion features according to a similarity calculation result, and constructing voices with emotion changes based on the emotion features and voice features of the real voice samples; and
(15) obtaining a feature vector of voice rhythm information according to the voices, generating voice style features, obtaining texts to be synthesized, extracting tone features of the texts to be synthesized by using the text tone training set, and generating the simulated voices of the texts to be synthesized based on the voice style features and the tone features.
(16) It can be seen from the above description that the present disclosure has the following beneficial effects: real voice samples of teachers are collected and converted into text sequences, and thus, a text emotion polarity training set and a text tone training set are constructed according to the text sequences; a lexical item emotion model is constructed based on lexical items in the text sequences, and is trained by using the emotion polarity training set, and a word vector, an emotion polarity vectors and a weight parameter are obtained by training; and the similarity between the word vector and the emotion polarity vector is calculated, and emotion features are extracted according to a similarity calculation result, and thus, a conditional vocoder is constructed according to the voice styles and emotion features to generate new voices with emotion changes. The feature vector of voice rhythm information in the new voices is extracted, and voice style features and coding states thereof are generated; then, according to the texts to be synthesized and the voice features, new voice sequences are generated and outputted. With the wide application of virtual teachers in teaching scenes such as classroom teaching, online teaching and campus activities, the method and the terminal contribute to extracting and synthesizing the voice features and emotion styles of teaching administrators, teachers and related users, and satisfy the application requirements of high-quality virtual teachers.
(17) Further, the converting the real voice samples into text sequences includes:
(18) denoising and editing the real voice samples, and then saving a processing result as a WAV file;
(19) weighting and framing voice signals in the WAV files, and smoothing the voice signals by windowing; and
(20) acquiring text sequences corresponding to the voice signals by using a voice conversion algorithm, filtering lexical items in the text sequences, and dividing the filtered text sequences into different paragraphs by using a segmentation algorithm.
(21) It can be seen from the above description that the real voice samples are preprocessed by denoising, editing, segmentation, etc., which can facilitate the subsequent generation of a text training set based on the preprocessed texts.
(22) Further, the constructing a text emotion polarity training set according to the text sequences includes:
(23) removing stop words, punctuation marks and low-frequency lexical items of lexical item sequences in the text sequences, correcting grammatical errors and spelling errors of the text sequences, and labeling the parts of speech of the lexical items;
(24) acquiring emotion word lists and emotion rules of voices, and labeling the emotion polarities of the lexical items by combining the categories of the lexical items and context lexical items thereof; and constructing the text emotion polarity training set according to the emotion polarities of the lexical items and context lexical items thereof.
(25) It can be seen from the above description that the categories of lexical items in text sequences and context lexical items thereof are used to label the emotion polarities, so that a text emotion polarity training set can be constructed by taking the emotion polarities of lexical items and context lexical items thereof as samples.
(26) Further, the constructing a text tone training set according to the text sequences includes:
(27) correcting punctuation marks of the text sequences divided into different paragraphs, and configuring corresponding tones for the corrected text sequences:
(28) labeling the text sequences with Pinyin according to the tones of the text sequences; and
(29) constructing the text tone training set according to the tones and Pinyin of the text sequences in different paragraphs.
(30) It can be seen from the above description that the text sequences are labeled with tones and then labeled with Pinyin, so that the text tone training set can be constructed by using the tones and Pinyin of the text sequences of different paragraphs.
(31) Further, the constructing a lexical item emotion model based on lexical items in the text sequences, and training the lexical item emotion model by using the text emotion polarity training set includes:
(32) extracting lexical items containing emotion polarities from the text sequences, and constructing a mapping relationship between the lexical items and lexical frequencies based on the extracted lexical items and the lexical frequencies thereof;
(33) constructing the lexical item emotion model based on a neural network and the mapping relationship between the lexical items and the lexical frequencies, and calculating word vectors according to the lexical item emotion model; and
(34) training the lexical item emotion model by using the text emotion polarity training set to an emotion polarity vector and a weight parameter.
(35) It can be seen from the above description that the mapping relationship of lexical items and lexical frequencies can be obtained according to the lexical items containing emotion polarities and the lexical frequencies thereof, thus, the lexical item emotion model is established based on a neural network and the mapping relationship, and a word vector is calculated; and the text emotion polarity training set is used to train the lexical item emotion model, so that an emotion polarity vector and a weight parameter, and the subsequent calculation of the similarity of the above two vectors based on the weight parameter is facilitated.
(36) Further, the calculating a similarity according to a word vector, an emotion polarity vector and a weight parameter obtained by training, and extracting emotion features according to a similarity calculation result includes:
(37) calculating the similarity between the word vector and the emotion polarity vector;
(38)
(39) Where cov represents a covariance, σ represents a standard deviation, e.sub.j represents a word vector, and q.sub.j represents an emotion polarity vector;
(40) determining a similarity category according to the similarity between the word vector and the emotion polarity vector, and fusing the word vectors and the emotion polarity vectors according to the similarity category; and
(41) according to the similarity between the word vector and the emotion polarity vector, acquiring an emotion vector by a corresponding operation method, converting lexical item sequences into emotion polarity vector sequences, extracting the features of vector sequences by using a circulation neural network layer, nonlinearly transforming the features of the vector sequences by two fully-connected layers, compressing, and generating emotion features.
(42) It can be seen from the above description that the similarity between the word vector and the emotion polarity vector is calculated, the word vector and emotion polarity vector are fused according to the similarity category, and the emotion features are compressed and generated by the neural network, and thus, the subsequent obtaining of the voices with emotion changes is facilitated.
(43) Further, the constructing voices with emotion changes based on the emotion features and voice features of the real voice samples includes:
(44) extracting the voice features of the real voice samples by using fast Fourier transform, nonlinear transform, and filter banks;
(45) constructing a conditional model of a vocoder by taking the emotion features and the voice features as preconditions and input variables of the neural network vocoder, and using the vocoder to generate voices with emotion changes.
(46) It can be seen from the above description that after the voice features of the real voice samples are extracted, a conditional model of a vocoder is constructed by taking the emotion features and voice features as preconditions and input variables of the neural network vocoder, and thus, the voices with emotion changes are generated by the vocoder in this way.
(47) Further, the obtaining a feature vector of voice rhythm information according to the voices, and generating voice style features include:
(48) converting the voice rhythm information into a rhythm feature vector by using a two-dimensional convolution neural network, batch standardization, a correction linear unit and a single-layer circulation neural network layer;
(49) mining multi-rhythm features in the voices by using a pair of multi-cycle neural network layers, giving weights to style features by using an attention mechanism, and acquiring a style coding vector; and
(50) according to the style coding vector, generating voice style features and coding states thereof.
(51) Further, the obtaining texts to be synthesized, extracting tone features of the texts to be synthesized by using the text tone training set, and generating the simulated voices of the texts to be synthesized based on the voice style features and the tone features include:
(52) constructing a tone prediction model, training the tone prediction model by using the text tone training set, updating weight parameters in the tone prediction model by using an error back-propagation algorithm, and mapping Pinyin sub-labels to a vector with implied tone features;
(53) capturing the fluctuation changes of tones by hollow convolution, and converting into a tone feature coding state with fixed dimensions by using a fully-connected layer;
(54) mining text feature information by using a double-layer circulation neural network layer, and outputting feature vectors of the texts to be synthesized by the fully-connected connection layer and a modified linear unit; and
(55) giving weights to the coding states of voice style features and tone features by using an attention mechanism, fusing the coding states by an addition operation, and according to the texts to be synthesized and simulated voice features, generating voice sequences with voice styles and emotion features.
(56) It can be seen from the above description that the feature vector of voice rhythm information is extracted by using a combination method, the weight is given to the style feature by using the attention mechanism, the style coding vector is extracted, and the voice style feature and a coding state thereof are generated; the fluctuation changes of tones are captured by hollow convolution, the coding state of tone features is obtained, and the fusion coding state of voices and tones is processed by an addition operation; according to the texts to be synthesized and the voice features of the real teacher, new voice sequences are generated and outputted, and thus, the emotion features and voice styles can be added to the voices of the virtual teacher.
(57) Referring to
(58) The method and the terminal for generating simulated voices of virtual teachers provided by the present disclosure are suitable for the generation and application of voice emotions and style features of real teachers in the educational metaverse, and will be explained by specific embodiments below.
Embodiment I
(59) Referring to
(60) S1. Collecting real voice samples of teachers, converting the real voice samples into text sequences, and constructing a text emotion polarity training set and a text tone training set according to the text sequences.
(61) S11. Collecting real voice samples of teachers, denoising and editing the real voice samples, and then saving a processing result as a WAV file.
(62) Specifically, according to the set sampling rate, under the condition of interference-free recording, real voice samples of teachers with a preset duration are collected, noise in the real voice samples is eliminated by using a denoising algorithm, the real voice samples are edited or processed by using the labeling, deleting, inserting and moving functions of audio editing software, and an editing or processing result is saved as a waveform voice file in way format.
(63) S12. Weighting and framing voice signals in the WAV file, and smoothing the voice signals by windowing.
(64) Specifically, referring to
(65) S13. Acquiring text sequences corresponding to the voice signals by using a voice conversion algorithm, filtering lexical items in the text sequences, and dividing the filtered text sequences into different paragraphs by using a segmentation algorithm.
(66) The voice sequence is automatically converted into the text sequence by using a voice conversion algorithm, for example, the text of collecting, identifying and converting real voice samples of a teacher: “Don't all students like durian? Teacher loves durian very much! Because durain can be used for making many valuable products”.
(67) According to the conversion rules of a voice recognition text, the text sequence is divided into different paragraphs by a segmentation algorithm, and each paragraph is labeled with a logo <p>. The labeling result is “<p>Don't all students like durian? Teacher loves durian very much! Because durain can be used for making many valuable products.</p>”.
(68) Regular expressions are used to identify invalid and repeated lexical items, the lexical items are replaced with a common identifier <UNK> in natural language processing, and the result is saved in .txt text format.
(69) S14. Constructing a text emotion polarity data set.
(70) S141. Removing stop words, punctuation marks and low-frequency lexical items of lexical item sequences in the text sequences, correcting grammatical errors and spelling errors of the text sequences, and labeling the parts of speech of the lexical items.
(71) In the present embodiment, specific steps are as follows.
(72) S1411. Correcting grammatical errors and spelling errors of paragraph texts, and changing “Liek” to “Like”.
(73) S1412. Segmenting a paragraph into lexical item sequences by using a word segmentation algorithm. In the present embodiment, “/” is used for segmentation to obtain “don't/all/students/like/durian/?/Teacher/loves/durian/very much/!/Because/durain/can/be used for/making/many/valuable/products/”.
(74) S1413. According to a stop word dictionary and a lexical frequency statistics threshold, removing stop words, punctuation marks and low-frequency lexical items in the lexical item sequences to obtain “don't/all/students/like/durian//teacher/loves/durian/very much/be used for/making/many/valuable/products”.
(75) S1414. Using a part-of-speech labelling algorithm to label the parts of speech of lexical items, for example, “all (adjective) students (nouns) do not (negative words) like (verb) durian (noun) teacher (noun) loves (verb) durian (noun) very much (adverb) durian (noun) is used for making (verb) many (numeral) valuable (adjective) products (noun)”.
(76) S1415. Retaining lexical items of adjectives, verbs, adverbs, and negative words related to emotion polarities, and eliminating the lexical items of other parts of speech. The processed sequence is “all (adjective) do not (negative word) like (verb) very much (adverb) love (verb) making (verb) valuable (adjective)”.
(77) S1416. Using adjectives and verbs “like”, “love”, “making” and “valuable” as emotion words, and using adverbs and negatives words “all”, “very much” and “do not” as corrections to the degree and polarities of emotion words.
(78) S142. Acquiring emotion word lists and emotion rules of voices, and labeling the emotion polarities of the lexical items by combining the categories of the lexical items and context lexical items thereof.
(79) Specifically, emotion word lists are loaded, the word list to which lexical items belong is comprehensively judged, and each lexical item is labeled with an emotion polarity. If the lexical item belongs to multiple word lists, the word list to which lexical items belong is comprehensively judged according to the emotion rules of voices in the teaching scenes, and the lexical items are labeled with five emotion polarity categories being high positive, low positive, high negative, low negative, and neutral in combination with the lexical items and the content part-of-speech categories.
(80) In the present embodiment, the emotion words are labeled with the emotion polarities by adopting the following steps:
(81) S1421. Judging the emotion words, “like” and “love” belong to positive emotion words in the positive word list, and “have” and “make” are not in the positive word list and the passive word list, and are classified as neutral emotion words.
(82) S1422. Assigning values to the emotion polarities: assigning 1, 0, −1, and −1 to positive words, neutral words, passive words and negative words respectively, S.sub.like=1, S.sub.love=1, S.sub.have=0, S.sub.make=0, and S.sub.no=−1; and assigning different numerical multiples such as S.sub.both=2, S.sub.very much=3 to adverbs of degree according to the degree of modification.
(83) S143. Composite processing. If the emotion word is a positive word or a passive word, the lexical item of a non-emotion word between the emotion word and the previous emotion word is searched, and if the result is empty, the composite processing is not performed; if the search result is not empty, processing is performed separately; if the emotion word is a negative word, S=S*S.sub.negative word; and if the emotion word is an adverb of degree, S=S*S.sub.adverb of degree.
(84) For example, if the adverb of degree “all” and the negative word “don't” are before the emotion word “like”, S.sub.like=S.sub.like*S.sub.don't*S.sub.all=1*(−1)*2=−2; there is only one adverb of degree before “love”, S.sub.love=S.sub.love*S.sub.very much=1*3=3.
(85) S1424: According to the range of an emotion polarity value S, labeling the emotion polarity of the emotion word:
(86)
(87) In the above emotion sequence, there is “don't” before “like”, so “like” is labeled with a strong passive emotion polarity; and there is “don't” before “love”, so “love” is labeled with a strong positive emotion polarity.
(88) S143. Constructing the text emotion polarity training set according to the emotion polarities of the lexical items and context lexical items thereof.
(89) Specifically, according to the feature that the emotion polarity of the lexical item depends on context information, a supervised learning training sample is constructed. The training sample is divided into the preceding text and the next text. The emotion word labeled with the emotion polarity is introduced as the next text of the training set, and the emotion word with the emotion polarity to be acquired is used as the preceding text of the training set. According to the learning effect, the training sample set is gradually expanded.
(90) In the present embodiment, referring to
(91) S1431. Loading a lexical item sequence {w1,w2, . . . , wn} of emotion words and a labeled emotion polarity {t1, T2, . . . , TN}.
(92) S1432. Constructing training samples by using emotion words and emotion polarities thereof, taking the emotion word to be predicted in the emotion lexical item sequence as a dividing point, and dividing the lexical item sequence into the preceding text and the next text.
(93) S1433. Setting the size of a convolution kernel to be 3 and the step size to be 1, obtaining three emotion words as a convolution processing sequence from the preceding text and the next text respectively according to the order of the lexical item sequence, and sliding the window with the step size of 1 to obtain the next convolution processing sequence. When the length of the convolution processing sequence is less than 3, the current emotion words to be predicted are used as supplements.
(94) S15. Generating a text tone data set.
(95) S151. Correcting punctuation marks of the text sequences divided into different paragraphs, and configuring corresponding tones for the corrected text sequences.
(96) Specifically, according to the usage specification of punctuation marks, the irregular punctuation marks in the segmented texts are corrected, and question marks, exclamation marks, pause marks and emphasis marks are set as a predefined set of punctuation marks. According to the changes of rising tones, falling tones and tones corresponding to all punctuation marks in the set, the punctuation marks that do not belong to the predefined set of punctuation marks are replaced with commas.
(97) S152. Labeling the text sequences with Pinyin according to the tones of the text sequences.
(98) Specifically, an automatic Chinese character tone labeling tool is developed, and Chinese characters are labeled with Pinyin in turn according to segmentation marks. According to the labeling standard of Chinese Pinyin, a set of automatic Chinese character tone labeling tool is developed. According to the segmentation marks, the Chinese characters of each paragraph of text are labeled with Pinyin. The first, second, third and fourth tones are placed at the back of the Pinyin, and light tones are expressed by spaces. The ampersand (&) is used to separate Pinyin labels and punctuation marks, such as “Tong (tong2) Xue (xue2) Men (men) Dou (dou1) Bu (bu4) Xi (xi3) Huan (huan1) Liu (liu1) Lian (lian2) Ma (ma) &?”.
(99) The tone and Pinyin labeling result is saved as a file in a .txt format.
(100) S153. Constructing the text and tone training set according to the tones and Pinyin of the text sequences in different paragraphs.
(101) Specifically, the training samples and labels thereof are divided according to punctuation marks, and special labels are added to the samples containing polyphonic characters. A tone and Pinyin labeling sequence is loaded; separators are removed; the tone and Pinyin labeling sequence is divided into a plurality of subsequence training samples; punctuation marks at the ends of subsequences are taken as labels of the training samples; a tone training set is generated; and polyphonic characters and corresponding tones thereof are extracted from the training samples, which are labeled as samples containing polyphonic characters.
(102) S2. Constructing a lexical item emotion model based on lexical items in the text sequences, training the lexical item emotion model by using the text emotion polarity training set, calculating a similarity according to a word vector, an emotion polarity vector and a weight parameter obtained by training, extracting emotion features according to a similarity calculation result, and constructing voices with emotion changes based on the emotion features and voice features of the real voice samples.
(103) S21. Extracting lexical items containing emotion polarities from the text sequences, and constructing a mapping relationship between the lexical items and lexical frequencies based on the extracted lexical items and the lexical frequencies thereof.
(104) In the present embodiment, according to the statistical results of lexical frequencies, the lexical items less than the threshold value are eliminated, and the mapping relationship is constructed for the remaining lexical items in the order of lexical frequencies from big to small.
(105) Specifically, a data vectorization algorithm is used to screen out the lexical items containing emotion polarity information, and a lexical frequency statistics algorithm is used to count the frequency of emotion lexical items in the collected voice. The threshold value of lexical frequency statistics is set, and the lexical items, the lexical frequency of which is less than the threshold value, are eliminated. The mapping between the emotion lexical items and integer indexes is constructed for the remaining lexical items in the order of lexical frequencies from big to small.
(106) S22. Constructing the lexical item emotion model based on the neural network and the mapping relationship between the lexical items and the lexical frequencies, and calculating a word vector according to the lexical item emotion model.
(107) Specifically, a word embedding layer, a one-dimensional convolution neural network, a circulation neural network layer, a fully-connected layer and a normalized exponential function are stacked in sequence according to the output and input specifications of front and back network layers in combination with the serial modeling sequence, the lexical item emotion model is constructed, the emotion polarity training sample is used as the input variable of the model, and the emotion polarity of the lexical item is used as the output result.
(108) In the present embodiment, referring to
(109) S2121. Loading a lexical item sequence {w.sub.1, w.sub.2, . . . , w.sub.n} of emotion words, and using an emotion word w.sub.i, to be predicted which has been labeled with an emotion polarity t.sub.i to respectively obtain the preceding text {w.sub.1, w.sub.2, w.sub.3} {w.sub.2, w.sub.3, w.sub.4} . . . {w.sub.i−3, w.sub.i−2, w.sub.i−1} and next text {w.sub.i+1, w.sub.i+2, w.sub.i+3} {w.sub.i+2, w.sub.i+3, w.sub.i+4} . . . {w.sub.n−2, w.sub.n−1, w.sub.n} of the training sample, the label of the training sample being t.sub.i.
(110) S2122. According to the mapping relationship between the emotion lexical items and the integer indexes, respectively mapping the emotion lexical items in the preceding next and the next text of the training sample into two integer sequences of preceding text {num.sub.1, num.sub.2, num.sub.3} {num.sub.2, num.sub.3, num.sub.4} . . . {num.sub.i−3, num.sub.i−2, num.sub.i−1} and next text {num.sub.i+1, num.sub.i+2, num.sub.i+3} {num.sub.i+2, num.sub.i+3, num.sub.i+4} . . . {num.sub.n−2, num.sub.n−1, num.sub.n}.
(111) S2123. Representing a weight matrix of a word embedding layer as n sets of row vectors e.sub.1, e.sub.2, . . . , e.sub.i, . . . , e.sub.n, wherein e.sub.i represents a word vector of w.sub.i, using one-hot coding to represent integer values in an integer sequence as an n-dimensional vector with only one item being 1 and the rest being all 0, for example, representing num.sub.i as an n-dimensional vector (0, 0, . . . , 1, . . . , 0) with the i-th position being 1, and calculating the word vector of w.sub.i:
(112)
(113) S2124. According to the integer indexes of the preceding text and the next text of the training sample, respectively converting lexical items of emotion words in the preceding text and the next text into word vectors containing emotion information to obtain {e.sub.1, e.sub.2, e.sub.3} {e.sub.2, e.sub.3, e.sub.4} . . . {e.sub.i−3, e.sub.i−2, e.sub.i−1} and {e.sub.i+3, e.sub.i+2, e.sub.i+3} {e.sub.i+2, e.sub.i+3, e.sub.i+4} . . . {e.sub.n−2, e.sub.n−1, e.sub.n}.
(114) S2125. Using two one-dimensional convolution neural networks to mine the emotion information in the preceding text and the next text respectively, splicing processing results and using a recurrent neural network to capture w.sub.i emotion polarity information implied in the preceding text and the next text, and after passing through a fully-connected layer and a normalized exponential function, outputting an emotion polarity probability distribution vector ŷ.sub.l of a model-predicted emotion word w.sub.i.
(115) S213. Using an emotion polarity training set to train the lexical item emotion model to obtain training results and weight parameters.
(116) An initialization algorithm is used to assign values for weights and threshold parameters in the lexical item emotion model. Based on the emotion polarity training set, a gradient descent algorithm is used to update the weight parameters iteratively, and a model-predicted accuracy threshold is set. When the accuracy of the lexical item emotion model reaches the threshold value, model training is stopped, and the model and learned weight parameters are saved in a. ckpt file. The specific steps of the gradient descent algorithm to update the weight and threshold parameters are as follows.
(117) S2131. Defining a neuron as a basic unit of calculation in the neural network, and using a Xavier parameter initialization algorithm to initialize a weight and a threshold.
(118)
bias.sub.initialization˜N[mean=0,std=1];
(119) where n.sub.in and n.sub.out are the number of input neurons and the number of output neurons, respectively.
(120) S2132. Using one-hot coding to express five emotion polarity categories as a five-dimensional vector with only one item being 1 and the rest being all 0, the current emotion word to be predicted being w.sub.i, and an emotion polarity vector t.sub.i=(t.sub.i1, t.sub.i2, t.sub.i3, t.sub.i4, t.sub.i5).
(121) S2133. When training the lexical item emotion model, inputting the preceding text and the next text of the emotion word w.sub.i to be predicted, and outputting a probability distribution vector
(122)
of model-predicted w.sub.i emotion polarity.
(123) S2134. Calculating the distance between t.sub.i and ŷ.sub.l by using a cross entropy loss function,
(124)
(125) S2135. Using a gradient descent algorithm to update weight.sub.initialization and bias.sub.initialization parameters iteratively, and searching for a parameter value enabling the value of the cross entropy loss function to be minimum, an updating formula of the gradient descent algorithm for the first time being:
weight′=weight.sub.initialization−η∇.sub.wL;
bias′=bias.sub.initialization−η∇.sub.bL;
(126) where weight′ and bias′ are the updated parameters, ρ is the learning rate, ∇.sub.wL and ∇.sub.bL are the gradients of the cross entropy loss function to the weight and threshold parameters.
(127) S2136. Setting the accuracy threshold value as 95%, using the gradient descent algorithm to update iteration parameters until the sum of values of the cross entropy loss function of all the training samples is 5%, and obtaining parameters of weight and bias to complete training of the lexical item emotion model.
(128) S22. Acquiring the word vector and the emotion polarity vector, calculating the similarity between the word vector and the emotion polarity vector, and fusing lexical item emotion polarities.
(129) S221. Acquiring the word vector and the emotion polarity vector based on the lexical item emotion model and the weight parameter thereof, wherein the lexical item emotion model and the weight parameter thereof are loaded; according to the mapping relationship between the emotion lexical item and the integer index and the weight parameter, the word vector with emotion information is acquired, and imported into the lexical item emotion model; the emotion polarity vector is calculated and output according to a functional relationship showing the emotion polarity vector in the model.
(130) S222. Using a similarity algorithm to calculate the similarity between the word vector and the emotion polarity vector, so that the similarity is divided into strong correlation, weak correlation and non-correlation according to the degree of similarity, wherein the word vector and the emotion polarity vector of the lexical item are loaded, the degree of similarity between the vectors is calculated by using the similarity algorithm, the similarity categories of the word vector and the emotion polarity vector are determined, and the similarity is set to three categories of strong correlation, weak correlation and negative correlation according to the positive and negative values and magnitude of a calculation result. The steps for calculating similarity are as follows.
(131) S2221. Acquiring a word vector e.sub.j and an emotion polarity vector q.sub.j of an emotion word w.sub.j.
(132) S2222. Using a Pearson correlation coefficient to calculate the similarity between the word vector and the emotion polarity vector:
(133)
(134) where cov is a covariance and σ is a standard deviation.
(135) S2223. Dividing the degree of correlation according to the calculation result of the Pearson correlation coefficient of the two vectors:
(136)
(137) S223. According to the similarity category, using an arithmetic mean, weighted mean or adding method, respectively, to realize the fusion of the two vectors.
(138) According to the similarity of the two vectors, if the two vectors are in strong correlation, the arithmetic mean method is used to calculate the emotion polarity information; if the two vectors are in weak correlation, the weighted mean method is used to process the emotion polarity information. If the two vectors are in non-correlation, adding the word vector and the emotion polarity vector to obtain the emotion polarity information of the lexical item.
(139) Taking the word vector e.sub.j and the emotion polarity vector q.sub.j of the emotion word w.sub.j as an example, the weighted mean method is:
(140)
(141) in the formula, ∥ is the modulus length of the vector.
(142) S23. Constructing a conditional vocoder to output new voices with emotion changes.
(143) S231. Extracting voice features of a teacher by using fast Fourier transform, non-linear transform and a filter bank.
(144) Specifically, referring to
(145) S232. Using a recurrent neural network layer to extract the features of a vector sequence, and converting the lexical items into emotion features. According to the similarity between the word vector and the emotion polarity vector, an emotion vector is acquired by using the corresponding operation method, the lexical item sequence is converted into the emotion polarity vector sequence, the vector sequence sequence is extracted by using the recurrent neural network layer, the feature of the vector sequence is transformed non-linearly by using two fully-connected layers, and then the emotion feature is generated by compression.
(146) In the present embodiment, the specific steps of acquiring the emotion feature are as follows.
(147) S2321. Loading the sequence {w.sub.1, w.sub.2, . . . , w.sub.n} of the lexical items of emotion words.
(148) S2322. Acquiring the word vector and the emotion polarity vector of each emotion word in the lexical item sequence, calculating a similarity, and according to the calculation result, using a corresponding fusion mode to obtain the emotion vector sequence.
(149) S2323. Using a recurrent neural network to extract the features of the emotion vector sequence, and outputting an emotion feature vector h={h.sub.1, h.sub.2, . . . , h.sub.j} after non-linear transformation by the two fully-connected layers and compression.
(150) S233. Constructing the conditional vocoder based on emotion and voice features of the teacher to generate new voices with emotion changes.
(151) Specifically, the emotion feature and voice features of the Mel-spectrogram are taken as the preconditions and input variables of a neural network vocoder respectively, and a conditional model of the vocoder is constructed; the emotion change and pitch and timbre feature are fused by the vocoder to generate the new voice with the emotion changes for subsequent voice synthesis.
(152) In the present embodiment, referring to
(153) S2331. Using the emotion feature vector h={h.sub.1, h.sub.2, . . . , h.sub.j} and a voice feature x={x.sub.1, x.sub.2, . . . , x.sub.T} of the Mel-spectrogram as the precondition and input of the vocoder respectively.
(154) S2332. Having a conditional model formula for the vocoder:
ρ(x|h)=Π.sub.t+1.sup.Tp(x.sub.t|x.sub.1,x.sub.2, . . . ,x.sub.t−1,h);
(155) where x.sub.t is the voice feature of the Mel-spectrogram at time t.
(156) S2333. Having a calculating formula for fusing the emotion feature h and the voice feature x of the Mel-spectrogram:
z=tanh(W.sub.1*x+V.sub.1.sup.T)⊙σ(W.sub.2*x+V.sub.2.sup.Th);
(157) Where tanh is a tangent function, σ is a sigmoid function, ⊙ is a Khatri-Rao product, and V.sub.1, V.sub.2, W.sub.1 and W.sub.2 are weight parameters.
(158) S3. According to the voice, obtaining a feature vector of voice rhythm information, and generating voice style features and coding states, obtaining text to be synthesized, and generating a simulated voice of the text to be synthesized according to the text to be synthesized and the voice style feature.
(159) S31. Generating voice style features fused with text emotion information, and using a one-to-many recurrent neural network layer to mine multi-rhythm features in the voice to obtain a voice style coding state.
(160) S311. Using a two-dimensional convolution neural network, batch normalization, a rectified linear unit, and a single-layer recurrent neural network layer to convert the rhythm information into a rhythm feature vector.
(161) Specifically, the two-dimensional convolution neural network is used to extract the voice features of the teacher and acquire the rhythm information of the tone, time domain distribution, stress and emotion. A batch standardization algorithm is used to process the multi-rhythm information in the voice. The single-layer recurrent neural network layer is used to extract the rhythm information and convert the same into the rhythm feature vector of a fixed dimension.
(162) S312. Using the one-to-many recurrent neural network layer to mine the multi-rhythm feature in the voice, and assigning a feature weight to a style by using an attention mechanism to obtain a style coding vector.
(163) Specifically, the number of voice style features required to be captured is set, the one-to-many recurrent neural network layer is used to mine the rhythm feature vector to acquire the tone, pitch, timbre, rhythm and emotion voice style features of a real teacher, the attention mechanism is applied to assign a high weight to prominent features of the voice style, the style features are added for operation, and the style coding vector is generated.
(164) In the present embodiment, referring to
(165) S3121. Acquiring a rhythm feature vector pr={pr.sub.1, pr.sub.2, . . . , pr.sub.k} containing the tone, time domain distribution, stress and emotion information.
(166) S3122. Based on the tone, pitch, timbre, rhythm and emotion, constructing a five-dimensional feature voice style, taking the rhythm feature vector as an input variable of a one-to-many recurrent neural network, and outputting a feature vector {s.sub.1, s.sub.2, s.sub.3, s.sub.4, s.sub.5} of the voice style.
(167) S3123. According to the features of the voice style of the teacher, using the attention mechanism to assign different weights to the five voice style features, a formula for calculating the feature weight of the voice style being:
[α.sub.1,α.sub.2,α.sub.3,α.sub.4,α.sub.5]=softmax([score(s.sub.1,q),score(s.sub.2,q),score(s.sub.3,q),score(s.sub.4,q),score(s.sub.5,q)])
(168) where score is a scoring function and q is a query vector.
(169) S3124. Multiplying five-dimensional variables of the voice style feature by the corresponding weights, adding the operation results, and outputting a style coding vector style={style.sub.1, style.sub.2, . . . , style.sub.i}.
(170) S313. Extracting the style coding vector, and generating the voice style features and the coding state thereof.
(171) Specifically, the number of extraction modules is set, the fully-connected layer, batch standardization and the rectified linear unit are taken as a group of extraction modules, the dimension of an output coding state is set, and the extraction module is used to perform non-linear transformation and compression processing operation on the style coding vector so as to generate the coding state containing the voice style feature and having a fixed dimension.
(172) S32. According to the modeling order, constructing a tone prediction model, and capturing tone fluctuation changes and converting the same into a tone feature coding state.
(173) S321. According to the serialized modeling design specification and the modeling order, constructing the tone prediction model, wherein according to the serialized modeling design specification and the modeling order, the word embedding layer, a double-layer recurrent neural network layer, the fully-connected layer and the normalized exponential function are stacked, so that the tone prediction model is constructed; a tone training sample is input variables of the model, and the output of the model implies the probability distribution of the punctuation symbols of a rising tone, a falling tone and tone changes.
(174) In the present embodiment, referring to
(175) S3211. Loading a Pinyin labeling sub-sequence p={p.sub.1, p.sub.2, . . . , p.sub.n} with tones.
(176) S3212. Using the word embedding layer to convert the sub-sequence into a vector sequence e={e.sub.1, e.sub.2, . . . , e.sub.n} implying tone changes.
(177) S3213. Using the double-layer recurrent neural network layer to capture the pitch and fluctuation change features of the tone in the vector sequence.
(178) S3214. Using the features captured by non-linear transformation of the fully-connected layer and compression to obtain a probability distribution vector pun={pun.sub.1, pun.sub.2, . . . , pun.sub.k} of the corresponding punctuation symbols of the sub-sequence through processing of the normalized exponential function.
(179) S322. Updating the weight parameters in the model by using a back-propagation algorithm, and mapping Pinyin sub-labels into a vector implying the tone features.
(180) A text tone training set is used to train the tone prediction model, the error back-propagation algorithm is used to update the weight parameter in the model, and a predicted accuracy threshold value is set; when the accuracy of the tone prediction model reaches the threshold value, model training is stopped; according to the tone change of the Pinyin sub-sequence implied by the weight parameter in the word embedding layer, the weight parameter is used to map the sub-sequence into the feature vector containing the tone change.
(181) In the present embodiment, the specific steps of the error back-propagation algorithm are as follows.
(182) S3221. In the tone prediction model, making the input of the i-th layer and the output of the (i+1)-th layer be x.sub.i and x.sub.i+1 respectively, the weight parameters of the two layers being respectively w.sub.i and w.sub.i+1.
(183) S3222. Defining a true output result as z, and calculating an error between a model predicted result and the true output result:
δ=z−x.sub.i+1.
(184) S3223. Transferring the error from the (i+1)-th layer to the i-th layer through a chain rule, and respectively calculating the errors of the i-th layer and the (i+1)-th layer:
δ.sub.i+1=w.sub.i+1δ;δ.sub.i=w.sub.iδ.sub.i+1.
(185) S3224. Calculating weight parameters of the updated i-th layer and (i+1)-th layer respectively:
w.sub.i=w.sub.i+ηδ.sub.iƒx.sub.i;
w.sub.i+1=w.sub.i+1+ηδ.sub.i+1ƒx.sub.i+1;
(186) where η is the learning rate and ƒ is the derivative of an activation function.
(187) S323. Using hole convolution to capture the tone fluctuation changes, and using the fully-connected layer to convert the tone fluctuation changes into a tone feature coding state with a fixed dimension.
(188) Specifically, a hole cause and effect convolution neural network is used to capture the rule of fluctuation changes in the tone feature vector, the fluctuation changes of the tone are sequentially spliced and processed in the order of time steps, the splicing results are transformed non-linearly by the fully-connected layer, and the processing results are compressed into the tone feature coding state with the fixed dimension.
(189) S33. Generating simulated voice features, wherein the attention mechanism is used to assign weights to the voice coding state and the tone coding state, and the coding states are fused through addition operation processing; according to the text to be synthesized and the simulated voice features, a voice sequence with the voice style and emotion features of the real teacher is generated and output.
(190) S331. Using the double-layer recurrent neural network layer to mine text feature information, and outputting a feature vector of the text to be synthesized via the fully-connected layer and the rectified linear unit.
(191) A neural network language model is used to convert the text to be synthesized into a text vector, and feature information in the text vector is mined by using the double-layer recurrent neural network layer to acquire the output result of the last time step, and the feature vector of the text to be synthesized is acquired through processing by a fully-connected layer and a rectified linear unit function.
(192) In the present embodiment, the specific steps of obtaining the features of texts to be synthesized are as follows.
(193) S3311. Obtaining texts to be synthesized {w.sub.1, w.sub.2, . . . , w.sub.n}.
(194) S3312. Converting the texts to be synthesized into text vectors, i.e., text={text.sub.1, text.sub.2, . . . , text.sub.n} by using a neural network voice model.
(195) S3313. Extracting the text vector features by adopting a double-layer recurrent neural network, and acquiring text structures and semantic features.
(196) S3314. Processing the text features by using a fully-connected layer and a rectified linear unit function to obtain text feature vectors, i.e., f={f1,f2, . . . , fk} to be synthesized.
(197) S332. Multi-coding state fusion. The coding states of voice styles and tone features are acquired. According to the prominent voice styles and the degree of tone fluctuation changes in the voices of the real teachers, the weight is given to the coding state of each part by using the attention mechanism, the coding states of the above two features and corresponding weights thereof are calculated by adopting the addition operation, and the simulated voice features are acquired.
(198) The coding states of the voice styles and the tone features are s.sub.state and p.sub.state respectively, and the weights given to the two coding states by using the attention mechanism are weight.sub.s and weight.sub.p respectively so the addition operation is:
feature=weight.sub.s*s.sub.state+weight.sub.p*p.sub.state
(199) S333. Real teacher style voice generation. The texts to be synthesized and the simulated voice features are used as input variables, the style and prosody features of the voice sequence are acquired by combining multi-coding state fusion results in the voice synthesizer, and according to the emotion polarities of the teacher, the voice sequence with the voice style and emotion features of the real teacher is generated and outputted.
Embodiment II
(200) Referring to
(201) In summary, according to the method and the terminal for generating simulated voices of virtual teachers provided by the present disclosure, the voice samples of real teachers are collected, and after preprocessing, the preprocessing result is saved as a WAV file, and voice texts are obtained by adopting a voice conversion algorithm. The grammatical errors and spelling errors of the texts are corrected; stop words, punctuation marks and low-frequency lexical items are removed; the parts of speech of the lexical items are labeled; according to the emotion rules, the emotion polarities of the lexical items are labeled; and a set of punctuation marks with tone changes is defined in advance, and an automatic Chinese character tone labeling tool is developed for labeling Chinese Pinyin. According to the statistical results of lexical frequencies and the order of the lexical frequencies from big to small, the mapping relationship is constructed, and the lexical item emotion model is constructed. The emotion polarity training set is used to train the lexical item emotion model, and the training results and weight parameters are obtained. The similarity between the word vector and the emotion polarity vector is calculated, and the fusion of the word vector and the emotion polarity vector is realized according to the similarity category; the voice style and emotion features of real teachers are extracted, and thus, the conditional vocoder is constructed to generate the new voices with the emotion changes. The feature vector of voice rhythm information is extracted by using a combination method, the weight is given to the style feature by using the attention mechanism, the style coding vector is extracted, and the voice style feature and a coding state thereof are generated; the fluctuation changes of tones are captured by hollow convolution, the coding state of tone features is obtained, and the fusion coding state of voices and tones is processed by an addition operation; and according to the texts to be synthesized and the voice features of the real teacher, new voice sequences are generated and outputted. With the wide application of virtual teachers in classroom teaching, online teaching, campus activities and other teaching scenes, the demand for providing simulated voice synthesis services is increasingly urgent. The method and the terminal contribute to extracting and synthesizing the voice features and emotion styles of teaching administrators, teachers, and other related users, and satisfy the application requirements of high-quality virtual teachers.
(202) The foregoing descriptions are merely preferred embodiments of the present application but are not intended to limit the patent scope of the present application. Any equivalent modifications made to the structures or processes based on the content of the specification and the accompanying drawings of the present application for direct or indirect use in other relevant technical fields shall also be encompassed in the patent protection scope of the present application.