Multi-language mixed speech recognition method
11151984 · 2021-10-19
Assignee
Inventors
Cpc classification
G10L15/148
PHYSICS
G10L15/14
PHYSICS
International classification
G10L15/06
PHYSICS
Abstract
The invention discloses a multi-language mixed speech recognition method, which belongs to the technical field of speech recognition; the method comprises: step S1, configuring a multi-language mixed dictionary including a plurality of different languages; step S2, performing training according to the multi-language mixed dictionary and multi-language speech data including a plurality of different languages to form an acoustic recognition model; step S3, performing training according to multi-language text corpus including a plurality of different languages to form a language recognition model; step S4, forming the speech recognition system by using the multi-language mixed dictionary, the acoustic recognition model and the language recognition model; and subsequently, recognizing mixed speech by using the speech recognition system, and outputting a corresponding recognition result. The above technical solution has the beneficial effects of being able to support the recognition of mixed speech in multiple languages, improving the accuracy and efficiency of recognition, and thus improving the performance of the speech recognition system.
Claims
1. A multi-language mixed speech recognition method, comprising: configuring a multi-language mixed dictionary including a plurality of different languages; executing training, based on the multi-language mixed dictionary and multi-language speech data including the plurality of different languages, to form an acoustic recognition model; executing training, based on multi-language text corpus including the plurality of different languages, to form a language recognition model; forming a speech recognition system based on: the multi-language mixed dictionary, the acoustic recognition model, and the language recognition model; recognizing a mixed speech by the speech recognition system; and outputting a corresponding recognition result of the recognition of the mixed speech, wherein forming the acoustic recognition model includes: executing training, based on multi-language speech data in which the plurality of different languages are mixed and the multi-language mixed dictionary to form an acoustic model; extracting a speech feature from the multi-language speech data, and executing a frame alignment operation on the speech feature by the acoustic model to obtain an output label corresponding to the speech feature in each frame; and executing training, to form the acoustic recognition model, based on the speech feature as input data of the acoustic recognition model and the output label corresponding to the speech feature as an output label in an output layer of the acoustic recognition model.
2. The multi-language mixed speech recognition method according to claim 1, wherein the multi-language mixed dictionary is configured based on a single-language dictionary corresponding to each different language in a manner of triphone modeling.
3. The multi-language mixed speech recognition method according to claim 1, wherein the multi-language mixed dictionary is configured in a manner of triphone modeling, and when the multi-language mixed dictionary is being configured, a corresponding language mark is respectively added in front of a phone of each language contained in the multi-language mixed dictionary to distinguish phones of the plurality of different languages.
4. The multi-language mixed speech recognition method according to claim 1, wherein the acoustic model is a hidden Markov-Gaussian mixture model.
5. The multi-language mixed speech recognition method according to claim 1, wherein after the acoustic recognition model is trained, the output layer of the acoustic recognition model is adjusted by: respectively calculating a prior probability of each language, and calculating the common prior probability of silence of all kinds of languages; respectively calculating a posterior probability of each language, and calculating the posterior probability of silence; and adjusting the output layer of the acoustic recognition model based on the prior probability and the posterior probability of each language, and the prior probability and the posterior probability of silence.
6. The multi-language mixed speech recognition method according to claim 5, wherein the prior probability of each language is respectively calculated based on the following formula:
7. The multi-language mixed speech recognition method according to claim 5, wherein the prior probability of silence is calculated based on the following formula:
8. The multi-language mixed speech recognition method according to claim 5, wherein the posterior probability of each language is respectively calculated based on the following formula:
9. The multi-language mixed speech recognition method according to claim 5, wherein the posterior probability of silence is calculated based on the following formula:
10. The multi-language mixed speech recognition method according to claim 1, wherein the acoustic recognition model is an acoustic model of a deep neural network.
11. The multi-language mixed speech recognition method according to claim 1, wherein the language recognition model is formed by training based on an n-Gram model, or a recurrent neural network.
12. The multi-language mixed speech recognition method according to claim 1, wherein, after the speech recognition system is formed, weight adjustment executed on different kinds of languages in the speech recognition system at first; steps of executing the weight adjustment comprise: respectively determining a posteriori probability weight value of each language according to real speech data; and respectively adjusting the posterior probability of each language based on the posterior probability weight value to complete the weight adjustment.
13. The multi-language mixed speech recognition method according to claim 12, wherein the weight adjustment is performed based on the following formula:
{circumflex over (P)}(q.sub.j.sup.i|x)=a.sub.j.Math.P(q.sub.j.sup.i|x); wherein, q.sub.j.sup.i is used for representing the output label of the ith state of the jth language in the multi-language speech data; x is used for representing the speech feature; P(q.sub.j.sup.i|x) is used for representing the posterior probability of the output label q.sub.j.sup.i in the multi-language speech data; a.sub.j is used for representing the posteriori probability weight value of the jth language in the multi-language speech data; and {circumflex over (P)}(q.sub.j.sup.i|x) is used for representing the posterior probability of the output label q.sub.j.sup.i in the multi-language speech data after the weight adjustment.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION OF THE EMBODIMENTS
(7) A clear and complete description of technical solutions in the embodiments of the present invention will be given below, in combination with the drawings in the embodiments of the present invention. Apparently, the embodiments described below are merely a part, but not all, of the embodiments of the present invention. All of other embodiments, obtained by those of ordinary skill in the art based on the embodiments of the present invention without any creative effort, fall into the protection scope of the present invention.
(8) It should be noted that the embodiments in the present invention and the features in the embodiments can be combined with each other without conflict.
(9) A further description of the present invention is given below in combination with the drawings and specific embodiments, but is not used as a limitation to the present invention.
(10) Based on the above problems existing in the prior art, the present invention provides a multi-language mixed speech recognition method, the so-called mixed speech refers to speech data in which a plurality of different languages are mixed, for example, a user inputs “I need a USB interface” in speech, the segment of speech not only includes Chinese speech, but also includes an English proper noun “USB”, and then the segment of speech is mixed speech. In other embodiments of the present invention, the mixed speech can also be a mixture of two or more kinds of speech, which is not limited herein.
(11) In the above multi-language mixed speech recognition method, a speech recognition system for recognizing mixed speech needs to be formed at first. The method for forming the speech recognition system is specifically as shown in
(12) step S1, configuring a multi-language mixed dictionary including a plurality of different languages;
(13) step S2, performing training according to the multi-language mixed dictionary and multi-language speech data including a plurality of different languages to form an acoustic recognition model;
(14) step S3, performing training according to multi-language text corpus including a plurality of different languages to form a language recognition model; and
(15) step S4, forming the speech recognition system by using the multi-language mixed dictionary, the acoustic recognition model and the language recognition model.
(16) After the speech recognition system is formed, the mixed speech can be recognized by using the speech recognition system, and a corresponding recognition result is output.
(17) Specifically, in the present embodiment, the multi-language mixed dictionary is a mixed dictionary including a plurality of different languages, and the mixed dictionary is configured to a phone level. In a preferred embodiment of the present invention, the above mixed dictionary is configured in a manner of triphone modeling, and a dictionary model more stable than word modeling can be obtained. In addition, since the dictionaries of different languages may contain phones represented by the same characters, it is necessary to respectively add a corresponding language mark in front of the phone of each language included in the multi-language mixed dictionary to distinguish the phones of the plurality of different languages, when the mixed dictionary is configured.
(18) For example, both Chinese and English phone sets include “b”, “d” and other phones. To distinguish them, language marks are added in front of all English phone subsets (for example, “en” is added to serve as a prefix), so as to distinguish the English phone set from the Chinese phone set, as shown in
(19) The above language mark can be empty, for example, if there are two languages in the mixed dictionary, the language mark only needs to be added to one language to distinguish the two languages. Similarly, if there are three languages in the mixed dictionary, the language marks only need to be added to two languages to distinguish the three languages, and so on.
(20) In the above mixed dictionary, the language mark can also be added between the phone sets that may cause confusion, for example, one mixed dictionary includes Chinese, English and other languages, and only the Chinese and English phone sets may be confused, therefore the language mark only needs to be added in front of the English phone set.
(21) In the present embodiment, after the multi-language mixed dictionary is formed, training is performed according to the multi-language mixed dictionary and multi-language speech data including a plurality of different languages to form an acoustic recognition model. Specifically, the multi-language speech data are mixed speech data prepared in advance and including a plurality of different languages for training, and the mixed dictionary provides phones of different languages in the process of forming the acoustic recognition model. Therefore, in the process of performing training to form the acoustic recognition model in which a plurality of different languages are mixed, in order to obtain a triphone relationship of the mixed language phones, the above multi-language speech data in which a plurality of different languages are mixed need to be prepared, and the operation is performed according to the multi-language mixed dictionary formed above.
(22) In the present embodiment, training is performed according to multi-language text corpus in which a plurality of different languages are mixed to form a language recognition model, the multi-language mixed dictionary, the acoustic recognition model and the language recognition model are included in a speech recognition system, and the mixed speech input by a user and including a plurality of languages is recognized according to the speech recognition system to output a recognition result.
(23) In the present embodiment, after the above processing, the recognition process of the above mixed speech is similar to the recognition process of the single-language speech in the prior art, speech features in a segment of speech data are recognized as corresponding phones or word sequences via the acoustic recognition model, and the word sequences are recognized as a complete sentence through the language recognition model, thereby completing the recognition process of the mixed speech. The above recognition process will not be described in detail herein.
(24) In summary, in the technical solution of the present invention, the multi-language mixed dictionary including a plurality of languages is formed according to a plurality of language dictionaries of single languages at first, and language marks are added to the phones of the different languages for distinguishing. Then, training is performed according to the multi-language mixed speech data and the multi-language mixed dictionary to form an acoustic recognition model, and training is performed according to the multi-language mixed text corpus to form a language recognition model. Then, a complete speech recognition system is formed according to the multi-language mixed dictionary, the acoustic recognition model and the language recognition model to recognize the multi-language mixed speech input by the user.
(25) In a preferred embodiment of the present invention, as shown in
(26) step S21, performing training according to multi-language speech data in which a plurality of different languages are mixed and the multi-language mixed dictionary to form an acoustic model;
(27) step S22, extracting a speech feature from the multi-language speech data, and performing a frame alignment operation on the speech feature by using the acoustic model to obtain an output label corresponding to the speech feature in each frame; and
(28) step S23, using the speech feature as input data of the acoustic recognition model, and using the output label corresponding to the speech feature as an output label in an output layer of the acoustic recognition model to perform training to form the acoustic recognition model.
(29) Specifically, in the present embodiment, before the acoustic recognition model is formed by training, training is performed according to multi-language speech data in which the plurality of different languages are mixed to form an acoustic model. The acoustic model can be a hidden markov model-Gussian mixture model (Hidden Markov Model-Gussian Mixture Model, HMM-GMM) model. In view of a parameter revaluation robustness problem in triphone modeling, the parameter sharing technology can be selected in the process of performing training to form the acoustic model, thereby reducing the parameter scale. The modeling technology of the acoustic model based on the HMM-GMM is quite mature at present, and thus will not be described repeatedly herein again.
(30) In the present embodiment, after the acoustic model is formed, the frame alignment operation needs to be performed on the multi-language speech data by using the acoustic model, so that the speech feature extracted from the multi-language speech data in each frame corresponds to an output label. Specifically, after the frame alignment, the speech feature in each frame corresponds to a GMM serial number. The output label in the output layer of the acoustic recognition model is the label corresponding to the speech feature in each frame, therefore, the number of the output labels in the output layer of the acoustic recognition model is the number of GMMSs in the HMM-GMM model, and each output node corresponds to one GMM.
(31) In the present embodiment, the speech feature is used as input data of the acoustic recognition model, and the output label corresponding to the speech feature is used as the output label in the output layer of the acoustic recognition model to perform training to form the acoustic recognition model.
(32)
(33) In the preferred embodiment of the present invention, in the above step S23, after the acoustic recognition model is trained, adjustment, prior and other operation need to be performed on the output layer of the acoustic recognition model in view of a plurality of languages, which is specifically as shown in
(34) step S231, respectively calculating a prior probability of each language, and calculating the common prior probability of silence of all kinds of languages;
(35) step S232, respectively calculating a posterior probability of each language, and calculating the posterior probability of silence; and
(36) step S233, adjusting the output layer of the acoustic recognition model according to the prior probability and the posterior probability of each language, and the prior probability and the posterior probability of silence.
(37) Specifically, in a preferred embodiment of the present invention, when the acoustic recognition model is used for performing speech recognition, with respect to a given speech feature, the character string of the output result thereof is usually determined by the following formula:
ŵ=arg max P(w|x)P(w)/p(x); (1)
(38) wherein, ŵ is used for representing the character string of the output result, w represents a possible character string, x represents an input speech feature, P(w) is used for representing the probability of the above language recognition model, and P(x|w) is used for representing the probability of the above acoustic recognition model.
(39) Then the above P(x|w) can be further expanded as:
(40)
(41) wherein, x.sub.t is used for representing the speech feature input at the moment t, q.sub.t is used for representing a triphone state bound at the moment t, π(q.sub.0) is used for representing the probability distribution of an initial state q.sub.0, P(x.sub.t|q.sub.t) is used for representing the probability that the speech feature x.sub.t is at the state q.sub.t.
(42) Then, the above P(x.sub.t|q.sub.t) can be further expanded as:
P(x.sub.t|q.sub.t)=P(q.sub.t|x.sub.t)P(x.sub.t)/P(q.sub.t); (3)
(43) wherein, P(x.sub.t|q.sub.t) represents the posterior probability of the output layer of the acoustic recognition model, P(q.sub.t) represents the prior probability of the acoustic recognition model, and P(x.sub.t) represents the probability of x.sub.t. P(x.sub.t) is not related with a character string sequence, and thus can be ignored.
(44) According to the above formula (3), it can be concluded that the character string of the output result can be adjusted by calculating the prior probability and the posterior probability of the output layer of the acoustic recognition model.
(45) In a preferred embodiment of the present invention, the prior probability P(q) of the neural network is usually calculated by the following formula:
(46)
(47) wherein, Count(q.sup.i) is used for representing the total number of labels q.sup.i in the multi-language speech data, and N is used for representing the total number of all output labels.
(48) In a preferred embodiment of the present invention, since the number of training speech data of different kinds of languages may be different, the prior probability cannot be uniformly calculated and needs to be respectively calculated according to different kinds of languages.
(49) In a preferred embodiment of the present invention, in the above step S231, the prior probability of each language is respectively calculated at first, and the common prior probability of silence of all kinds of languages is calculated.
(50) The prior probability of each language is respectively calculated according to the following formula at first:
(51)
(52) wherein,
(53) q.sub.j.sup.i is used for expressing the output label of the ith state of the jth language in the multi-language speech data;
(54) P(q.sub.j.sup.i) is used for representing the prior probability of the output label q.sub.j.sup.i in the multi-language speech data;
(55) Count(q.sub.j.sup.i) is used for representing the total number of the output labels q.sub.j.sup.i in the multi-language speech data;
(56) q.sub.si l.sup.i is used for representing the output label of the ith state of silence in the multi-language speech data;
(57) Count(q.sub.si l.sup.i) is used for representing the total number of the output labels q.sub.j.sup.i in the multi-language speech data;
(58) M.sub.j is used for representing the total number of states in the jth language in the multi-language speech data; and
(59) M.sub.si l is used for representing the total number of states of silence in the multi-language speech data.
(60) Then, the prior probability of silence is calculated according to the following formula:
(61)
(62) wherein,
(63) P(q.sub.si l.sup.i) is used for representing the prior probability of the output label q.sub.si l.sup.i in the multi-language speech data; and
(64) L is used for representing all languages in the multi-language speech data.
(65) In a preferred embodiment of the present invention, after the prior probability of each language and the prior probability of silence are calculated, the posterior probability of the acoustic recognition model is continuously calculated. The posterior probability P(qi|x) output by the neural network is usually calculated by the output layer, and when the output layer is implemented by the softmax nonlinear unit, the posterior probability is usually calculated according to the following formula:
(66)
(67) wherein, y.sup.i is used for representing an input value in the ith state, and N represents the number of all states.
(68) Similarly, in the acoustic recognition model, the imbalance of the number of training data in different kinds of languages may result in the imbalance in the distribution of state value calculation results of different kinds of languages, so the posterior probability still needs to be calculated respectively for different kinds of languages.
(69) In a preferred embodiment of the present invention, in the above step S232, the posterior probability of each language is respectively calculated according to the following formula.
(70)
(71) wherein,
(72) x is used for representing the speech feature;
(73) P(q.sub.j.sup.i|x) is used for representing the posterior probability of the output label q.sub.j.sup.i in the multi-language speech data;
(74) y.sub.j.sup.i is used for representing the input data of the ith state of the jth language in the multi-language speech data;
(75) y.sub.si l.sup.i is used for representing the input data of the ith state of silence; and
(76) exp is used for representing an exponential function calculation manner.
(77) In a preferred embodiment of the present invention, in the step S232, the posterior probability of silence is calculated according to the following formula:
(78)
(79) wherein, P(q.sub.si l.sup.i|x) is used for representing the posterior probability of the output label q.sub.si l.sup.i in the multi-language speech data.
(80) In the present invention, the prior probabilities and the posterior probabilities in each language and in silence state can be calculated by using the above improved formulas (6)-(9), so that the acoustic recognition model can meet the output requirements of multi-language mixed modeling, and each language and in silence state can be described more accurately. It should be noted that after the above formulas are adjusted, the sums of the prior probabilities and the posterior probabilities are no longer 1.
(81) In a preferred embodiment of the present invention, in the above step S3, the language recognition model can be formed by training by using an n-Gram model, or the language recognition model is formed by training by using a recurrent neural network. The above multi-language text corpus should include individual text corpus of a plurality of languages, as well as text data in which a plurality of languages are mixed.
(82) In a preferred embodiment of the present invention, after the speech recognition system is formed, weight adjustment is performed on different kinds of languages in the speech recognition system at first.
(83) The steps of performing the weight adjustment, as shown in
(84) step A1, respectively determining a posterior probability weight value of each language according to real speech data; and
(85) step A2, respectively adjusting the posterior probability of each language according to the posterior probability weight value to complete the weight adjustment.
(86) Specifically, in the present embodiment, after the speech recognition system is formed, a problem of unbalanced training data size may be generated in the training process, one language with relatively large data size may obtain a relatively large prior probability, since the final recognition probability is obtained by dividing the posterior probability by the prior probability, the actual recognition probability of the language with more training data is smaller, such that the recognition result of the recognition system may tend to recognize a certain language and cannot recognize another language, leading to a deviation of the recognition result.
(87) In order to solve this problem, before the above speech recognition system is put into practical use, it is necessary to use real data as a development set to perform actual measurement so as to adjust the weight of each language. The above weight adjustment is usually applied to the posterior probability output by the acoustic recognition model, so the formula is as follows:
{circumflex over (P)}(q.sub.j.sup.i|x)=a.sub.j.Math.P(q.sub.j.sup.i|x); (10)
(88) wherein,
(89) q.sub.j.sup.i is used for representing the output label of the ith state of the jth language in the multi-language speech data;
(90) x is used for representing the speech feature;
(91) P(q.sub.j.sup.i|x) is used for representing the posterior probability of the output label q.sub.j.sup.i in the multi-language speech data;
(92) a.sub.j is used for representing the posterior probability weight value of the jth language in the multi-language speech data, and the posterior probability weight value is determined by performing actual measurement on the acoustic recognition model through the development set formed by the real data.
(93) {circumflex over (P)}(q.sub.j.sup.i|x) is used for representing the posterior probability of the output label q.sub.j.sup.i in the multi-language speech data after the weight adjustment.
(94) Through the above weight adjustment, the speech recognition system can obtain a very good recognition effect in different application scenarios.
(95) In a preferred embodiment of the present invention, for the speech recognition system mixed by Chinese and English, after the actual measurement of the real data, the posterior probability weight value of Chinese can be set as 1.0, the posterior probability weight value of English is set as 0.3, and the posterior probability weight value of silence is set as 1.0.
(96) In other embodiments of the present invention, the posterior probability weight value can be repeatedly adjusted for multiple times by using different real data, and an optimal value is determined as last.
(97) The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the embodiments and the protection scope of the present invention. Those skilled in the art should be aware that solutions obtained by making equivalent substitutions and obvious variations by using the specification of the present invention and the contents shown in the figures shall all fall within the protection scope of the present invention.