SYSTEM AND METHOD FOR TONE RECOGNITION IN SPOKEN LANGUAGES
20230186905 · 2023-06-15
Inventors
Cpc classification
G10L15/30
PHYSICS
International classification
G10L15/30
PHYSICS
Abstract
There is provided a system and method for recognizing tone patterns in spoken languages using sequence-to-sequence neural networks in an electronic device. The recognized tone patterns can be used to improve the accuracy for a speech recognition system on tonal languages.
Claims
1. A method of speech recognition on acoustic signals associated with a tonal language, in a computing device, the method comprising: applying a feature vector extractor to an input acoustic signal and outputting a sequence of feature vectors for the input acoustic signal; applying at least one runtime model of one or more neural networks to the sequence of feature vectors and producing a sequence of tones as output from the input acoustic signal; wherein the sequence of tones are predicted as probabilities of each feature vector of the sequence of feature vectors representing a part of a tone of the sequence of tones; applying an acoustic model to the input acoustic signal to obtain one or more complimentary acoustic vectors; and combining the sequence of tones and the one or more complimentary acoustic vectors to output a speech recognition result of the input acoustic signal.
2. The method of claim 1 wherein the sequence of tones define a tone posteriorgram.
3. The method of claim 1 wherein the complimentary acoustic vectors are speech feature vectors or a phoneme posteriorgram.
4. The method of claim 3 wherein the speech feature vectors are provided by one of a Mel-frequency cepstral coefficients (MFCC), a filterbank features (FBANK) technique, or a perceptual linear predictive (PLP) technique.
5. The method of claim 1, further comprising: mapping the sequence of feature vectors to the sequence of tones using one or more neural networks to learn at least one model to map the sequence of feature vectors to the sequence of tones.
6. The method of claim 1, wherein the feature vector extractor comprises one or more of a multi-layer perceptron (MLP), a convolutional neural network (CNN), a recurrent neural network (RNN), a cepstrogram, a spectrogram, a Mel-filtered cepstrum coefficients (MFCC), or a filterbank coefficient (FBANK).
7. The method of claim 6, wherein the neural network is a sequence-to-sequence network.
8. The method of claim 7 wherein the sequence-to-sequence network comprises one or more of an MLP, a CNN, or an RNN, trained using a loss function appropriate to connectionist temporal classification (CTC) training, encoder-decoder training, or attention training.
9. The method of claim 8 wherein the sequence-to-sequence network has one or more uni-directional or bi-directional recurrent layers.
10. The method of claim 8 wherein when the sequence-to-sequence network is a RNN, the RNN has recurrent units such as long-short term memory (LSTM) or gated recurrent units (GRU).
11. The method of claim 10, where the RNN is implemented using one or more of uni-directional or bi-directional LSTM or GRU units.
12. The method of claim 1 further comprising a preprocessing network for computing frames using a Hamming window providing to define a cepstrogram input representation.
13. The method of claim 12 further comprising a convolutional neural network for performing n x m convolutions on the cepstrogram and then pooling prior to application of an activation layer.
14. The method of claim 13 wherein n=2, 3 or 4 and m=3 or 4.
15. The method of claim 13 wherein pooling comprises 2x2 pooling, average pooling or I2-norm pooling.
16. The method of claim 13 wherein activation layers of the one or more neural networks is one of a rectified linear unit (ReLU) activation function using a three-layer network, a sigmoid layer or a tanh layer.
17. A speech recognition system comprising: an audio input device; a processor coupled to the audio input device; a memory coupled to the processor, the memory for estimating tones present in an input acoustic signal and outputting a sequence of feature vectors for the input acoustic signal by: applying a feature vector extractor to an input acoustic signal and outputting a sequence of feature vectors for the input acoustic signal; applying at least one runtime model of one or more neural networks to the sequence of feature vectors and producing a sequence of tones as output from the input acoustic signal, wherein the sequence of tones are predicted as probabilities of each feature vector of the sequence of feature vectors representing a part of a tone of the sequence of tones; applying an acoustic model to the input acoustic signal to obtain one or more complimentary acoustic vectors; and combining the sequence of tones and the one or more complimentary acoustic vectors to output a speech recognition result of the input acoustic signal.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0019] Further features and advantages of the present disclosure will become apparent from the following detailed description, taken in combination with the appended drawings, in which:
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026] It will be noted that throughout the appended drawings, like features are identified by like reference numerals.
DETAILED DESCRIPTION
[0027] A system and method is provided which learns to recognize sequences of tones without segmented training data using sequence-to-sequence networks. A sequence-to-sequence network is a neural network trained to output a sequence, given a sequence as input. Sequence-to-sequence networks include connectionist temporal classification (CTC) networks, encoder-decoder networks], and attention networks among other possibilities. The model used in sequence-to-sequence networks is typically a recurrent neural network (RNN); however, not-recurrent architectures also exists, which can be trained a convolutional neural network (CNN) for speech recognition using a CTC-like sequence loss function.
[0028] In accordance with an aspect there is provided a method of processing and/or recognizing tones in acoustic signals associated with a tonal language, in a computing device, the method comprising: applying a feature vector extractor to an input acoustic signal and outputting a sequence of feature vectors for the input acoustic signal; and applying at least one runtime model of one or more neural networks to the sequence of feature vectors and producing a sequence of tones as output from the input acoustic signal; wherein the sequence of tones are predicted as probabilities of each given speech feature vector of the sequence of feature vectors representing a part of a tone.
[0029] In accordance with another aspect the sequence of feature vectors are mapped to a sequence of tones using one or more sequence-to-sequence networks to learn at least one model to map the sequence of feature vectors to a sequence of tones.
[0030] In accordance with an aspect the feature vector extractor comprises one or more of a multi-layer perceptron (MLP), a convolutional neural network (CNN), a recurrent neural network (RNN), a cepstrogram computer, a spectrogram computer, a Mel-filtered cepstrum coefficients (MFCC) computer, or a filterbank coefficient (FBANK) computer.
[0031] In accordance with an aspect the sequence of output tones can be combined with complimentary acoustic vectors, such as MFCC or FBANK feature vectors or a phoneme posteriorgram, for a speech recognition system that is able to do speech recognition in a tonal language with higher accuracy.
[0032] In accordance with an aspect the sequence-to-sequence network comprises one or more of an MLP, a feed-forward neural network (DNN), a CNN, or an RNN, trained using a loss function appropriate to CTC training, encoder-decoder training, or attention training.
[0033] In accordance with an aspect an RNN is implemented using one or more of uni-directional or bi-direction GRU, LSTM units or a derivative thereof.
[0034] The system and method described can be implemented in a speech recognition system to assist in estimating words. The speech recognition system is implemented on a computing device having a processor, memory and microphone input device.
[0035] In another aspect, there is provided a method of processing and/or recognizing tones in acoustic signals, the method comprising a trainable feature vector extractor and a sequence-to-sequence neural network.
[0036] In another aspect, there is provided a computer readable media comprising computer executable instructions for performing the method.
[0037] In another aspect, there is provided a system for processing acoustic signals, the system comprising a processor and memory, the memory comprising computer executable instructions for performing the method.
[0038] In an implementation of the system, the system comprises a cloud-based device for performing cloud-based processing.
[0039] In yet another aspect, there is provided an electronic device comprising an acoustic sensor for receiving acoustic signals, the system described herein, and an interface with the system to make use of the estimated tones when the system has outputted them.
[0040] Referring to
[0041] Referring to
[0042] The sequence-to-sequence network is typically a recurrent neural network (RNN) 230 which can have one or more uni-directional or bi-directional recurrent layers. The recurrent neural network 230 can also have more complex recurrent units such as long-short term memory (LSTM) or gated recurrent units (GRU), etc.
[0043] In one embodiment, the sequence-to-sequence network uses the CTC loss function 240 to learn to output the correct tone sequence. The output may be decoded from the logits produced by the network using a greedy search or a beam search.
EXAMPLE AND EXPERIMENT
[0044] An example of the method is shown in
[0045] Table 1 lists one possible set of hyper-parameters used in the recognizer for these example experiments. We used a bidirectional gated recurrent unit (BiGRU) with 128 hidden units in each direction as the RNN. The RNN has an affine layer with 6 outputs: 5 for the 5 Mandarin tones, and 1 for the CTC “blank” label.
TABLE-US-00001 Layers of the recognizer described in the experiment Layer type Hyperparameters framing 25 ms w/ 10 ms stride windowing Hamming window FFT length-512 abs - log - IFFT length-512 conv2d 11×11, 16 lifters, stride 1 pool 4×4, max, stride 2 activation ReLU conv2d 11×11, 16 lifters, stride 1 pool 4×4, max, stride 2 activation ReLU conv2d 11×11, 16 lifters, stride 1 pool 4×4, max, stride 2 activation ReLU dropout 50% recurrent BiGRU, 128 hidden units CTC -
[0046] The network was trained for a maximum of 20 epochs using an optimized, such as for example as disclosed in Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations (ICLR), 2015 hereby incorporated by reference with a learning rate of 0.001 and gradient clipping. The batch normalization for RNNs and a novel optimization curriculum, called SortaGrad curriculum learning strategy was utilized, described in Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al., “Deep Speech 2: End-to-end speech recognition in English and Mandarin,” in 33rd International Conference on Machine Learning (ICML), 2016, pp. 173-182 , in which training sequences are drawn from the training set in order of length during the first epoch and randomly in subsequent epochs. For regularization, and early stopping on the validation set was used to select the final model. To decode the tone sequences from the logits, a greedy search was used.
[0047] In an embodiment, the said predicted tones are combined with complimentary acoustic information to enhance the performance of a speech recognition system. Examples of such complimentary acoustic information include a sequence of acoustic feature vectors or a sequence of posterior phoneme probabilities (also known as a phone posteriorgram) obtained via a separate model or set of models, such as a fully connected network, a convolutional neural network, or a recurrent neural network. The posterior probabilities can also be obtained via a joint learning method such as multi-task learning to combined tone as well as phone recognition among other tasks.
[0048] An experiment to show that the predicted tones can improve the performance of a speech recognition system was performed. For this experiment, 31 native Mandarin speakers were recorded reading a set of 8 pairs of phonetically similar commands. The 16 commands, as shown in Table 1, were chosen to be phonetically identical except in tones. Two neural networks were trained to recognize this command set: one with phoneme posteriors alone as input, and one with both phoneme and tone posteriors as input.
TABLE-US-00002 Commands used in confusable command experiment Index Transcription in Mandarin characters Transcription in pinyin English translation 0 “nǐ de xióngmāo” “your panda” 1
“nǐ de xiōngmáo” “your chest hair” 2
“wǒ kĕyǐ wėn nǐ ma?” “Can I ask you?” 3
“wǒ kĕyǐ wĕn nǐ ma?” “Can I kiss you?” 4
“wǒ xǐhuān yánjiū” “I like to study” 5
“wǒ xǐhuān yān jiŭ” “I like smoking and drinking” 6
“shānghài” “injure” 7
“Shànghăi” “Shanghai (city)” 8
“lăogōng” “husband” 9
“láogōng” “hard labour” 10
“shīfqu̇” “lose” 11
“shíqŭ” “pick up” 12
“yèzhŭ” “owner” 13
“yĕzhū” “wild boar” 14
“shìyán” “promise” 15
“shīyán” “slip of the tongue”
Results
[0049] The performance of a number of tone recognizers is compared in Table 3. In rows [1]-[5] of the table, other Mandarin tone recognition results reported elsewhere in the literature are provided. In row [6] of the table, the result of the example of the presently disclosed method. The presently disclosed method achieves better results than the other reported results by a wide margin, with a TER of 11.7%.
TABLE-US-00003 Comparison of tone recognition results Method Model and input features TER [1] Lei et al. [ HDPF .fwdarw. MLP 23.8% [2] Kalinli Spectrogram .fwdarw. Gabor.fwdarw. MLP 21.0% [3] Huang et al. [ HDPF .fwdarw. GMM 19.0% [4] Huang et al. [ MFCC + HDPF .fwdarw. RNN 17.1% [5] Ryant et al. [ MFCC .fwdarw. MLP 15.6% [6] Present method CG .fwdarw. CNN .fwdarw. RNN .fwdarw. CTC 11.7% [1] - Xin Lei and Manhung Siu and Mei-Yuh Hwang and Mari Ostendorf and Tan Lee, “Improved tone modeling for Mandarin broadcast news speech recognition.” Proc. of Int. Conf. on Spoken Language Processing, pp. 1237-1240, 2006. [2] - Ozlem Kalinli, “Tone and pitch accent classification using auditory attention cues,” in ICASSP, May 2011, pp. 5208-5211. [3] - Hank Huang and Han Chang and Frank Seide, “Pitch tracking and tone features for Mandarin speech recognition,” ICASSP, pp. 1523-1526, 2000. [4] - Hao Huang and Ying Hu and Haihua Xu, “Mandarin tone modeling using recurrent neural networks,” arXiv preprint arXiv: 1711.01946, 2017. [5] - Ryant, Neville, Jiahong Yuan, and Mark Liberman, “Mandarin tone classification without pitch tracking,” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4868-4872.
[0050]
[0051] Another embodiment which tone recognition is useful is computer-assisted language learning. Correct pronunciation of tones is necessary for a speaker to be intelligible while speaking a tonal language. In a computer-assisted language learning application, such as Rosetta Stone™ or Duolingo™, tone recognition can be used to check whether the learner is pronouncing the tones of a phrase correctly. This can be done by recognizing the tones spoken by the learner and checking whether they match the expected tones of the phrase to be spoken.
[0052] Another embodiment for which automatic tone recognition is useful is corpus linguistics, in which patterns in a spoken language are inferred from large amounts of data obtained for that language. For instance, a certain word may have multiple pronunciations (consider how “either” in English may be pronounced as “IY DH ER” or “AY DH ER”), each with a different tone pattern. Automatic tone recognition can be used to search a large audio database and determine how often each pronunciation variant is used, and in which context each pronunciation is used, by recognizing the tones with which the word is spoken.
[0053]
[0054]
[0055] Each element in the embodiments of the present disclosure may be implemented as hardware, software/program, or any combination thereof. Software codes, either in its entirety or a part thereof, may be stored in a non-transitory computer readable medium or memory (e.g., as a ROM, for example a non-volatile memory such as flash memory, CD ROM, DVD ROM, Blu-ray™, a semiconductor ROM, USB, or a magnetic recording medium, for example a hard disk). The program may be in the form of source code, object code, a code intermediate source and object code such as partially compiled form, or in any other form.
[0056] It would be appreciated by one of ordinary skill in the art that the system and components shown in