HMM DECODING COMPENSATION FOR SPEECH RECOGNITION AND MULTI-STRUCTURED DECODING FOR LOW RESOURCE COMMAND RECOGNITION
20260105912 ยท 2026-04-16
Assignee
Inventors
Cpc classification
G10L15/148
PHYSICS
International classification
Abstract
Described are techniques to recognize spoken wake word (WW) or command for human-machine interface using a speech recognition system that does not require any WW/command-matching speech data for training. The system uses the text or grapheme representation of the WW or commands for training before deployment. The technique includes receiving a target phrase for recognition by a speech recognition model. The technique includes analyzing a sequence of acoustic units representative of the target phrase when the target phrase is spoken to generate offline analysis data. The technique further includes constructing the speech recognition model based on the offline analysis data to decode speech signals of the target phrase according to the acoustic units. The technique further includes processing speech based on the speech recognition model to detect a presence of the target phrase.
Claims
1. A method of speech recognition by a device, the method comprising: receiving a target phrase for recognition by a speech recognition model; analyzing a sequence of acoustic units representative of the target phrase when the target phrase is spoken to generate offline analysis data; constructing the speech recognition model based on the offline analysis data to decode speech signals of the target phrase according to the acoustic units; and processing speech based on the speech recognition model to detect a presence of the target phrase.
2. The method of claim 1, wherein processing speech based on the speech recognition model comprises: determining from the speech recognition model a model likelihood score representing a likelihood of the presence of the target phrase based on an observed sequence of acoustic units decoded from the speech.
3. The method of claim 2, wherein processing speech based on the speech recognition model further comprises: modifying the model likelihood score based on the observed sequence of acoustic units and the offline analysis data to determine the presence of the target phrase.
4. The method of claim 1, wherein the speech recognition model comprises a sequence of decoding states, wherein each decoding state of the sequence of decoding states models each acoustic unit of the sequence of acoustic units, and wherein analyzing the sequence of acoustic units comprises: determining an order of transitioning through the sequence of decoding states based on time.
5. The method of claim 1, wherein the target phrase comprises a plurality of words, and wherein the offline analysis data comprises an expected length in time of the target phrase and an expected length in time of each of the plurality of words when the target phrase is spoken.
6. The method of claim 1, wherein the speech recognition model comprises a sequence of states, wherein each state of the sequence of states models each acoustic unit of the sequence of acoustic units, and wherein processing speech based on the speech recognition model comprises determining a likelihood of the presence of the target phrase based on an order of transitions between states of the sequence of states when the target phrase is spoken.
7. The method of claim 1, wherein the offline analysis data comprises an expected length in time of each of the acoustic units in the sequence of acoustic units.
8. The method of claim 7, wherein the speech recognition model comprises a sequence of decoding states, wherein each decoding state of the sequence of decoding states models each acoustic unit of the sequence of acoustic units, and wherein the expected length in time of each of the acoustic units comprises an expected length in time the speech recognition model stays in each of the decoding states when decoding speech signals of the target phrase.
9. The method of claim 1, wherein the speech recognition model comprises a sequence of decoding states, wherein each decoding state of the sequence of decoding states models each acoustic unit of the sequence of acoustic units, and wherein the offline analysis data comprises: one or more acoustically similar acoustic units to an acoustic unit modeled by a decoding state, wherein the acoustically similar acoustic units are associated with probability estimates of a presence of the acoustically similar acoustic units when the decoding state identifies the acoustic unit modeled by the decoding state as a most likely acoustic unit.
10. The method of claim 1, wherein constructing the speech recognition model based on the offline analysis data comprises: constructing a sequence decoding model based on the sequence of acoustic units, wherein the sequence decoding model includes a sequence of states, and wherein each state of the sequence of states models each acoustic unit of the sequence of acoustic units; and constructing a decoding compensation model based on the offline analysis data to modify a decoding output of the sequence decoding model.
11. The method of claim 10, wherein the sequence decoding model decodes a most likely path through the sequence of states when processing speech, and wherein the decoding compensation model compares transitions through the sequence of states of the most likely path with expected transitions through the sequence of states when the speech recognition model processes acoustic units of the target phrase.
12. The method of claim 11, wherein the expected transitions through the sequence of states comprises at least one of: a ratio between an expected length in time of a word in the target phrase and an expected total length in time of the target phrase when the target phrase is spoken; an expected transition of 1 state through the sequence of states when the target phrase is spoken; an expected length in time in each state of the sequence of states when the target phrase is spoken; or acoustically similar acoustic units to an acoustic unit that is modeled by each state of the sequence of states, wherein each of the acoustically similar acoustic units is associated with a probability estimate of a detection when the acoustic unit is modeled by a corresponding state.
13. The method of claim 12, wherein processing speech based on the speech recognition model comprises at least one of: comparing a ratio of an observed length in time of a word in the speech and an observed total length in time of the speech when transitioning through the sequence of states of the most likely path with the ratio between an expected length in time of a word in the target phrase and an expected total length in time of the target phrase to generate a word-ratio penalty; comparing observed state jumps when transitioning through the sequence of states of the most likely path with the expected transition of 1 state for the target phrase to generate a state jump penalty; comparing an observed length in time in each state when transitioning through the sequence of states of the most likely path with the expected length in time in each state of the target phrase to generate a state walk penalty; or comparing a probability estimate of a most likely acoustic unit modeled by each state when transitioning through the sequence of states of the most likely path with probability estimates for the acoustically similar acoustic units and the acoustic unit modeled by each state for the expected transitions of the target phrase to generate a top-1 penalty.
14. The method of claim 13, wherein the state walk penalty is weighted by a probability of transitioning within each state for the expected transitions of the target phrase.
15. The method of claim 13, wherein the top-1 penalty for a state is weighted by a probability estimate of the acoustic unit modeled by the state to reward the most likely acoustic unit that matches the acoustic unit modeled by the state, and to penalize the most likely acoustic unit that fails to match the acoustic unit modeled by the state, when a probability estimate of the acoustic unit modeled by the state is high.
16. The method of claim 13, wherein processing speech based on the speech recognition model comprises: combining the word-ratio penalty, the state jump penalty, the state walk penalty, and the top-1 penalty to generate a total compensation; and modifying a score associated with the most likely path by the total compensation to generate a modified score indicating a probability of the presence of the target phrase.
17. The method of claim 1, wherein the target phrase comprises at least one of: a wake-word spoken to address the device; a simple command spoken following the wake-word, wherein the simple command includes one or more words; a compound command spoken following the wake-word, wherein the compound command includes a common sub-command and a second sub-command unique to each compound command; a number and an associated unit spoken following the wake-word; or a complex command spoken following the wake-word, wherein the complex command includes a combination of any one of the simple command, the compound command, and the number and the associated unit.
18. The method of claim 17, wherein constructing the speech recognition model based on the offline analysis data comprises: constructing a sequence decoding model based on a sequence of acoustic units of the wake-word followed by a sequence of acoustic units of a command, wherein the sequence decoding model models a first sequence of states corresponding to the sequence of acoustic units of the wake-word and a second sequence of states corresponding to the sequence of acoustic units of the command, and wherein a state of the first sequence of states corresponding to a last acoustic unit of the wake-word also models a gap between the wake-word and the command.
19. The method of claim 17, wherein constructing the speech recognition model based on the offline analysis data comprises: constructing a sequence decoding model based on a concatenation of sequences of acoustic units of a plurality of words of a command, wherein the sequence decoding model includes a first sequence of states modeling a sequence of acoustic units of a first word, an silence state modeling a gap between the first word and a second word of the command, and a second sequence of states modeling a sequence of acoustic units of the second word, and wherein the silence state also models a last acoustic unit of the first word and a first acoustic unit of the second word.
20. An apparatus comprising: an input terminal configured to receive an audio signal from one or more microphones; and a processing system configured to: receive a target phrase for recognition by a speech recognition model; analyze a sequence of acoustic units representative of the target phrase when the target phrase is spoken to generate offline analysis data; construct the speech recognition model based on the offline analysis data to decode speech signals of the target phrase according to the acoustic units; and process the audio signal based on the speech recognition model to detect a presence of the target phrase.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] The described embodiments and the advantages thereof may best be understood by reference to the following description taken in conjunction with the accompanying drawings. These drawings in no way limit any changes in form and detail that may be made to the described embodiments by one skilled in the art without departing from the spirit and scope of the described embodiments.
[0005]
[0006]
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
DETAILED DESCRIPTION
[0059] Examples of various aspects and variations of the subject technology are described herein and illustrated in the accompanying drawings. The following description is not intended to limit the invention to these embodiments, but rather to enable a person skilled in the art to make and use this invention.
[0060]
[0061] Described are methods and systems for a WW and command recognition solution that does not require any user defined WW or command-matching speech data for training. This approach enables the user/customer to quickly and inexpensively deploy a speech recognition-based human-machine interface (HMI). The disclosed systems and methods enable a speech recognition solution trained based (e.g., solely) on text-specified WWs and commands to be deployed very quickly. The system is termed data free speech recognition because of the lack of needing speech data specific to the user-defined WWs and commands. Advantageously, the data free system may use, for example, text or grapheme representation of the WW or commands for online training on the order of seconds before the system is ready for use. The resulting low complexity and small memory footprint may be implemented on an edge processor while the system is robust enough to handle confusable phonemes, accents, different pronunciations and adverse environmental conditions.
[0062] A system architecture for continuous speech recognition may have a feature analysis module of the speech signal, followed by a unit matching block, lexical decoding block, syntactical analysis block, and a semantic analysis block to generate recognized utterance. The feature analysis block typically involves a spectral and/or temporal analysis of the speech signal yielding observation vectors, x, which are processed by the unit matching block to characterize various speech sounds. The unit matching block may include a recognition unit to recognize linguistically-based sub-word units such as phones, diphones, or triphones, partial or whole word units or even multiple word units. Generally, the smaller the sub-word unit, the fewer of them there are, but the more complicated their structure is in speech, and hence, the more reliant system performance hinges on the remaining architecture blocks.
[0063] The lexical decoding block applies word-based knowledge to the output of the recognition unit, putting restrictions on possible unit decoding by considering word structure. A word dictionary may be included to further restrict possibilities to the valid word database. In the case that the output of the recognition unit is words, the lexical decoding block may be eliminated. The syntactical analysis block applies further constraints based on word grammar and proper sequencing. Finally, semantic analysis block includes additional constraints based on meaning, reference, logic, implication, application, etc.
[0064] The continuous speech recognition architecture may be capable of handling a large vocabulary. Many applications exist where the valid vocabulary is very limited, often limited to the context of a single focused scenario, such as controlling the functions of an oven, or adjusting the settings of a smart thermostat. Often, the application includes a wake-word (WW) to address the device, followed by a limited and known set of commands. In one embodiment, the application may accept a command without a WW present (e.g., push to talk). For the case of the WW, the job at hand is to recognize a single word or phrase in an essentially infinite possible range of input speech, noise, and conditions. In this deployment scenario, the continuous speech recognition architecture may be simplified by eliminating the syntactical analysis block and the semantic analysis block. In further simplification, if the recognition unit is targeted to recognize the WW itself, then the lexical decoding block may be eliminated as well.
[0065] For speech recognition of commands, if the words contained within the set of commands are limited and the grammar is known and restricted to the set of commands, the syntactical analysis may generate the recognized commands, thus eliminating the semantic analysis. If the lexical decoding can construct complete commands rather than individual words, the syntactical analysis may be eliminated too. One or more of the feature analysis module, unit matching block, lexical decoding block, syntactical analysis block, and semantic analysis block may leverage neural networks and machine learning models, depending on the applications, resource requirements, performance requirements, hardware capabilities, available training data, etc., to generate a wide-range of speech recognition architecture customized to the tasks and resources at hand.
[0066] To accelerate the training and deployment of a neural network-based speech recognition architecture to recognize user-defined WWs and commands while offering robust performance, a data-free speech recognition architecture may receive text rather than speech data to build one or more command models during an online training phase. During the inference phase, the neural network-based speech recognition system may use the command models and an acoustic model (derived from offline training) to recognize speech as a command associated with the received text.
[0067]
[0068] On the other hand, the online training 230 may be performed using user-defined WWs and command texts 232 to build one or more command models such as a lexical decoding model 238 or a syntactical analysis model 240 for WW and command speech recognition during inference. In one embodiment, the online training 230 may train a text-to-speech (TTS) engine (not shown) to generate synthetic speech of the WWs and command texts 232 that is then used for lexical analysis training 234 of the lexical analysis model 238 and syntactical analysis training 236 of the syntactical analysis model 240. In one embodiment, the lexical analysis model 238 and/or the syntactical analysis model 240 may be based on a statistical model such as the Hidden Markov Model (HMM) phoneme-based word model.
[0069] During the inference stage 250, the data-free speech recognition architecture uses the one or more command models (e.g., lexical analysis model 238 and syntactical analysis model 240) from the online training and the acoustic model 128 from the offline training to recognize speech 252 uttered by a user as a WW 262 or a command 264 associated with the text. The inference stage 250 may use the same feature analysis module 214 used for the offline training 210 of the acoustic model 218. The feature analysis module 214 may perform spectral and/or temporal analysis of the speech 252 for processing by the neural network-based unit matching block 256 using the acoustic model 218. A sequence decoding block that includes lexical decoding 258 and syntactical decoding 260 may use the lexical model 238 and syntactical model, respectively, to process the output of the unit matching block 256 to recognize a WW 262 or command 264 from the set of user-defined WWs and command texts 232.
[0070]
[0071] The role of the offline training is to produce a neural network-based acoustic model 218 capable of identifying the acoustic units present in the input speech. The acoustic unit used by the system may be the phoneme. The neural network-based acoustic model is trained offline as it is independent of the target phrases (other than the language). In one embodiment, the offline training may produce a neural network-based acoustic model 218 capable of identifying acoustic units present in any language so that the data-free speech recognition architecture for many languages may use the same acoustic model. Target phrases as used here may refer to the WWs or commands that the data-free speech recognition system is trained to recognize. To achieve this, offline training may employ multiple speech databases. The databases may be annotated, such as phoneme annotated database 320, to identify the boundaries of the phonemes contained within the databases. The augmentation block 322 may augment the phoneme annotated databases 320 to further enrich the variety of content and improve training. The data select/prep module 323 may select the files for training based on the desired balance in terms of gender, native talkers, non-native talkers, accent, age, background noise, room acoustics, etc., and can be adjusted based on the envisioned target application (close talking, near field, far field, geographic location, background noise environment, etc.). The feature extraction block 214 performs a spectral and temporal analysis of the input speech from the selected file and produces observation vectors, also referred to as observation sequences, which capture the important recognition aspects and discards as much as possible the rest.
[0072] The neural network (NN) training block 324 iterates through the set of observation sequences and respective annotation information to learn to distinguish between and classify the input speech according to the specified acoustic units, in this case phonemes. The output of the neural network-based acoustic model 218 is a sequence of Softmax vectors consisting of the phonemes of the language plus a silence/noise class. In one embodiment, the output of the neural network-based acoustic model 218 is a frame-by-frame estimate of the likelihood or probability of the presence of the set of recognized units (phonemes in this case). In one embodiment, the output of the neural network-based acoustic model 218 represents some other estimate of the presence of the recognized units. This estimate may be limited in accuracy and precision and varies by different NN designs, training databases used, quality of the annotations, etc.
[0073] The phoneme analysis model 326 performs an analysis of the final acoustic model, captures relevant information, and incorporates this information into a decoding model. In one embodiment, the phoneme analysis block 326 collects statistical information characterizing the phonetic content and behavior of the databases and the trained acoustic model 218 that is later used to improve decoder model construction and performance.
[0074] The decoding model is trained or constructed online as it is dependent on the target phrase and is performed quickly for near immediate use by the user. The online training is based on the user-defined WWs and commands which are entered as text. In
[0075]
[0076]
[0077] As discussed, the NN-based unit matching block 256 is trained offline using annotated databases to learn to discriminate between the units (phonemes). The HMM models for sequence decoding 420 are based on the user defined text input descriptions of the desired set of WWs and commands (e.g., target phrases). Since the system is data-free, material containing the target phrases is not available during the offline training process. The data-free speech recognition architecture utilizes the training procedure as discussed in
[0078]
[0079] The annotated database 320 includes labeling of the true units and their respective time boundaries within the training speech clips. During offline training of the acoustic model 218, the unit matching block 256 processes through the annotated database while simultaneously the phoneme analysis block 326 collects and compiles information on how the likelihood results 610 relate to the true labeled units within their respective time boundaries. For example, the phoneme analysis block 326 collects statistical information characterizing the phonetic content and behavior of the annotated databases 320 and the unit matching block 256. The phoneme analysis block 326 compiles the phoneme analysis block 326 statistical information into a unit recognition statistics database 620. The ASP 640 then uses the unit recognition statistics database 620 to improve its results when utilizing the unit matching bock 256 block on unseen speech data. In one embodiment, the phoneme analysis block 326 generates a matrix of statistics characterizing the unit matching block 256 of the acoustic model 218.
[0080] In one aspect, the recognition model 350 uses the matrix of statistics to aid the ASP 640 such as the sequence decoding block 420 incorporating the HMM of
[0081]
[0082] The HMM evaluates the probability of the sequence of observations at each input vector of Softmax values given a model of the HMM. The HMM may evaluate the probability by forward or Viterbi algorithm. The assignment of the phoneme to each state of the HMM is challenging and may use approaches such as phonetic dictionary, manual definition, a tokenizer, etc. However, the resulting phonetic transcription may not be a good match, especially for unseen words. In addition, these approaches do not generally yield alternative pronunciations. In one aspect of the present disclosure, a decoding compensation model incorporates features derived from internal behavior of the HMM and offline analysis of matching and non-matching words to improve the discrimination ability of the HMM.
[0083] Given the observation sequence X=x.sub.1, x.sub.2, . . . , x.sub.T, and a model =(A, B, ), the HMM evaluates the probability of the observation sequence, P(X|), given the model (i.e., probability of the observation sequence X=x.sub.1, x.sub.2, . . . , x.sub.T given the model of the HMM).
[0084] The word/command model 730 shows an example for using a HMM to modeling the WW Okay Infineon with the pronunciation/OW/K/EY/IH/N/F/IH/N/IY/AA/N/. However, alternate pronunciations may be equally valid or common, and is supported. The word/command model 730 may support alternate pronunciations by allowing multiple phonemes in each state definition of the HMM. The preferred or most common phoneme for each state may be listed first and used to derive the expected length (time) of the state. The WW Okay Infineon with multiple pronunciations is shown.
[0085] The value of the Softmax output corresponding to the j.sup.th phoneme, ph.sub.j, is an estimate of the posterior probability P(q=ph.sub.j|x) where x is the observed input feature vector. Ideally, the acoustic model such as the unit matching block 256 would have 100% accuracy and be 100% confident (true phoneme Softmax score is 1.0). However, this is not the case. In fact, in clean conditions, the true phoneme has been shown to have the maximum score in the Softmax vector about 70-80% of the time with an average value of 0.5-0.9.
[0086] The Softmax output contains a posterior probability estimate for all phonemes, ph.sub.j, j=1 . . . N where N is the number of phonemes (plus noise) and has the property of:
on a frame by frame basis. Hence, when the Softmax value of the true phoneme is less than 1.0, the difference is contained in the next most likely phonemes. Accordingly, a matrix can be built to capture this information to better characterize the acoustic model Softmax output. As indicated, the phoneme analysis block 326 of
[0087] The word/command model 730 also includes a decoding compensation block 740 and a universal background model 750. The decoding compensation block 740 takes as its input the offline analysis from the phoneme analysis block 326, intermediate model observations and probabilities from the word/command model 730, and the current model probability, P(X.sub.n|) to compute a new frame-by-frame model score that is used to detect the presence of the user-defined WW or commands. The decoding compensation block 740 incorporates not only P(X.sub.n|) but also features derived from internal behavior of the word/command model 730, and additionally considers offline analysis of matching words (positive input) and non-matching words (negative input) from the phoneme analysis block 326 to improve the discrimination ability of the sequence decoding block 420 compared with P(X.sub.n|).
[0088] To further improve the robustness of the sequence decoding block 420 to poorly articulated speech, noisy conditions, different speaker rates, etc., the universal background model (UBM) 750 attempts to normalize these conditions in estimating the user-defined WW or commands.
[0089] In one embodiment, the UBM 750 is modeled as a 3-state HMM of a leading silence state (leading sil), a speech state (Sp), and a final silence state (final sil). An HMM may be modeled by the transition probabilities of the states and the emission probabilities of the states. The transition probability of a state indicates how likely the HMM is to transition to the state given some current state. The emission probability of a state indicates the probability of the HMM generating an observation given some current state. The 3-state HMM may obtain an emission probability for the leading silence state from the NN Softmax entry for SIL, while the emission prob for the speech state is 1(emission probability for leading silence state). The transition probability for the leading silence state may be based on the expected length of leading silence from when the SOD triggers until the start of the speech. The transition probability of the speech state may be based on the average length of the phonemes in the WW/CMD for which this UBM is working with. The self-transition probability of the final silence state may be 1.0. The probably of the UBM may be the maximum probabilities of these three states. The following describes in details operations of the sequence decoding block 420.
[0090] As discussed, the phoneme analysis block 326 of
[0091]
[0092] The unit matching block 256 of
[0093]
[0094]
[0095] The AM may compute the posterior probabilities, P.sub.n (910), using data within a window of length t.sub.window (920), which is shifted every analysis interval by t.sub.frame (930). The center of each window represents the time instance for the respective P.sub.n. For unit U_n (940), the start and end times are labeled U_n.sub.start (950) and U_n.sub.end (960), respectively. The posterior probabilities P.sub.0 and P.sub.1 lie within the time boundaries of U_n and are collected as part of the data for it, while P.sub.2 lies outside of the boundary, using data within a window of length t.sub.window (920), which is shifted every analysis interval by t.sub.frame (930). The phoneme analysis block 326 of
[0096] In one aspect, the phoneme analysis block 326 may build a matrix to capture this information and better characterize the AM output. For example, the phoneme analysis block 326 may compile a matrix by processing through the annotated training database 320. For each phoneme time boundary U_n (940), the posterior probabilities vector, P.sub.n (910), of the frame (e.g., t.sub.window (920)) containing the maximum value of the root truth unit (phoneme) is identified and stored. The phoneme analysis block 326 repeats this process for each utterance in the database such that a matrix of P.sub.n (910) values is accumulated for each unit. The vectors are averaged across time to obtain the average P.sub.n vector values when the root-truth unit is at its maximum value within the identified boundaries.
[0097]
[0098]
[0099] As explained above, the AM is neither 100% accurate nor 100% confident in classifying the phonemes. The output classes sum to one as shown in Equation (1). Hence, when the Softmax value of the true phoneme is less than 1.0, the difference is contained in the next most likely phonemes. The similarity matrix captures statistics of this confusion across similar phonemes. In one aspect, to improve the discrimination ability of the AM, the unit matching block 256 incorporates the a priori information in the similarity matrix into the posterior probability estimates.
[0100]
[0101] The posterior probability that the current phoneme, q, is the j.sup.th phoneme, ph.sub.j is given by the equation:
where Sx(j) is the Softmax value of the j.sup.th phoneme in the j.sup.th row. The similarity matrix gives the average value of each phoneme, ph.sub.i, when q=ph.sub.j and ph.sub.j is at its maximum value. For example, referring to the magnified similarity matrix in
[0102] To address this, one technique to improve the posterior probability estimates is to add the Softmax values of the confusing phonemes with restrictions in consideration of the ratios in the similarity matrix. Denote the top-N highest scores in the j.sup.th column of the similarity matrix, Sim.sub.j, to be {ph.sub.k1, ph.sub.k2, . . . , ph.sub.kN}, then a new posterior probability estimate, P.sub.sim, is formulated to be:
where .sub.R is a small factor (e.g., .sub.R=0.1) added to the ratio of
to allow some variance, as the ratios were the global average. The model compensation block 1210 modifies the likelihood results 610 (e.g., Softmax outputs) as in Equation 3 to improve the posterior probability estimates only when the Softmax outputs correlate well with the similarity matrix, thus improving recognition ability without increasing false detections.
[0103] Advantageously, the approach to modify the posterior probabilities according to the computed statistics in the similarity matrix improves phonetic modeling for a given acoustic model, thereby improving phoneme recognition and speech recognition. In addition, the acoustic model training loop may integrate the model compensation block 1210 to automatically compensate for different AM characteristics, thus avoiding the need to retrain or retune the ASP system.
[0104] In one aspect, the HMM used for decoding the posterior probability sequence for speech recognition may use Equation 3 to improve the transition probabilities for the state decoding.
[0105]
[0106] The j.sup.th state in the HMM represents the j.sup.th phoneme in the WW or command being recognized. The model begins and ends with a silence state, and the total number of states in the model is N.sub.states.
[0107] In one aspect, alternate pronunciations may be equally valid or common. The HMM may support alternate pronunciations by allowing multiple phonemes in each state definition, as shown for states 1320 and 1330. The preferred or most common phoneme for each state may be listed first and used to derive the expected length (time) of the state. The HMM may limit the number of pronunciations to minimize processing complexity. In addition, the alternative pronunciations increase the chance for false detections, so this may be weighed against the improvement in detection rate.
[0108] When determining the pronunciations to be included in the HMM, the model may include probabilities. For example, if two pronunciations are equally probable, then the HMM may include both pronunciations. However, if one pronunciation has probability of 0.99 while the other 0.01, then the HMM may not include the latter pronunciation, as its inclusion will only slightly improve the positive recognition rate, while likely increasing the false detection rate by a disproportionate amount.
[0109] In one aspect, a tokenizer such as the tokenizer/custom dictionary block 340 of
[0110] In one aspect, the HMM of a sequence decoding model contains the primary pronunciation phoneme for each state, along with up to P1 alternate pronunciations for each state, for a total of up to P phones for each state. For each pronunciation, the highest confusing phonemes according to the similarity matrix may be included along with their respective ratios according to Equation 3. In one embodiment, confusable phonemes are included until the sum of Equation 4:
is greater than 0.8 and Sim; (k)>0.05 and kN4. These parameters are configurable.
[0111]
[0112] As discussed in
[0113]
[0114] In one aspect, the HMM may be determined by parameters =(A, B, ), where A represents the transition probability matrix of the states, B represents the emission probability; and represents the initial state distribution. A is a matrix whose rows represent a probability distribution that dictates how likely the HMM is to transition to each state, given some current state. B estimates the probability of the HMM generating an observation X=(x.sub.1, x.sub.2, . . . , x.sub.T,), given some current state. is a probability distribution that dictates the probability of the HMM starting in each state (usually start in first state).
[0115] For example, in the HMM of the word/command model 730 of
[0116] Given the observation sequence X=x.sub.1, x.sub.2, . . . , x.sub.T, and a model =(A, B, ), the HMM evaluates the probability of the observation sequence, P(X|), given the model (i.e., probability of the observation sequence X=x.sub.1, x.sub.2, . . . , x.sub.T given the model of the HMM). A straightforward approach to evaluate (X|) may sum over all possible state sequences s.sub.1, s.sub.2, . . . , s.sub.T that could result in the observation sequence, X. However, this direct computation method is extremely complex.
[0117]
[0118] Instead, the HMM may evaluate P(X|) using a recursive approach, known as the forward algorithm, based on the Markov assumption of the HMM.
[0119]
[0120] In one embodiment, the HMM may use the Viterbi algorithm to further simplify the recursive approach by considering only the most likely path, instead of a summation.
[0121]
[0122] The value of the Softmax output corresponding to the j.sup.th phoneme, ph.sub.j, is an estimate of the posterior probability P(q=ph.sub.j|x). Using Bayes Rule, the posterior probability P(q=ph.sub.j|x) may be related to the likelihood p(x|q=ph.sub.j) by:
[0123] Rearranging in terms of the likelihood yields:
[0124] The Viterbi Algorithm may evaluate the trellis using the Softmax output. Interpreting Equation (7), the likelihoods are obtained by dividing the posterior probabilities by the a priori probabilities, which means to divide the NN Softmax output scores by the relative frequency of each phoneme, P(q=ph.sub.j). Equation 7 also shows scaling the division of the posterior probabilities with the a priori probabilities by the probability of observing x, which may be estimated from the universal background model 750 of
[0125] The model probability evaluated with the Viterbi algorithm is described by:
where a.sub.s is the transition probability vector, such as {.sub.ij, .sub.jj, .sub.kj} of
[0126] Substituting Equation 7 for p (x|q=ph.sub.j) into Equation 8 yields:
[0127] In one embodiment, the HMM may evaluate P(X|) using a modified version of Equation 9 by including the similarity matrix formulation. Using Equation 3, Equation 8 may be modified according to:
[0128] Next, the HMM may incorporate multiple pronunciations. Each state may include multiple phonemes in its definition. Define the phonemes included in the definition of state s to be .sub.s:
[0129] The HMM may then evaluate the probability of the observation sequence as:
[0130] At each frame time, t, the state with the highest likelihood is known as the most likely state,
and may be expressed as.
[0131] Advantageously, the HMM as described automatically optimizes decoder model to different acoustic models to improve performance and accuracy. It also recognizes different accents and pronunciations.
[0132] Evaluation of the trellis using Viterbi algorithm yields the likelihood of the most probable path, but says nothing of the path itself. However, since the HMM is modeling a WW or CMD, and each state represents a phoneme in sequence from the beginning to end, it would follow that the most likely state,
found according to Equation (13), may also progress in sequence over time. Thus, the path in which
takes can also be used to discriminate among the state sequences. The most likely state sequence (also referred to as state walk) may be expressed as:
where t is the first frame in which
that is, the most likely state is not the initial silence state.
[0133]
[0134]
[0135] Referring back to
[0136] In one aspect, the sub-word ratio analysis block 1510 may analyze the time length of words in the decoded sequence of a WW or command. The WW/CMD/phrase often comprises of individual words. For these words, the time (number of frames) expected for each word may be estimated. Each talker may speak slower or faster, but the ratio of length between words is expected to remain approximately consistent. If too much time is spent in one word vs. another in relation to what is expected, then it is less likely that the WW is present.
[0137] The sub-word ratio analysis block 1510 may compute the sub-word ratio penalty based on the states in the HMM representing the different words and the number of frames spent in each word using the most likely state sequence s.sub.ML decoded by the Viterbi algorithm of the HMM. The sub-word ratio analysis block 1510 may compute the ratio between each word, L.sub.i and the total length of the WW, L.sub.t, and compared to expected lengths,
[0138] The sub-word ratio analysis block 1510 may compute a log likelihood penalty,
according to
where, in one embodiment, the default values of TH.sub.swr and R.sub.swr are {0.15, 15.0}.
[0139] The sub-word ratio analysis block 1510 may compute the total sub-word ratio penalty as the sum of all the sub-word penalties, p.sub.swr:
where N.sub.words are the number of sub-words in the WW or command.
[0140] In one aspect, the state jump analysis block 1520 may analyze state jumps or skipping in the s.sub.ML. Since every state in the HMM represents a phoneme in the pronunciation of the desired word, s.sub.ML should include every state. When there are skipping or jumping states in the s.sub.ML it implies that the phoneme was not present in the input speech.
[0141]
[0142] The state jump analysis block 1520 may compute the state jump penalty, p.sub.sj, as the weighted sum of all of the state jumps found in s.sub.ML according to:
where
are the states of s.sub.ML at frames t and t1, respectively, and R.sub.sjp is a state jump weighting factor. In one embodiment, the default value of R.sub.sjp is {0.5}.
[0143] In one aspect, the self-transition analysis block 1530 may analyze the self-transition probability for each state in the s.sub.ML. The average length of each phoneme,
where t.sub.frame is the time for each frame.
[0144] Hence
is the expected number of frames in the corresponding state for which ph.sub.j is the preferred pronunciation. The self-transition analysis block 1530 may compare this expected number to the observed number of consecutive frames that the s.sub.ML spends in a state. If the observed number of frames for a state substantially exceeds the expected number of frames, it is less likely that the WW is present.
[0145]
[0146] The self-transition analysis block 1530 may quantify the difference between the observed and the expected number of frames the s.sub.ML remains in each state. Let
be the number of consecutive self-transitions starting from
and
where R.sub.st is a self-transition factor. In one embodiment, R.sub.st may have a default value of 0.5.
[0147] Equation 20 shows that if the difference between the observed self-transition length and the expected length of the phoneme is greater than zero, multiply this difference by the self-transition factor, R.sub.st, and then multiply by the log of the self-transition probability to compute the self-transition penalty for the state corresponding to the phoneme. Equation 20 computes a sum of all such formulations for the self-transitions in the current state walk. This formulation is proportional not only to the number of frames beyond that expected, but also by how quickly a state is expected to transition to the next state. For example, if the number of self-transitions has exceeded the expected by 2 frames, then the penalty is log
if R.sub.st is 0.5. If it is highly likely for the state to stay in its current state, then
is close to 1, and log
This makes sense because exceeding by 2 frames is more likely in this case. However, if the phoneme length is short, then the self-transition probability is low. For example, if P
then log
In this case, exceeding the expected length by 2 frames when the probability to transition is 0.9 is highly unlikely, and hence, the self-transition penalty is larger.
[0148] In one aspect, the top-1 statistics analysis block 1540 may weigh the s.sub.ML based on how well the s.sub.ML matches the expected phonemes of the WW or command. The motivation for the analysis is that even though the HMM captures well the likelihood of the input given the model, P(X|), it does not inherently give weight to the absolute ranking of the phonemes in the Softmax output.
[0149]
[0150] Define Top1(state, phoneme) as the top-1 ranking fraction for the phoneme phone in HMM state state. Hence, from Table 1, Top1(1, OW)=0.312. Define
to be the phoneme at time t whose Softmax score is the highest:
where
is the prior probability.
[0151] The top-1 statistics analysis block 1540 may tally the score at each frame t according to state
and then may average the scores for each state. The top-1 statistics analysis block 1540 may average the average score in each state across states to obtain a final top1 score, score.sub.top1. If there are no scores in a state, then that state may obtain a score average of 1. The final penalty, p.sub.top1 is then obtained by:
where R.sub.top1 is the top-1 score factor. In one embodiment, R.sub.top1 may have a default value of 3.0.
[0152] The formulation of p.sub.top1 rewards top-1 sequences matching that expected, giving extra weight to those phonemes highly expected to be in the top-1, while penalizing top-1 sequences that do not match, again especially those not matching phonemes highly expected to be ranked top-1. Note that the final penalty, p.sub.top1, can be positive or negative.
[0153] The total compensations, p.sub.ZNN is the sum of all of the individual compensations:
[0159] The decoding compensation block 740 may improve the sequence decoding of the WWs or commands from the HMM by modifying the model likelihood score, P(X|), at time, t, according to:
[0160] Advantageously, use of the decoding compensation block 740 on different WW models demonstrate a 50-90% reduction in the false alarm (FA) rate. The decoding compensation block 740 is integrated within the decoding, and operates frame-by-frame, thus working seamlessly with the Viterbi algorithm, and introducing essentially no additional algorithm or processing delay.
[0161] In one aspect, a set of sequence decoding structures customized to the user-defined command set may support any combination of WW, simple commands, compound commands, and numbers-based commands.
[0162]
[0163] The sequence decoding structure for WWs is similar to that depicted in
[0164] A simple command may include constituent words that contain little or no commonality with other commands. For example, a simple command may be the command to take a picture, or to set alarm clock to snooze. Similar to the decoding structure for WWs, a single structure block for decoding simple commands may include a lexical decoding block 2238 trained to recognize one or more simple commands 2230. The lexical decoding block 2238 may process the sequence of phoneme likelihood vectors from the unit matching block 256 to render a recognized simple command by modeling the combined phonemes of the constituent words of the simple command.
[0165] Compound commands may include a mix of sub-commands that are common and sub-commands that are unique to each compound command. For example, the four compound commands 1) Turn the light on in the living room; 2) Turn the light on on the porch; 3) Turn the light on behind the study desk; and 4) Turn the light on by the stove, may be split into the common sub-command Turn the light on followed by the four unique second stage sub-commands.
[0166] A sequence decoding structure for decoding compound commands may include a lexical decoding block 2248 trained to recognize common sub-commands and unique sub-commands based on a limited word dictionary 2240. A syntactical analysis block 2244 trained to recognize one or more compound commands 2242 may apply constraints based on word grammar and proper sequencing to evaluate the common sub-commands and unique sub-commands. For example, if the syntactical analysis block 2244 recognizes Turn the light on, the syntactical analysis block 2244 may evaluate the set of the four second stage sub-commands to render a recognized compound command.
[0167] If a command includes only a few numbers, such Set the dial to {1,2}, then the command can be unrolled into two separate simple commands, or into a compound command. However, this becomes impractical when the number range is large, such as setting the temperature of an oven to two hundred forty seven degrees. A sequence decoding structure for decoding a large number range followed by units of the numbers such as temperature, volume, currency, time, etc., (referred to as number-based entities) may include a lexical decoding block 2258 trained to recognize numbers and units based on a number/unit dictionary 2250. A syntactical analysis block 2254 trained to recognize numbers followed by units may apply rules 2252 to evaluate the sequence of numbers and units. Number decoding may need to consider the past and current to determine the future. For example, the number two could be the end of recognition if the expected range is digits, or it may be followed by hundred or something else if a larger range is defined. Thus, number decoding may include a semantic analysis block 2255 trained to evaluate commands based on constraints such as meaning, reference, logic, implication, application, etc., (collectively app 2255) to render a recognized number-based entity.
[0168] A complex command may include a simple or compound command followed by a large range of numbers and a unit. For example, a complex command may be the command to set oven temperature to two hundred forty seven degrees. A sequence decoding structure for decoding complex commands may combine the structure blocks of a decoding structure for simple commands, compound commands, and number-based entities. For example, a sequence decoding structure for complex commands may include a lexical decoding block 2238 trained to recognize one or more simple commands 2230, a lexical decoding structure 2258 trained to recognize numbers and units based on a number/unit dictionary 2250, a syntactical analysis block 2254 trained to recognize numbers followed by units based on rules 2252, and a semantic analysis block 2255 trained to evaluate number-based entities based on constraints in app 2255. The sequence decoding structure may render a recognized complex command composed of a simple command followed by a number-based entity.
[0169]
[0170]
[0171]
[0172]
[0173]
[0174] A user may define the WWs and commands in the command set and may invoke a design flow to map the user-defined command set to the desired sequence decoding structures as part of a training process.
[0175]
[0176] In operation 2401, the data free speech recognition system may select user-defined WWs and commands.
[0177] In operation 2403, the data free speech recognition system may analyze content and inherent structure of the WWs and commands. In one embodiment, the WWs may include multiple constituent words, and the commands may be classified as simple commands, compound commands, number-based entities, and complex commands that include combinations of simple/complex commands and number-based entities.
[0178] In operation 2405, the data free speech recognition system may construct recognition models for the WWs and commands based on the analysis. In one embodiment, the recognition models may include sequence decoding structures such as a HMM that evaluate the probability of the observation sequence, P(X|), given the model A of the HMM.
[0179] In operation 2407, data free speech recognition system may train the recognition model to recognize the WWs and commands (e.g., target phrases). In one embodiment, an online training process as described in
[0180] The data free speech recognition system may deploy the recognitions models to detect WWs and commands in speech during the inference stage as discussed in
[0181]
[0182] Another potential issue is the random-likeness of the last non-silence WW state and the first non-silence (S2) state of the command models. For example, if CMD1 (2520) starts with the word Next then S.sub.12 (2580) will be modeled by phoneme /N/ and happens to match well with the /N/ (2560) from the end of Infineon. CMD 2 (2530) may not match the /N/ (2560). Hence, CMD1 (2520) may have a higher initial likelihood than CMD 2 (2530), completely unrelated to the command being spoken. This results in a bias towards CMD 1 (2520) and a decrease in performance of the command models. In one aspect, the command models may compensate for the WW-to-command transition.
[0183]
[0184] A recognition model for a command concatenates each word of the command into a single model, each word separated by a silence state. However, the amount or even presence of a silence gap between words is quite variable, depending on the words and the talker. In one aspect, to better handle command-to-command transitions, a command model may include last non-silence state from the preceding word and the first non-silence state from the following word into the silence state modeling the gap.
[0185]
[0186] In one embodiment, a compound command composed of multiple stage sub-commands may have modified states for sub-command transitions similar to that for a simple command. However, the first stage sub-command may include only a preceding silence state (decoding only a silence gap) and no trailing silence state. Intermediate sub-commands may not contain preceding or trailing silence states. The final stage sub-command may include only a trailing silence state and no preceding silence state.
[0187] As discussed in operation 2407 of
[0188]
[0189] Tokenizers based on machine learning or deep learning approaches may achieve high transcription accuracy when performing the G2P task. However, their performance relies heavily on both the quality and volume of the training data based on real speech. Training databases may be restricted for use, expensive to obtain, or may not exist in enough quantity, especially for different languages, to properly train the models. It is also desirable to train the tokenizers to support different accents, dialects, and languages.
[0190] Described is a statistics-based tokenizer solution that includes a training phase and a decoder phase. In the training phase, the tokenizer may process words from a reference phonetic dictionary containing word-token transcriptions. The tokenizer may break words in the dictionary into sub-words and may compile statistics to generate a custom dictionary containing sub-words and their estimated likelihoods.
[0191] In the decoding phase, the tokenizer may analyze the text input, perform a sub-word search, and solve iteratively using the sub-words and their likelihoods from the custom dictionary to maximize the token stream probability. In one aspect, during the decoding phase, the tokenizer may analyze the text input of target phrases in the user-defined command set using the dictionary of sub-words and their estimated likelihoods to tabulate the most likely phoneme string equivalents of the text input and their likelihoods. The data free speech recognition system may use the top-N phoneme string equivalents for online training of the recognition models of the target phrases such that sequence decoding of the speech of the target phrases are adaptive to the offline trained acoustic model.
[0192]
[0193] In the decoding phase 2950, the tokenizer 2960 may split a target text input, such as text input of target phrases in the user-defined command set, into different unique combinations of sub-words. The tokenizer 2960 may perform a search of each sub-word of the combinations in the sub-word likelihood dictionary 2940 to find the phoneme corresponding to the sub-word and the sub-word's position within the target text input. Each combination of phoneme, sub-word, and the sub-word's position has a corresponding probability. The tokenizer 2960 may multiply the corresponding probabilities for all the sub-words in each unique combination of sub-word split to obtain the probability of the combination. The tokenizer 2960 may solve for the most likely combination among all the combinations to maximize the probability of the phoneme string equivalent for the target text.
[0194]
[0195] In one embodiment of the sub-word splitting block 3030, a split block 3040 splits the current dictionary word into different sub-words. For example, splitting may proceed one grapheme at a time from the beginning of the word, and/or from the end of the word, and/or in both directions from the middle or other starting point. In one embodiment, the sub-words may have a minimum length of 2 graphemes.
[0196] The split block 3040 may consider certain exceptions, conditions, rules for common beginnings/common endings, etc., 3050, when splitting. For example, in English, certain grapheme pairs exist that constitute a single phoneme such as [ph, sh, ch, th, ck, ng, ll, ss, tt, aw]. If the split block 3040 observes these pairs, the split block 3040 will not split the pairs and will consider each pair as a single grapheme unit. The split block 3040 may employ other exceptions or rules to improve the splitting such as common endings [ing, ion, etc.].
[0197] A tag/position block 3060 may categorize the different positions of the sub-words within the original word into tags. For example, the tag/position block 3060 may assign the tags <Start>, <Middle>, <End> to categorize sub-words that are positioned at the start, middle, or end of the word. The tag/position block 3060 may assign the tag <Full> for a sub-word that constitutes the complete original word. In addition, the tag/position block 3060 may assign the grapheme starting position number to track the original location of the sub-word within the word.
[0198]
[0199] Referring back to
[0200]
[0201] Referring back to
[0202]
[0203] Referring back to
[0204]
[0205] The sub-word likelihoods dictionary 2940 contains the likelihoods of each unique {sub-word, phoneme, tag} triplet after processing through the complete input phonetic dictionary 2920. The final sub-word likelihoods dictionary 2940 may include the complete table of tallies as in 34 or may be pruned to contain only the top-N likely pronunciations to reduce table storage requirements. In one embodiment, if only the most likely final word pronunciation is required, then the sub-word likelihoods dictionary 2940 can be pruned to contain only the top-1 likely pronunciation for each {sub-word, tag} pair.
[0206]
[0207] In one embodiment, a split block 3540 may split the grapheme of the unseen input word into different combinations of sub-words. The split block 3540 may be exhaustive, covering every combination of different lengths and numbers of sub-words. In one embodiment, the split block 3540 for the decoding phase may be the same as the split block 3040 used during the training phase.
[0208]
[0209] Referring back to
[0210] A search and solve block 3520 may search through the trained sub-word likelihoods dictionary 2940 for the sub-words contained in each unique sub-word split combination of the input word to find the phonemes corresponding to the sub-words. In one embodiment, the phonemes may be based on the positions associated with the sub-words within the input word. If the search and solve block 3520 finds the phonemes corresponding to all the sub-words of a sub-word split combination in the sub-word likelihoods dictionary 2940, then the combination is solved and the search and solve block 3520 may combine the corresponding phonemes for each sub-word into the corresponding solution.
[0211] The phoneme corresponding to each sub-word of the sub-word split combination has a likelihood (probability) found from the sub-word likelihoods dictionary 2940. The search and solve block 3540 may multiply the probabilities for the phonemes corresponding to all the sub-words in each unique combination of sub-word split to obtain the phonetic probability of the combination. The search and solve block 3540 may compile the phonetic probabilities for all unique sub-word split combinations of the input word into a phonetic solutions and likelihoods tabulation 3530. In one embodiment, the phonetic solutions and likelihoods tabulation 3530 may tabulate the sub-word split combination with the most likely phonetic probability among all the combinations to maximize the probability of the phoneme string equivalent for the input word for use in online training of the recognition model for data free speech recognition system. In one embodiment, the phonetic solutions and likelihoods tabulation 3530 may tabulate the sub-word split combinations with the N most-likely phonetic probabilities among all the combinations.
[0212]
[0213] Advantageously, the approach for training and decoding the statistics-based tokenizer as described yields high transcription accuracy. The approach can support tokens other than phonemes. The training phase with different phonetic libraries can support different accents, dialects, and languages without requiring any additional training data. The training phase can also support small sized phonetic dictionaries for languages with limited phonetic dictionary support. The decoding phase can identify the most likely pronunciation or the N most-likely pronunciations making the approach attractive for a data free speech recognition system.
[0214] In one aspect, after the training and decoding phases of the tokenizer, the data free speech recognition system may use the tokenizer to generate strings of phonemes from the user-defined WWs or commands for the online training of the recognition models used during inference. The online training of the recognition models may use the phonemes from the tokenizer and SoftMax vector from the acoustic model to compile statistics. Sequence decoding of the WWs or commands may use the statistics to achieve a highly accurate, robust, phoneme string equivalent for the WWs or commands that is adaptive to the acoustic model and to alternate pronunciations of the WWs or commands.
[0215] In one aspect, a text-to-speech (TTS) engine may generate synthetic speech to modify/enhance the speech recognition model (e.g., HMM model) of the data free speech recognition system. As discussed in
[0216] TTS engines based on machine learning or deep learning may produce excellent synthetic speech quality that is barely discernable from real speech to the untrained listener. They are generally capable of synthesizing hundreds of different talkers, either cloning real target talkers, or generating purely fictional talkers. TTS engines may also target different emotions, accents, and prosodies. While these features increase the variability of the output speech, such variability may still not approach that of real speech. To further increase the statistical variation in the synthetic speech, TTS engines may apply augmentation techniques such as time scale modification, vocal tract normalization, level scaling, etc. An ASP system may use a TTS engine to adapt or train a speech recognition model that is already trained using real speech. Such an approach may be useful when limited real speech data is available for training purposes, for example, on an uncommon language, or for new words in an evolving language. However, when a speech recognition model is trained solely on synthetic speech generated from a TTS, the synthetic-speech training data may be inadequate because the synthetic speech may not accurately represent the desired statistics, spectral content, variability, etc. of real speech.
[0217] Described herein is an approach to use a TTS engine to synthesize speech that is otherwise unavailable to train or tune a data free speech recognition system to recognize target phrases in the user-defined command set using only the text or grapheme representation of the target phrases. The data free speech recognition system does not rely on real speech that matches the target phrases, yet may achieve good performance. In one embodiment, the approach may iteratively tune the settings and an augmentation block of a TTS engine to match the target characteristics of real speech and may utilize a compensation block to further compensate/adapt the synthetic speech to real speech.
[0218] In one aspect, during online training of the recognition model of the data free speech recognition system, the data free speech recognition system may tune TTS settings and the augmentation block of a TTS engine using an annotated database to derive the compensation block to minimize differences between the synthetic speech and real speech. After the TTS settings and the augmentation block are tuned, the TTS engine may synthesize the target speech from the user-defined WWs and commands to aid the online training of the recognition models of the target phrases.
[0219]
[0220] The tuning phase uses an one or more annotated databases 3850 considered to contain the target or desired characteristics of speech to tune the components. Because the data free speech recognition system lacks speech data specific to the user-defined WWs and commands, the annotated databases 3850 do not contain speech data of the target phrases. Instead, the annotated databases 3850 may contain an ensemble set of talkers representative of a particular language, or a set talkers from a particular region with a desired target accent. For example, the annotated databases 3850 may contain words-token transcriptions (e.g., text-speech) of the ensemble of talkers.
[0221] The TTS engine 3810 takes as its input the text of each training segment of the annotated speech databases 3850 to produce the equivalent synthetic speech based on the settings and speaker from the selection block 3820. The TTS engine 3810 may have the ability to synthesize multiple talkers, and/or model different prosodies (rhythm, melody, emphasis, duration, level), etc.
[0222] The augmentation block 3830 may process the synthetic speech with augmentation features selected by the selection block 3820 to generate augmented synthetic speech. In one embodiment, the augmentation features may include time scale modification (speed up, slow down), vocal tract length compensation (or other spectral warping), gain scaling, etc.
[0223] An acoustic model 3860 such as the acoustic model 218 of
[0224] An analysis block 3870 compares the output of the synthetic speech and the real speech from the acoustic model 3860. The analysis block 3870 may provide the results of this analysis to the selection block 3820 to adjust the TTS settings and augmentation features. The tuning phase may iterate the TTS setting and augmentation features until convergence of the synthetic speech and the real speech as analyzed by the analysis block 3870. After convergence, the compensation block 3840 may derive compensation or mapping information for use by the data free speech recognition system to further minimize the differences between the synthetic and real speech usage during the online training phase of the recognition models of the target phrases.
[0225]
[0226] In one embodiment, the tuned TTS engine 3810 may synthesize the target speech of the WWs or commands (e.g., the target text) using the ensemble of TTS setting and speakers from the selection block 2920 determined during the tuning phase of the tokenizer. For example, the TTS engine 3810 may be the TTS 346 of
[0227] The augmentation block 3830 may process the synthetic speech with augmentation features again from the selection block 3820 determined during the tuning phase to generate augmented synthetic speech. In one embodiment, the augmentation features may include time scale modification (speed up, slow down), vocal tract length compensation (or other spectral warping), gain scaling, etc. An analysis block 3910 may analyze the augmented synthetic speech to tune or train the data free speech recognition system 3920. For example, the analysis block 3910 may be the analysis module 348 that analyzes the synthetic speech generated by the TTS 346 to aid the generation of the recognition model 350 as shown in
[0228]
[0229] An aligner block 4010 may determine the phonemes of the augmented synthetic speech and their time boundaries. In one embodiment, the aligner block 4010 may use the Montreal Forced Aligner (MFA) to determine the time boundaries of the phonemes. The aligner block 4010 may output the phoneme time boundaries to a statistics collection block 4020.
[0230]
[0231] Referring back to
[0232] Refer back to
[0233] In one embodiment, the HMM for a target phrase may use the top-1 statistics to support alternate pronunciations by allowing multiple phonemes in each state definition, as shown for states 1320 and 1330 in
[0234] In one aspect, an offline analysis may compute the average length (in time) of each phoneme using the annotated database used for training the acoustic model, such as the phoneme annotated database 320 of
[0235] In one embodiment, if the starting time of i.sup.th occurrence of phoneme ph.sub.j is
and the ending time is
then average length
where N.sub.ph is the number occurrences of the phoneme ph.sub.j used in the average.
[0236] The self-transition analysis block 1530 of the decoding compensation analysis block 740 of
where t.sub.frame is the time (in seconds) for each frame and
[0237] In one embodiment, as an alternative to computing the average length of each phoneme in the annotated database used for training the acoustic model, an offline analysis may use the synthetic speech. The offline analysis may use the synthetic speech in conjunction with the phoneme boundaries for the phonemes as that described for compiling the top-1 statistics of
[0238]
[0239] In operation 4201, the system receives a target phrase for recognition by a speech recognition model.
[0240] In operation 4203, the system analyzes a sequence of acoustic units representative of the target phrase when the target phrase is spoken to generate offline analysis data.
[0241] In operation 4205, the system constructs the speech recognition model based on the offline analysis data to decode speech signals of the target phrase according to the acoustic units.
[0242] In operation 4207, the system processes speech based on the speech recognition model to detect a presence of the target phrase
[0243]
[0244] A microphone 4301 of the data processing system 4300 may capture audio signals to store an input signal containing noise and target speech to a buffer 4303. In one embodiment, an input terminal (not shown) of the processing system 4300 may receive audio signals captured by one or more external microphones to store in the buffer 4303.
[0245] A processor 4320 may read the captured audio signals from the buffer for processing. The processor 4320 may retrieve computer-readable instructions from the memory 4330 to execute the instructions to perform the operations described above. The processor 4320 may contain one or more processing cores. The memory 4330 may include one or more ROMs (read only memories), volatile random access memories (RAMs), and/or other types of memories. Communication between the buffer 4310, processor 4320, and memory 4330 may take place through a communication bus 4380.
[0246] In one aspect, during offline training of an neural network-based acoustic model, the processor 4320 may perform feature extraction of input speech from a phoneme annotated database to generate observation vectors, iterate the acoustic model through the observation vectors to learn to distinguish the input speech according to phonemes, and analyze the vectors of phonemes from the acoustic model to generate a similarity matrix.
[0247] In one aspect, during the online training of a decoding model, the processor 4320 may implement a tokenizer to convert text of user-defined WWs/commands to phoneme sequences, and may train the decoding model based on the phoneme sequences of the WWs/commands and the similarity matrix from the offline training.
[0248] In one aspect, during the inference stage of the data free speech recognition system, the processor 4320 may implement SOD algorithm to detect active speech, perform feature extraction of the active speech to generate observation vectors, invoke the acoustic model based on the observation vectors to generate Softmax vectors, and apply statistical modeling on the Softmax vectors according to the decoding model to determine if a user-defined WW or command is spoken.
[0249] In one aspect, the processor 4320 may tune a TTS engine to match the characteristics of real speech using a training phase and a decoding phase. For example, during the tuning of the TTS engine, the processor 4320 may use an annotated speech database to tune the TTS settings, augmentation parameters, and compensation block of the TTS engine to minimize differences between synthetic speech generated by the TTS engine and real speech.
[0250] In one aspect, the processor 4320 may train a tokenizer by applying words from a reference phonetic dictionary to the tokenizer to generate a custom dictionary containing sub-words and their estimated likelihoods. During the decoding phase of the tokenizer, the processor 4320 may invoke the tokenizer to analyze the text input of user-defined WWs/commands using the custom dictionary of sub-words and their estimated likelihoods to tabulate the most likely phoneme string equivalents of the text input and their likelihoods, which may be used for online training of the decoding model.
[0251] Various embodiments of the data free speech recognition system described herein may include various operations. These operations may be performed and/or controlled by hardware components, digital hardware and/or firmware/programmable registers (e.g., as implemented in computer-readable medium), and/or combinations thereof. The methods and illustrative examples described herein are not inherently related to any particular device or other apparatus. For example, during the inference stage of the data free speech recognition system, the processor 4320 may invoke a SOD block 4340 to detect active speech, a feature extraction block 4350 to perform feature extraction of the active speech to generate observation vectors, a phoneme unit matching block 4360 to generate Softmax vectors based on the observation vectors, and a WW/command sequence decoding block 4370 that applies statistical modeling on the Softmax vectors to determine if a user-defined WW or command is spoken. The required structure for a variety of these systems will appear as set forth in the description above.
[0252] A computer-readable medium used to implement operations of various aspects of the disclosure may be non-transitory computer-readable storage medium that may include, but is not limited to, electromagnetic storage medium, magneto-optical storage medium, ROM, RAM, erasable programmable memory (e.g., EPROM and EEPROM), flash memory, or another now-known or later-developed non-transitory type of medium that is suitable for storing configuration information.
[0253] The above description is intended to be illustrative, and not restrictive. Although the present disclosure has been described with references to specific illustrative examples, it will be recognized that the present disclosure is not limited to the examples described. The scope of the disclosure should be determined with reference to the following claims, along with the full scope of equivalents to which the claims are entitled.
[0254] As used herein, the singular forms a, an and the are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms comprises, comprising, may include, and/or including, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Therefore, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.
[0255] It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
[0256] Although the method operations were described in a specific order, it should be understood that other operations may be performed in between described operations, described operations may be adjusted so that they occur at slightly different times or the described operations may be distributed in a system which allows the occurrence of the processing operations at various intervals associated with the processing. For example, certain operations may be performed, at least in part, in a reverse order, concurrently and/or in parallel with other operations.
[0257] Various units, circuits, or other components may be described or claimed as configured to or configurable to perform a task or tasks. In such contexts, the phrase configured to or configurable to is used to connote structure by indicating that the units/circuits/components include structure (e.g., circuitry) that performs the task or tasks during operation. As such, the unit/circuit/component can be said to be configured to perform the task, or configurable to perform the task, even when the specified unit/circuit/component is not currently operational (e.g., is not on). The units/circuits/components used with the configured to or configurable to language include hardwarefor example, circuits, memory storing program instructions executable to implement the operation, etc. Reciting that a unit/circuit/component is configured to perform one or more tasks, or is configurable to perform one or more tasks, is expressly intended not to invoke 35 U.S.C. 112, sixth paragraph, for that unit/circuit/component.
[0258] Additionally, configured to or configurable to can include generic structure (e.g., generic circuitry) that is manipulated by firmware (e.g., an FPGA) to operate in manner that is capable of performing the task(s) at issue. Configured to may also include adapting a manufacturing process (e.g., a semiconductor fabrication facility) to fabricate devices (e.g., integrated circuits) that are adapted to implement or perform one or more tasks. Configurable to is expressly intended not to apply to blank media, an unprogrammed processor, or an unprogrammed programmable logic device, unprogrammed programmable gate array, or other unprogrammed device, unless accompanied by programmed media that confers the ability to the unprogrammed device to be configured to perform the disclosed function(s).
[0259] The foregoing description, for the purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the embodiments and its practical applications, to thereby enable others skilled in the art to best utilize the embodiments and various modifications as may be suited to the particular use contemplated. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.