Patent classifications
G10L15/197
Machine action based on language-independent graph rewriting of an utterance
An utterance in any of various languages is processed to derive a predicted label using a generated grammar. The grammar is suitable for deriving meaning of utterances from several languages (polyglot). The utterance is processed by an encoder using word embeddings. The encoder and a decoder process the utterance using the polyglot grammar to obtain a machine-readable result. The machine-readable result is well-formed based on accounting for re-entrances of intermediate variable references. A machine then takes action on the machine-readable result. Ambiguity is reduced by the decoder by the well-formed machine-readable result. Sparseness of the generated polyglot grammar is reduced by using a two-pass approach including placeholders which are ultimately replaced by edge labels.
ADDING WORDS TO A PREFIX TREE FOR IMPROVING SPEECH RECOGNITION
An approach for improving speech recognition is provided. A processor receives a new word to add to a prefix tree. A processor determines a bonus score for a first transition from a first node to a second node in a prefix tree on condition that the first transition is included in a path of at least one transition representing the new word. A processor determines a hypothesis score for a hypothesis that corresponds to a speech sequence based on the prefix tree, where the hypothesis score adds the bonus score to an initial hypothesis score to determine the hypothesis score. In response to a determination that the hypothesis score exceeds a threshold value, a processor generates an output text sequence for the speech sequence based on the hypothesis.
ADDING WORDS TO A PREFIX TREE FOR IMPROVING SPEECH RECOGNITION
An approach for improving speech recognition is provided. A processor receives a new word to add to a prefix tree. A processor determines a bonus score for a first transition from a first node to a second node in a prefix tree on condition that the first transition is included in a path of at least one transition representing the new word. A processor determines a hypothesis score for a hypothesis that corresponds to a speech sequence based on the prefix tree, where the hypothesis score adds the bonus score to an initial hypothesis score to determine the hypothesis score. In response to a determination that the hypothesis score exceeds a threshold value, a processor generates an output text sequence for the speech sequence based on the hypothesis.
MULTI-MODAL INPUT ON AN ELECTRONIC DEVICE
A computer-implemented input-method editor process includes receiving a request from a user for an application-independent input method editor having written and spoken input capabilities, identifying that the user is about to provide spoken input to the application-independent input method editor, and receiving a spoken input from the user. The spoken input corresponds to input to an application and is converted to text that represents the spoken input. The text is provided as input to the application.
Using Video Clips as Dictionary Usage Examples
Implementations are provided for automatically mining corpus(es) of electronic video files for video clips that contain spoken utterances that are suitable usage examples to accompany or compliment dictionary definitions. These video clips may then be associated with target n-grams in a searchable database, such as a database underlying an online dictionary. In various implementations, a set of candidate video clips in which a target n-gram is uttered in a target context may be identified from a corpus of electronic video files. For each candidate video clip of the set, pre-existing manual subtitles associated with the candidate video clip may be compared to text generated based on speech recognition processing of an audio portion of the candidate video clip. Based at least in part on the comparing, a measure of suitability as a dictionary usage example may be calculated for the candidate video clip.
Using Video Clips as Dictionary Usage Examples
Implementations are provided for automatically mining corpus(es) of electronic video files for video clips that contain spoken utterances that are suitable usage examples to accompany or compliment dictionary definitions. These video clips may then be associated with target n-grams in a searchable database, such as a database underlying an online dictionary. In various implementations, a set of candidate video clips in which a target n-gram is uttered in a target context may be identified from a corpus of electronic video files. For each candidate video clip of the set, pre-existing manual subtitles associated with the candidate video clip may be compared to text generated based on speech recognition processing of an audio portion of the candidate video clip. Based at least in part on the comparing, a measure of suitability as a dictionary usage example may be calculated for the candidate video clip.
Language model biasing modulation
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for modulating language model biasing. In some implementations, context data is received. A likely context associated with a user is determined based on at least a portion of the context data. One or more language model biasing parameters based at least on the likely context associated with the user is selected. A context confidence score associated with the likely context based on at least a portion of the context data is determined. One or more language model teasing parameters based at least on the context confidence score is adjusted. A baseline language model based at least on the one or more of the adjusted language model biasing parameters is biased. The baseline language model is provided for use by an automated speech recognizer (ASR).
Language model biasing modulation
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for modulating language model biasing. In some implementations, context data is received. A likely context associated with a user is determined based on at least a portion of the context data. One or more language model biasing parameters based at least on the likely context associated with the user is selected. A context confidence score associated with the likely context based on at least a portion of the context data is determined. One or more language model teasing parameters based at least on the context confidence score is adjusted. A baseline language model based at least on the one or more of the adjusted language model biasing parameters is biased. The baseline language model is provided for use by an automated speech recognizer (ASR).
RESPONSE METHOD, TERMINAL, AND STORAGE MEDIUM
A response method, a terminal, and a storage medium. The response method comprises: determining, at a first time point by means of speech recognition processing, a first target text corresponding to the first time point (1001); determining, according to the first target text, a first predicted intention and an answer to be pushed, wherein said answer is used for responding to speech information (1002); continuing to determine, by means of the speech recognition processing, a second target text corresponding to a second time point and a second predicted intention, wherein the second time point is the next successive time point of the first time point (1003); determining, according to the first predicted intention and the second predicted intention, whether a preset response condition is satisfied (1004); and responding according to said answer if the preset response condition is determined to be satisfied (1005).
RESPONSE METHOD, TERMINAL, AND STORAGE MEDIUM
A response method, a terminal, and a storage medium. The response method comprises: determining, at a first time point by means of speech recognition processing, a first target text corresponding to the first time point (1001); determining, according to the first target text, a first predicted intention and an answer to be pushed, wherein said answer is used for responding to speech information (1002); continuing to determine, by means of the speech recognition processing, a second target text corresponding to a second time point and a second predicted intention, wherein the second time point is the next successive time point of the first time point (1003); determining, according to the first predicted intention and the second predicted intention, whether a preset response condition is satisfied (1004); and responding according to said answer if the preset response condition is determined to be satisfied (1005).