G10L2015/081

System and Method for Performing Dual Mode Speech Recognition

A system and method is presented for performing dual mode speech recognition, employing a local recognition module on a mobile device and a remote recognition engine on a server device. The system accepts a spoken query from a user, and both the local recognition module and the remote recognition engine perform speech recognition operations on the query, returning a transcription and confidence score, subject to a latency cutoff time. If both sources successfully transcribe the query, then the system accepts the result having the higher confidence score. If only one source succeeds, then that result is accepted. In either case, if the remote recognition engine does succeed in transcribing the query, then a client vocabulary is updated if the remote system result includes information not present in the client vocabulary.

Dynamic Acoustic Model Switching to Improve Noisy Speech Recognition

An automatic speech recognition system for a vehicle includes a controller configured to select an acoustic model from a library of acoustic models based on ambient noise in a cabin of the vehicle and operating parameters of the vehicle. The controller is further configured to apply the selected acoustic model to noisy speech to improve recognition of the speech.

WAKE-ON-VOICE METHOD AND DEVICE
20170206895 · 2017-07-20 ·

The present invention provides a wake-on-voice method and device. The method includes: obtaining a voice inputted by a user; processing data frames of the voice with a frame skipping strategy and performing a voice activity detection on the data frames by a time-domain energy algorithm; extracting an acoustic feature of the voice and performing a voice recognition on the acoustic feature according to a preset recognition network and an acoustic model; and performing an operation corresponding to the voice if the voice is a preset wake-up word in the preset recognition network.

System and method of extracting clauses for spoken language understanding

A clausifier and method of extracting clauses for spoken language understanding are disclosed. The method relates to generating a set of clauses from speech utterance text and comprises inserting at least one boundary tag in speech utterance text related to sentence boundaries, inserting at least one edit tag indicating a portion of the speech utterance text to remove, and inserting at least one conjunction tag within the speech utterance text. The result is a set of clauses that may be identified within the speech utterance text according to the inserted at least one boundary tag, at least one edit tag and at least one conjunction tag. The disclosed clausifier comprises a sentence boundary classifier, an edit detector classifier, and a conjunction detector classifier. The clausifier may comprise a single classifier or a plurality of classifiers to perform the steps of identifying sentence boundaries, editing text, and identifying conjunctions within the text.

System and method for performing dual mode speech recognition

A system and method is presented for performing dual mode speech recognition, employing a local recognition module on a mobile device and a remote recognition engine on a server device. The system accepts a spoken query from a user, and both the local recognition module and the remote recognition engine perform speech recognition operations on the query, returning a transcription and confidence score, subject to a latency cutoff time. If both sources successfully transcribe the query, then the system accepts the result having the higher confidence score. If only one source succeeds, then that result is accepted. In either case, if the remote recognition engine does succeed in transcribing the query, then a client vocabulary is updated if the remote system result includes information not present in the client vocabulary.

Performing speech recognition using a local language context including a set of words with descriptions in terms of components smaller than the words

A method of a local recognition system controlling a host device to perform one or more operations is provided. The method includes receiving, by the local recognition system, a query, performing speech recognition on the received query by implementing, by the local recognition system, a local language context comprising a set of words comprising descriptions in terms of components smaller than the words, and performing speech recognition, using the local language context, to create a transcribed query. Further, the method includes controlling the host device in dependence upon the speech recognition performed on the transcribed query.

Automatic cognate detection in a computer-assisted language learning system

According to an aspect, a first word in a first language and a second word in a second language in a bilingual corpus are stemmed. A probability for aligning the first stem and the second stem and a distance metric between the normalized first stem and the normalized second stem are calculated. The first word and the second word are identified as a cognate pair when the probability and the distance metric meet a threshold criterion and stored as a cognate pair in a set of cognates. A candidate sentence in the second language is retrieved from a corpus. The candidate sentence is filtered by the active vocabulary of a user in the second language and the set of cognates. A sentence quality score is calculated for the candidate sentence; and the candidate sentence is ranked for presentation to the user based on the sentence quality scorer.

Voice recognition apparatus and method of controlling the same

A voice recognition apparatus includes a voice recognizer configured to recognize user utterance, a storage unit configured to store a plurality of tokens, a token network generator configured to generate a plurality of recognition tokens from the recognized user utterance, search for a similar token similar to each of the recognition tokens and a peripheral token having a history used with the recognition token among the plurality of tokens stored in the storage unit, and generate a token network using the recognition token, the similar token, and the peripheral token, and a processor configured to control the token network generator to generate the token network in response to the user utterance being recognized through the voice recognizer, calculate a transition probability between the tokens constituting the token network, and generate text data for corrected user utterance using the calculated transition probability.

FAITHFUL GENERATION OF OUTPUT TEXT FOR MULTIMODAL APPLICATIONS
20250078818 · 2025-03-06 ·

Systems and techniques are described for generating and using unimodal/multimodal generative models that mitigate hallucinations. For example, a computing device can encode input data to generate encoded representations of the input data. The computing device can obtain intermediate data including a plurality of partial sentences associated with the input data and can generate, based on the intermediate data, at least one complete sentence associated with the input data. The computing device can encode the at least one complete sentence to generate at least one encoded representation of the at least one complete sentence. The computing device can generate a faithfulness score based on a comparison of the encoded representations of the input data and the at least one encoded representation of the at least one complete sentence. The computing device can re-rank the plurality of partial sentences of the intermediate data based on the faithfulness score to generate re-ranked data.

System and Method for Learning Alternate Pronunciations for Speech Recognition

A system and method for learning alternate pronunciations for speech recognition is disclosed. Alternative name pronunciations may be covered, through pronunciation learning, that have not been previously covered in a general pronunciation dictionary. In an embodiment, the detection of phone-level and syllable-level mispronunciations in words and sentences may be based on acoustic models trained by Hidden Markov Models. Mispronunciations may be detected by comparing the likelihood of the potential state of the targeting pronunciation unit with a pre-determined threshold through a series of tests. It is also within the scope of an embodiment to detect accents.