G10L15/197

SCALABLE ENTITIES AND PATTERNS MINING PIPELINE TO IMPROVE AUTOMATIC SPEECH RECOGNITION

A computing system obtains features that have been extracted from an acoustic signal, where the acoustic signal comprises spoken words uttered by a user. The computing system performs automatic speech recognition (ASR) based upon the features and a language model (LM) generated based upon expanded pattern data. The expanded pattern data includes a name of an entity and a search term, where the entity belongs to a segment identified in a knowledge base. The search term has been included in queries for entities belonging to the segment. The computing system identifies a sequence of words corresponding to the features based upon results of the ASR. The computing system transmits computer-readable text to a search engine, where the text includes the sequence of words.

METHOD FOR FACILITATING SPEECH ACTIVITY DETECTION FOR STREAMING SPEECH RECOGNITION
20220358913 · 2022-11-10 ·

The present disclosure relates to a system and method for automatic recording of speech. The system is configured for end of sentence detection which may also perform as a punctuation predictor. The system uses interrelated Natural Language Processing (NLP) and Automatic Speech Recognition (ASR) with a switching mechanism. The switching mechanism decides when the ASR should start or stop recording for processing. The decision is made by using a temporal neural network which tells the switching mechanism whether a meaningful sentence is formed or not. The temporal neural network is a sequence to classification network which is trained on a huge dataset for news articles.

METHOD FOR FACILITATING SPEECH ACTIVITY DETECTION FOR STREAMING SPEECH RECOGNITION
20220358913 · 2022-11-10 ·

The present disclosure relates to a system and method for automatic recording of speech. The system is configured for end of sentence detection which may also perform as a punctuation predictor. The system uses interrelated Natural Language Processing (NLP) and Automatic Speech Recognition (ASR) with a switching mechanism. The switching mechanism decides when the ASR should start or stop recording for processing. The decision is made by using a temporal neural network which tells the switching mechanism whether a meaningful sentence is formed or not. The temporal neural network is a sequence to classification network which is trained on a huge dataset for news articles.

MEETING TRANSCRIPTION USING CUSTOM LEXICONS BASED ON DOCUMENT HISTORY
20230042473 · 2023-02-09 ·

A collaborative content management system allows multiple users to access and modify collaborative documents. When audio data is recorded by or uploaded to the system, the audio data may be transcribed or summarized to improve accessibility and user efficiency. Text transcriptions are associated with portions of the audio data representative of the text, and users can search the text transcription and access the portions of the audio data corresponding to search queries for playback. An outline can be automatically generated based on a text transcription of audio data and embedded as a modifiable object within a collaborative document. The system associates hot words with actions to modify the collaborative document upon identifying the hot words in the audio data. Collaborative content management systems can also generate custom lexicons for users based on documents associated with the user for use in transcribing audio data, ensuring that text transcription is more accurate.

MEETING TRANSCRIPTION USING CUSTOM LEXICONS BASED ON DOCUMENT HISTORY
20230042473 · 2023-02-09 ·

A collaborative content management system allows multiple users to access and modify collaborative documents. When audio data is recorded by or uploaded to the system, the audio data may be transcribed or summarized to improve accessibility and user efficiency. Text transcriptions are associated with portions of the audio data representative of the text, and users can search the text transcription and access the portions of the audio data corresponding to search queries for playback. An outline can be automatically generated based on a text transcription of audio data and embedded as a modifiable object within a collaborative document. The system associates hot words with actions to modify the collaborative document upon identifying the hot words in the audio data. Collaborative content management systems can also generate custom lexicons for users based on documents associated with the user for use in transcribing audio data, ensuring that text transcription is more accurate.

USER MEDIATION FOR HOTWORD/KEYWORD DETECTION
20230101572 · 2023-03-30 ·

Techniques are described herein for improving performance of machine learning model(s) and thresholds utilized in determining whether automated assistant function(s) are to be initiated. A method includes: receiving, via one or more microphones of a client device, audio data that captures a spoken utterance of a user; processing the audio data using a machine learning model to generate a predicted output that indicates a probability of one or more hotwords being present in the audio data; determining that the predicted output satisfies a secondary threshold that is less indicative of the one or more hotwords being present in the audio data than is a primary threshold; in response to determining that the predicted output satisfies the secondary threshold, prompting the user to indicate whether or not the spoken utterance includes a hotword; receiving, from the user, a response to the prompting; and adjusting the primary threshold based on the response.

USER MEDIATION FOR HOTWORD/KEYWORD DETECTION
20230101572 · 2023-03-30 ·

Techniques are described herein for improving performance of machine learning model(s) and thresholds utilized in determining whether automated assistant function(s) are to be initiated. A method includes: receiving, via one or more microphones of a client device, audio data that captures a spoken utterance of a user; processing the audio data using a machine learning model to generate a predicted output that indicates a probability of one or more hotwords being present in the audio data; determining that the predicted output satisfies a secondary threshold that is less indicative of the one or more hotwords being present in the audio data than is a primary threshold; in response to determining that the predicted output satisfies the secondary threshold, prompting the user to indicate whether or not the spoken utterance includes a hotword; receiving, from the user, a response to the prompting; and adjusting the primary threshold based on the response.

Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
20230096821 · 2023-03-30 ·

A method of training a language model for rare-word speech recognition includes obtaining a set of training text samples, and obtaining a set of training utterances used for training a speech recognition model. Each training utterance in the plurality of training utterances includes audio data corresponding to an utterance and a corresponding transcription of the utterance. The method also includes applying rare word filtering on the set of training text samples to identify a subset of rare-word training text samples that include words that do not appear in the transcriptions from the set of training utterances or appear in the transcriptions from the set of training utterances less than a threshold number of times. The method further includes training the external language model on the transcriptions from the set of training utterances and the identified subset of rare-word training text samples.

Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
20230096821 · 2023-03-30 ·

A method of training a language model for rare-word speech recognition includes obtaining a set of training text samples, and obtaining a set of training utterances used for training a speech recognition model. Each training utterance in the plurality of training utterances includes audio data corresponding to an utterance and a corresponding transcription of the utterance. The method also includes applying rare word filtering on the set of training text samples to identify a subset of rare-word training text samples that include words that do not appear in the transcriptions from the set of training utterances or appear in the transcriptions from the set of training utterances less than a threshold number of times. The method further includes training the external language model on the transcriptions from the set of training utterances and the identified subset of rare-word training text samples.

Enhancing ASR System Performance for Agglutinative Languages

A training-stage technique trains a language model for use in an ASR system. The technique includes: obtaining a training corpus that includes a sequence of terms; determining that an original term in the training corpus is not present in a dictionary resource; segmenting the original term into two or more sub-terms using a segmentation resource; determining that the segmentation of the original term into the two or more sub-terms is a valid segmentation, based on two or more validity tests; and training the language model based on the terms that have been identified. A computer-implemented inference-stage technique applies the language model to produce ASR output results. The inference-stage technique merges a sub-term with a preceding term if these two terms are separated by no more than a prescribed interval of time.