Patent classifications
G10L15/05
Freeze words
A method for detecting freeze words includes receiving audio data that corresponds to an utterance spoken by a user and captured by a user device associated with the user. The method also includes processing, using a speech recognizer, the audio data to determine that the utterance includes a query for a digital assistant to perform an operation. The speech recognizer is configured to trigger endpointing of the utterance after a predetermined duration of non-speech in the audio data. Before the predetermined duration of non-speech, the method includes detecting a freeze word in the audio data. In response to detecting the freeze word in the audio data, the method also includes triggering a hard microphone closing event at the user device. The hard microphone closing event prevents the user device from capturing any audio subsequent to the freeze word.
SYSTEM AND METHOD OF PERFORMING AUTOMATIC SPEECH RECOGNITION USING END-POINTING MARKERS GENERATED USING ACCELEROMETER-BASED VOICE ACTIVITY DETECTOR
A method of performing automatic speech recognition (ASR) using end-pointing markers generated using accelerometer-based voice activity detector starts with a voice activity detector (VAD) generating an accelerometer VAD output (VADa) based on data output by at least one accelerometer that is included in at least one earbud. The at least one accelerometer to detect vibration of the user's vocal chords. A voice processor detects a speech signal based on acoustic signals from at least one microphone. An end-pointer generates the end-pointing markers based on the VADa output and an ASR engine performs ASR on the speech signal based on the end-pointing markers. Other embodiments are also described.
SYSTEM AND METHOD OF PERFORMING AUTOMATIC SPEECH RECOGNITION USING END-POINTING MARKERS GENERATED USING ACCELEROMETER-BASED VOICE ACTIVITY DETECTOR
A method of performing automatic speech recognition (ASR) using end-pointing markers generated using accelerometer-based voice activity detector starts with a voice activity detector (VAD) generating an accelerometer VAD output (VADa) based on data output by at least one accelerometer that is included in at least one earbud. The at least one accelerometer to detect vibration of the user's vocal chords. A voice processor detects a speech signal based on acoustic signals from at least one microphone. An end-pointer generates the end-pointing markers based on the VADa output and an ASR engine performs ASR on the speech signal based on the end-pointing markers. Other embodiments are also described.
SYSTEMS AND METHODS FOR POWER-EFFICIENT KEYWORD DETECTION
Systems and methods for audio processing include capturing sound data via at least one microphone of a network microphone device (NMD) and determining whether the captured sound includes voice activity. While in a first stage, the NMD forgoes spatial processing of the captured sound data. If the NMD determines that the detected sound includes voice activity, the NMD transitions to a second stage. In this second stage, the NMD spatially processes the detected sound to produce filtered sound data and detects a wake word. After detecting the wake word, the NMD may determine an action to be performed based on the captured sound data.
SYSTEMS AND METHODS FOR POWER-EFFICIENT KEYWORD DETECTION
Systems and methods for audio processing include capturing sound data via at least one microphone of a network microphone device (NMD) and determining whether the captured sound includes voice activity. While in a first stage, the NMD forgoes spatial processing of the captured sound data. If the NMD determines that the detected sound includes voice activity, the NMD transitions to a second stage. In this second stage, the NMD spatially processes the detected sound to produce filtered sound data and detects a wake word. After detecting the wake word, the NMD may determine an action to be performed based on the captured sound data.
Hybrid decoding using hardware and software for automatic speech recognition systems
Embodiments describe a method for decoding speech including receiving speech input at an audio input device, generating speech data that is a digital representation of the speech input; extracting acoustic features of the speech data, assigning acoustic scores to the acoustic features, receiving data representing the acoustic features and the acoustic scores, decoding the data representing the acoustic features into a word, having a word score, by referencing a WFST language model, modifying the word score into a new word score based on a personalized grammar model stored in the external memory device, the processor is separate from and external to the WFST accelerator, and determining an intent represented by a plurality of words outputted by the WFST accelerator, where the plurality of words include the word and the new word score.
NOISE DATA AUGMENTATION FOR NATURAL LANGUAGE PROCESSING
Techniques for noise data augmentation for training chatbot systems in natural language processing. In one particular aspect, a method is provided that includes receiving a training set of utterances for training an intent classifier to identify one or more intents for one or more utterances; augmenting the training set of utterances with noise text to generate an augmented training set of utterances; and training the intent classifier using the augmented training set of utterances. The augmenting includes: obtaining the noise text from a list of words, a text corpus, a publication, a dictionary, or any combination thereof irrelevant of original text within the utterances of the training set of utterances, and incorporating the noise text within the utterances relative to the original text in the utterances of the training set of utterances at a predefined augmentation ratio to generate augmented utterances.
NOISE DATA AUGMENTATION FOR NATURAL LANGUAGE PROCESSING
Techniques for noise data augmentation for training chatbot systems in natural language processing. In one particular aspect, a method is provided that includes receiving a training set of utterances for training an intent classifier to identify one or more intents for one or more utterances; augmenting the training set of utterances with noise text to generate an augmented training set of utterances; and training the intent classifier using the augmented training set of utterances. The augmenting includes: obtaining the noise text from a list of words, a text corpus, a publication, a dictionary, or any combination thereof irrelevant of original text within the utterances of the training set of utterances, and incorporating the noise text within the utterances relative to the original text in the utterances of the training set of utterances at a predefined augmentation ratio to generate augmented utterances.
Voice identification method, device, apparatus, and storage medium
A voice identification method, device, apparatus, and a storage medium are provided. The method includes: receiving voice data; and performing a voice identification on the voice data, to obtain first text data associated with the voice data; determining common text data in a preset fixed data table, wherein a similarity between a pronunciation of the determined common text data and a pronunciation of the first text data meets a preset condition, wherein the determined common text data is a voice identification result with an occurrence number larger than a first preset threshold; and replacing the first text data with the determined common text data.
Voice identification method, device, apparatus, and storage medium
A voice identification method, device, apparatus, and a storage medium are provided. The method includes: receiving voice data; and performing a voice identification on the voice data, to obtain first text data associated with the voice data; determining common text data in a preset fixed data table, wherein a similarity between a pronunciation of the determined common text data and a pronunciation of the first text data meets a preset condition, wherein the determined common text data is a voice identification result with an occurrence number larger than a first preset threshold; and replacing the first text data with the determined common text data.