Patent classifications
G10L15/075
Speaker adaptation for attention-based encoder-decoder
Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution, training of the second attention-based encoder-decoder model to classify output tokens based on input speech frames of a target speaker and simultaneously training the speaker-dependent attention-based encoder-decoder model to maintain a similarity between the first output distribution and the second output distribution, and performing automatic speech recognition on speech frames of the target speaker using the trained speaker-dependent attention-based encoder-decoder model.
Apparatus, method, non-transitory computer-readable recording medium storing program, and robot
A processor causes a robot to execute any one of a first, second, and third action as an initial action. The initial action is executed for communication with a target person according to a captured image and a captured sound. When a sound is acquired by a microphone after execution of the current action, the processor causes the robot to execute an action one-level higher than the current action. The current action includes the initial action. When the sound is not acquired, the processor determines whether a time elapsed from the execution of the current action is shorter than a threshold. When the time is shorter than the threshold, the processor causes the robot to continue the current action. When the time is equal to or longer than the threshold, the processor causes the robot to execute an action one-level lower than the current action.
ROBUST EXPANDABLE DIALOGUE SYSTEM
An automated natural dialogue system provides a combination of structure and flexibility to allow for ease of annotation of dialogues as well as learning and expanding the capabilities of the dialogue system based on natural language interactions.
Systems and methods for speech recognition in unseen and noisy channel conditions
Systems and methods for speech recognition are provided. In some aspects, the method comprises receiving, using an input, an audio signal. The method further comprises splitting the audio signal into auditory test segments. The method further comprises extracting, from each of the auditory test segments, a set of acoustic features. The method further comprises applying the set of acoustic features to a deep neural network to produce a hypothesis for the corresponding auditory test segment. The method further comprises selectively performing one or more of: indirect adaptation of the deep neural network and direct adaptation of the deep neural network.
METHOD AND APPARATUS FOR CORRECTING FAILURES IN AUTOMATED SPEECH RECOGNITION SYSTEMS
Systems and methods are disclosed and described for correcting errors in ASR transcriptions. For an incorrect transcription, different words or phrases from the transcription, and/or related words or phrases, are submitted as hint words to the ASR system, and the voice query is submitted again, to determine new transcriptions. This process is repeated with different transcription terms, until a different and more proper transcription is generated. This increases the accuracy of ASR systems.
Speech model personalization via ambient context harvesting
An apparatus for speech model with personalization via ambient context harvesting, is described herein. The apparatus includes a microphone, context harvesting module, confidence module, and training module. The context harvesting module is to determine a context associated with the captured audio signals. A confidence module is to determine a confidence of the context as applied to the audio signals. A training module is to train a neural network in response to the confidence being above a predetermined threshold.
Robust expandable dialogue system
An automated natural dialogue system provides a combination of structure and flexibility to allow for ease of annotation of dialogues as well as learning and expanding the capabilities of the dialogue system based on natural language interactions.
DYNAMIC CONTEXT EXTRACTION FROM MEDIA STREAMS
A method of enabling a virtual assistant (VA) serving a user to dynamically acquire contextual information regarding digital media environment accessed by a user includes: extracting, by an analysis engine, the contextual information dynamically from at least one of media content accessed by the user and webpage content accessed by the user; and injecting, by the analysis engine, the extracted contextual information into a VA memory to serve the user. The analysis engine is configured to analyze the extracted contextual information using at least one machine learning (ML) model. The extracted contextual information includes at least one of topics, intents, entities, sentiments, and products of interest. The at least one ML model includes at least one of Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), speaker diarization, sentiment analysis on media streams, and web analytics for product focus.
Bias detection in speech recognition models
Systems and methods for detecting demographic bias in automatic speech recognition (ASR) systems. Corpuses of transcriptions from different demographic groups are analyzed, where one of the groups is known to be susceptible to bias and another group is known not to be susceptible to bias. A difference between the transcription accuracy for the first group and a transcription accuracy for a second group is measured. ASR accuracy for each group is measured and compared to each other using both statistics-based and practicality-based methodologies to determine whether a given ASR system or model exhibits a meaningful level of bias. Based on the statistical significance and the practical significance, an alert including a recommendation to adjust the ASR model is generated.
System and methods for accent and dialect modification
Systems and methods for accent and dialect modification are disclosed. Discussed are a method for selecting a target dialect and accent to use to modify voice communications based on a context and a method for selectively modifying one or more words in voice communications in one dialect and accent with one or more vocal features of a different accent.