Patent classifications
G10L15/075
Bias detection in speech recognition models
Systems and methods for detecting demographic bias in automatic speech recognition (ASR) systems. Corpuses of transcriptions from different demographic groups are analyzed, where one of the groups is known to be susceptible to bias and another group is known not to be susceptible to bias. ASR accuracy for each group is measured and compared to each other using both statistics-based and practicality-based methodologies to determine whether a given ASR system or model exhibits a meaningful level of bias.
Dialog management system
Techniques for determining an intent for a user input in a dialog are described. The system processes historic interaction data that is structured based skills and intents, with each skill-intent pair being associated with one or more past user inputs received by the system, one or more sample inputs, and one or more alternative representations of the user inputs. Based on processing of the historic interaction data and dialog data of previous turns of the dialog, the system determines potential intents for the user input of the current turn of the dialog. The potential intents may correspond to a presently active skill or another skill, enabling the user to interact with another skill during the dialog.
Fully supervised speaker diarization
A method includes receiving an utterance of speech and segmenting the utterance of speech into a plurality of segments. For each segment of the utterance of speech, the method also includes extracting a speaker=discriminative embedding from the segment and predicting a probability distribution over possible speakers for the segment using a probabilistic generative model configured to receive the extracted speaker-discriminative embedding as a feature input. The probabilistic generative model trained on a corpus of training speech utterances each segmented into a plurality of training segments. Each training segment including a corresponding speaker-discriminative embedding and a corresponding speaker label. The method also includes assigning a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment.
LINGUISTIC MODEL SELECTION FOR ADAPTIVE AUTOMATIC SPEECH RECOGNITION
The present disclosure describes dynamically adjusting linguistic models for automatic speech recognition based on biometric information to produce a more reliable speech recognition experience. Embodiments include receiving a speech signal, receiving a biometric signal from a biometric sensor implemented at least partially in hardware, determining a linguistic model based on the biometric signal, and processing the speech signal for speech recognition using the linguistic model based on the biometric signal.
Systems and methods for learning for domain adaptation
A method for training parameters of a first domain adaptation model. The method includes evaluating a cycle consistency objective using a first task specific model associated with a first domain and a second task specific model associated with a second domain, and evaluating one or more first discriminator models to generate a first discriminator objective using the second task specific model. The one or more first discriminator models include a plurality of discriminators corresponding to a plurality of bands that corresponds domain variable ranges of the first and second domains respectively. The method further includes updating, based on the cycle consistency objective and the first discriminator objective, one or more parameters of the first domain adaptation model for adapting representations from the first domain to the second domain.
Streaming Action Fulfillment Based on Partial Hypotheses
A method for streaming action fulfillment receives audio data corresponding to an utterance where the utterance includes a query to perform an action that requires performance of a sequence of sub-actions in order to fulfill the action. While receiving the audio data, but before receiving an end of speech condition, the method processes the audio data to generate intermediate automated speech recognition (ASR) results, performs partial query interpretation on the intermediate ASR results to determine whether the intermediate ASR results identify an application type needed to perform the action and, when the intermediate ASR results identify a particular application type, performs a first sub-action in the sequence of sub-actions by launching a first application to execute on the user device where the first application is associated with the particular application type. The method, in response to receiving an end of speech condition, fulfills performance of the action.
Memory deterioration detection and amelioration
Memory deterioration detection and evaluation includes capturing human utterances with a voice interface and generating, for a user, a human utterances corpus that comprises human utterances selected from the plurality of human utterances based on meanings of the human utterances as determined by natural language processing by a computer processor. Based on data generated in response to signals sensed by one or more sensing devices operatively coupled with the computer processor, contextual information corresponding to one or more human utterances of the corpus is determined. Patterns among the corpus of human utterances are recognized based on pattern recognition performed by the computer processor using one or more machine learning models. Based on the pattern recognition a change in memory functioning of the user is identified. The identified change is classified, based on the contextual information, as to whether the change is likely due to memory impairment of the user.
Systems and methods for an automatic language characteristic recognition system
In some embodiments, a method of creating an automatic language characteristic recognition system. The method can include receiving a plurality of audio recordings. The method also can include segmenting each of the plurality of audio recordings to create a plurality of audio segments for each audio recording. The method additionally can include clustering each audio segment of the plurality of audio segments according to audio characteristics of each audio segment to form a plurality of audio segment clusters. Other embodiments are provided.
System and method for machine-mediated human-human conversation
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for processing speech. A system configured to practice the method monitors user utterances to generate a conversation context. Then the system receives a current user utterance independent of non-natural language input intended to trigger speech processing. The system compares the current user utterance to the conversation context to generate a context similarity score, and if the context similarity score is above a threshold, incorporates the current user utterance into the conversation context. If the context similarity score is below the threshold, the system discards the current user utterance. The system can compare the current user utterance to the conversation context based on an n-gram distribution, a perplexity score, and a perplexity threshold. Alternately, the system can use a task model to compare the current user utterance to the conversation context.
NOISE COMPENSATION IN SPEAKER-ADAPTIVE SYSTEMS
A method of adapting an acoustic model relating acoustic units to speech vectors, wherein said acoustic model comprises a set of speech factor parameters related to a given speech factor and which enable the acoustic model to output speech vectors with different values of the speech factor, the method comprising: inputting a sample of speech with a first value of the speech factor;
determining values of the set of speech factor parameters which enable the acoustic model to output speech with said first value of the speech factor; and
employing said determined values of the set of speech factor parameters in said acoustic model, wherein said sample of speech is corrupted by noise, and wherein said step of determining the values of the set of speech factor parameters comprises: (i) obtaining noise characterization parameters characterising the noise; (ii) performing a speech factor parameter generation algorithm on the sample of speech, thereby generating corrupted values of the set of speech factor parameters; (iii) using the noise characterization parameters to map said corrupted values of the set of speech factor parameters to clean values of the set of speech factor parameters, wherein the clean values of the set of speech factor parameters are estimates of the speech factor parameters which would be obtained by performing the speech factor parameter generation algorithm on the sample of speech if the sample of speech were not corrupted by the noise; and (iv) employing said clean values of the set of speech factor parameters as said determined values of the set of speech factor parameters.