Patent classifications
G10L2015/0638
Natural language processing routing
Devices and techniques are generally described for a speech processing routing architecture. In various examples, first data comprising a first feature definition is received. The first feature definition may include a first indication of first source data and first instructions for generating feature data using the first source data. In various examples, the feature data may be generated according to the first feature definition. In some examples, a speech processing system may receive a first request to process a first utterance. The feature data may be retrieved from a non-transitory computer-readable memory. The speech processing system may determine a first skill for processing the first utterance based at least in part on the feature data.
Model training system for custom speech-to-text models
- Vivek Govindan ,
- Varun Sembium Varadarajan ,
- Christian Egon Berkhoff Dossow ,
- Himalay Mohanlal Joriwal ,
- Sai Madhuri Bhavirisetty ,
- Abhinav Kumar ,
- Orestis Lykouropoulos ,
- Akshay Nalwaya ,
- Rahul Gupta ,
- Sravan Babu Bodapati ,
- Liangwei Guo ,
- Julian E. S. Salazar ,
- Yibin Wang ,
- K P N V D S Siva Rama ,
- Calvin Xuan Li ,
- Mohit Narendra Gupta ,
- Asem Rustum ,
- Katrin Kirchhoff ,
- Pu Zhao
A transcription service may receive a request from a developer to build a custom speech-to-text model for a specific domain of speech. The custom speech-to-text model for the specific domain may replace a general speech-to-text model or add to a set of one or more speech-to-text models available for transcribing speech. The transcription service may receive a training data and instructions representing tasks. The transcription service may determine respective schedules for executing the instructions based at least in part on dependencies between the tasks. The transcription service may execute the instructions according to the respective schedules to train a speech-to-text model for a specific domain using the training data set. The transcription service may deploy the trained speech-to-text model as part of a network-accessible service for an end user to convert audio in the specific domain into texts.
Intent authoring using weak supervision and co-training for automated response systems
A combination of propagation operations and learning algorithms is applied, using a selected set of labeled conversational logs retrieved from a subset of a plurality of conversational logs, to a remaining corpus of the plurality of conversational logs to train an automated response system according to an intent associated with each of the conversational logs. The combination of propagation operations and learning algorithms may include defining the labels by a user for the selected set of the subset of the plurality of conversational logs; training a probabilistic classifier using the defined labels of features of the selected set, wherein the probabilistic classifier produces labeling decisions for the subset of conversational logs; weighting the features of the selected set in a model optimization process; and/or training an additional classifier using the weighted features of the selected set and applying the additional classifier to the remaining corpus.
ENHANCING SIGNATURE WORD DETECTION IN VOICE ASSISTANTS
Systems and methods detecting a spoken sentence in a speech recognition system are disclosed herein. Speech data is buffered based on an audio signal captured at a computing device operating in an active mode. The speech data is buffered irrespective of whether the speech data comprises a signature word. The buffered speech data is processed to detect a presence of the sentence comprising at least one command and a query for the computing device. Processing the buffered speech data includes detecting the signature word in the buffered speech data, and in response to detecting the signature word in the speech data, initiating detection of the sentence in the buffered speech data.
SOLUTION GUIDED RESPONSE GENERATION FOR DIALOG SYSTEMS
A processor may receive first voice data associated with a first user utterance in conversation in a guided dialog system. The processor may identify from the first voice data a first topic of a set of topics associated with the first user utterance. The processor may identify a first solution associated with the first topic, the first solution having one or more solution segments for accomplishing a task related to the topic. The processor may generate a first response for a second user based on a first solution segment of the first solution and the first voice data.
MACHINE-LEARNING-MODEL BASED NAME PRONUNCIATION
A computer-implemented conferencing method is disclosed. A conference session between a user and one or more other conference participants is initiated via a computer conference application. An attribute-specific pronunciation of the user's name is determined via one or more attribute-specific-pronunciation machine-learning models previously trained based at least on one or more attributes of the one or more other conference participants. The attribute-specific pronunciation of the user's name is compared to a preferred pronunciation of the user's name via computer-pronunciation-comparison logic. Based on the attribute-specific pronunciation of the user's name being inconsistent with the preferred pronunciation of the user's name, a pronunciation learning protocol is automatically executed to convey, via the computer conference application, the preferred pronunciation of the user's name to the one or more other conference participants.
Noise data augmentation for natural language processing
Techniques for noise data augmentation for training chatbot systems in natural language processing. In one particular aspect, a method is provided that includes receiving a training set of utterances for training an intent classifier to identify one or more intents for one or more utterances; augmenting the training set of utterances with noise text to generate an augmented training set of utterances; and training the intent classifier using the augmented training set of utterances. The augmenting includes: obtaining the noise text from a list of words, a text corpus, a publication, a dictionary, or any combination thereof irrelevant of original text within the utterances of the training set of utterances, and incorporating the noise text within the utterances relative to the original text in the utterances of the training set of utterances at a predefined augmentation ratio to generate augmented utterances.
Text independent speaker recognition
Text independent speaker recognition models can be utilized by an automated assistant to verify a particular user spoke a spoken utterance and/or to identify the user who spoke a spoken utterance. Implementations can include automatically updating a speaker embedding for a particular user based on previous utterances by the particular user. Additionally or alternatively, implementations can include verifying a particular user spoke a spoken utterance using output generated by both a text independent speaker recognition model as well as a text dependent speaker recognition model. Furthermore, implementations can additionally or alternatively include prefetching content for several users associated with a spoken utterance prior to determining which user spoke the spoken utterance.
EXPLAINING ANOMALOUS PHONETIC TRANSLATIONS
A method includes: receiving, by a computing device, a digital voice stream; receiving, by the computing device, converted text that represents the digital voice stream; identifying, by the computing device, an erroneously converted portion of the converted text; selecting, by the computing device, the erroneously converted portion for explainability processing; parsing, by the computing device, the erroneously converted portion into parts based on a predetermined parsing level; collecting, by the computing device, supplementary input data related to the erroneously converted portion; and determining, by the computing device and based on the supplemental input data, a reason why the erroneously converted portion was erroneously converted.
Filtering directive invoking vocal utterances
Methods, computer program products, and systems are presented. The method computer program products, and systems can include, for instance: receiving, from a user, voice data defining a candidate directive invoking vocal utterance for invoking a directive to execute a first text based command to perform a first computer function of a computer system, wherein the candidate directive invoking vocal utterance includes at least one word or phrase of the text based command, wherein the computer system is configured to perform the first computer function in response to the first text based command and wherein the computer system is configured to perform a second computer function in response to a second text based command; determining, based on machine logic, whether a word or phrase of the candidate vocal utterance sounds confusingly similar to a speech rendering of a word or phrase defining the second text based command.