Patent classifications
G10L15/06
Skill shortlister for natural language processing
Devices and techniques are generally described for application determination in speech processing. Input data corresponding to a spoken utterance may be received. Speech recognition processing may be performed on the input data to generate text data. A machine learning encoder may generate a vector representation of the input data. A first binary classifier may determine a first probability that the input data corresponds to a first speech-processing application. A second binary classifier may determine a second probability that the input data corresponds to a second speech-processing application. A selection between the first speech-processing application and the second speech-processing application may be made based at least in part on the first probability and the second probability.
TRAINING SPEECH PROCESSING MODELS USING PSEUDO TOKENS
A speech processing model may be trained using pseudo tokens. Training a speech processing model with pseudo tokens may allow for training with a smaller amount of labeled training data and accordingly lower costs. A set of pseudo tokens may be determined by computing feature vectors from unlabeled training data, clustering the feature vectors, and performing token compression using the clustered feature vectors. A first speech processing model may be trained using unlabeled training data by determining sequences of pseudo tokens corresponding to the unlabeled training data. A second speech processing model may be initialized using the first speech processing model and then trained using labeled training data. The second speech processing model may then be deployed to a speech processing application.
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING SYSTEM, AND INFORMATION PROCESSING METHOD
An information processing apparatus (100) includes an acquisition unit (132) that acquires, from a storage unit (120) that stores episode data (D1) of a speaker, the episode data (D1) regarding topic information included in utterance data (D30) of the speaker, and an interaction control unit (133) that controls an interaction with the speaker so as to include an episode based on the episode data (D1).
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING SYSTEM, AND INFORMATION PROCESSING METHOD
An information processing apparatus (100) includes an acquisition unit (132) that acquires, from a storage unit (120) that stores episode data (D1) of a speaker, the episode data (D1) regarding topic information included in utterance data (D30) of the speaker, and an interaction control unit (133) that controls an interaction with the speaker so as to include an episode based on the episode data (D1).
Abnormality degree calculation system and method
An abnormality degree calculation system includes: a feature amount vector extraction unit configured to generate and output a feature amount vector from an input signal originating from vibration of a target device; an encoding unit configured to receive as an input a set composed of the feature amount vector and a device type vector representing a type of the target device and output an encoding vector; a decoding unit configured receive as an input the encoding vector and the device type vector and output a decoding vector; a learning unit configured to learn parameters of the neural networks of the encoding unit and the decoding unit; and an abnormality degree calculation unit configured to calculate a degree of abnormality defined as a function of the feature amount vector from the feature amount vector extraction unit, the encoding vector from the encoding unit, and the decoding vector from the decoding unit.
Inverted Projection for Robust Speech Translation
The technology provides an approach to train translation models that are robust to transcription errors and punctuation errors. The approach includes introducing errors from actual automatic speech recognition and automatic punctuation systems into the source side of the machine translation training data. A method for training a machine translation model includes performing automatic speech recognition on input source audio to generate a system transcript. The method aligns a human transcript of the source audio to the system transcript, including projecting system segmentation onto the human transcript. Then the method performs segment robustness training of a machine translation model according to the aligned human and system transcripts, and performs system robustness training of the machine translation model, e.g., by injecting token errors into training data.
Deep learning models for speech recognition
Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the system do not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learn a function that is robust to such effects. Neither a phoneme dictionary, nor even the concept of a “phoneme,” is needed. Embodiments include a well-optimized recurrent neural network (RNN) training system that can use multiple GPUs, as well as a set of novel data synthesis techniques that allows for a large amount of varied data for training to be efficiently obtained. Embodiments of the system can also handle challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
SYSTEMS AND METHODS FOR FACILITATING INTEGRATIVE, EXTENSIBLE, COMPOSABLE, AND INTERPRETABLE DEEP LEARNING
Some disclosed systems are configured to obtain a knowledge module configured to receive one or more knowledge inputs corresponding to one or more different modalities and generate a set of knowledge embeddings to be integrated with a set of multi-modal embeddings generated by a multi-modal main model. The systems receive a knowledge input at the knowledge module, identify a knowledge type associated with the knowledge input, and extract a knowledge unit from the knowledge input. The systems select a representation model that corresponds to the knowledge type and select a grounding type configured to ground the at least one knowledge unit into the representation model. The systems then ground the knowledge unit into the representation model according to the grounding type.
DYNAMIC BOUNDARY CREATION FOR VOICE COMMAND AUTHENTICATION
A computer-implemented method executes voice commands issued from within a command boundary. The method includes defining a command boundary for a VCD, where the command boundary is based on receiving an input from a user. The method further includes receiving, from the user and by the VCD, a voice command. The method also includes determining an origination location of the voice command. The method includes classifying the voice command into a command category. The method further includes executing the voice command in response to determining the origination location is within the command boundary for the VCD. The method also includes storing a set of data for the voice command.
SYSTEM AND METHOD FOR GENERATING WRAP UP INFORMATION
A system for generating wrap-up information is capable of learning how interactions are transformed into contact notes and outcome codes using natural language processing and can generate the contact notes and outcome codes for new incoming interactions by applying prediction models trained on interaction data, contact notes and outcome codes. The system for generating wrap-up information receives interaction data, including interaction audio data, interaction transcripts, associated contact notes and associated outcome codes. The interaction transcripts are generated from the previous interactions between agents and customers. The contact notes and outcome codes are generated by agents during the associated previous interactions. The system processes and uses the interaction data to train prediction models to analyze interaction audio data and interaction transcripts and predict appropriate contact notes and outcome codes for the interaction. Once trained the prediction model(s) can generate appropriate contact notes and outcome codes for new interactions.