G10L15/063

TRAINING SPEECH PROCESSING MODELS USING PSEUDO TOKENS

A speech processing model may be trained using pseudo tokens. Training a speech processing model with pseudo tokens may allow for training with a smaller amount of labeled training data and accordingly lower costs. A set of pseudo tokens may be determined by computing feature vectors from unlabeled training data, clustering the feature vectors, and performing token compression using the clustered feature vectors. A first speech processing model may be trained using unlabeled training data by determining sequences of pseudo tokens corresponding to the unlabeled training data. A second speech processing model may be initialized using the first speech processing model and then trained using labeled training data. The second speech processing model may then be deployed to a speech processing application.

Abnormality degree calculation system and method
11710500 · 2023-07-25 · ·

An abnormality degree calculation system includes: a feature amount vector extraction unit configured to generate and output a feature amount vector from an input signal originating from vibration of a target device; an encoding unit configured to receive as an input a set composed of the feature amount vector and a device type vector representing a type of the target device and output an encoding vector; a decoding unit configured receive as an input the encoding vector and the device type vector and output a decoding vector; a learning unit configured to learn parameters of the neural networks of the encoding unit and the decoding unit; and an abnormality degree calculation unit configured to calculate a degree of abnormality defined as a function of the feature amount vector from the feature amount vector extraction unit, the encoding vector from the encoding unit, and the decoding vector from the decoding unit.

Inverted Projection for Robust Speech Translation
20230021824 · 2023-01-26 · ·

The technology provides an approach to train translation models that are robust to transcription errors and punctuation errors. The approach includes introducing errors from actual automatic speech recognition and automatic punctuation systems into the source side of the machine translation training data. A method for training a machine translation model includes performing automatic speech recognition on input source audio to generate a system transcript. The method aligns a human transcript of the source audio to the system transcript, including projecting system segmentation onto the human transcript. Then the method performs segment robustness training of a machine translation model according to the aligned human and system transcripts, and performs system robustness training of the machine translation model, e.g., by injecting token errors into training data.

Deep learning models for speech recognition

Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the system do not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learn a function that is robust to such effects. Neither a phoneme dictionary, nor even the concept of a “phoneme,” is needed. Embodiments include a well-optimized recurrent neural network (RNN) training system that can use multiple GPUs, as well as a set of novel data synthesis techniques that allows for a large amount of varied data for training to be efficiently obtained. Embodiments of the system can also handle challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

SYSTEMS AND METHODS FOR FACILITATING INTEGRATIVE, EXTENSIBLE, COMPOSABLE, AND INTERPRETABLE DEEP LEARNING

Some disclosed systems are configured to obtain a knowledge module configured to receive one or more knowledge inputs corresponding to one or more different modalities and generate a set of knowledge embeddings to be integrated with a set of multi-modal embeddings generated by a multi-modal main model. The systems receive a knowledge input at the knowledge module, identify a knowledge type associated with the knowledge input, and extract a knowledge unit from the knowledge input. The systems select a representation model that corresponds to the knowledge type and select a grounding type configured to ground the at least one knowledge unit into the representation model. The systems then ground the knowledge unit into the representation model according to the grounding type.

DYNAMIC BOUNDARY CREATION FOR VOICE COMMAND AUTHENTICATION

A computer-implemented method executes voice commands issued from within a command boundary. The method includes defining a command boundary for a VCD, where the command boundary is based on receiving an input from a user. The method further includes receiving, from the user and by the VCD, a voice command. The method also includes determining an origination location of the voice command. The method includes classifying the voice command into a command category. The method further includes executing the voice command in response to determining the origination location is within the command boundary for the VCD. The method also includes storing a set of data for the voice command.

SYSTEM AND METHOD FOR GENERATING WRAP UP INFORMATION

A system for generating wrap-up information is capable of learning how interactions are transformed into contact notes and outcome codes using natural language processing and can generate the contact notes and outcome codes for new incoming interactions by applying prediction models trained on interaction data, contact notes and outcome codes. The system for generating wrap-up information receives interaction data, including interaction audio data, interaction transcripts, associated contact notes and associated outcome codes. The interaction transcripts are generated from the previous interactions between agents and customers. The contact notes and outcome codes are generated by agents during the associated previous interactions. The system processes and uses the interaction data to train prediction models to analyze interaction audio data and interaction transcripts and predict appropriate contact notes and outcome codes for the interaction. Once trained the prediction model(s) can generate appropriate contact notes and outcome codes for new interactions.

EXTRACTING ENGAGING QUESTIONS FROM A COMMUNICATION SESSION

Methods and systems provide for extracting engaging questions from a communication session. In one embodiment, the system connects to a communication session with a number of participants; receives a transcript of a conversation between the participants produced during the communication session; extracts, from the transcript, utterances including one or more sentences spoken by the participants; identifies a subset of the utterances spoken by a subset of the participants associated with a prespecified organization; extracts engaging questions within the subset of utterances, the engaging questions each including a question asked by the participant associated with the organization that is immediately answered in the following utterance by a participant not associated with the organization; and presents, for display at one or more client devices, data corresponding to the extracted engaging questions.

TALKING SPEED ANALYSIS PER TOPIC SEGMENT IN A COMMUNICATION SESSION

Methods and systems provide for presenting the results of talking speed analysis per topic segment in a communication session. In one embodiment, the system connects to a communication session with a number of participants; receives a transcript of a conversation between the participants produced during the communication session, the transcript including timestamps for each utterance of a speaking participant; determines, based on analysis of the transcript, a meeting type for the communication session; generates, based on the meeting type, a number of topic segments for the conversation and respective timestamps for the topic segments; for each topic segment, determines, based on analysis of the transcript and the timestamps for each utterance in the topic segment, a word count per unit of time for each speaking participant associated with a prespecified organization; and presents, to one or more client devices, data relating to the word count per unit of time for speaking participants within each topic segment and across all topic segments.

APPROACHES TO GENERATING STUDIO-QUALITY RECORDINGS THROUGH MANIPULATION OF NOISY AUDIO
20230230610 · 2023-07-20 ·

Introduced here are computer programs and associated computer-implemented techniques for manipulating noisy audio signals to produce clean audio signals that are sufficiently high quality so as to be largely, if not entirely, indistinguishable from “rich” recordings generated by recording studios. When a noisy audio signal is obtained by a media production platform, the noisy audio signal can be manipulated to sound as if recording occurred with sophisticated equipment in a soundproof environment. Manipulation can be performed by a model that, when applied to the noisy audio signal, can manipulate its characteristics so as to emulate the characteristics of clean audio signals that are learned through training.