G10L15/1807

Automated predictive analysis and modification of user interaction features using multiple classification models
11210607 · 2021-12-28 · ·

Methods and apparatuses are described for automated predictive analysis of user interactions to determine a modification based upon competing classification models. A server computing device receives first encoded text for prior user interactions and trains a plurality of classification models using the first text. The server determines a prediction cost for each of the models based upon the training. The server receives second encoded text for a current user interaction and executes the trained models using the second text to generate a prediction vector for each model that maximizes user engagement. The server selects one of the models based upon the prediction vectors, identifies a communication feature of the model, generates a user interaction modification, and transmits the user interaction modification to a client computing device.

Coach-Assist Controller for Customer Service Representative (CSR) Interactions
20210390491 · 2021-12-16 ·

This disclosure describes techniques that allow a coach-assist controller to provide coach support to a customer service representative (CSR) during an ongoing consumer-CSR interaction. The coach-assist controller may intercept a consumer-CSR interaction and generate corresponding interaction data. The coach-assist controller may further analyze the interaction data to infer a current state of the consumer-CSR interaction, and in doing so, determine whether to request coach support for the CSR.

Method of generating estimated value of local inverse speaking rate (ISR) and device and method of generating predicted value of local ISR accordingly

A method is disclosed. The proposed method includes: providing an initial speech corpus including plural utterances; based on a condition of maximum a posteriori (MAP), according to respective sequences of syllable duration, syllable duration prosodic state, syllable tone, base-syllable type, and break type of the k.sup.th utterance, using a probability of an ISR of the k.sup.th utterance x.sub.k to estimate an estimated value {circumflex over (x)}.sub.k of the x.sub.k; and through the MAP condition, according to respective sequences of syllable duration, syllable duration prosodic state, syllable tone, base-syllable type, and break type of the given l.sup.th breath group/prosodic phrase group (BG/PG) of the k.sup.th utterance, using a probability of an ISR of the l.sup.th BG/PG of the k.sup.th utterance x.sub.k,l to estimate an estimated value {circumflex over (x)}.sub.k,l of the x.sub.k,l wherein the {circumflex over (x)}.sub.k,l is the estimated value of local ISR, and a mean of a prior probability model of the {circumflex over (x)}.sub.k,l is the {circumflex over (x)}.sub.k.

System and method for cross-speaker style transfer in text-to-speech and training data generation

Systems are configured for generating spectrogram data characterized by a voice timbre of a target speaker and a prosody style of source speaker by converting a waveform of source speaker data to phonetic posterior gram (PPG) data, extracting additional prosody features from the source speaker data, and generating a spectrogram based on the PPG data and the extracted prosody features. The systems are configured to utilize/train a machine learning model for generating spectrogram data and for training a neural text-to-speech model with the generated spectrogram data.

APPLICATIONS AND SERVICES FOR ENHANCED PROSODY INSTRUCTION

Systems, methods, and software are disclosed herein that improve the instruction of prosody in the context of software applications and services. In various implementations, a service analyzes an audio recording of a user reading text aloud to determine the prosody of the reading. The service provides data to an application indicative the prosody, as well as a reference prosody for the text. The application may then display of a visualization of a comparison of the user prosody for the text to the reference prosody for the text for consumption by users, e.g., a teacher or the reader.

Templated rule-based data augmentation for intent extraction

An agent automation system includes a memory configured to store a natural language understanding (NLU) framework and a model, wherein the model includes at least one original meaning representation. The system includes a processor configured to execute instructions of the NLU framework to cause the agent automation system to perform actions including: performing rule-based generalization of the model to generate at least one generalized meaning representation of the model from the at least one original meaning representation of the model; performing rule-based refinement of the model to prune or modify the at least one generalized meaning representation of the model, or the at least one original meaning representation of the model, or a combination thereof; and after performing the rule-based generalization and the rule-based refinement of the model, using the model to extract intents/entities from a received user utterance.

SYSTEMS AND METHODS FOR DETERMINING WHETHER TO TRIGGER A VOICE CAPABLE DEVICE BASED ON SPEAKING CADENCE
20230267921 · 2023-08-24 ·

Systems and methods are described for determining whether to activate a voice activated device based on a speaking cadence of the user. When the user speaks with a first cadence the system may determine that the user does not intend to activate the device and may accordingly not to trigger a voice activated device. When the user speaks with a second cadence the system may determine that the user does wish to trigger the device and may accordingly trigger the voice activated device.

Information processing apparatus and learning method
11335337 · 2022-05-17 · ·

An information processing apparatus includes a memory; and a processor coupled to the memory and the processor configured to: generate phoneme string information in which a plurality of phonemes included in voice information is arranged in time series, based on a recognition result of the phonemes for the voice information; and learn parameters of a network such that when the phoneme string information is input to the network, output information that is output from the network approaches correct answer information that indicates whether a predetermined conversation situation is included in the voice information that corresponds to the phoneme string information.

Systems and methods for classification and rating of calls based on voice and text analysis

Methods and systems include sending recording data of a call to a first server and a second server, wherein the recording data includes a first voice of a first participant of the call and a second voice of a second participant of the call; receiving, from the first server, a first emotion score representing a degree of a first emotion associated with the first voice, and a second emotion score representing a degree of a second emotion associated with the first voice; receiving, from the second server, a first sentiment score, a second sentiment score, and a third sentiment score; determining a quality score and classification data for the recording data based on the first emotion score, the second emotion score, the first sentiment score, the second sentiment score, and the third sentiment score; and outputting the quality score and the classification data for visualization of the recording data.

Method and apparatus for generating caption

A method and apparatus for generating a caption are provided. The method of generating a caption according to one embodiment comprises: generating caption text which corresponds to a voice of a speaker included in broadcast data; generating reference voice information using a part of the voice of the speaker included in the broadcast data; and generating caption style information for the caption text based on the voice of the speaker and the reference voice information.