Patent classifications
G10L15/1807
Intent recognition and emotional text-to-speech learning
An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.
Automated social agent interaction quality monitoring and improvement
A system for monitoring and improving social agent interaction quality includes a computing platform having processing hardware and a system memory storing a software code. The processing hardware is configured to execute the software code to receive, from a social agent, interaction data describing an interaction of the social agent with a user, and to perform an assessment of the interaction, using the interaction data, as one of successful or including a flaw. When the assessment indicates that the interaction includes the flaw, the processing hardware is further configured to execute the software code to identify an interaction strategy for correcting the flaw, and to deliver, to the social agent, one or both of the assessment and the interaction strategy to correct the flaw in the interaction.
RECOGNIZING ACCENTED SPEECH
Techniques and apparatuses for recognizing accented speech are described. In some embodiments, an accent module recognizes accented speech using an accent library based on device data, uses different speech recognition correction levels based on an application field into which recognized words are set to be provided, or updates an accent library based on corrections made to incorrectly recognized speech.
PERVASIVE ADVISOR FOR MAJOR EXPENDITURES
A pervasive advisor for major purchases and other expenditures may detect that a customer is contemplating a major purchase (e.g., through active listening). The advisor may assist the customer with the timing and manner of making the purchase in a way that is financially sensible in view of the customer's financial situation. A customer may be provided with dynamically-updated information in response to recent actions that may affect an approved loan amount and/or interest rate. Underwriting of a loan may be triggered based on the geo-location of the user. Financial advice may be provided to customers to help them meet their goals using information obtained from third party sources, such as purchase options based on particular goals. The pervasive advisor may thus intervene to assist with budgeting, financing, and timing of major expenditures based on the customer's location and on the customer's unique and changing circumstances.
Generating acoustic sequences via neural networks using combined prosody info
An example system includes a processor to receive a linguistic sequence and a prosody info offset. The processor can generate, via a trained prosody info predictor, combined prosody info including a number of observations based on the linguistic sequence. The number of observations include linear combinations of statistical measures evaluating a prosodic component over a predetermined period of time. The processor can generate, via a trained neural network, an acoustic sequence based on the combined prosody info, the prosody info offset, and the linguistic sequence.
Automated word correction in speech recognition systems
Systems and methods for correcting recognition errors in speech recognition systems are disclosed herein. Natural conversational variations are identified to determine whether a query intends to correct a speech recognition error or whether the query is a new command. When the query intends to correct a speech recognition error, the system identifies a location of the error and performs the correction. The corrected query can be presented to the user or be acted upon as a command for the system.
INTENT RECOGNITION AND EMOTIONAL TEXT-TO-SPEECH LEARNING
An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.
SYSTEM AND METHOD FOR EXTRACTING HIDDEN CUES IN INTERACTIVE COMMUNICATIONS
Disclosed herein are system, method, and computer program product embodiments for machine learning systems to process interactive communications between at least two participants. Speech and text, within the interactive communications, are analyzed using machine learning classifiers to extract prosodic, semantic and key phrase cues located within the interactive communications to identify changes to emotion, sentiments and key phrases. A summary of the interactive communications between a first participant and a second participant is generated at least, in-part, based on the extracted prosodic, semantic and key phrase cues and the summary is highlighted based on any of the changes to emotion, the sentiments or the key phrases.
DETECTING NON-VERBAL, AUDIBLE COMMUNICATION CONVEYING MEANING
Various embodiments of the invention provide methods, systems, and computer-program products for analyzing an audio to capture semantic and non-semantic characteristics of the audio and corresponding relationships between the semantic and non-semantic characteristics. In particular embodiments, the audio is segmented into a set of utterance segments containing a party speaking on the audio and a set of noise segments containing the party not speaking on the audio. The semantic and non-semantic characteristics are then captured for each of the utterance segments. Specifically, speech analytics is performed on each segment to identify the words spoken by the party in the segment as semantic characteristics. Further, laughter, emotion, and sentence boundary detection is performed on each segment to identify occurrences of such in the segment as non-semantic characteristics. Once identified for each segment, various embodiments of the invention involve constructing a transcript based on the identified semantic and non-semantic characteristics.
Voice interaction system, its processing method, and program therefor
A voice interaction system performs a voice interaction with a user. The voice interaction system includes: ask-again detection means for detecting ask-again by the user; response-sentence generation means for generating, when the ask-again has been detected by the ask-again detection means, a response sentence for the ask-again in response to the ask-again based on a response sentence responding to the user before the ask-again; and storage means for storing a history of the voice interaction with the user. The response-sentence generation means generates, when the response sentence includes a word whose frequency of appearance in the history of the voice interaction in the storage means is equal to or smaller than a first predetermined value, a response sentence for the ask-again formed of only this word or a response sentence for the ask-again in which this word is emphasized in the response sentence.