Patent classifications
G10L15/1807
CONTEXT-AWARE PROSODY CORRECTION OF EDITED SPEECH
Methods are performed by one or more processing devices for correcting prosody in audio data. A method includes operations for accessing subject audio data in an audio edit region of the audio data. The subject audio data in the audio edit region potentially lacks prosodic continuity with unedited audio data in an unedited audio portion of the audio data. The operations further include predicting, based on a context of the unedited audio data, phoneme durations including a respective phoneme duration of each phoneme in the unedited audio data. The operations further include predicting, based on the context of the unedited audio data, a pitch contour comprising at least one respective pitch value of each phoneme in the unedited audio data. Additionally, the operations include correcting prosody of the subject audio data in the audio edit region by applying the phoneme durations and the pitch contour to the subject audio data.
Voice identification method, device, apparatus, and storage medium
A voice identification method, device, apparatus, and a storage medium are provided. The method includes: receiving voice data; and performing a voice identification on the voice data, to obtain first text data associated with the voice data; determining common text data in a preset fixed data table, wherein a similarity between a pronunciation of the determined common text data and a pronunciation of the first text data meets a preset condition, wherein the determined common text data is a voice identification result with an occurrence number larger than a first preset threshold; and replacing the first text data with the determined common text data.
SPEAKING-RATE NORMALIZED PROSODIC PARAMETER BUILDER, SPEAKING-RATE DEPENDENT PROSODIC MODEL BUILDER, SPEAKING-RATE CONTROLLED PROSODIC-INFORMATION GENERATION DEVICE AND PROSODIC-INFORMATION GENERATION METHOD ABLE TO LEARN DIFFERENT LANGUAGES AND MIMIC VARIOUS SPEAKERS' SPEAKING STYLES
A speaking-rate dependent prosodic model builder and a related method are disclosed. The proposed builder includes a first input terminal for receiving a first information of a first language spoken by a first speaker, a second input terminal for receiving a second information of a second language spoken by a second speaker and a functional information unit having a function, wherein the function includes a first plurality of parameters simultaneously relevant to the first language and the second language or a plurality of sub-parameters in a second plurality of parameters relevant to the second language alone, and the functional information unit under a maximum a posteriori condition and based on the first information, the second information and the first plurality of parameters or the plurality of sub-parameters produces speaking-rate dependent reference information and constructs a speaking-rate dependent prosodic model of the second language.
SELF-DRIVING VEHICLE SYSTEMS AND METHODS
When a human driver picks up a passenger, the driver typically looks at the passenger to determine if the passenger is the individual who is supposed to receive a ride. Self-driving vehicles often do not have a human in the vehicle to make this decision. Several systems and methods described herein enable self-driving vehicles to know if they are picking up the correct person.
MOTOR VEHICLE OPERATING DEVICE WITH A CORRECTION STRATEGY FOR VOICE RECOGNITION
The invention relates to a method for operating a motor vehicle, wherein a first speech input of a user is received, at least one recognition result (A-D) is determined by means of a speech recognition system, at least one recognition result (A-D) is output to an output device of the motor vehicle as a result list and a second speech output of the user is received. The objective of the invention is to avoid a double input of false recognition results. In the second speech input, first, a content input repetition of the first speech input is recognized, which points to a correction request of the user. As a result, an excludable portion of the result list is determined and with the determination of a recognition result (C-E) for the second speech input, the excludable portion is excluded as a possible recognition result.
Analysis of professional-client interactions
One or more processors receive recording data of a meeting between a professional and a client. One or more processors analyze the recording data to make one or more determinations. One or more processors identify one or more characteristics of the professional based on the one or more determinations. One or more processors match the one or more characteristics of the professional to one or more preferences of an individual seeking one or more professionals. One or more processors build a profile of the professional based on the one or more characteristics and store the profile in a database. One or more processors search the database for one or more profiles that provide a match of the one or more preferences of the individual seeking one or more professionals and display the one or more profiles.
Intent recognition and emotional text-to-speech learning
An example intent-recognition system comprises a processor and memory storing instructions. The instructions cause the processor to receive speech input comprising spoken words. The instructions cause the processor to generate text results based on the speech input and generate acoustic feature annotations based on the speech input. The instructions also cause the processor to apply an intent model to the text result and the acoustic feature annotations to recognize an intent based on the speech input. An example system for adapting an emotional text-to-speech model comprises a processor and memory. The memory stores instructions that cause the processor to receive training examples comprising speech input and receive labelling data comprising emotion information associated with the speech input. The instructions also cause the processor to extract audio signal vectors from the training examples and generate an emotion-adapted voice font model based on the audio signal vectors and the labelling data.
System and method for conversational agent via adaptive caching of dialogue tree
The present teaching relates to method, system, medium, and implementations for managing a user machine dialogue. Sensor data is received at a device, including an utterance representing a speech of a user engaged in a dialogue with the device. The speech of the user is determined based on the utterance and a response to the user is searched by a local dialogue manager residing on the device against a sub-dialogue tree stored on the device. The response, if identified from the sub-dialogue tree, is rendered to the user in response to the speech. A request is sent to a server for the response, if the response is not available in the sub-dialogue tree.
Method and apparatus for managing agent interactions with enterprise customers
A method and apparatus for managing agent interactions with customers of an enterprise are disclosed. The method includes generating a value representative of an emotional state of a customer engaged in an ongoing interaction with a virtual agent (VA) associated with the enterprise. The value is generated based, at least in part, on one or more inputs provided by the customer during the ongoing interaction. The value is compared with a predefined emotional threshold range to determine whether the emotional state of the customer is a non-neutral state. The ongoing interaction is deflected to one of a human agent and a specialized VA capable of empathetically handling the ongoing interaction if it is determined that the emotional state of the customer is the non-neutral state.
Prosodic and lexical addressee detection
Prosodic features are used for discriminating computer-directed speech from human-directed speech. Statistics and models describing energy/intensity patterns over time, speech/pause distributions, pitch patterns, vocal effort features, and speech segment duration patterns may be used for prosodic modeling. The prosodic features for at least a portion of an utterance are monitored over a period of time to determine a shape associated with the utterance. A score may be determined to assist in classifying the current utterance as human directed or computer directed without relying on knowledge of preceding utterances or utterances following the current utterance. Outside data may be used for training lexical addressee detection systems for the H-H-C scenario. H-C training data can be obtained from a single-user H-C collection and that H-H speech can be modeled using general conversational speech. H-C and H-H language models may also be adapted using interpolation with small amounts of matched H-H-C data.