Patent classifications
G10L13/08
Preprocessor System for Natural Language Avatars
A preprocessor for use with natural language processors for control of computerized avatars provides for an embedding of avatar control information in a speech response file of the natural language processor providing avatars with improved perception of emotional intelligence. Rapid avatar response is provided by independent end of speech detection and a response cache bypassing text-to-speech conversion times. The preprocessor may be shared among multiple websites to provide a shared analysis of query optimization.
REAL TIME CORRECTION OF ACCENT IN SPEECH AUDIO SIGNALS
Systems and methods for real-time correction of an accent in a speech audio signal are provided. A method includes dividing the speech audio signal into a stream of input chunks, an input chunk from the stream of input chunks including a pre-defined number of frames of the speech audio signal, extracting, by an acoustic features extraction module from the input chunk and a context associated with the input chunk, acoustic features, the context is a pre-determined number of the frames preceding the input chunk in the stream; extracting, by a linguistic features extraction module from the input chunk and the context, linguistic features, receiving a speaker embedding for a human speaker, providing the speaker embedding, the acoustic features, and the linguistic features to a synthesis module to generate a melspectrogram with a reduced accent, providing the melspectrogram to a vocoder to generate an output chunk of an output audio signal.
Displaying media information and graphical controls for a chat application
A method for displaying media information includes: receiving a media information request sent by an originating client, where the media information request carries media information and destination client information; parsing the received media information to obtain text information corresponding to the media information; synthesizing information related to the media information with information related to the text information to obtain composite information; and sending the composite information to a destination client according to the destination client information, so that the destination client obtains the media information and the text information according to the composite information and displays the media information and the text information.
Displaying media information and graphical controls for a chat application
A method for displaying media information includes: receiving a media information request sent by an originating client, where the media information request carries media information and destination client information; parsing the received media information to obtain text information corresponding to the media information; synthesizing information related to the media information with information related to the text information to obtain composite information; and sending the composite information to a destination client according to the destination client information, so that the destination client obtains the media information and the text information according to the composite information and displays the media information and the text information.
Viseme data generation for presentation while content is output
Systems and methods for viseme data generation are disclosed. Uncompressed audio data is generated and/or utilized to determine the beats per minute of the audio data. Visemes are associated with the audio data utilizing a Viterbi algorithm and the beats per minute. A time-stamped list of viseme data is generated that associates the visemes with the portions of the audio data that they correspond to. An animatronic toy and/or an animation is caused to lip sync using the viseme data while audio corresponding to the audio data is output.
Viseme data generation for presentation while content is output
Systems and methods for viseme data generation are disclosed. Uncompressed audio data is generated and/or utilized to determine the beats per minute of the audio data. Visemes are associated with the audio data utilizing a Viterbi algorithm and the beats per minute. A time-stamped list of viseme data is generated that associates the visemes with the portions of the audio data that they correspond to. An animatronic toy and/or an animation is caused to lip sync using the viseme data while audio corresponding to the audio data is output.
Using speech to text data in training text to speech models
A system and method for providing a text to speech output by receiving user audio data, determining a user region-specific-pronunciation classification according to the audio data, determining text for a response to the user according to the audio data, identifying a portion from the text, where a region specific-pronunciation dictionary includes the portion, and using a phoneme string, from the dictionary selected according to the user region-specific pronunciation classification, for the word in a text to speech output to the user.
Using speech to text data in training text to speech models
A system and method for providing a text to speech output by receiving user audio data, determining a user region-specific-pronunciation classification according to the audio data, determining text for a response to the user according to the audio data, identifying a portion from the text, where a region specific-pronunciation dictionary includes the portion, and using a phoneme string, from the dictionary selected according to the user region-specific pronunciation classification, for the word in a text to speech output to the user.
Systems and methods for machine-generated avatars
Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.
Systems and methods for machine-generated avatars
Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.