Patent classifications
G10L2021/105
Method of Converting Phoneme Transcription Data Into Lip Sync Animation Data for 3D Animation Software
Described is a system, method, and computer program product that substantially advances the art of animating Lip Sync in 3D computer animated characters by automatically producing data from a Phoneme Transcription of a dialog audio file, which data results in Lip Sync animation that is more realistic, smooth, and aesthetically pleasing than that produced by current Phoneme-Target Lip Sync systems. This Invention works by converting a Phoneme Transcription of a recorded dialog audio file into KeyFrame Data which dynamically controls 16 independent animation Parameters, each associated with a different part of the animated character's mouth, then algorithmically modifying that data such that it conforms to the previously unknown complex, subtle and context-specific relationships between audible phonemes and visible mouth movements.
Using machine-learning models to determine movements of a mouth corresponding to live speech
Disclosed systems and methods predict visemes from an audio sequence. A viseme-generation application accesses a first set of training data that includes a first audio sequence representing a sentence spoken by a first speaker and a sequence of visemes. Each viseme is mapped to a respective audio sample of the first audio sequence. The viseme-generation application creates a second set of training data adjusting a second audio sequence spoken by a second speaker speaking the sentence such that the second and first sequences have the same length and at least one phoneme occurs at the same time stamp in the first sequence and in the second sequence. The viseme-generation application maps the sequence of visemes to the second audio sequence and trains a viseme prediction model to predict a sequence of visemes from an audio sequence.
CONFIGURATION FOR REMOTE MULTI-CHANNEL LANGUAGE INTERPRETATION PERFORMED VIA IMAGERY AND CORRESPONDING AUDIO AT A DISPLAY-BASED DEVICE
A configuration is implemented to receive, with a processor from a customer care platform, a request for spoken language interpretation of a user query from a first spoken language to a second spoken language. The first spoken language is spoken by a user situated at a display-based device that is remotely situated from the customer care platform. The user query is sent from the display-based device by the user to the customer care platform. The configuration performs, at a language interpretation platform, a first spoken language interpretation of the user query from the first spoken language to the second spoken language. Further, the configuration transmits, from the language interpretation platform to the customer care platform, the first spoken language interpretation so that a customer care representative speaking the second spoken language understands the first spoken language being spoken by the user.
Production of speech based on whispered speech and silent speech
A method, a system, and a computer program product are provided for interpreting low amplitude speech and transmitting amplified speech to a remote communication device. At least one computing device receives sensor data from multiple sensors. The sensor data is associated with the low amplitude speech. At least one of the at least one computing device analyzes the sensor data to map the sensor data to at least one syllable resulting in a string of one or more words. An electronic representation of the string of the one or more words may be generated and transmitted to a remote communication device for producing the amplified speech from the electronic representation.
Systems and methods for machine-generated avatars
Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.
Methods and systems for image and voice processing
Systems and methods are disclosed configured to train an autoencoder using images that include faces, wherein the autoencoder comprises an input layer, an encoder configured to output a latent image from a corresponding input image, and a decoder configured to attempt to reconstruct the input image from the latent image. An image sequence of a face exhibiting a plurality of facial expressions and transitions between facial expressions is generated and accessed. Images of the plurality of facial expressions and transitions between facial expressions are captured from a plurality of different angles and using different lighting. An autoencoder is trained using source images that include the face with different facial expressions captured at different angles with different lighting, and using destination images that include a destination face. The trained autoencoder is used to generate an output where the likeness of the face in the destination images is swapped with the likeness of the source face, while preserving expressions of the destination face.
JOINT AUDIO-VIDEO FACIAL ANIMATION SYSTEM
The present invention relates to a joint automatic audio visual driven facial animation system that in some example embodiments includes a full scale state of the art Large Vocabulary Continuous Speech Recognition (LVCSR) with a strong language model for speech recognition and obtained phoneme alignment from the word lattice.
Method of translating and synthesizing a foreign language
A method to interactively convert a source language video/audio stream into one or more target languages in high definition video format using a computer. The spoken words in the converted language are synchronized with synthesized movements of a rendered mouth. Original audio and video streams from pre-recorded or live sermons are synthesized into another language with the original emotional and tonal characteristics. The original sermon could be in any language and be translated into any other language. The mouth and jaw are digitally rendered with viseme and phoneme morphing targets that are pre-generated for lip synching with the synthesized target language audio. Each video image frame has the simulated lips and jaw inserted over the original. The new audio and video image then encoded and uploaded for internee viewing or recording to a storage medium.
COMPUTING SYSTEM FOR EXPRESSIVE THREE-DIMENSIONAL FACIAL ANIMATION
A computer-implemented technique for animating a visual representation of a face based on spoken words of a speaker is described herein. A computing device receives an audio sequence comprising content features reflective of spoken words uttered by a speaker. The computing device generates latent content variables and latent style variables based upon the audio sequence. The latent content variables are used to synchronized movement of lips on the visual representation to the spoken words uttered by the speaker. The latent style variables are derived from an expected appearance of facial features of the speaker as the speaker utters the spoken words and are used to synchronize movement of full facial features of the visual representation to the spoken words uttered by the speaker. The computing device causes the visual representation of the face to be animated on a display based upon the latent content variables and the latent style variables.
Method and apparatus for generating facial expression and training method for generating facial expression
A method and apparatus for generating a facial expression may receive an input image, and generate facial expression images that change from the input image based on an index indicating a facial expression intensity of the input image, the index being obtained from the input image.