Patent classifications
G10L2021/105
Method and apparatus for processing information
Embodiments of the present disclosure provide a method and apparatus for processing information. A method may include: generating voice response information based on voice information sent by a user; generating a phoneme sequence based on the voice response information; generating mouth movement information based on the phoneme sequence, the mouth movement information being used for controlling a mouth movement of a displayed three-dimensional human image when playing the voice response information; and playing the voice response information, and controlling the mouth movement of the three-dimensional human image based on the mouth movement information.
JOINT AUDIO-VIDEO FACIAL ANIMATION SYSTEM
The present invention relates to a joint automatic audio visual driven facial animation system that in some example embodiments includes a full scale state of the art Large Vocabulary Continuous Speech Recognition (LVCSR) with a strong language model for speech recognition and obtained phoneme alignment from the word lattice.
IDENTITY PRESERVING REALISTIC TALKING FACE GENERATION USING AUDIO SPEECH OF A USER
Speech-driven facial animation is useful for a variety of applications such as telepresence, chatbots, etc. The necessary attributes of having a realistic face animation are: 1) audiovisual synchronization, (2) identity preservation of the target individual, (3) plausible mouth movements, and (4) presence of natural eye blinks. Existing methods mostly address audio-visual lip synchronization, and synthesis of natural facial gestures for overall video realism. However, existing approaches are not accurate. Present disclosure provides system and method that learn motion of facial landmarks as an intermediate step before generating texture. Person-independent facial landmarks are generated from audio for invariance to different voices, accents, etc. Eye blinks are imposed on facial landmarks and the person-independent landmarks are retargeted to person-specific landmarks to preserve identity related facial structure. Facial texture is then generated from person-specific facial landmarks that helps to preserve identity-related texture.
Motion Tracking and Image Recognition of Hand Gestures to Animate a Digital Puppet, Synchronized with Recorded Audio
There is provided a system and method for creating a digital puppet show by animating a digital puppet using gestures and sound. The method includes presenting the digital puppet to a user on a display. The method further includes receiving motion data corresponding to a gesture, using a camera, from the user, translating the motion data into digital data using a motion tracking algorithm, and animating the digital puppet, on the display, using the digital data. The method can further include receiving audio data from the user using a microphone and playing the audio data, using a speaker, while animating the digital puppet on the display to create the digital puppet show.
METHOD AND APPARATUS FOR PREDICTING MOUTH-SHAPE FEATURE, AND ELECTRONIC DEVICE
A method and apparatus for predicting a mouth-shape feature, and an electronic device are provided. A specific implementation of the method comprises: recognizing a phonetic posterior gram (PPG) of a phonetic feature; and performing a prediction on the PPG by using a neural network model, to predict a mouth-shape feature of the phonetic feature, the neural network model being obtained by training with training samples and an input thereof including a PPG and an output thereof including a mouth-shape feature, and the training samples including a PPG training sample and a mouth-shape feature training sample.
SPEECH-DRIVEN FACIAL ANIMATION GENERATION METHOD
The present disclosure discloses a speech-driven facial animation generation method. The method is mainly divided into six steps: extracting speech features, collecting frequency information, summarizing time information, decoding action features, driving a facial model, and sliding a signal window. The present disclosure can drive, according to an input speech audio signal, any facial model in real time under a particular delay to generate animation. The quality of the animation reaches the currently most advanced speech animation technology level, and has the characteristics of light weight and good robustness. The present disclosure can be used to generate speech animation under different scenes, such as VR virtual social networking, and virtual speech assistants and games.
STYLE-AWARE AUDIO-DRIVEN TALKING HEAD ANIMATION FROM A SINGLE IMAGE
Embodiments of the present invention provide systems, methods, and computer storage media for generating an animation of a talking head from an input audio signal of speech and a representation (such as a static image) of a head to animate. Generally, a neural network can learn to predict a set of 3D facial landmarks that can be used to drive the animation. In some embodiments, the neural network can learn to detect different speaking styles in the input speech and account for the different speaking styles when predicting the 3D facial landmarks. Generally, template 3D facial landmarks can be identified or extracted from the input image or other representation of the head, and the template 3D facial landmarks can be used with successive windows of audio from the input speech to predict 3D facial landmarks and generate a corresponding animation with plausible 3D effects.
Computer-implemented systems and methods for acquiring and assessing physical-world data indicative of avatar interactions
Systems and methods are provided for acquiring physical-world data indicative of interactions of a subject with an avatar for evaluation. An interactive avatar is provided for interaction with the subject. Speech from the subject to the avatar is captured, and automatic speech recognition is performed to determine content of the subject speech. Motion data from the subject interacting with the avatar is captured. A next action of the interactive avatar is determined based on the content of the subject speech or the motion data. The next action of the avatar is implemented, and a score for the subject is determined based on the content of the subject speech and the motion data.
METHOD AND APPARATUS FOR CONTROLLING AVATARS BASED ON SOUND
Provided is a method for controlling avatar motion, which is operated in a user terminal and includes receiving an input audio by an audio sensor, and controlling, by one and more processors, a motion of a first user avatar based on the input audio.
Motion tracking and image recognition of hand gestures to animate a digital puppet, synchronized with recorded audio
There is provided a system and method for creating a digital puppet show by animating a digital puppet using gestures and sound. The method includes presenting the digital puppet to a user on a display. The method further includes receiving motion data corresponding to a gesture, using a camera, from the user, translating the motion data into digital data using a motion tracking algorithm, and animating the digital puppet, on the display, using the digital data. The method can further include receiving audio data from the user using a microphone and playing the audio data, using a speaker, while animating the digital puppet on the display to create the digital puppet show.