Patent classifications
G10L2021/105
PERIOCULAR AND AUDIO SYNTHESIS OF A FULL FACE IMAGE
Systems and methods for synthesizing an image of the face by a head-mounted device (HMD) are disclosed. The HMD may not be able to observe a portion of the face. The systems and methods described herein can generate a mapping from a conformation of the portion of the face that is not imaged to a conformation of the portion of the face observed. The HMD can receive an image of a portion of the face and use the mapping to determine a conformation of the portion of the face that is not observed. The HMD can combine the observed and unobserved portions to synthesize a full face image.
EMBODIED NEGOTIATION AGENT AND PLATFORM
Human speech signals that are uttered within an environment are transcribed; the environment includes one or more avatars representing one or more software agents; the human speech signals are directed to at least one of the avatars. At least one non-speech behavioral trace is obtained within the environment; the trace is representative of non-speech behavior directed to the at least one of the avatars. The transcribed human speech signals and the at least one non-speech behavioral trace are forwarded to the one or more software agents. A proposed act is obtained from at least one of the agents; responsive thereto, a command is issued to cause the avatar corresponding to the software agent from which the proposed act is obtained to emit synthesized speech and to act visually in accordance with the proposed act.
Identity preserving realistic talking face generation using audio speech of a user
Speech-driven facial animation is useful for a variety of applications such as telepresence, chatbots, etc. The necessary attributes of having a realistic face animation are: 1) audiovisual synchronization, (2) identity preservation of the target individual, (3) plausible mouth movements, and (4) presence of natural eye blinks. Existing methods mostly address audio-visual lip synchronization, and synthesis of natural facial gestures for overall video realism. However, existing approaches are not accurate. Present disclosure provides system and method that learn motion of facial landmarks as an intermediate step before generating texture. Person-independent facial landmarks are generated from audio for invariance to different voices, accents, etc. Eye blinks are imposed on facial landmarks and the person-independent landmarks are retargeted to person-specific landmarks to preserve identity related facial structure. Facial texture is then generated from person-specific facial landmarks that helps to preserve identity-related texture.
Producing realistic talking face with expression using images text and voice
A method for providing visual sequences using one or more images comprising: receiving one or more person images of showing at least one face, receiving a message to be enacted by the person, wherein the message comprises at least a text or a emotional and movement command, processing the message to extract or receive an audio data related to voice of the person, and a facial movement data related to expression to be carried on face of the person, processing the image/s, the audio data, and the facial movement data, and generating an animation of the person enacting the message. Wherein emotional and movement command is a GUI or multimedia based instruction to invoke the generation of facial expression/s and or body part/s movement.
METHOD FOR PROVIDING SPEECH VIDEO AND COMPUTING DEVICE FOR EXECUTING THE METHOD
A computing device according to an embodiment is a computing device that is provided with one or more processors and a memory storing one or more programs executed by the one or more processors, the computing device includes a standby state video generating module that generates a standby state video in which a person in a video is in a standby state, a speech state video generating module that generates a speech state video in which a person in a video is in a speech state based on a source of speech content, and a video reproducing module that reproduces the standby state video, and generates a synthesized speech video by synthesizing the standby state video being reproduced and the speech state video.
Style-aware audio-driven talking head animation from a single image
Embodiments of the present invention provide systems, methods, and computer storage media for generating an animation of a talking head from an input audio signal of speech and a representation (such as a static image) of a head to animate. Generally, a neural network can learn to predict a set of 3D facial landmarks that can be used to drive the animation. In some embodiments, the neural network can learn to detect different speaking styles in the input speech and account for the different speaking styles when predicting the 3D facial landmarks. Generally, template 3D facial landmarks can be identified or extracted from the input image or other representation of the head, and the template 3D facial landmarks can be used with successive windows of audio from the input speech to predict 3D facial landmarks and generate a corresponding animation with plausible 3D effects.
Three-dimensional expression base generation method and apparatus, speech interaction method and apparatus, and medium
This application provides a three-dimensional (3D) expression base generation method performed by a computer device. The method includes: obtaining image pairs of a target object in n types of head postures, each image pair including a color feature image and a depth image in a head posture; constructing a 3D human face model of the target object according to then image pairs; and generating a set of expression bases of the target object according to the 3D human face model of the target object. According to this application, based on a reconstructed 3D human face model, a set of expression bases of a target object is further generated, so that more diversified product functions may be expanded based on the set of expression bases.
Three-dimensional face animation from speech
A method for training a three-dimensional model face animation model from speech, is provided. The method includes determining a first correlation value for a facial feature based on an audio waveform from a first subject, generating a first mesh for a lower portion of a human face, based on the facial feature and the first correlation value, updating the first correlation value when a difference between the first mesh and a ground truth image of the first subject is greater than a pre-selected threshold, and providing a three-dimensional model of the human face animated by speech to an immersive reality application accessed by a client device based on the difference between the first mesh and the ground truth image of the first subject. A non-transitory, computer-readable medium storing instructions to cause a system to perform the above method, and the system, are also provided.
System, method, and computer program for transmitting face models based on face data points
A system, method, and computer program are provided for receiving face models based on face nodal points. In use, a real-time face model is received, wherein the real-time face model includes one or more face nodal points. Real-time face nodal points are received, including additional one or more face nodal points. The real-time face model is manipulated based on the real-time face nodal points.
Systems And Methods For Machine-Generated Avatars
Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.