Patent classifications
G10L2021/105
Speech image providing method and computing device for performing the same
A computing device according to an embodiment includes one or more processors, a memory storing one or more programs executed by the one or more processors, a standby state image generating module configured to generate a standby state image in which a person is in a standby state, and generate a back-motion image set including a plurality of back-motion images at a preset frame interval from the standby state image for image interpolation between a preset reference frame of the standby state image, a speech state image generating module configured to generate a speech state image in which a person is in a speech state based on a source of speech content, and an image playback module configured to generate a synthetic speech image by combining the standby state image and the speech state image while playing the standby state image.
STYLE-AWARE AUDIO-DRIVEN TALKING HEAD ANIMATION FROM A SINGLE IMAGE
Embodiments of the present invention provide systems, methods, and computer storage media for generating an animation of a talking head from an input audio signal of speech and a representation (such as a static image) of a head to animate. Generally, a neural network can learn to predict a set of 3D facial landmarks that can be used to drive the animation. In some embodiments, the neural network can learn to detect different speaking styles in the input speech and account for the different speaking styles when predicting the 3D facial landmarks. Generally, template 3D facial landmarks can be identified or extracted from the input image or other representation of the head, and the template 3D facial landmarks can be used with successive windows of audio from the input speech to predict 3D facial landmarks and generate a corresponding animation with plausible 3D effects.
Systems and methods for improving animation of computer-generated avatars
The disclosed computer-implemented method may include identifying a set of action units (AUs) associated with a face of a user. Each AU may be associated with a muscle group engaged by the user to produce a viseme associated with a sound produced by the user. The method may also include, for each AU in the set of AUs, determining a set of AU parameters associated with the AU and the viseme. The set of AU parameters may include (1) an onset curve, and (2) a falloff curve. The method may also include (1) detecting that the user has produced the sound, and (2) directing a computer-generated avatar to produce the viseme in accordance with the set of AU parameters in response to detecting that the user is producing the sound. Various other methods, systems, and computer-readable media are also disclosed.
CALL CONTROL METHOD AND RELATED PRODUCT
Provided are a call control method and related product. In the method, during a voice call between the first user of the first terminal and the second user of the second terminal, a three-dimensional face model of the second user is displayed; model-driven parameters are determined according to the call voice of the second user, where the model-driven parameters include expression parameters and posture parameters; the three-dimensional face model of the second user is driven according to the model-driven parameters to display a three-dimensional simulated call animation of the second user.
Matching mouth shape and movement in digital video to alternative audio
A method for matching mouth shape and movement in digital video to alternative audio includes deriving a sequence of facial poses including mouth shapes for an actor from a source digital video. Each pose in the sequence of facial poses corresponds to a middle position of each audio sample. The method further includes generating an animated face mesh based on the sequence of facial poses and the source digital video, transferring tracked expressions from the animated face mesh or the target video to the source video, and generating a rough output video that includes transfers of the tracked expressions. The method further includes generating a finished video at least in part by refining the rough video using a parametric autoencoder trained on mouth shapes in the animated face mesh or the target video. One or more computers may perform the operations of the method.
System and method for voice driven lip syncing and head reenactment
A system and method voice driven animation of an object in an image by sampling an input video, depicting a puppet object, to obtain an image, receiving audio data, extracting voice related features from the audio data, producing an expression representation based on the voice related features, wherein the expression representation is related to a region of interest, obtaining from the image, auxiliary data related to the image and generating a target image based on the expression representation and the auxiliary data.
Embodied negotiation agent and platform
Human speech signals that are uttered within an environment are transcribed; the environment includes one or more avatars representing one or more software agents; the human speech signals are directed to at least one of the avatars. At least one non-speech behavioral trace is obtained within the environment; the trace is representative of non-speech behavior directed to the at least one of the avatars. The transcribed human speech signals and the at least one non-speech behavioral trace are forwarded to the one or more software agents. A proposed act is obtained from at least one of the agents; responsive thereto, a command is issued to cause the avatar corresponding to the software agent from which the proposed act is obtained to emit synthesized speech and to act visually in accordance with the proposed act.
PROGRAM, METHOD, AND TERMINAL DEVICE
The present invention is to, on a basis of data including information capable of specifying a motion input by a performer who plays a character arranged in a virtual space and a speech of the performer, produce a behavior of the character, to enable the character viewed from a predetermined point of view in the virtual space to display on a predetermined display unit, and to enable a sound depending on the speech of the performer to output. A behavior of a mouth of the character in the virtual space is controlled based on the data, a next facial expression to be taken as the facial expression of the character is specified in accordance with a predetermined rule, and switching control is performed on the facial expression of the character in the virtual space to the facial expression, which is specified by the step of specifying, with lapse of time.
Style-aware audio-driven talking head animation from a single image
Embodiments of the present invention provide systems, methods, and computer storage media for generating an animation of a talking head from an input audio signal of speech and a representation (such as a static image) of a head to animate. Generally, a neural network can learn to predict a set of 3D facial landmarks that can be used to drive the animation. In some embodiments, the neural network can learn to detect different speaking styles in the input speech and account for the different speaking styles when predicting the 3D facial landmarks. Generally, template 3D facial landmarks can be identified or extracted from the input image or other representation of the head, and the template 3D facial landmarks can be used with successive windows of audio from the input speech to predict 3D facial landmarks and generate a corresponding animation with plausible 3D effects.
SYSTEM AND METHOD FOR LIP-SYNCING A FACE TO TARGET SPEECH USING A MACHINE LEARNING MODEL
A processor-implemented method for generating a lip-sync for a face to a target speech of a live session to a speech in one or more languages in-sync with improved visual quality using a machine learning model and a pre-trained lip-sync model is provided. The method includes (i) determining a visual representation of the face and an audio representation, the visual representation includes crops of the face; (ii) modifying the crops of the face to obtain masked crops; (iii) obtaining a reference frame from the visual representation at a second timestamp; (iv) combining the masked crops at the first timestamp with the reference to obtain lower half crops; (v) training the machine learning model by providing historical lower half crops and historical audio representations as training data; (vi) generating lip-synced frames for the face to the target speech, and (vii) generating an in-sync lip-synced frames by the pre-trained lip-sync model.