G10L2021/105

Virtual object image display method and apparatus, electronic device and storage medium

The application provides a virtual object image display method and apparatus, an electronic device and a storage medium, relates to the field of artificial intelligence, in particular to the field of computer vision and deep learning, and may be applied to virtual object dialogue scenarios. The specific implementation scheme includes: segmenting acquired voice to obtain voice segments; predicting lip shape sequence information for the voice segments; searching for a corresponding lip shape image sequence based on the lip shape sequence information; performing lip fusion between the lip shape image sequence and a virtual object baseplate to obtain a virtual object image; displaying the virtual object image. The application improves ability to obtain virtual object image.

REAL-TIME GENERATION OF SPEECH ANIMATION
20220108510 · 2022-04-07 ·

To realistically animate a String (such as a sentence) a hierarchical search algorithm is provided to search for stored examples (Animation Snippets) of sub-strings of the String, in decreasing order of sub-string length, and concatenate retrieved sub-strings to complete the String of speech animation. In one embodiment, real-time generation of speech animation uses model visemes to predict the animation sequences at onsets of visemes and a look-up table based (data-driven) algorithm to predict the dynamics at transitions of visemes. Specifically posed Model Visemes may be blended with speech animation generated using another method at corresponding time points in the animation when the visemes are to be expressed. An Output Weighting Function is used to map Speech input and Expression input into Muscle-Based Descriptor weightings

Method and apparatus for controlling mouth shape changes of three-dimensional virtual portrait

Embodiments of the present disclosure relate to a method and apparatus for controlling mouth shape changes of a three-dimensional virtual portrait, relating to the field of cloud computing. The method may include: acquiring a to-be-played speech; sliding a preset time window at a preset step length in the to-be-played speech to obtain at least one speech segment; generating, based on the at least one speech segment, a mouth shape control parameter sequence for the to-be-played speech; and controlling, in response to playing the to-be-played speech, a preset mouth shape of the three-dimensional virtual portrait to change based on the mouth shape control parameter sequence.

LIVE STREAMING CONTROL METHOD AND APPARATUS, LIVE STREAMING DEVICE, AND STORAGE MEDIUM
20220101871 · 2022-03-31 ·

Embodiments of the present application relate to the technical field of Internet, and provide a live streaming control method and apparatus, a live streaming device, and a storage medium. Voice information of a live streamer is obtained, and the voice information is analyzed and processed, so that according to the processing result, a virtual image in a live streaming screen is controlled to execute an action matching the voice information, so as to improve the precision of controlling the virtual image and enable the virtual image in the live streaming screen and the live streaming content of the live streamer to have a high matching degree.

Spatial audio and avatar control using captured audio signals

An audio system in a local area providing an audio signal to a headset of a remote user is presented herein. The audio system identifies sounds from a human sound source in the local area, based in part on sounds detected within the local area. The audio system generates an audio signal for presentation to a remote user within a virtual representation of the local area based in part on a location of the remote user within the virtual representation of the local area relative to a virtual representation of the human sound source within the virtual representation of the local area. The audio system provides the audio signal to a headset of the remote user, wherein the headset presents the audio signal as part of the virtual representation of the local area to the remote user.

SYSTEM AND METHOD FOR SYNTHESIZING PHOTO-REALISTIC VIDEO OF A SPEECH
20220084273 · 2022-03-17 ·

A system and a method for obtaining a photo-realistic video from a text. The method includes: providing the text and an image of a talking person; synthesizing a speech audio from the text; extracting an acoustic feature from the speech audio by an acoustic feature extractor; and generating the photo-realistic video from the acoustic feature and the image by a video generation neural network. The video generating neural network is pre-trained by: providing a training video and a training image; extracting a training acoustic feature from training audio of the training video by the acoustic feature extractor; generating video frames from the training image and the training acoustic feature by the video generation neural network; and comparing the generated video frames with ground truth video frames using generative adversarial network (GAN). The ground truth video frames correspond to the training video frames.

Systems and methods for improving animation of computer-generated avatars

The disclosed computer-implemented method may include identifying a muscle group engaged by a user to execute a predefined body action by (1) capturing a set of images of the user while the user executes the predefined body action, and (2) associating a feature of a body of the user with the muscle group based on the predefined body action and the set of images. The method may also include determining, based on the set of images, a set of parameters associated with the user, the muscle group, and the predefined body action. The method may also include directing a computer-generated avatar that represents the body of the user to produce the predefined body action in accordance with the set of parameters. Various other methods, systems, and computer-readable media are also disclosed.

Computing system for expressive three-dimensional facial animation

A computer-implemented technique for animating a visual representation of a face based on spoken words of a speaker is described herein. A computing device receives an audio sequence comprising content features reflective of spoken words uttered by a speaker. The computing device generates latent content variables and latent style variables based upon the audio sequence. The latent content variables are used to synchronized movement of lips on the visual representation to the spoken words uttered by the speaker. The latent style variables are derived from an expected appearance of facial features of the speaker as the speaker utters the spoken words and are used to synchronize movement of full facial features of the visual representation to the spoken words uttered by the speaker. The computing device causes the visual representation of the face to be animated on a display based upon the latent content variables and the latent style variables.

SYSTEM AND METHOD FOR VOICE DRIVEN LIP SYNCING AND HEAD REENACTMENT

A system and method voice driven animation of an object in an image by sampling an input video, depicting a puppet object, to obtain an image, receiving audio data, extracting voice related features from the audio data, producing an expression representation based on the voice related features, wherein the expression representation is related to a region of interest, obtaining from the image, auxiliary data related to the image and generating a target image based on the expression representation and the auxiliary data.

METHODS AND SYSTEMS FOR IMAGE AND VOICE PROCESSING

Systems and methods are disclosed configured to train an autoencoder using images that include faces, wherein the autoencoder comprises an input layer, an encoder configured to output a latent image from a corresponding input image, and a decoder configured to attempt to reconstruct the input image from the latent image. An image sequence of a face exhibiting a plurality of facial expressions and transitions between facial expressions is generated and accessed. Images of the plurality of facial expressions and transitions between facial expressions are captured from a plurality of different angles and using different lighting. An autoencoder is trained using source images that include the face with different facial expressions captured at different angles with different lighting, and using destination images that include a destination face. The trained autoencoder is used to generate an output where the likeness of the face in the destination images is swapped with the likeness of the source face, while preserving expressions of the destination face.