Patent classifications
G10L2021/105
APPARATUS AND METHOD FOR GENERATING SPEECH SYNTHESIS IMAGE
An apparatus for generating a speech synthesis image according to a disclosed embodiment is an apparatus for generating a speech synthesis image based on machine learning, the apparatus including a first global geometric transformation predictor configured to be trained to receive each of a source image and a target image including the same person, and predict a global geometric transformation for a global motion of the person between the source image and the target image based on the source image and the target image, a local feature tensor predictor configured to be trained to predict a feature tensor for a local motion of the person based on input target image-related information, and an image generator configured to be trained to reconstruct the target image based on the global geometric transformation, the source image, and the feature tensor for the local motion.
Actor-Replacement System for Videos
In one aspect, an example method includes (i) estimating, using a skeletal detection model, a pose of an original actor for each of multiple frames of a video; (ii) obtaining, for each of a plurality of the estimated poses, a respective image of a replacement actor; (iii) obtaining replacement speech in the replacement actor's voice that corresponds to speech of the original actor in the video; (iv) generating, using the estimated poses, the images of the replacement actor, and the replacement speech, synthetic frames corresponding to the multiple frames of the video that depict the replacement actor in place of the original actor, with the synthetic frames including facial expressions for the replacement actor that temporally align with the replacement speech; and (iv) combining the synthetic frames and the replacement speech so as to obtain a synthetic video that replaces the original actor with the replacement actor.
Identity transfer models for generating audio/video content
Systems, devices, and methods are provided for training and/or inferencing using machine-learning models. In at least one embodiment, a user selects a source media (e.g., video or audio file) and a target identity. A content embedding may be extracted from the source media, and an identity embedding may be obtained for the target identity. The content embedding of the source media and the identity embedding of the target identity may be provided to a transfer model that generates synthesized media. For example, a user may select a song that is sung by a first artist and then select a second artist as the target identity to produce a cover of the song in the voice of the second artist.
SYSTEM AND METHOD FOR ANIMATED LIP SYNCHRONIZATION
A system and method for animated lip synchronization. The method includes: capturing speech input; parsing the speech input into phenomes; aligning the phonemes to the corresponding portions of the speech input; mapping the phonemes to visemes; synchronizing the visemes into viseme action units, the viseme action units comprising jaw and lip contributions for each of the phonemes; and outputting the viseme action units.
Voice band detection and implementation
A system encourages experimentation with audio frequency and speaker technologies while causing an inanimate object to appear to lip-sync. The system applies a bandpass filter to an incoming audio stream to determine a magnitude of audio content in a frequency band of interest. For example, the system may filter results directed at the voice band, associated with speech. A controller controls a strobe light to flash at a particular point of travel of a platform reciprocating at a known frequency. An illusion is created that a sculpture, such as a piece of paper formed into a ring, is lip-synching to music.
SYSTEMS, METHODS, DEVICES AND APPARATUSES FOR DETECTING FACIAL EXPRESSION
A system, method and apparatus for detecting facial expressions according to EMG signals.
Method of generating 3D video, method of training model, electronic device, and storage medium
A method of generating a 3D video, a method of training a neural network model, an electronic device, and a storage medium, which relate to a field of image processing, and in particular to technical fields of computer vision, augmented/virtual reality and deep learning. The method includes: determining, based on an input speech feature, a principal component analysis (PCA) coefficient by using a first network, wherein the PCA coefficient is used to generate the 3D video; correcting the PCA coefficient by using a second network; generating a lip movement information based on the corrected PCA coefficient and a PCA parameter for a neural network model, wherein the neural network model includes the first network and the second network; and applying the lip movement information to a pre-constructed 3D basic avatar model to obtain a 3D video with a lip movement effect.
METHOD AND APPARATUS FOR GENERATING FACIAL EXPRESSION AND TRAINING METHOD FOR GENERATING FACIAL EXPRESSION
A method and apparatus for generating a facial expression may receive an input image, and generate facial expression images that change from the input image based on an index indicating a facial expression intensity of the input image, the index being obtained from the input image
Information processing method and information processing device
An information processing method includes receiving a change instruction to change a voice parameter used in synthesizing a voice for a set of texts, changing the voice parameter in accordance with the change instruction to change the voice parameter, changing, in accordance with the change instruction, an image parameter used in synthesizing an image of a virtual object, the virtual object indicating a character that vocalizes the voice that has been synthesized, synthesizing the voice using the changed voice parameter, and synthesizing the image using the changed image parameter.
PERIOCULAR AND AUDIO SYNTHESIS OF A FULL FACE IMAGE
Systems and methods for synthesizing an image of the face by a head-mounted device (HMD) are disclosed. The HMD may not be able to observe a portion of the face. The systems and methods described herein can generate a mapping from a conformation of the portion of the face that is not imaged to a conformation of the portion of the face observed. The HMD can receive an image of a portion of the face and use the mapping to determine a conformation of the portion of the face that is not observed. The HMD can combine the observed and unobserved portions to synthesize a full face image.