G10L2021/105

DEVICE AND METHOD FOR SYNTHESIZING IMAGE CAPABLE OF IMPROVING IMAGE QUALITY
20230177664 · 2023-06-08 ·

An image synthesis device according to a disclosed embodiment has one or more processors and a memory which stores one or more programs executed by the one or more processors. The image synthesis device includes a first artificial neural network provided to learn each of a first task of using a damaged image as an input to output a restored image and a second task of using an original image as an input to output a reconstructed image, and a second artificial neural network connected to an output layer of the first artificial neural network, and trained to use the reconstructed image output from the first artificial neural network as an input and improve the image quality of the reconstructed image.

METHOD AND SYSTEM FOR GENERATING 2D ANIMATED LIP IMAGES SYNCHRONIZING TO AN AUDIO SIGNAL

This disclosure relates generally to a method and system for generating 2D animated lip images synchronizing to an audio signal for an unseen subject. Recent advances in Convolutional Neural Network (CNN) based approaches generate convincing talking heads. Personalization of such talking heads requires training the model with large number of samples of the target person which is time consuming. The lip generator system receives an audio signal and a target lip image of an unseen target subject as inputs from a user and processes these inputs to extract a plurality of high dimensional audio image features. The lip generator system is meta-trained with training dataset which consists of large variety of subjects' ethnicity and vocabulary. The meta-trained model generates realistic animation for previously unseen face and unseen audio when finetuned with only a few-shot samples for a predefined interval of time. Additionally, the method protects intrinsic features of the unseen target subject.

Methods and systems for image and voice processing

Systems and methods are disclosed configured to train an autoencoder using images that include faces, wherein the autoencoder comprises an input layer, an encoder configured to output a latent image from a corresponding input image, and a decoder configured to attempt to reconstruct the input image from the latent image. An image sequence of a face exhibiting a plurality of facial expressions and transitions between facial expressions is generated and accessed. Images of the plurality of facial expressions and transitions between facial expressions are captured from a plurality of different angles and using different lighting. An autoencoder is trained using source images that include the face with different facial expressions captured at different angles with different lighting, and using destination images that include a destination face. The trained autoencoder is used to generate an output where the likeness of the face in the destination images is swapped with the likeness of the source face, while preserving expressions of the destination face.

SYSTEM AND METHOD TO INSERT VISUAL SUBTITLES IN VIDEOS

A system and method to insert visual subtitles in videos is described. The method comprises segmenting an input video signal to extract the speech segments and music segments. Next, a speaker representation is associated for each speech segment corresponding to a speaker visible in the frame. Further, speech segments are analysed to compute the phones and the duration of each phone. The phones are mapped to a corresponding viseme and a viseme based language model is created with a corresponding score. Most relevant viseme is selected for the speech segments by computing a total viseme score. Further, a speaker representation sequence is created such that phones and emotions in the speech segments are represented as reconstructed lip movements and eyebrow movements. The speaker representation sequence is then integrated with the music segments and super imposed on the input video signal to create subtitles.

THREE-DIMENSIONAL EXPRESSION BASE GENERATION METHOD AND APPARATUS, SPEECH INTERACTION METHOD AND APPARATUS, AND MEDIUM
20220036636 · 2022-02-03 ·

This application provides a three-dimensional (3D) expression base generation method performed by a computer device. The method includes: obtaining image pairs of a target object in n types of head postures, each image pair including a color feature image and a depth image in a head posture; constructing a 3D human face model of the target object according to then image pairs; and generating a set of expression bases of the target object according to the 3D human face model of the target object. According to this application, based on a reconstructed 3D human face model, a set of expression bases of a target object is further generated, so that more diversified product functions may be expanded based on the set of expression bases.

THREE-DIMENSIONAL FACE ANIMATION FROM SPEECH

A method for training a three-dimensional model face animation model from speech, is provided. The method includes determining a first correlation value for a facial feature based on an audio waveform from a first subject, generating a first mesh for a lower portion of a human face, based on the facial feature and the first correlation value, updating the first correlation value when a difference between the first mesh and a ground truth image of the first subject is greater than a pre-selected threshold, and providing a three-dimensional model of the human face animated by speech to an immersive reality application accessed by a client device based on the difference between the first mesh and the ground truth image of the first subject. A non-transitory, computer-readable medium storing instructions to cause a system to perform the above method, and the system, are also provided.

Photo-realistic synthesis of image sequences with lip movements synchronized with speech

Audiovisual data of an individual reading a known script is obtained and stored in an audio library and an image library. The audiovisual data is processed to extract feature vectors used to train a statistical model. An input audio feature vector corresponding to desired speech with which a synthesized image sequence will be synchronized is provided. The statistical model is used to generate a trajectory of visual feature vectors that corresponds to the input audio feature vector. These visual feature vectors are used to identify a matching image sequence from the image library. The resulting sequence of images, concatenated from the image library, provides a photorealistic image sequence with lip movements synchronized with the desired speech.

Photorealistic CGI Generated Character
20170221504 · 2017-08-03 ·

This document presents an apparatus and method for creating and displaying a three dimensional CGI character. The 3D CGT character is created as a digital file that is displayed on a silhouette that has an outline of the character created. The top portion of the display silhouette is transparent upon which the 3D CGI character is displayed by a holographic projector. The bottom half of the display silhouette is opaque and conceals the CPU and projector components. The entire silhouette performs as a stand-alone display unit that may be placed in any location a user requires.

SYSTEM AND METHOD FOR AN INTERACTIVE QUERY UTILIZING A SIMULATED PERSONALITY
20170269946 · 2017-09-21 ·

A system and method provides for an interactive query comprising a first input module capable of receiving input for creating a simulated personality for a first user. An expert system is capable of creating and storing the simulated personality. An output module is used for presenting the simulated personality to a second user. An interactive query module is capable of allowing the second user to communicate with the simulated personality of the first user.

IMAGE PROCESSING DEVICE, ANIMATION DISPLAY METHOD AND COMPUTER READABLE MEDIUM

An image processing device includes a controller and a display. The controller adds an expression to a displayed face image in accordance with an audio when the audio is output. Further, the controller generates an animation in which a mouth contained in the face image with the expression moves in sync with the audio. The display displays the generated animation.