G10L2021/105

Method for Audio-Driven Character Lip Sync, Model for Audio-Driven Character Lip Sync and Training Method Therefor
20240054711 · 2024-02-15 ·

Embodiments of the present disclosure provide a method for audio-driven character lip sync, a model for audio-driven character lip sync, and a training method therefor. A target dynamic image is obtained by acquiring a character image of a target character and speech for generating a target dynamic image, processing the character image and the speech as image-audio data that may be trained, respectively, and mixing the image-audio data with auxiliary data for training. When a large amount of sample data needs to be obtained for training in different scenarios, a video when another character speaks is used as an auxiliary video for processing, so as to obtain the auxiliary data. The auxiliary data, which replaces non-general sample data, and other data are input into a model in a preset ratio for training. The auxiliary data may improve a process of training a synthetic lip sync action of the model, so that there are no parts unrelated to the synthetic lip sync action during the training process. In this way, a problem that a large amount of sample data is required during the training process is resolved.

LEARNING METHOD FOR GENERATING LIP SYNC IMAGE BASED ON MACHINE LEARNING AND LIP SYNC IMAGE GENERATION DEVICE FOR PERFORMING SAME
20240055015 · 2024-02-15 ·

A lip sync image generation device based on machine learning according to a disclosed embodiment includes an image synthesis model, which is an artificial neural network model, and which uses a person background image and an utterance audio signal as an input to generate a lip sync image, and a lip sync discrimination model, which is an artificial neural network model, and which discriminates the degree of match between the lip sync image generated by the image synthesis model and the utterance audio signal input to the image synthesis model.

Interactive systems and methods

A method of producing an avatar video, the method comprising the steps of: providing a reference image of a person's face; providing a plurality of characteristic features representative of a facial model X0 of the person's face, the characteristic features defining a facial pose dependent on the person speaking; providing a target phrase to be rendered over a predetermined time period during the avatar video and providing a plurality of time intervals t within the predetermined time period; generating, for each of said times intervals t, speech features from the target phrase, to provide a sequence of speech features; and generating, using the plurality of characteristic features and sequence of speech features, a sequence of facial models Xt for each of said time intervals t.

Processing speech signals of a user to generate a visual representation of the user
11900940 · 2024-02-13 · ·

A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.

SYSTEM, METHOD, AND COMPUTER PROGRAM FOR TRANSMITTING FACE MODELS BASED ON FACE DATA POINTS
20190379863 · 2019-12-12 ·

A system, method, and computer program are provided for receiving face models based on face data points. In use, a real-time face model is received, wherein the real-time face model includes one or more face data points. Real-time face data points are received, including additional one or more face data points. The real-time face model is manipulated based on the real-time face data points.

PRODUCTION OF SPEECH BASED ON WHISPERED SPEECH AND SILENT SPEECH

A method, a system, and a computer program product are provided for interpreting low amplitude speech and transmitting amplified speech to a remote communication device. At least one computing device receives sensor data from multiple sensors. The sensor data is associated with the low amplitude speech. At least one of the at least one computing device analyzes the sensor data to map the sensor data to at least one syllable resulting in a string of one or more words. An electronic representation of the string of the one or more words may be generated and transmitted to a remote communication device for producing the amplified speech from the electronic representation.

Systems And Methods For Generating Composite Media Using Distributed Networks
20190370554 · 2019-12-05 ·

A distributed systems and methods for generating composite media including receiving a media context that defines media that is to be generated, the media context including: a definition of a sequence of media segment specifications and, an identification of a set of remote devices. For each media segment specification, a reference segment may be generated and transmitted to at least one remote device. A media segment may be received from each of the remote device, the media segment having been recorded by a camera. Verified media sequences may replace the corresponding reference segment. The media segments may be aggregated and an updated sequence of media segments may be defined. An instance of the media context that includes a subset of the updated sequence of media segments may then be generated.

Systems and methods for generating synthetic videos based on audio contents
11968433 · 2024-04-23 · ·

Systems and methods for generating a synthetic video based on an audio are provided. An exemplary system may include a memory storing computer-readable instructions and at least one processor. The processor may execute the computer-readable instructions to perform operations. The operations may include receiving a reference video including a motion picture of a human face and receiving the audio including a speech. The operations may also include generating a synthetic motion picture of the human face based on the reference video and the audio. The synthetic motion picture of the human face may include a motion of a mouth of the human face presenting the speech. The motion of the mouth may match a content of the speech. The operations may further include generating the synthetic video based on the synthetic motion picture of the human face.

Method for providing speech video and computing device for executing the method
11967336 · 2024-04-23 · ·

A computing device according to an embodiment is a computing device that is provided with one or more processors and a memory storing one or more programs executed by the one or more processors, the computing device includes a standby state video generating module that generates a standby state video in which a person in a video is in a standby state, a speech state video generating module that generates a speech state video in which a person in a video is in a speech state based on a source of speech content, and a video reproducing module that reproduces the standby state video, and generates a synthesized speech video by synthesizing the standby state video being reproduced and the speech state video.

Wearable speech input-based to moving lips display overlay
11955135 · 2024-04-09 · ·

Eyewear having a speech to moving lips algorithm that receives and translates speech and utterances of a person viewed through the eyewear, and then displays an overlay of moving lips corresponding to the speech and utterances on a mask of the viewed person. A database having text to moving lips information is utilized to translate the speech and generate the moving lips in near-real time with little latency. This translation provides the deaf/hearing impaired users the ability to understand and communicate with the person viewed through the eyewear when they are wearing a mask. The translation may include automatic speech recognition (ASR) and natural language understanding (NLU) as a sound recognition engine.