G10L2021/105

Information processing apparatus, information processing method, and program
09557956 · 2017-01-31 · ·

An information processing apparatus is provided which includes a metadata extraction unit for analyzing an audio signal in which a plurality of instrument sounds are present in a mixed manner and for extracting, as a feature quantity of the audio signal, metadata changing along with passing of a playing time, and a player parameter determination unit for determining, based on the metadata extracted by the metadata extraction unit, a player parameter for controlling a movement of a player object corresponding to each instrument sound.

THREE-DIMENSIONAL FACE ANIMATION FROM SPEECH

A method for training a three-dimensional model face animation model from speech, is provided. The method includes determining a first correlation value for a facial feature based on an audio waveform from a first subject, generating a first mesh for a lower portion of a human face, based on the facial feature and the first correlation value, updating the first correlation value when a difference between the first mesh and a ground truth image of the first subject is greater than a pre-selected threshold, and providing a three-dimensional model of the human face animated by speech to an immersive reality application accessed by a client device based on the difference between the first mesh and the ground truth image of the first subject. A non-transitory, computer-readable medium storing instructions to cause a system to perform the above method, and the system, are also provided.

VIRTUAL PHOTOREALISTIC DIGITAL ACTOR SYSTEM FOR REMOTE SERVICE OF CUSTOMERS
20170011745 · 2017-01-12 ·

A system for remote servicing of customers includes an interactive display unit at the customer location providing two-way audio/visual communication with a remote service/sales agent, wherein communication inputted by the agent is delivered to customers via a virtual Digital Actor on the display. The system also provides for remote customer service using physical mannequins with interactive capability having two-way audio visual communication ability with the remote agent, wherein communication inputted by the remote service or sales agent is delivered to customers using the physical mannequin. A web solution integrates the virtual Digital Actor system into a business website. A smart phone solution provides the remote service to customers via an App. In another embodiment, the Digital Actor is instead displayed as a 3D hologram. The Digital Actor is also used in an e-learning solution, in a movie studio suite, and as a presenter on TV, online, or other broadcasting applications.

Real-time generation of speech animation

To realistically animate a String (such as a sentence) a hierarchical search algorithm is provided to search for stored examples (Animation Snippets) of sub-strings of the String, in decreasing order of sub-string length, and concatenate retrieved sub-strings to complete the String of speech animation. In one embodiment, real-time generation of speech animation uses model visemes to predict the animation sequences at onsets of visemes and a look-up table based (data-driven) algorithm to predict the dynamics at transitions of visemes. Specifically posed Model Visemes may be blended with speech animation generated using another method at corresponding time points in the animation when the visemes are to be expressed. An Output Weighting Function is used to map Speech input and Expression input into Muscle-Based Descriptor weightings.

Method for generating a talking head video with mouth movement sequence, device and computer-readable storage medium

A method for generating a talking head video includes: obtaining a text and an image containing a face of a user; determining a phoneme sequence that corresponds to the text and includes one or more phonemes; determining acoustic features corresponding to the text according to the phoneme sequence, and obtaining synthesized speech corresponding to the text according to the acoustic features; determining a first mouth movement sequence corresponding to the text according to the phoneme sequence, and determining a second mouth movement sequence corresponding to the text according to the acoustic features; creating a facial action video corresponding to the user according to the first mouth movement sequence, the second mouth movement sequence and the image; and processing the synthesized speech and the facial action video synchronously to obtain a talking head video corresponding to the user.

WRONG PHRASE REPLACEMENT
20250174249 · 2025-05-29 ·

According to one embodiment, a method, computer system, and computer program product for wrong phrase replacement is provided. The embodiment may include, in response to identifying an error spoken by a presenter in a multimedia file, generating a plan to correct the error. The embodiment may also include generating a corrected audio segment based on the plan. The embodiment may further include replacing an original audio segment in the multimedia file containing the error with the corrected audio segment. The embodiment may also include modifying a lip movement in a video segment of the multimedia file so lip movements of the presenter correspond to respective phonetics in the corrected audio segment. The embodiment may further include replacing an original lip movement with the modified lip movement so that the modified lip movement corresponds with the corrected audio segment.

Method for providing speech video and computing device for executing the method
12367892 · 2025-07-22 · ·

In a method of providing a speech video according to an embodiment, a standby state video in which a person in a video is in a standby state is reproduced, a speech state video in which a person in a video is in a speech state based on a source of speech content is generated, the standby state video being reproduced to a reference frame of the standby state video being reproduced based on a back motion image is returned, and a synthesized speech video by synthesizing the returned reference frame and the speech state video is generated.

Speech-driven animation using one or more neural networks

Apparatuses, systems, and techniques are presented to generate digital content. In at least one embodiment, one or more neural networks are used to generate video information based at least in part upon voice information and a combination of image features and facial landmarks corresponding to one or more images of a person.

Synthetic emotion in continuously generated voice-to-video system

One example method includes collecting an audio segment that includes audio data generated by a user, analyzing the audio data to identify an emotion expressed by the user, computing start and end indices of a video segment, selecting video data that shows the emotion expressed by the user, using the video data and the start and end indices of the video segment to modify a face of the user as the face appears in the video segment so as to generate modified face frames, and stitching the modified face frames into the video segment to create a modified video segment with the emotion expressed by the user, and the modified video segment includes the audio data generated by the user.

Creating images, meshes, and talking animations from mouth shape data

Creating images and animations of lip motion from mouth shape data includes providing, as one or more input features to a neural network model, a vector of a plurality of coefficients. Each vector of the plurality of coefficients corresponds to a different mouth shape. Using the neural network model, a data structure output specifying a visual representation of a mouth including lips having a shape corresponding to the vector is generated.