G10L2021/105

System and method for talking avatar
11600290 · 2023-03-07 · ·

Aspects of this disclosure provide techniques for generating a viseme and corresponding intensity pair. In some embodiments, the method includes generating, by a server, a viseme and corresponding intensity pair based at least on one of a clean vocal track or corresponding transcription. The method may include generating, by the server, a compressed audio file based at least on one of the viseme, the corresponding intensity, music, or visual offset. The method may further include generating, by the server or a client end application, a buffer of raw pulse-code modulated (PCM) data based on decoding at least a part of the compressed audio file, where the viseme is scheduled to align with a corresponding phoneme.

MATCHING MOUTH SHAPE AND MOVEMENT IN DIGITAL VIDEO TO ALTERNATIVE AUDIO
20230121540 · 2023-04-20 ·

A method for matching mouth shape and movement in digital video to alternative audio includes deriving a sequence of facial poses including mouth shapes for an actor from a source digital video. Each pose in the sequence of facial poses corresponds to a middle position of each audio sample. The method further includes generating an animated face mesh based on the sequence of facial poses and the source digital video, transferring tracked expressions from the animated face mesh or the target video to the source video, and generating a rough output video that includes transfers of the tracked expressions. The method further includes generating a finished video at least in part by refining the rough video using a parametric autoencoder trained on mouth shapes in the animated face mesh or the target video. One or more computers may perform the operations of the method.

ARTIFICIAL INTELLIGENCE-BASED ANIMATION CHARACTER DRIVE METHOD AND RELATED APPARATUS

This application discloses an artificial intelligence (AI) based animation character drive method. A first expression base of a first animation character corresponding to a speaker is determined by acquiring media data including a facial expression change when the speaker says a speech, and the first expression base may reflect different expressions of the first animation character. After target text information is obtained, an acoustic feature and a target expression parameter corresponding to the target text information are determined according to the target text information, the foregoing acquired media data, and the first expression base. A second animation character having a second expression base may be driven according to the acoustic feature and the target expression parameter, so that the second animation character may simulate the speaker's sound and facial expression when saying the target text information, thereby improving experience of interaction between the user and the animation character.

METHOD OF PROCESSING IMAGE, METHOD OF TRAINING MODEL, ELECTRONIC DEVICE AND MEDIUM

A method of processing an image, a method of training a model, an electronic device and a medium, which relate to a field of artificial intelligence technology, in particular to deep learning, computer vision and other technical fields. A solution includes: generating a first face image, wherein a definition difference and an authenticity difference between the first face image and a reference face image are within a set range; adjusting, according to a target voice used to drive the first face image, a facial action information related to pronunciation in the first face image to generate a second face image with a facial tissue position conforming to a pronunciation rule of the target voice; and determining the second face image as a face image driven by the target voice.

SYNTHETIC EMOTION IN CONTINUOUSLY GENERATED VOICE-TO-VIDEO SYSTEM
20230061761 · 2023-03-02 ·

One example method includes collecting an audio segment that includes audio data generated by a user, analyzing the audio data to identify an emotion expressed by the user, computing start and end indices of a video segment, selecting video data that shows the emotion expressed by the user, using the video data and the start and end indices of the video segment to modify a face of the user as the face appears in the video segment so as to generate modified face frames, and stitching the modified face frames into the video segment to create a modified video segment with the emotion expressed by the user, and the modified video segment includes the audio data generated by the user.

Spatial audio and avatar control at headset using audio signals

An audio system in a local area providing an audio signal to a headset of a remote user is presented herein. The audio system identifies sounds from a human sound source in the local area, based in part on sounds detected within the local area. The audio system generates an audio signal for presentation to a remote user within a virtual representation of the local area based in part on a location of the remote user within the virtual representation of the local area relative to a virtual representation of the human sound source within the virtual representation of the local area. The audio system provides the audio signal to a headset of the remote user, wherein the headset presents the audio signal as part of the virtual representation of the local area to the remote user.

JOINT AUDIO-VIDEO FACIAL ANIMATION SYSTEM
20230106140 · 2023-04-06 ·

The present invention relates to a joint automatic audio visual driven facial animation system that in some example embodiments includes a full scale state of the art Large Vocabulary Continuous Speech Recognition (LVCSR) with a strong language model for speech recognition and obtained phoneme alignment from the word lattice.

System and Method for Talking Avatar
20230206939 · 2023-06-29 ·

Aspects of this disclosure provide techniques for generating a viseme and corresponding intensity pair. In some embodiments, the method includes generating, by a server, a viseme and corresponding intensity pair based at least on one of a clean vocal track or corresponding transcription. The method may include generating, by the server, a compressed audio file based at least on one of the viseme, the corresponding intensity, music, or visual offset. The method may further include generating, by the server or a client end application, a buffer of raw pulse-code modulated (PCM) data based on decoding at least a part of the compressed audio file, where the viseme is scheduled to align with a corresponding phoneme.

METHOD AND SYSTEM FOR APPLYING SYNTHETIC SPEECH TO SPEAKER IMAGE
20230206896 · 2023-06-29 · ·

The present disclosure relates to a method for applying synthesis voice to a speaker image, in which the method includes receiving an input text, inputting the input text to an artificial neural network text-to-speech synthesis model and outputting voice data for the input text, generating a synthesis voice corresponding to the output voice data, and generating information on a plurality of phonemes included in the output voice data, in which the information on the plurality of phonemes may include timing information for each of the plurality of phonemes included in the output voice data.

Video Processing Method, Electronic Device And Non-transitory Computer-Readable Storage Medium

Provided are a video processing method, an electronic device and a non-transitory computer-readable storage medium, relates to the field of data processing, and in particular to the field of video generation. The specific implementation solution is text content and a selection instruction are received, wherein the selection instruction is configured to indicate a model for generating a virtual object; the text content is converted into voice; a mixed deformation parameter set is generated according to the text content and the voice; and the model of the virtual object is rendered with the mixed deformation parameter set, so as to obtain a picture set of the virtual object, and generating, according to the picture set, a video that includes the virtual object for broadcasting the text content. By means of the present disclosure, it is possible to simplify a large number of complicated operations for video production.