G10L2021/105

ELECTRONIC DEVICE FOR GENERATING MOUTH SHAPE AND METHOD FOR OPERATING THEREOF

An electronic device includes at least one processor, and at least one memory storing instructions executable by the at least one processor and operatively connected to the at least one processor, where the at least one processor is configured to acquire voice data to be synthesized with at least one first image, generate a plurality of mouth shape candidates by using the voice data, select a mouth shape candidate among the plurality of mouth shape candidates, generate at least one second image based on the selected mouth shape candidate and at least a portion of each of the at least one first image, and generate at least one third image by applying at least one super-resolution model to the at least one second image.

FACE-TRANSLATOR: END-TO-END SYSTEM FOR SPEECH-TRANSLATED LIP-SYNCHRONIZED AND VOICE PRESERVING VIDEO GENERATION
20250315631 · 2025-10-09 ·

A neural end-to-end system is provided for the face and voice preserving translation of videos. The system is a pipeline of multiple models that produces a video of the original speaker speaking in the target language with modified lip movement to match the target speech, while preserving emphases and prosody of the original speech, and voice characteristics of the original speaker. The pipeline starts with automatic speech recognition including emphasis detection, followed by the translation model. The translated text is then synthesized by a Text-to-Speech model that recreates the original emphases in the target sentence. The resulting synthetic speech is then converted back to the original speakers' voice using a voice conversion model. Finally, to synchronize the lips of the speaker with the translated audio, a generative model generates frames of adapted lip movements which are combined with the audio to produce the final output. The disclosure further describes several use-cases and configurations that apply these techniques to video conferencing, dubbing, low-bandwidth transmission, speech enhancement and assistive technology for the hearing impaired.

Apparatus and method for generating speech synthesis image
12437773 · 2025-10-07 · ·

An apparatus for generating a speech synthesis image according to a disclosed embodiment is an apparatus for generating a speech synthesis image based on machine learning, the apparatus including a first global geometric transformation predictor configured to be trained to receive each of a source image and a target image including the same person, and predict a global geometric transformation for a global motion of the person between the source image and the target image based on the source image and the target image, a local feature tensor predictor configured to be trained to predict a feature tensor for a local motion of the person based on input target image-related information, and an image generator configured to be trained to reconstruct the target image based on the global geometric transformation, the source image, and the feature tensor for the local motion.

Method and apparatus for providing interactive avatar services
12450809 · 2025-10-21 · ·

A method of providing an avatar service includes obtaining a user-uttered voice and a spatial information of a user-utterance space, transmitting the user-uttered voice and the spatial information to a server, receiving, from the server, a first avatar voice answer and an avatar facial expression sequence corresponding to the first avatar voice, which are determined based on the user-uttered voice and the spatial information, determining first avatar facial expression data, based on the first avatar voice answer and the avatar facial expression sequence, identifying a certain event during reproduction of a first avatar animation created based on the first avatar voice answer and the first avatar facial expression data, determining second avatar facial expression data or a second avatar voice answer, based on the certain event, and reproducing a second avatar animation created based on the second avatar facial expression data or the second avatar voice answer.

System and method of modulating animation curves
12561881 · 2026-02-24 · ·

A system and method of modulating animation curves based on audio input. The method including: identifying phonetic features for a plurality of visemes in the audio input; determining viseme animation curves based on parameters representing a spatial appearance of the plurality of visemes; modulating the viseme animation curves based on melodic accent, pitch sensitivity, or both, based on the phonetic features; and outputting the modulated animation curves.

Photorealistic Talking Faces from Audio
20260038179 · 2026-02-05 ·

Provided is a framework for generating photorealistic 3D talking faces conditioned only on audio input. In addition, the present disclosure provides associated methods to insert generated faces into existing videos or virtual environments. We decompose faces from video into a normalized space that decouples 3D geometry, head pose, and texture. This allows separating the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. To stabilize temporal dynamics, we propose an auto-regressive approach that conditions the model on its previous visual state. We also capture face illumination in our model using audio-independent 3D texture normalization.

METHOD FOR GENERATING DIGITAL HUMAN, INTELLIGENT AGENT, ELECTRONIC DEVICE AND STORAGE MEDIUM

Provided is a method for generating a digital human, an intelligent agent, an electronic device and a storage medium, relating to the field of artificial intelligence technology, and particularly to the fields of computer vision, deep learning, large model, augmented reality and other technologies. The method includes: segmenting a target object from an image to be processed to obtain a target sub-image; selecting a digital human to be optimized that is compatible with the target object from a digital human set based on the target sub-image; generating clothing texture of the digital human to be optimized based on an appearance feature of the target object in the target sub-image; applying the clothing texture to the digital human to be optimized to obtain a target digital human; and driving the target digital human.

Systems And Methods For Machine-Generated Avatars
20260094612 · 2026-04-02 ·

Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.

Machine translation system for entertainment and media
12596892 · 2026-04-07 · ·

Techniques for generating translated audio output based on media content are disclosed. Text is accessed corresponding to media content. One or more untranslated mouth shape indicia are determined based on the text. The text is parsed into one or more text chunks when one or more dubbing parameters are met. The parsed text is translated from a first spoken language to a second spoken language. One or more translated mouth shape indicia are determined. The one or more translated mouth shape indicia and the one or more untranslated mouth shape indicia are compared based on a predetermined tolerance threshold. A translated audio output is generated based on the translated text.

AI-driven system and methods for personalized virtual medical and spiritual advisor avatars with adaptive therapeutic audio, biometric monitoring, and immersive AR/VR healthcare interfaces
12609103 · 2026-04-21 ·

A computer-implemented system personalizes virtual advisors for immersive healthcare by creating virtual medical and spiritual avatars that resemble trusted authority figures using deepfake technology and multimodal deep neural networks. The virtual medical advisor tailors guidance by analyzing unstructured electronic health record data with natural language processing and BERT-based techniques while adapting its communication based on real-time physiological data from sensors like EEG and photoplethysmography. Concurrently, the virtual spiritual advisor offers faith-based counseling by factoring in user-declared spiritual preferences and sacred text analysis weighted for doctrinal considerations. Additional features include gamification with cryptocurrency tokens or NFTs for health activities, blockchain-based audit trails for HIPAA compliance, and federated learning with differential privacy. The system also employs 3D anatomical simulations to visualize pharmacokinetics and uses adaptive audio treatments in augmented reality with techniques like binaural beats and haptic feedback, all optimized through reinforcement learning based on historical interactions.