G10L21/10

METHOD FOR ANIMATION SYNTHESIS, ELECTRONIC DEVICE AND STORAGE MEDIUM
20220375456 · 2022-11-24 ·

A method for animation synthesis includes: obtaining an audio stream to be processed and a syllable sequence, wherein both the audio stream and the syllable sequence correspond to the same text and each syllable in the syllable sequence is pinyin of each character of the text; obtaining a phoneme information sequence of the audio stream by performing phoneme detection on the audio stream, wherein each piece of phoneme information in the phoneme information sequence comprises a phoneme category and a pronunciation time period; determining a pronunciation time period corresponding to each syllable in the syllable sequence based on the syllable sequence, phoneme categories and pronunciation time periods in the phoneme information sequence; and generating an animation video corresponding to the audio stream based on the pronunciation time period corresponding to each syllable in the syllable sequence and an animation frame sequence corresponding to each syllable.

DEVICE AND METHOD FOR GENERATING SPEECH VIDEO ALONG WITH LANDMARK
20220375224 · 2022-11-24 ·

A speech video generation device according to an embodiment includes a first encoder, which receives an input of a person background image that is a video part in a speech video of a predetermined person, and extracts an image feature vector from the person background image, a second encoder, which receives an input of a speech audio signal that is an audio part in the speech video, and extracts a voice feature vector from the speech audio signal, a combining unit, which generates a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder, a first decoder, which reconstructs the speech video of the person using the combined vector as an input, and a second decoder, which predicts a landmark of the speech video using the combined vector as an input.

DEVICE AND METHOD FOR GENERATING SPEECH VIDEO
20220375190 · 2022-11-24 ·

A speech video generation device according to an embodiment includes a first encoder that receives an input of a first person background image of a predetermined person partially hidden by a first mask, and extracts a first image feature vector from the first person background image, a second encoder, which receives an input of a second person background image of the person partially hidden by a second mask, and extracts a second image feature vector from the second person background image, a third encoder, which receives an input of a speech audio signal of the person, and extracts a voice feature vector from the speech audio signal, a combining unit, which generates a combined vector of the first image feature vector, the second image feature vector, and the voice feature vector, and a decoder, which reconstructs a speech video of the person using the combined vector as an input.

SYNTHESIZING VIDEO FROM AUDIO USING ONE OR MORE NEURAL NETWORKS
20220374637 · 2022-11-24 ·

Apparatuses, systems, and techniques are presented to reduce an amount of data to be transmitted for media content. In at least one embodiment, one or more neural networks are used to generate video and audio information corresponding to one or more people based, at least in part, on at least one image and voice information corresponding to the one or more people.

Automated conversation content items from natural language

A conversation augmentation system can automatically augment a conversation with content items based on natural language from the conversation. The conversation augmentation system can select content items to add to the conversation based on determined user “intents” generated using machine learning models. The conversation augmentation system can generate intents for natural language from various sources, such as video chats, audio conversations, textual conversations, virtual reality environments, etc. The conversation augmentation system can identify constraints for mapping the intents to content items or context signals for selecting appropriate content items. In various implementations, the conversation augmentation system can add selected content items to a storyline the conversation describes or can augment a platform in which an unstructured conversation is occurring.

Automated conversation content items from natural language

A conversation augmentation system can automatically augment a conversation with content items based on natural language from the conversation. The conversation augmentation system can select content items to add to the conversation based on determined user “intents” generated using machine learning models. The conversation augmentation system can generate intents for natural language from various sources, such as video chats, audio conversations, textual conversations, virtual reality environments, etc. The conversation augmentation system can identify constraints for mapping the intents to content items or context signals for selecting appropriate content items. In various implementations, the conversation augmentation system can add selected content items to a storyline the conversation describes or can augment a platform in which an unstructured conversation is occurring.

WEARABLE SPEECH INPUT-BASED TO MOVING LIPS DISPLAY OVERLAY
20230056847 · 2023-02-23 ·

Eyewear having a speech to moving lips algorithm that receives and translates speech and utterances of a person viewed through the eyewear, and then displays an overlay of moving lips corresponding to the speech and utterances on a mask of the viewed person. A database having text to moving lips information is utilized to translate the speech and generate the moving lips in near-real time with little latency. This translation provides the deaf/hearing impaired users the ability to understand and communicate with the person viewed through the eyewear when they are wearing a mask. The translation may include automatic speech recognition (ASR) and natural language understanding (NLU) as a sound recognition engine.

WEARABLE SPEECH INPUT-BASED TO MOVING LIPS DISPLAY OVERLAY
20230056847 · 2023-02-23 ·

Eyewear having a speech to moving lips algorithm that receives and translates speech and utterances of a person viewed through the eyewear, and then displays an overlay of moving lips corresponding to the speech and utterances on a mask of the viewed person. A database having text to moving lips information is utilized to translate the speech and generate the moving lips in near-real time with little latency. This translation provides the deaf/hearing impaired users the ability to understand and communicate with the person viewed through the eyewear when they are wearing a mask. The translation may include automatic speech recognition (ASR) and natural language understanding (NLU) as a sound recognition engine.

SYSTEM AND METHOD FOR DUAL MODE PRESENTATION OF CONTENT IN A TARGET LANGUAGE TO IMPROVE LISTENING FLUENCY IN THE TARGET LANGUAGE
20230048738 · 2023-02-16 ·

Embodiments of a language learning system and method for implementing or assisting in self-study for improving listening fluency in a target language are disclosed. Such embodiments may simultaneously present the same piece of content in an auditory presentation and a corresponding visual presentation of a transcript of the auditory presentation, where the two presentations are adapted to work in tandem to increase the effectiveness of language learning for users.

SYSTEM AND METHOD FOR DUAL MODE PRESENTATION OF CONTENT IN A TARGET LANGUAGE TO IMPROVE LISTENING FLUENCY IN THE TARGET LANGUAGE
20230048738 · 2023-02-16 ·

Embodiments of a language learning system and method for implementing or assisting in self-study for improving listening fluency in a target language are disclosed. Such embodiments may simultaneously present the same piece of content in an auditory presentation and a corresponding visual presentation of a transcript of the auditory presentation, where the two presentations are adapted to work in tandem to increase the effectiveness of language learning for users.