G06T13/205

METHOD FOR OUTPUTTING BLEND SHAPE VALUE, STORAGE MEDIUM, AND ELECTRONIC DEVICE

A method for outputting a blend shape value includes: performing feature extraction on obtained target audio data to obtain a target audio feature vector; inputting the target audio feature vector and a target identifier into an audio-driven animation model; inputting the target audio feature vector into an audio encoding layer, determining an input feature vector of a next layer at a (2t−n)/2 time point based on an input feature vector of a previous layer between a t time point and a t-n time point, determining a feature vector having a causal relationship with the input feature vector of the previous layer as a valid feature vector, outputting sequentially target-audio encoding features, and inputting the target identifier into a one-hot encoding layer for binary vector encoding to obtain a target-identifier encoding feature; and outputting a blend shape value corresponding to the target audio data.

SYSTEMS AND METHODS FOR AUTOMATED REAL-TIME GENERATION OF AN INTERACTIVE AVATAR UTILIZING SHORT-TERM AND LONG-TERM COMPUTER MEMORY STRUCTURES

Systems and methods enabling rendering an avatar attuned to a user. The systems and methods include receiving audio-visual data of user communications of a user. Using the audio-visual data, the systems and methods may determine vocal characteristics of the user, facial action units representative of facial features of the user, and speech of the user based on a speech recognition model and/or natural language understanding model. Based on the vocal characteristics, an acoustic emotion metric can be determined. Based on the speech recognition data, a speech emotion metric may be determined. Based on the facial action units, a facial emotion metric may be determined. An emotional complex signature may be determined to represent an emotional state of the user for rendering the avatar attuned to the emotional state based on a combination of the acoustic emotion metric, the speech emotion metric and the facial emotion metric.

Systems and methods for animation generation

Systems and methods for animating from audio in accordance with embodiments of the invention are illustrated. One embodiment includes a method for generating animation from audio. The method includes steps for receiving input audio data, generating an embedding for the input audio data, and generating several predictions for several tasks from the generated embedding. The several predictions includes at least one of blendshape weights, event detection, and/or voice activity detection. The method includes steps for generating a final prediction from the several predictions, where the final prediction includes a set of blendshape weights, and generating an output based on the generated final prediction.

AVATAR RENDERING OF PRESENTATIONS

A computer-implemented method for avatar rendering of virtual presentations is disclosed. The computer-implemented method includes extracting visual content from a presentation. The computer-implemented method further includes extracting audio content from the presentation. The computer-implemented method includes correlating the visual content with the audio content of the presentation. The computer-implemented method includes generating a virtual avatar to dynamically render a virtual presentation to a viewer, based at least in part, on the correlated visual content and audio content of the presentation.

FACIAL ACTIVITY DETECTION FOR VIRTUAL REALITY SYSTEMS AND METHODS

In an embodiment, a virtual reality ride system includes a display to present virtual reality image content to a first rider, an audio sensor to capture audio data associated with a second rider, and an image sensor to capture image data associated with the second rider. The virtual reality ride system also includes at least one processor communicatively coupled to the display and configured to (i) receive the audio data, the image data, or both, (ii) generate a virtual avatar corresponding to the second rider, wherein the virtual avatar includes a set of facial features, (iii) update the set of facial features based on the audio data, the image data, or both, and (iv) instruct the display to present the virtual reality image content including the virtual avatar and the updated set of facial features.

SPEECH IMAGE PROVIDING METHOD AND COMPUTING DEVICE FOR PERFORMING THE SAME
20230005202 · 2023-01-05 ·

A computing device according to an embodiment includes one or more processors, a memory storing one or more programs executed by the one or more processors, a standby state image generating module configured to generate a standby state image in which a person is in a standby state, and generate a back-motion image set including a plurality of back-motion images at a preset frame interval from the standby state image for image interpolation between a preset reference frame of the standby state image, a speech state image generating module configured to generate a speech state image in which a person is in a speech state based on a source of speech content, and an image playback module configured to generate a synthetic speech image by combining the standby state image and the speech state image while playing the standby state image.

HARMONY-AWARE HUMAN MOTION SYNTHESIS WITH MUSIC
20230005201 · 2023-01-05 ·

A method and device for harmony-aware audio-driven motion synthesis are provided. The method includes determining a plurality of testing meter units according to an input audio, each testing meter unit corresponding to an input audio sequence of the input audio, obtaining an auditory input corresponding to each testing meter unit, obtaining an initial pose of each testing meter unit as a visual input based on a visual motion sequence synthesized for a previous testing meter unit, and automatically generating a harmony-aware motion sequence corresponding to the input audio using a generator of a generative adversarial network (GAN) model. The GAN model is trained by incorporating a hybrid loss function. The hybrid loss function includes a multi-space pose loss, a harmony loss, and a GAN loss. The harmony loss is determined according to beat consistencies of audio-visual beat pairs.

INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND PROGRAM

An information processing device including a control unit that displays a virtual object on a three-dimensional coordinate space associated with a real space and causes the virtual object to perform a predetermined movement according to a predetermined sound reproduced from a real sound source disposed in the real space, in which the control unit performs delay amount setting processing of increasing a delay amount of movement start timing at which the virtual object is caused to perform the predetermined movement with respect to reproduction timing at which the real sound source reproduces the predetermined sound, as a position of the real sound source in the real space and a position of the virtual object on the three-dimensional coordinate space are separated from each other.

RENDERING VIRTUAL ARTICLES OF CLOTHING BASED ON AUDIO CHARACTERISTICS
20220406001 · 2022-12-22 ·

Systems and methods for generating a virtual article of clothing at a display are described. Some examples may include: obtaining video data and audio data, analyzing the video data to determine one or more body joints of a target object appearing in the video data. A mesh based on the determined one or more body joints may be generated. The audio data may be analyzed to determine audio characteristics associated with the audio data. Texture rendering information associated with a virtual article of clothing may be determined based on the audio characteristics. A rendered video may be generated by rendering the virtual article of clothing to the generated mesh using the texture rendering information.

ARTIFICIAL INTELLIGENCE (AI) LIFELIKE 3D CONVERSATIONAL CHATBOT
20220398794 · 2022-12-15 ·

A 3D conversational chatbot is disclosed. The conversational chatbot is embodied in an avatar to provide a human-like experience for end-users. The chatbot is an artificial intelligence-based chatbot. The chatbot is configured with the knowledge of the chatbot owner. The knowledge may depend on the owner, such as the products and/or services provided by the owner. For example, the chatbot is customized with AI for the specific needs of its owner. The avatar communicates with the user, such as a customer, to answer questions with life-like speech and facial movement.