G10L2021/105

Spatial audio and avatar control at headset using audio signals

An audio system in a local area providing an audio signal to a headset of a remote user is presented herein. The audio system identifies sounds from a human sound source in the local area, based in part on sounds detected within the local area. The audio system generates an audio signal for presentation to a remote user within a virtual representation of the local area based in part on a location of the remote user within the virtual representation of the local area relative to a virtual representation of the human sound source within the virtual representation of the local area. The audio system provides the audio signal to a headset of the remote user, wherein the headset presents the audio signal as part of the virtual representation of the local area to the remote user.

Actor-Replacement System for Videos
20240185890 · 2024-06-06 ·

In one aspect, an example method includes (i) estimating, using a skeletal detection model, a pose of an original actor for each of multiple frames of a video; (ii) obtaining, for each of a plurality of the estimated poses, a respective image of a replacement actor; (iii) obtaining replacement speech in the replacement actor's voice that corresponds to speech of the original actor in the video; (iv) generating, using the estimated poses, the images of the replacement actor, and the replacement speech, synthetic frames corresponding to the multiple frames of the video that depict the replacement actor in place of the original actor, with the synthetic frames including facial expressions for the replacement actor that temporally align with the replacement speech; and (iv) combining the synthetic frames and the replacement speech so as to obtain a synthetic video that replaces the original actor with the replacement actor.

METHOD FOR PROVIDING SPEECH VIDEO AND COMPUTING DEVICE FOR EXECUTING THE METHOD
20240185877 · 2024-06-06 ·

In a method of providing a speech video according to an embodiment, a standby state video in which a person in a video is in a standby state is reproduced, a speech state video in which a person in a video is in a speech state based on a source of speech content is generated, the standby state video being reproduced to a reference frame of the standby state video being reproduced based on a back motion image is returned, and a synthesized speech video by synthesizing the returned reference frame and the speech state video is generated.

Image processing device, animation display method and computer readable medium

An image processing device includes a controller and a display. The controller adds an expression to a displayed face image in accordance with an audio when the audio is output. Further, the controller generates an animation in which a mouth contained in the face image with the expression moves in sync with the audio. The display displays the generated animation.

Image processing device
10290138 · 2019-05-14 · ·

Device comprising a memory (2) storing sound data, three-dimensional surface data, and a plurality of control data sets which represent control points defined by data of coordinates which are associated with sound data, and a processor (4) which, on the basis of first and second successive sound data, and of first three-dimensional surface data, selects the control data sets associated with the first and second sound data, and defines second three-dimensional surface data by applying a displacement to each point. The displacement of a given point is calculated as the sum of displacement vectors calculated for each control point on the basis of the sum of first and second vectors, weighted by the ratio between the result of a function with two variables exhibiting a zero limit at infinity applied to the given point and to the control point and the sum of the result of this function applied to the point on the one hand and to each of the control points on the other hand. The first vector represents the displacement of the control point between the first and the second second sound data. The second vector corresponds to the difference between the data of coordinates of the point and the data of coordinates of the control point in the first sound data, multiplied by a coefficient dependent on the gradient of the first vector.

JOINT AUDIO-VIDEO FACIAL ANIMATION SYSTEM
20190130628 · 2019-05-02 ·

The present invention relates to a joint automatic audio visual driven facial animation system that in some example embodiments includes a full scale state of the art Large Vocabulary Continuous Speech Recognition (LVCSR) with a strong language model for speech recognition and obtained phoneme alignment from the word lattice.

Animation synthesis system and lip animation synthesis method

An animation display system is provided. The animation display system includes a display; a storage configured to store a language model database, a phonetic-symbol lip-motion matching database and a lip motion synthesis database; and a processor electronically connected to the storage and the display, respectively. The processor includes a speech conversion module, a phonetic-symbol lip-motion matching module, and a lip motion synthesis module. A lip animation display method is also provided.

System and method for talking avatar
12039997 · 2024-07-16 · ·

Aspects of this disclosure provide techniques for generating a viseme and corresponding intensity pair. In some embodiments, the method includes generating, by a server, a viseme and corresponding intensity pair based at least on one of a clean vocal track or corresponding transcription. The method may include generating, by the server, a compressed audio file based at least on one of the viseme, the corresponding intensity, music, or visual offset. The method may further include generating, by the server or a client end application, a buffer of raw pulse-code modulated (PCM) data based on decoding at least a part of the compressed audio file, where the viseme is scheduled to align with a corresponding phoneme.

Photorealistic talking faces from audio

Provided is a framework for generating photorealistic 3D talking faces conditioned only on audio input. In addition, the present disclosure provides associated methods to insert generated faces into existing videos or virtual environments. We decompose faces from video into a normalized space that decouples 3D geometry, head pose, and texture. This allows separating the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. To stabilize temporal dynamics, we propose an auto-regressive approach that conditions the model on its previous visual state. We also capture face illumination in our model using audio-independent 3D texture normalization.

SYSTEMS AND METHODS FOR MACHINE-GENERATED AVATARS
20190057714 · 2019-02-21 ·

Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.