Patent classifications
G10L2021/105
Joint audio-video facial animation system
The present invention relates to a joint automatic audio visual driven facial animation system that in some example embodiments includes a full scale state of the art Large Vocabulary Continuous Speech Recognition (LVCSR) with a strong language model for speech recognition and obtained phoneme alignment from the word lattice.
Periocular and audio synthesis of a full face image
Systems and methods for synthesizing an image of the face by a head-mounted device (HMD) are disclosed. The HMD may not be able to observe a portion of the face. The systems and methods described herein can generate a mapping from a conformation of the portion of the face that is not imaged to a conformation of the portion of the face observed. The HMD can receive an image of a portion of the face and use the mapping to determine a conformation of the portion of the face that is not observed. The HMD can combine the observed and unobserved portions to synthesize a full face image.
PROCESSING SPEECH SIGNALS OF A USER TO GENERATE A VISUAL REPRESENTATION OF THE USER
A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.
Facial recognition method for video conference and server using the method
A facial recognition method for video conferencing requiring a reduced bandwidth and transmitting video and audio frames synchronously first determines whether a 3D body model of a first user at a local end has been currently retrieved or is otherwise retrievable from a historical database. Multiple audio frames of first user are collected and audio frequency at a specific range are filtered out. An envelope curve of the first audio frames and multiple attacking time periods and multiple releasing time periods of the envelope curve is calculated and correlated with lip movements of first user. Information packets of same and head-rotating and limb-swinging images of the first user are transmitted to a remote second user so that the 3D body model can simulate and show lip shapes and other movement of the first user.
Production of speech based on whispered speech and silent speech
A method, a system, and a computer program product are provided for interpreting low amplitude speech and transmitting amplified speech to a remote communication device. At least one computing device receives sensor data from multiple sensors. The sensor data is associated with the low amplitude speech. At least one of the at least one computing device analyzes the sensor data to map the sensor data to at least one syllable resulting in a string of one or more words. An electronic representation of the string of the one or more words may be generated and transmitted to a remote communication device for producing the amplified speech from the electronic representation.
Processing speech to drive animations on avatars
Functionality is disclosed herein for using a framework for a VR/AR application to utilize different services. In some configurations, a VR/AR application can utilize different services, such as an animation service, a multi-modal disambiguation service, a virtual platform service, a recognition service, an automatic speech recognition (ASR) service, a text-to-speech (TTS) service, a search service, as well as one or more other services. Instead of a developer of the VR/AR application having to develop programming code to implement features provided by one or more of services, the developer may utilize functionality of existing services that are available from a service provider network.
METHOD AND APPARATUS FOR GENERATING ANIMATION
Embodiments of the present disclosure provide a method and apparatus for generating an animation. A method may include: extracting an audio feature from target speech segment by segment, to aggregate the audio feature into an audio feature sequence composed of an audio feature of each speech segment; inputting the audio feature sequence into a pre-trained mouth-shape information prediction model, to obtain a mouth-shape information sequence corresponding to the audio feature sequence; generating, for mouth-shape information in the mouth-shape information sequence, a face image including a mouth-shape object indicated by the mouth-shape information; and using the generated face image as a key frame of a facial animation, to generate the facial animation.
USING MACHINE-LEARNING MODELS TO DETERMINE MOVEMENTS OF A MOUTH CORRESPONDING TO LIVE SPEECH
Disclosed systems and methods predict visemes from an audio sequence. A viseme-generation application accesses a first set of training data that includes a first audio sequence representing a sentence spoken by a first speaker and a sequence of visemes. Each viseme is mapped to a respective audio sample of the first audio sequence. The viseme-generation application creates a second set of training data adjusting a second audio sequence spoken by a second speaker speaking the sentence such that the second and first sequences have the same length and at least one phoneme occurs at the same time stamp in the first sequence and in the second sequence. The viseme-generation application maps the sequence of visemes to the second audio sequence and trains a viseme prediction model to predict a sequence of visemes from an audio sequence.
Method for audio-driven character lip sync, model for audio-driven character lip sync and training method therefor
Embodiments of the present disclosure provide a method for audio-driven character lip sync, a model for audio-driven character lip sync, and a training method therefor. A target dynamic image is obtained by acquiring a character image of a target character and speech for generating a target dynamic image, processing the character image and the speech as image-audio data that may be trained, respectively, and mixing the image-audio data with auxiliary data for training. When a large amount of sample data needs to be obtained for training in different scenarios, a video when another character speaks is used as an auxiliary video for processing, so as to obtain the auxiliary data. The auxiliary data, which replaces non-general sample data, and other data are input into a model in a preset ratio for training. The auxiliary data may improve a process of training a synthetic lip sync action of the model, so that there are no parts unrelated to the synthetic lip sync action during the training process. In this way, a problem that a large amount of sample data is required during the training process is resolved.
MODEL LEARNING SYSTEM, MODEL LEARNING METHOD, A NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM, AN ANIMATION GENERATION SYSTEM, AND AN ANIMATION GENERATION METHOD
Embodiments of the present disclosure provide methods, systems and non-transitory computer readable media of performing voice model learning and rig model learning. The voice model learning includes extracting an acoustic feature value by executing predetermined acoustic signal processing with respect to voice data including human voice and extracting a voice feature value by executing first transformation processing with respect to first input information including the extracted acoustic feature value. The rig model learning includes extracting a frame feature value by executing second transformation processing with respect to second input information including the extracted voice feature value and outputting character control information for controlling a character from the extracted frame feature value.