Patent classifications
G10L2021/105
Systems and Methods for Assisted Translation and Lip Matching for Voice Dubbing
Systems and methods for generating candidate translations for use in creating synthetic or human-acted voice dubbings, aiding human translators in generating translations that match the corresponding video, automatically grading how well a candidate translation matches the corresponding video, suggesting modifications to the speed and/or timing of the translated text to improve the grading of a candidate translation, and suggesting modifications to the voice dubbing and/or video to improve the grading of a candidate translation. In that regard, the present technology may be used to fully automate the process of generating lip-matched translations and associated voice dubbings, or as an aid for human-in-the-loop processes that may reduce or eliminate the time and effort required from translators, adapters, voice actors, and/or audio editors to generate voice dubbings.
Systems, methods, devices and apparatuses for detecting facial expression
A system, method and apparatus for detecting facial expressions according to EMG signals.
Processing speech signals of a user to generate a visual representation of the user
A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.
APPARATUS, METHOD, AND COMPUTER PROGRAM FOR PROVIDING LIP-SYNC VIDEO AND APPARATUS, METHOD, AND COMPUTER PROGRAM FOR DISPLAYING LIP-SYNC VIDEO
Provided is a lip-sync video providing apparatus for providing a video in which a voice and lip shapes are synchronized. The lip-sync video providing apparatus is configured to obtain a template video including at least one frame and depicting a target object, obtain a target voice to be used as a voice of the target object, generate a lip image corresponding to the voice for each frame of the template video by using a trained first artificial neural network, and generate lip-sync data including frame identification information of a frame in the template video, the lip image, and position information regarding the lip image in a frame in the template video.
Method and apparatus for predicting mouth-shape feature, and electronic device
A method and apparatus for predicting a mouth-shape feature, and an electronic device are provided. A specific implementation of the method comprises: recognizing a phonetic posterior gram (PPG) of a phonetic feature; and performing a prediction on the PPG by using a neural network model, to predict a mouth-shape feature of the phonetic feature, the neural network model being obtained by training with training samples and an input thereof including a PPG and an output thereof including a mouth-shape feature, and the training samples including a PPG training sample and a mouth-shape feature training sample.
Method and apparatus for controlling avatars based on sound
Provided is a method for controlling avatar motion, which is operated in a user terminal and includes receiving an input audio by an audio sensor, and controlling, by one and more processors, a motion of a first user avatar based on the input audio.
VOICE INTERACTION METHOD AND ELECTRONIC DEVICE
Embodiments of this application provide a voice interaction method and an electronic device, and relate to the field of artificial intelligence AI technologies and the field of voice processing technologies. A specific solution includes: An electronic device may receive first voice information sent by a second user, and the electronic device recognizes the first voice information in response to the first voice information. The first voice information is used to request a voice conversation with a first user. The electronic device may have, on a basis that the electronic device recognizes that the first voice information is voice information of the second user, a voice conversation with the second user by imitating a voice of the first user and in a mode in which the first user has a voice conversation with the second user.
ELECTRONIC DEVICE FOR GENERATING MOUTH SHAPE AND METHOD FOR OPERATING THEREOF
An electronic device includes at least one processor, and at least one memory storing instructions executable by the at least one processor and operatively connected to the at least one processor, where the at least one processor is configured to acquire voice data to be synthesized with at least one first image, generate a plurality of mouth shape candidates by using the voice data, select a mouth shape candidate among the plurality of mouth shape candidates, generate at least one second image based on the selected mouth shape candidate and at least a portion of each of the at least one first image, and generate at least one third image by applying at least one super-resolution model to the at least one second image.
PROCESSING SPEECH SIGNALS OF A USER TO GENERATE A VISUAL REPRESENTATION OF THE USER
A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.
Viseme data generation for presentation while content is output
Systems and methods for viseme data generation are disclosed. Uncompressed audio data is generated and/or utilized to determine the beats per minute of the audio data. Visemes are associated with the audio data utilizing a Viterbi algorithm and the beats per minute. A time-stamped list of viseme data is generated that associates the visemes with the portions of the audio data that they correspond to. An animatronic toy and/or an animation is caused to lip sync using the viseme data while audio corresponding to the audio data is output.