G10L21/013

GLOBAL PROSODY STYLE TRANSFER WITHOUT TEXT TRANSCRIPTIONS

A computer-implemented method is provided of using a machine learning model for disentanglement of prosody in spoken natural language. The method includes encoding, by a computing device, the spoken natural language to produce content code. The method further includes resampling, by the computing device without text transcriptions, the content code to obscure the prosody by applying an unsupervised technique to the machine learning model to generate prosody-obscured content code. The method additionally includes decoding, by the computing device, the prosody-obscured content code to synthesize speech indirectly based upon the content code.

GLOBAL PROSODY STYLE TRANSFER WITHOUT TEXT TRANSCRIPTIONS

A computer-implemented method is provided of using a machine learning model for disentanglement of prosody in spoken natural language. The method includes encoding, by a computing device, the spoken natural language to produce content code. The method further includes resampling, by the computing device without text transcriptions, the content code to obscure the prosody by applying an unsupervised technique to the machine learning model to generate prosody-obscured content code. The method additionally includes decoding, by the computing device, the prosody-obscured content code to synthesize speech indirectly based upon the content code.

METHOD OF CONVERTING SPEECH, ELECTRONIC DEVICE, AND READABLE STORAGE MEDIUM
20220383876 · 2022-12-01 ·

A method of converting a speech, an electronic device, and a readable storage medium are provided, which relate to a field of artificial intelligence technology such as speech and deep learning, in particular to speech converting technology. The method of converting a speech includes: acquiring a first speech of a target speaker; acquiring a speech of an original speaker; extracting a first feature parameter of the first speech of the target speaker; extracting a second feature parameter of the speech of the original speaker; processing the first feature parameter and the second feature parameter to obtain a Mel spectrum information; and converting the Mel spectrum information to output a second speech of the target speaker having a tone identical to a tone of the first speech of the target speaker and a content identical to a content of the speech of the original speaker.

Communication using interactive avatars

Generally this disclosure describes a video communication system that replaces actual live images of the participating users with animated avatars. A method may include selecting an avatar; initiating communication; detecting a user input; identifying the user input; identifying an animation command based on the user input; generating avatar parameters; and transmitting at least one of the animation command and the avatar parameters.

Communication using interactive avatars

Generally this disclosure describes a video communication system that replaces actual live images of the participating users with animated avatars. A method may include selecting an avatar; initiating communication; detecting a user input; identifying the user input; identifying an animation command based on the user input; generating avatar parameters; and transmitting at least one of the animation command and the avatar parameters.

ELECTRONIC DEVICE, METHOD AND COMPUTER PROGRAM

An electronic device having circuitry configured to perform source separation on an audio signal to obtain a separated source and a residual signal, to perform feature extraction on the separated source to obtain one or more processing parameters, and to perform audio processing on a captured audio signal based on the one or more processing parameters to obtain an adjusted separated source.

ELECTRONIC DEVICE, METHOD AND COMPUTER PROGRAM

An electronic device having circuitry configured to perform source separation on an audio signal to obtain a separated source and a residual signal, to perform feature extraction on the separated source to obtain one or more processing parameters, and to perform audio processing on a captured audio signal based on the one or more processing parameters to obtain an adjusted separated source.

METHODS AND SYSTEMS FOR IMAGE PROCESSING USING A LEARNING ENGINE

Systems and methods are disclosed configured to train an autoencoder. A data training set is generated comprising images of different faces. A first autoencoder configuration is generated, comprising a first encoder, and a first decoder. The first autoencoder configuration is trained using dataset images, wherein weights associated with the first encoder and weights associated with the first decoder are modified. A second autoencoder configuration is generated comprising the first encoder and a second decoder. The second decoder is trained using images of a first target face. First encoder weights are substantially maintained, and weights associated with the second decoder are modified. An autoencoder comprising the trained first encoder and the trained second decoder generates an output using a source image of a first face having a facial expression, where the facial expression of the first face from the source image is applied to the first specific target face.

METHOD AND SYSTEM FOR TIME AND FEATURE MODIFICATION OF SIGNALS

The application relates to a computer implemented method and system for modifying at least one feature of an input audio signal based on features in a guide audio signal. The method comprises: determining matchable and unmatchable sections of the guide and input audio signals; generating a time-alignment path for modifying the at least one feature of the input audio signal in the matchable sections of the input audio signal based on corresponding features in the matchable sections of the guide audio signal, based on the time-alignment path, modifying the at least one feature in the matchable sections of the audio input signal.

AUDIO REACTIVE AUGMENTED REALITY

Methods, systems, and storage media for augmenting a video are disclosed. Exemplary implementations may: receive a selection of an effect; receive user-generated content comprising video data and audio data; detect a characteristic of the audio data comprising at least a volume and/or a pitch of the audio data during a period of time; determine a series of numeric values based on the characteristic of the audio data during the period of time, individual numeric values of the series of numeric values being correlated with an amplitude of the volume and/or pitch at a discrete point within the period of time; and augment at least one of the video data and/or the audio data to include the effect based on the series of numeric values at discrete points in time within the period of time.