G10L21/007

SYNTHESIZED SPEECH GENERATION

A device for speech generation includes one or more processors configured to receive one or more control parameters indicating target speech characteristics. The one or more processors are also configured to process, using a multi-encoder, an input representation of speech based on the one or more control parameters to generate encoded data corresponding to an audio signal that represents a version of the speech based on the target speech characteristics.

Streaming voice conversion method and apparatus and computer readable storage medium using the same

The present disclosure provides a streaming voice conversion method as well as an apparatus and a computer readable storage medium using the same. The method includes: obtaining to-be-converted voice data; partitioning the to-be-converted voice data in an order of data obtaining time as a plurality of to-be-converted partition voices, where the to-be-converted partition voice data carries a partition mark; performing a voice conversion on each of the to-be-converted partition voices to obtain a converted partition voice, where the converted partition voice carries a partition mark; performing a partition restoration on each of the converted partition voices to obtain a restored partition voice, where the restored partition voice carries a partition mark; and outputting each of the restored partition voices according to the partition mark carried by the restored partition voice. In this manner, the response time is shortened, and the conversion speed is improved.

Unsupervised Learning of Disentangled Speech Content and Style Representation
20220189456 · 2022-06-16 · ·

A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. The style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. The style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. The decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.

SYSTEM FOR CONVERTING VIBRATION TO VOICE FREQUENCY WIRELESSLY
20220174423 · 2022-06-02 ·

The present application discloses a system for converting vibration to voice frequency wirelessly and a method thereof. By sensing a first vibration variation data and a voice frequency variation data of a vocal vibration part in a first sensing period, a voice frequency reference data is obtained from the voice frequency variation data and the first vibration result. A second vibration result is obtained at a second sensing period for converting to a voice frequency output signal, and the voice frequency output signal is used to output as a voice signal corresponding to the voice frequency various result. Thus, the present application provides a voice signal close to a human voice.

AUDIO SIGNAL CONVERSION MODEL LEARNING APPARATUS, AUDIO SIGNAL CONVERSION APPARATUS, AUDIO SIGNAL CONVERSION MODEL LEARNING METHOD AND PROGRAM

A voice signal conversion model learning device includes: a generation unit configured to execute generation processing of generating a conversion destination voice signal on the basis of an input voice signal that is a voice signal of an input voice, conversion source attribute information that is information indicating an attribute of an input voice that is a voice represented by the input voice signal, and conversion destination attribute information indicating an attribute of a voice represented by the conversion destination voice signal that is a voice signal of a conversion destination of the input voice signal; and an identification unit configured to execute voice estimation processing of estimating whether or not a voice signal that is a processing target is a voice signal representing a vocal sound actually uttered by a person on the basis of the conversion source attribute information and the conversion destination attribute intonation, wherein the conversion destination voice signal is input to the identification unit, the processing target is a voice signal input to the identification unit, and the generation unit and the identification unit pertain learning on the basis of an estimation result of the voice estimation processing.

POST FILTER FOR AUDIO SIGNALS

In some embodiments, a pitch filter for filtering a preliminary audio signal generated from an audio bitstream is disclosed. The pitch filter has an operating mode selected from one of either: (i) an active mode where the preliminary audio signal is filtered using filtering information to obtain a filtered audio signal, and (ii) an inactive mode where the pitch filter is disabled. The preliminary audio signal is generated in an audio encoder or audio decoder having a coding mode selected from at least two distinct coding modes, and the pitch filter is capable of being selectively operated in either the active mode or the inactive mode while operating in the coding mode based on control information.

POST FILTER FOR AUDIO SIGNALS

In some embodiments, a pitch filter for filtering a preliminary audio signal generated from an audio bitstream is disclosed. The pitch filter has an operating mode selected from one of either: (i) an active mode where the preliminary audio signal is filtered using filtering information to obtain a filtered audio signal, and (ii) an inactive mode where the pitch filter is disabled. The preliminary audio signal is generated in an audio encoder or audio decoder having a coding mode selected from at least two distinct coding modes, and the pitch filter is capable of being selectively operated in either the active mode or the inactive mode while operating in the coding mode based on control information.

DATA CONVERSION LEARNING DEVICE, DATA CONVERSION DEVICE, METHOD, AND PROGRAM

Accurate conversion to data of a conversion target domain is allowed. A training unit 32 trains a forward generator, an inverse generator, a conversion target discriminator, and a conversion source discriminator to optimize an objective function.

METHOD OF CONVERTING VOICE FEATURE OF VOICE
20220157329 · 2022-05-19 · ·

A method and apparatus for converting a voice of a first speaker into a voice of a second speaker by using a plurality of trained artificial neural networks are provided. The method of converting a voice feature of a voice comprises (i) generating a first audio vector corresponding to a first voice by using a first artificial neural network, (ii) generating a first text feature value corresponding to the first text by using a second artificial neural network, (iii) generating a second audio vector by removing the voice feature value of the first voice from the first audio vector by using the first text feature value and a third artificial neural network, and (iv) generating, by using the second audio vector and a voice feature value of a target voice, a second voice in which a feature of the target voice is reflected.

Personalized Accent and/or Pace of Speaking Modulation for Audio/Video Streams
20230267941 · 2023-08-24 ·

Aspects of the disclosure relate to generating personalized accent and/or pace of speaking modulation for audio/video streams. In some embodiments, a computing platform may train an artificial intelligence model on audio or video samples associated with different geographic regions. The computing platform may receive, via a communication interface, an audio or video stream associated with a first geographic region. The computing platform may identify a second geographic region different from the first geographic region. The computing platform may transform the audio or video stream to correspond to the second geographic region different from the first geographic region. The computing platform may send, via the communication interface, the transformed audio or video stream to a user device associated with the second geographic region.