Patent classifications
G10L2021/0135
Streaming voice conversion method and apparatus and computer readable storage medium using the same
The present disclosure provides a streaming voice conversion method as well as an apparatus and a computer readable storage medium using the same. The method includes: obtaining to-be-converted voice data; partitioning the to-be-converted voice data in an order of data obtaining time as a plurality of to-be-converted partition voices, where the to-be-converted partition voice data carries a partition mark; performing a voice conversion on each of the to-be-converted partition voices to obtain a converted partition voice, where the converted partition voice carries a partition mark; performing a partition restoration on each of the converted partition voices to obtain a restored partition voice, where the restored partition voice carries a partition mark; and outputting each of the restored partition voices according to the partition mark carried by the restored partition voice. In this manner, the response time is shortened, and the conversion speed is improved.
Unsupervised Learning of Disentangled Speech Content and Style Representation
A linguistic content and speaking style disentanglement model includes a content encoder, a style encoder, and a decoder. The content encoder is configured to receive input speech as input and generate a latent representation of linguistic content for the input speech output. The content encoder is trained to disentangle speaking style information from the latent representation of linguistic content. The style encoder is configured to receive the input speech as input and generate a latent representation of speaking style for the input speech as output. The style encoder is trained to disentangle linguistic content information from the latent representation of speaking style. The decoder is configured to generate output speech based on the latent representation of linguistic content for the input speech and the latent representation of speaking style for the same or different input speech.
System and method for cross-speaker style transfer in text-to-speech and training data generation
Systems are configured for generating spectrogram data characterized by a voice timbre of a target speaker and a prosody style of source speaker by converting a waveform of source speaker data to phonetic posterior gram (PPG) data, extracting additional prosody features from the source speaker data, and generating a spectrogram based on the PPG data and the extracted prosody features. The systems are configured to utilize/train a machine learning model for generating spectrogram data and for training a neural text-to-speech model with the generated spectrogram data.
AUDIO SIGNAL CONVERSION MODEL LEARNING APPARATUS, AUDIO SIGNAL CONVERSION APPARATUS, AUDIO SIGNAL CONVERSION MODEL LEARNING METHOD AND PROGRAM
A voice signal conversion model learning device includes: a generation unit configured to execute generation processing of generating a conversion destination voice signal on the basis of an input voice signal that is a voice signal of an input voice, conversion source attribute information that is information indicating an attribute of an input voice that is a voice represented by the input voice signal, and conversion destination attribute information indicating an attribute of a voice represented by the conversion destination voice signal that is a voice signal of a conversion destination of the input voice signal; and an identification unit configured to execute voice estimation processing of estimating whether or not a voice signal that is a processing target is a voice signal representing a vocal sound actually uttered by a person on the basis of the conversion source attribute information and the conversion destination attribute intonation, wherein the conversion destination voice signal is input to the identification unit, the processing target is a voice signal input to the identification unit, and the generation unit and the identification unit pertain learning on the basis of an estimation result of the voice estimation processing.
Personalized voice conversion system
A personalized voice conversion system includes a cloud server and an intelligent device that communicates with the cloud server. The intelligent device upstreams an original voice signal to the cloud server. The cloud server converts the original voice signal into an intelligible voice signal based on an intelligible voice conversion model. The intelligent device downloads and plays the intelligible voice signal. Based on the original voice signal and the corresponding intelligible voice signal, the cloud server and the intelligent device train an off-line voice conversion model provided to the intelligent device. When the intelligent device stops communicating with the cloud server, the intelligent device converts a new original voice signal into a new intelligible voice signal based on the off-line voice conversion model and plays the new intelligible voice signal.
REAL-TIME VOICE CONVERTER
Provided are systems and methods for real-time voice conversion. An example method includes generating, using an automatic speech recognition model, first embedding vectors from a first spectrum representation of a first speech audio signal of a first person, wherein the first embedding vectors are indicative of sounds present in the first speech audio signal; generating, using a speaker encoder, second embedding vectors from a second speech audio signal of a second person, wherein the second embedding vectors are indicative of voice characteristics of the second person; generating, based on the first embedding vectors and the second embedding vectors, acoustic features; generating, using a decoder, based on the acoustic features, a second spectrum representation; and synthesizing, based on the second spectrum representation and using a vocoder, a synthetic speech audio signal substantially resembling pronunciation of the first speech audio signal by the second person.
METHOD OF CONVERTING VOICE FEATURE OF VOICE
A method and apparatus for converting a voice of a first speaker into a voice of a second speaker by using a plurality of trained artificial neural networks are provided. The method of converting a voice feature of a voice comprises (i) generating a first audio vector corresponding to a first voice by using a first artificial neural network, (ii) generating a first text feature value corresponding to the first text by using a second artificial neural network, (iii) generating a second audio vector by removing the voice feature value of the first voice from the first audio vector by using the first text feature value and a third artificial neural network, and (iv) generating, by using the second audio vector and a voice feature value of a target voice, a second voice in which a feature of the target voice is reflected.
Method for modifying a style of an audio object, and corresponding electronic device, computer readable program products and computer readable storage medium
Method for modifying a style of an audio object, and corresponding electronic device, computer readable program products and computer readable storage medium The disclosure relates to a method for processing an input audio signal. According to an embodiment, the method includes obtaining a base audio signal being a copy of the input audio signal and generating an output audio signal from the base signal, the output audio signal having style features obtained by modifying the base signal so that a distance between base style features representative of a style of the base signal and a reference style feature decreases. The disclosure also relates to corresponding electronic device, computer readable program product and computer readable storage medium.
System for improving dysarthria speech intelligibility and method thereof
A system for improving dysarthria speech intelligibility and method thereof, are provided. In the system, user only needs to provides a set of paired corpus including a reference corpus and a patient corpus, and a speech disordering module can automatically generate a new corpus completely synchronous with the reference corpus, and the new corpus can be used as a training corpus for training a dysarthria voice conversion model. The present invention does not need to use a conventional corpus alignment technology or a manual manner to perform pre-processing on the training corpus, so that manpower cost and time cost can be reduced, and synchronization of the training corpus can be ensured, thereby improving both training and conversion qualities of the voice conversion model.
Singing voice conversion
A method, computer program, and computer system is provided for converting a singing first singing voice associated with a first speaker to a second singing voice associated with a second speaker. A context associated with one or more phonemes corresponding to the first singing voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a sample corresponding to the first singing voice is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.