G10L13/10

VOICE GENERATING METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM
20230131494 · 2023-04-27 ·

A voice generating method and apparatus, an electronic device and a storage medium. The specific implementation solution includes: acquiring a text to be processed, and determining an associated text of the text to be processed; acquiring an associated prosodic feature of the associated text; determining an associated text feature of the associated text based on the text to be processed; determining a spectrum feature to be processed of the text to be processed based on the associated prosodic feature and the associated text feature; and generating a target voice corresponding to the text to be processed based on the spectrum feature to be processed.

GENERATING AUDIO FILES BASED ON USER GENERATED SCRIPTS AND VOICE COMPONENTS

A computer-implemented method, according to one embodiment, includes: determining whether a predetermined version of a source script is available. In response to determining that a predetermined version of the source script is available, it is used to condition a first processor, and instructions are sent to the conditioned first processor to generate a translated copy of the source script by translating the words in the source script from a source language to a target language. Instructions are also sent to a second processor to determine a distribution of metrics associated with the speech of each of the actors in the source audio file. The distributions are used to condition a third processor, and instructions are sent to the conditioned third processor to generate an audio file that includes words spoken in the target language. Furthermore, instructions are sent to merge the generated audio file with the video file.

GENERATING AUDIO FILES BASED ON USER GENERATED SCRIPTS AND VOICE COMPONENTS

A computer-implemented method, according to one embodiment, includes: determining whether a predetermined version of a source script is available. In response to determining that a predetermined version of the source script is available, it is used to condition a first processor, and instructions are sent to the conditioned first processor to generate a translated copy of the source script by translating the words in the source script from a source language to a target language. Instructions are also sent to a second processor to determine a distribution of metrics associated with the speech of each of the actors in the source audio file. The distributions are used to condition a third processor, and instructions are sent to the conditioned third processor to generate an audio file that includes words spoken in the target language. Furthermore, instructions are sent to merge the generated audio file with the video file.

Method for synthesized speech generation using emotion information correction and apparatus

A method includes generating first synthesized speech by using text and a first emotion vector configured for the text, extracting a second emotion vector included in the first synthesized speech, determining whether correction of the second emotion information vector is needed by comparing a loss value calculated by using the first emotion information vector and the second emotion information vector with a preconfigured threshold, re-performing speech synthesis by using a third emotion information vector generated by correcting the second emotion information vector, and outputting the generated synthesized speech, thereby configuring emotion information of speech in a more effective manner. A speech synthesis apparatus may be associated with an artificial intelligence module, drone (unmanned aerial vehicle, UAV), robot, augmented reality (AR) devices, virtual reality (VR) devices, devices related to 5G services, and the like.

Method for synthesized speech generation using emotion information correction and apparatus

A method includes generating first synthesized speech by using text and a first emotion vector configured for the text, extracting a second emotion vector included in the first synthesized speech, determining whether correction of the second emotion information vector is needed by comparing a loss value calculated by using the first emotion information vector and the second emotion information vector with a preconfigured threshold, re-performing speech synthesis by using a third emotion information vector generated by correcting the second emotion information vector, and outputting the generated synthesized speech, thereby configuring emotion information of speech in a more effective manner. A speech synthesis apparatus may be associated with an artificial intelligence module, drone (unmanned aerial vehicle, UAV), robot, augmented reality (AR) devices, virtual reality (VR) devices, devices related to 5G services, and the like.

Method, system, and device for performing real-time sentiment modulation in conversation systems
11636850 · 2023-04-25 · ·

A method and system for performing real-time sentiment modulation in conversation systems is disclosed. The method includes generating an impact table comprising a plurality of sentiment vectors and a plurality of emotion vectors associated with the plurality of sentences. The method further includes generating for each of the plurality of sentences, a dependency vector based on the associated sentiment vector and the associated emotion vector. The method further includes stacking the dependency vector generated to generate a waveform representing variance in sentiment and emotions across words within the plurality of sentences. The method further includes altering at least one portion of the waveform based on a desired emotional output to generate a reshaped waveform. The method further includes generating a set of rephrased sentences associated with the at least one portion, based on the reshaped waveform, the set of sentences, a user defined sentiment output.

Method, system, and device for performing real-time sentiment modulation in conversation systems
11636850 · 2023-04-25 · ·

A method and system for performing real-time sentiment modulation in conversation systems is disclosed. The method includes generating an impact table comprising a plurality of sentiment vectors and a plurality of emotion vectors associated with the plurality of sentences. The method further includes generating for each of the plurality of sentences, a dependency vector based on the associated sentiment vector and the associated emotion vector. The method further includes stacking the dependency vector generated to generate a waveform representing variance in sentiment and emotions across words within the plurality of sentences. The method further includes altering at least one portion of the waveform based on a desired emotional output to generate a reshaped waveform. The method further includes generating a set of rephrased sentences associated with the at least one portion, based on the reshaped waveform, the set of sentences, a user defined sentiment output.

System and method for cross-speaker style transfer in text-to-speech and training data generation

Systems are configured for generating spectrogram data characterized by a voice timbre of a target speaker and a prosody style of source speaker by converting a waveform of source speaker data to phonetic posterior gram (PPG) data, extracting additional prosody features from the source speaker data, and generating a spectrogram based on the PPG data and the extracted prosody features. The systems are configured to utilize/train a machine learning model for generating spectrogram data and for training a neural text-to-speech model with the generated spectrogram data.

System and method for cross-speaker style transfer in text-to-speech and training data generation

Systems are configured for generating spectrogram data characterized by a voice timbre of a target speaker and a prosody style of source speaker by converting a waveform of source speaker data to phonetic posterior gram (PPG) data, extracting additional prosody features from the source speaker data, and generating a spectrogram based on the PPG data and the extracted prosody features. The systems are configured to utilize/train a machine learning model for generating spectrogram data and for training a neural text-to-speech model with the generated spectrogram data.

TEXT-TO-SPEECH SYNTHESIS METHOD AND APPARATUS USING MACHINE LEARNING, AND COMPUTER-READABLE STORAGE MEDIUM
20230067505 · 2023-03-02 · ·

A text-to-speech synthesis method using machine learning, the text-to-speech synthesis method is disclosed. The method includes generating a single artificial neural network text-to-speech synthesis model by performing machine learning based on a plurality of learning texts and speech data corresponding to the plurality of learning texts, receiving an input text, receiving an articulatory feature of a speaker, generating output speech data for the input text reflecting the articulatory feature of the speaker by inputting the articulatory feature of the speaker to the single artificial neural network text-to-speech synthesis model.