Patent classifications
G10L13/047
SPEECH SYNTHESIS METHOD, AND ELECTRONIC DEVICE
The disclosure provides a speech synthesis method, and an electronic device. The technical solution is described as follows. A text to be synthesized and speech features of a target user are obtained. Predicted first acoustic features based on the text to be synthesized and the speech features are obtained. A target template audio is obtained from a template audio library based on the text to be synthesized. Second acoustic features of the target template audio are extracted. Target acoustic features are generated by splicing the first acoustic features and the second acoustic features. Speech synthesis is performed on the text to be synthesized based on the target acoustic features and the speech features, to generate a target speech of the text to be synthesized.
SPEECH SYNTHESIS METHOD, AND ELECTRONIC DEVICE
The disclosure provides a speech synthesis method, and an electronic device. The technical solution is described as follows. A text to be synthesized and speech features of a target user are obtained. Predicted first acoustic features based on the text to be synthesized and the speech features are obtained. A target template audio is obtained from a template audio library based on the text to be synthesized. Second acoustic features of the target template audio are extracted. Target acoustic features are generated by splicing the first acoustic features and the second acoustic features. Speech synthesis is performed on the text to be synthesized based on the target acoustic features and the speech features, to generate a target speech of the text to be synthesized.
SYSTEM FOR INTELLIGENT FACILITATION OF SPEECH SYNTHESIS AND SPEECH RECOGNITION WITH AUTO-TRANSLATION ON SOCIAL MEDIA PLATFORM
The present invention relates to social media networking platform features. The present invention particularly relates to a system for the facilitation of speech synthesis i.e. text to speech or audio feature on a social media networking platform. The present invention further relates to the facilitation of speech recognition i.e. audio to text feature on social media networking platform. In addition, the aforementioned system also facilitates the feature of auto-translation in all languages to aid user operating in their own preferred language. Further, the aforementioned system enables sharing of content on its portal system while retaining the track and identity of the original creator of the content. The aforementioned system may be operated through all possible forms of multi-media platforms such as computers, laptops, mobiles, tablets etc.
SYSTEM FOR INTELLIGENT FACILITATION OF SPEECH SYNTHESIS AND SPEECH RECOGNITION WITH AUTO-TRANSLATION ON SOCIAL MEDIA PLATFORM
The present invention relates to social media networking platform features. The present invention particularly relates to a system for the facilitation of speech synthesis i.e. text to speech or audio feature on a social media networking platform. The present invention further relates to the facilitation of speech recognition i.e. audio to text feature on social media networking platform. In addition, the aforementioned system also facilitates the feature of auto-translation in all languages to aid user operating in their own preferred language. Further, the aforementioned system enables sharing of content on its portal system while retaining the track and identity of the original creator of the content. The aforementioned system may be operated through all possible forms of multi-media platforms such as computers, laptops, mobiles, tablets etc.
TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, AND A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM
A text-to-speech synthesis method includes receiving text, inputting the received text in a synthesizer that includes a prediction network configured to convert the received text into speech data having a speech attribute that includes emotion, intention, projection, pace, and/or accent, and outputting said speech data. The prediction network is obtained by obtaining a first sub-dataset and a second sub-dataset, where the first sub-dataset and the second sub-dataset each include audio samples and corresponding text, and the speech attribute of the audio samples of the second sub-dataset is more pronounced than the speech attribute of the audio samples of the first sub-dataset, training a first model using the first sub-dataset until a performance metric reaches a first predetermined value, training a second model by further training the first model using the second sub-dataset until the performance metric reaches a second predetermined value, and selecting one trained model as the prediction network.
TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, AND A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM
A text-to-speech synthesis method includes receiving text, inputting the received text in a synthesizer that includes a prediction network configured to convert the received text into speech data having a speech attribute that includes emotion, intention, projection, pace, and/or accent, and outputting said speech data. The prediction network is obtained by obtaining a first sub-dataset and a second sub-dataset, where the first sub-dataset and the second sub-dataset each include audio samples and corresponding text, and the speech attribute of the audio samples of the second sub-dataset is more pronounced than the speech attribute of the audio samples of the first sub-dataset, training a first model using the first sub-dataset until a performance metric reaches a first predetermined value, training a second model by further training the first model using the second sub-dataset until the performance metric reaches a second predetermined value, and selecting one trained model as the prediction network.
Stylizing text-to-speech (TTS) voice response for assistant systems
In one embodiment, a method includes receiving a voice input from a user and determining a first style of the voice input, based on first features extracted from the voice input. A second style for a voice response having second features may then be determined based on the first style. Finally, the voice response may be generated based on the second features of the second style, and this voice response may be provided in response to the voice input.
On-device speech synthesis of textual segments for training of on-device speech recognition model
Processor(s) of a client device can: identify a textual segment stored locally at the client device; process the textual segment, using a speech synthesis model stored locally at the client device, to generate synthesized speech audio data that includes synthesized speech of the identified textual segment; process the synthesized speech, using an on-device speech recognition model that is stored locally at the client device, to generate predicted output; and generate a gradient based on comparing the predicted output to ground truth output that corresponds to the textual segment. In some implementations, the generated gradient is used, by processor(s) of the client device, to update weights of the on-device speech recognition model. In some implementations, the generated gradient is additionally or alternatively transmitted to a remote system for use in remote updating of global weights of a global speech recognition model.
On-device speech synthesis of textual segments for training of on-device speech recognition model
Processor(s) of a client device can: identify a textual segment stored locally at the client device; process the textual segment, using a speech synthesis model stored locally at the client device, to generate synthesized speech audio data that includes synthesized speech of the identified textual segment; process the synthesized speech, using an on-device speech recognition model that is stored locally at the client device, to generate predicted output; and generate a gradient based on comparing the predicted output to ground truth output that corresponds to the textual segment. In some implementations, the generated gradient is used, by processor(s) of the client device, to update weights of the on-device speech recognition model. In some implementations, the generated gradient is additionally or alternatively transmitted to a remote system for use in remote updating of global weights of a global speech recognition model.
METHOD AND APPARATUS FOR PROCESSING SPEECH, ELECTRONIC DEVICE AND STORAGE MEDIUM
A method for processing a speech includes: acquiring an original speech; extracting a spectrogram from the original speech; acquiring a speech synthesis model, where the speech synthesis model comprises a first generation sub-model and a second generation sub-model; generating a harmonic structure of the spectrogram, by invoking the first generation sub-model to process the spectrogram; and generating a target speech, by invoking the second generation sub-model to process the harmonic structure and the spectrogram.