Patent classifications
G10L13/10
TEXT-TO-SPEECH (TTS) PROCESSING
During text-to-speech processing, a speech model creates output audio data, including speech, that corresponds to input text data that includes a representation of the speech. A spectrogram estimator estimates a frequency spectrogram of the speech; the corresponding frequency-spectrogram data is used to condition the speech model. A plurality of acoustic features corresponding to different segments of the input text data, such as phonemes, syllable-level features, and/or word-level features, may be separately encoded into context vectors; the spectrogram estimator uses these separate context vectors to create the frequency spectrogram.
SPEECH PROCESSING METHOD AND APPARATUS, DEVICE AND COMPUTER STORAGE MEDIUM
The present disclosure discloses a speech processing method and apparatus, a device and a computer storage medium, and relates to speech and deep learning technologies in the field of artificial intelligence technologies. A specific implementation solution involves: acquiring a vocoder feature obtained for text; correcting a value of an unvoiced and voiced (UV) feature in the vocoder feature according to an energy feature and/or a speech spectrum feature in the vocoder feature; and providing the corrected vocoder feature for a vocoder, so as to obtain synthesized speech.
SPEECH PROCESSING METHOD AND APPARATUS, DEVICE AND COMPUTER STORAGE MEDIUM
The present disclosure discloses a speech processing method and apparatus, a device and a computer storage medium, and relates to speech and deep learning technologies in the field of artificial intelligence technologies. A specific implementation solution involves: acquiring a vocoder feature obtained for text; correcting a value of an unvoiced and voiced (UV) feature in the vocoder feature according to an energy feature and/or a speech spectrum feature in the vocoder feature; and providing the corrected vocoder feature for a vocoder, so as to obtain synthesized speech.
SPEECH SYNTHESIS METHOD AND APPARATUS, DEVICE AND COMPUTER STORAGE MEDIUM
The present disclosure discloses a speech synthesis method and apparatus, a device and a computer storage medium, and relates to speech and deep learning technologies in the field of artificial intelligence technologies. A specific implementation solution involves: acquiring to-be-synthesized text; acquiring a prosody feature extracted from the text; inputting the text and the prosody feature into a speech synthesis model to obtain a vocoder feature; and inputting the vocoder feature into a vocoder to obtain synthesized speech.
SPEECH SYNTHESIS METHOD AND APPARATUS, DEVICE AND COMPUTER STORAGE MEDIUM
The present disclosure discloses a speech synthesis method and apparatus, a device and a computer storage medium, and relates to speech and deep learning technologies in the field of artificial intelligence technologies. A specific implementation solution involves: acquiring to-be-synthesized text; acquiring a prosody feature extracted from the text; inputting the text and the prosody feature into a speech synthesis model to obtain a vocoder feature; and inputting the vocoder feature into a vocoder to obtain synthesized speech.
TEXT-TO-SPEECH PROCESSING USING INPUT VOICE CHARACTERISTIC DATA
During text-to-speech processing, a speech model creates synthesized speech that corresponds to input data. The speech model may include an encoder for encoding the input data into a context vector and a decoder for decoding the context vector into spectrogram data. The speech model may further include a voice decoder that receives vocal characteristic data representing a desired vocal characteristic of synthesized speech. The voice decoder may process the vocal characteristic data to determine configuration data, such as weights, for use by the speech decoder.
TEXT-TO-SPEECH PROCESSING USING INPUT VOICE CHARACTERISTIC DATA
During text-to-speech processing, a speech model creates synthesized speech that corresponds to input data. The speech model may include an encoder for encoding the input data into a context vector and a decoder for decoding the context vector into spectrogram data. The speech model may further include a voice decoder that receives vocal characteristic data representing a desired vocal characteristic of synthesized speech. The voice decoder may process the vocal characteristic data to determine configuration data, such as weights, for use by the speech decoder.
Voice synthesis method, voice synthesis apparatus, and recording medium
Voice synthesis method and apparatus generate second control data using an intermediate trained model with first input data including first control data designating phonetic identifiers, change the second control data in accordance with a first user instruction provided by a user, generate synthesis data representing frequency characteristics of a voice to be synthesized using a final trained model with final input data including the first control data and the changed second control data, and generate a voice signal based on the generated synthesis data.
Voice synthesis method, voice synthesis apparatus, and recording medium
Voice synthesis method and apparatus generate second control data using an intermediate trained model with first input data including first control data designating phonetic identifiers, change the second control data in accordance with a first user instruction provided by a user, generate synthesis data representing frequency characteristics of a voice to be synthesized using a final trained model with final input data including the first control data and the changed second control data, and generate a voice signal based on the generated synthesis data.
VOICE SYNTHESIS METHOD, VOICE SYNTHESIS APPARATUS, AND RECORDING MEDIUM
Voice synthesis method and apparatus generate second control data using an intermediate trained model with first input data including first control data designating phonetic identifiers, change the second control data in accordance with a first user instruction provided by a user, generate synthesis data representing frequency characteristics of a voice to be synthesized using a final trained model with final input data including the first control data and the changed second control data, and generate a voice signal based on the generated synthesis data.