G10L13/06

USING SPEECH TO TEXT DATA IN TRAINING TEXT TO SPEECH MODELS

A system and method for providing a text to speech output by receiving user audio data, determining a user region-specific-pronunciation classification according to the audio data, determining text for a response to the user according to the audio data, identifying a portion from the text, where a region specific-pronunciation dictionary includes the portion, and using a phoneme string, from the dictionary selected according to the user region-specific pronunciation classification, for the word in a text to speech output to the user.

USING SPEECH TO TEXT DATA IN TRAINING TEXT TO SPEECH MODELS

A system and method for providing a text to speech output by receiving user audio data, determining a user region-specific-pronunciation classification according to the audio data, determining text for a response to the user according to the audio data, identifying a portion from the text, where a region specific-pronunciation dictionary includes the portion, and using a phoneme string, from the dictionary selected according to the user region-specific pronunciation classification, for the word in a text to speech output to the user.

Training method and apparatus for a speech synthesis model, and storage medium

The present application discloses a training method and an apparatus for a speech synthesis model, electronic device, and storage medium. The method includes: taking a syllable input sequence, a phoneme input sequence and a Chinese character input sequence of a current sample as inputs of an encoder of a model to be trained, to obtain encoded representations of these three sequences at an output end of the encoder; fusing the encoded representations of these three sequences, to obtain a weighted combination of these three sequences; taking the weighted combination as an input of an attention module, to obtain a weighted average of the weighted combination at each moment at an output end of the attention module; taking the weighted average as an input of a decoder of the model to be trained, to obtain a speech Mel spectrum of the current sample at an output end of the decoder.

Training method and apparatus for a speech synthesis model, and storage medium

The present application discloses a training method and an apparatus for a speech synthesis model, electronic device, and storage medium. The method includes: taking a syllable input sequence, a phoneme input sequence and a Chinese character input sequence of a current sample as inputs of an encoder of a model to be trained, to obtain encoded representations of these three sequences at an output end of the encoder; fusing the encoded representations of these three sequences, to obtain a weighted combination of these three sequences; taking the weighted combination as an input of an attention module, to obtain a weighted average of the weighted combination at each moment at an output end of the attention module; taking the weighted average as an input of a decoder of the model to be trained, to obtain a speech Mel spectrum of the current sample at an output end of the decoder.

WIRELESS COMMUNICATION DEVICE USING VOICE RECOGNITION AND VOICE SYNTHESIS
20230090052 · 2023-03-23 ·

Disclosed is a wireless communication device including a voice recognition portion configured to convert a voice signal input through a microphone into a syllable information stream using voice recognition, an encoding portion configured to encode the syllable information stream to generate digital transmission data, a transmission portion configured to modulate from the digital transmission data to a transmission signal and transmit the transmission signal through an antenna, a reception portion configured to demodulate from a reception signal received through the antenna to a digital reception data and output the digital reception data, a decoding portion configured to decode the digital reception data to generate the syllable information stream and a voice synthesis portion configured to convert the syllable information stream into the voice signal using voice synthesis and output the voice signal through a speaker.

WIRELESS COMMUNICATION DEVICE USING VOICE RECOGNITION AND VOICE SYNTHESIS
20230090052 · 2023-03-23 ·

Disclosed is a wireless communication device including a voice recognition portion configured to convert a voice signal input through a microphone into a syllable information stream using voice recognition, an encoding portion configured to encode the syllable information stream to generate digital transmission data, a transmission portion configured to modulate from the digital transmission data to a transmission signal and transmit the transmission signal through an antenna, a reception portion configured to demodulate from a reception signal received through the antenna to a digital reception data and output the digital reception data, a decoding portion configured to decode the digital reception data to generate the syllable information stream and a voice synthesis portion configured to convert the syllable information stream into the voice signal using voice synthesis and output the voice signal through a speaker.

VOICE GENERATING METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM
20230131494 · 2023-04-27 ·

A voice generating method and apparatus, an electronic device and a storage medium. The specific implementation solution includes: acquiring a text to be processed, and determining an associated text of the text to be processed; acquiring an associated prosodic feature of the associated text; determining an associated text feature of the associated text based on the text to be processed; determining a spectrum feature to be processed of the text to be processed based on the associated prosodic feature and the associated text feature; and generating a target voice corresponding to the text to be processed based on the spectrum feature to be processed.

VOICE GENERATING METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM
20230131494 · 2023-04-27 ·

A voice generating method and apparatus, an electronic device and a storage medium. The specific implementation solution includes: acquiring a text to be processed, and determining an associated text of the text to be processed; acquiring an associated prosodic feature of the associated text; determining an associated text feature of the associated text based on the text to be processed; determining a spectrum feature to be processed of the text to be processed based on the associated prosodic feature and the associated text feature; and generating a target voice corresponding to the text to be processed based on the spectrum feature to be processed.

METHOD AND SYSTEM FOR USER-INTERFACE ADAPTATION OF TEXT-TO-SPEECH SYNTHESIS
20230122824 · 2023-04-20 ·

A method and system is disclosed for adapting speech synthesis according to user-interface input. While synthesizing speech from a text segment with a text-to-speech (TTS) system and concurrently displaying the text segment in a display device, the system may receive tracking operation input tracking a portion of text undergoing synthesis and identifying a context portion of the text for which prior-synthesized speech has been synthesized at a canonical speech-pace. The tracking information may be used to adjust a speech-pace of TTS synthesis of the portion from the canonical speech-pace to an adapted speech-pace, and speech characteristics of synthesized speech of the portion may be adapted by applying both the adapted speech-pace and synthesized speech characteristics of the prior-synthesized speech of the context portion to TTS synthesis processing of the portion. The synthesized speech of the identified portion may be output at the adapted speech-pace and with the adapted speech characteristics.

Methods and systems for synthesizing speech audio

A computer-implemented method for synthesizing speech audio includes obtaining a grammatical profile defining an input text of actual words as a function of at least syllable-occurrence rates and syllable-count-per-word rates; generating a dictionary of pseudo-words having the syllable-count-per-word rates, each pseudo-word consisting of one syllable or concatenated syllables selected from the input text, wherein substantially all of the pseudo-words are not actual words; constructing an output text product having the grammatical profile, the output text product comprising at least one sentence consisting of one or more pseudo-words selected from the dictionary; and synthesizing speech audio using the output text product. Related systems and computer-readable media are also provided.