G10L13/10

COMPUTING SYSTEM FOR DOMAIN EXPRESSIVE TEXT TO SPEECH

A computing system obtains text that includes words and provides the text as input to an emotional classifier model that has been trained based upon emotional classification. The computing system obtains a textual embedding of the computer-readable text as output of the emotional classifier model. The computing system generates a phoneme sequence based upon the words of the text. The computing system, generates, by way of an encoder of a text to speech (TTS) model, a phoneme encoding based upon the phoneme sequence. The computing system provides the textual embedding and the phoneme encoding as input to a decoder of the TTS model. The computing system causes speech that includes the words to be played over a speaker based upon output of the decoder of the TTS model, where the speech reflects an emotion underlying the text due to the textual embedding provided to the encoder.

COMPUTING SYSTEM FOR DOMAIN EXPRESSIVE TEXT TO SPEECH

A computing system obtains text that includes words and provides the text as input to an emotional classifier model that has been trained based upon emotional classification. The computing system obtains a textual embedding of the computer-readable text as output of the emotional classifier model. The computing system generates a phoneme sequence based upon the words of the text. The computing system, generates, by way of an encoder of a text to speech (TTS) model, a phoneme encoding based upon the phoneme sequence. The computing system provides the textual embedding and the phoneme encoding as input to a decoder of the TTS model. The computing system causes speech that includes the words to be played over a speaker based upon output of the decoder of the TTS model, where the speech reflects an emotion underlying the text due to the textual embedding provided to the encoder.

Artificial intelligence apparatus for generating text or speech having content-based style and method for the same
11488576 · 2022-11-01 · ·

Provided is an artificial intelligence (AI) apparatus for generating a speech having a content-based style, including: a memory configured to store a plurality of TTS (Text-To-Speech) engines; and a processor configured to: obtain image data or text data containing a text, extract at least one content keyword corresponding to the text, determine a speech style based on the extracted content keyword, generate a speech corresponding to the text by using a TTS engine corresponding to the determined speech style among the plurality of TTS engines, and output the generated speech.

Artificial intelligence apparatus for generating text or speech having content-based style and method for the same
11488576 · 2022-11-01 · ·

Provided is an artificial intelligence (AI) apparatus for generating a speech having a content-based style, including: a memory configured to store a plurality of TTS (Text-To-Speech) engines; and a processor configured to: obtain image data or text data containing a text, extract at least one content keyword corresponding to the text, determine a speech style based on the extracted content keyword, generate a speech corresponding to the text by using a TTS engine corresponding to the determined speech style among the plurality of TTS engines, and output the generated speech.

Method and apparatus for training speech spectrum generation model, and electronic device

The present application discloses a method and an apparatus for training a speech spectrum generation model, as well as an electronic device, and relates to the technical field of speech synthesis and deep learning. A specific implementation is as follows: inputting a first text sequence into the speech spectrum generation model to generate an analog spectrum sequence corresponding to the first text sequence, and obtain a first loss value of the analog spectrum sequence according to a preset loss function; inputting the analog spectrum sequence corresponding to the first text sequence into an adversarial loss function model, which is a generative adversarial network model, to obtain a second loss value of the analog spectrum sequence; and training the speech spectrum generation model based on the first loss value and the second loss value.

WIRELESS COMMUNICATION DEVICE USING VOICE RECOGNITION AND VOICE SYNTHESIS
20230090052 · 2023-03-23 ·

Disclosed is a wireless communication device including a voice recognition portion configured to convert a voice signal input through a microphone into a syllable information stream using voice recognition, an encoding portion configured to encode the syllable information stream to generate digital transmission data, a transmission portion configured to modulate from the digital transmission data to a transmission signal and transmit the transmission signal through an antenna, a reception portion configured to demodulate from a reception signal received through the antenna to a digital reception data and output the digital reception data, a decoding portion configured to decode the digital reception data to generate the syllable information stream and a voice synthesis portion configured to convert the syllable information stream into the voice signal using voice synthesis and output the voice signal through a speaker.

WIRELESS COMMUNICATION DEVICE USING VOICE RECOGNITION AND VOICE SYNTHESIS
20230090052 · 2023-03-23 ·

Disclosed is a wireless communication device including a voice recognition portion configured to convert a voice signal input through a microphone into a syllable information stream using voice recognition, an encoding portion configured to encode the syllable information stream to generate digital transmission data, a transmission portion configured to modulate from the digital transmission data to a transmission signal and transmit the transmission signal through an antenna, a reception portion configured to demodulate from a reception signal received through the antenna to a digital reception data and output the digital reception data, a decoding portion configured to decode the digital reception data to generate the syllable information stream and a voice synthesis portion configured to convert the syllable information stream into the voice signal using voice synthesis and output the voice signal through a speaker.

Method and system for parametric speech synthesis
11605371 · 2023-03-14 · ·

Embodiments of the present systems and methods may provide techniques for synthesizing speech in any voice in any language in any accent. For example, in an embodiment, a text-to-speech conversion system may comprise a text converter adapted to convert input text to at least one phoneme selected from a plurality of phonemes stored in memory, a machine-learning model storing voice patterns for a plurality of individuals and adapted to receive the at least one phoneme and an identity of a speaker and to generate acoustic features for each phoneme, and a decoder adapted to receive the generated acoustic features and to generate a speech signal simulating a voice of the identified speaker in a language.

Method and system for parametric speech synthesis
11605371 · 2023-03-14 · ·

Embodiments of the present systems and methods may provide techniques for synthesizing speech in any voice in any language in any accent. For example, in an embodiment, a text-to-speech conversion system may comprise a text converter adapted to convert input text to at least one phoneme selected from a plurality of phonemes stored in memory, a machine-learning model storing voice patterns for a plurality of individuals and adapted to receive the at least one phoneme and an identity of a speaker and to generate acoustic features for each phoneme, and a decoder adapted to receive the generated acoustic features and to generate a speech signal simulating a voice of the identified speaker in a language.

VOICE GENERATING METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM
20230131494 · 2023-04-27 ·

A voice generating method and apparatus, an electronic device and a storage medium. The specific implementation solution includes: acquiring a text to be processed, and determining an associated text of the text to be processed; acquiring an associated prosodic feature of the associated text; determining an associated text feature of the associated text based on the text to be processed; determining a spectrum feature to be processed of the text to be processed based on the associated prosodic feature and the associated text feature; and generating a target voice corresponding to the text to be processed based on the spectrum feature to be processed.