Patent classifications
G10L13/10
UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS
Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.
COMPUTING SYSTEM FOR UNSUPERVISED EMOTIONAL TEXT TO SPEECH TRAINING
A text to speech (TTS) model is trained based on training data including text samples. The text samples are provided to a text embedding model for outputting text embeddings for the text samples. The text embeddings are clustered into several clusters of text embeddings. The several clusters are representative of variations in emotion. The TTS model is then trained based upon the several clusters of text embeddings. Upon being trained, the TTS model is configured to receive text input and output a spoken utterance that corresponds to the text input. The TTS model is configured to output the spoken utterance with emotion. The emotion is based upon the text input and the training of the TTS model.
Systems and methods for aggregating content
Methods and devices produce an audio representation of aggregated content by selecting preferred content from a number of sources. The sources are emotion-tagged. The emotion-tagged preferred content sources are converted into audio files. A set of audio files corresponding to the converted preferred content is generated. The preferred content is individually converted into the audio files. The generated set comprises non-aggregated content.
Systems and methods for enhancing responsiveness to utterances having detectable emotion
Methods, systems, and related products that provide emotion-sensitive responses to user's commands and other utterances received at an utterance-based user interface. Acknowledgements of user's utterances are adapted to the user and/or the user device, and emotions detected in the user's utterance that have been mapped from one or more emotion features extracted from the utterance. In some examples, extraction of a user's changing emotion during a sequence of interactions is used to generate a response to a user's uttered command. In some examples, emotion processing and command processing of natural utterances are performed asynchronously.
Systems and methods for enhancing responsiveness to utterances having detectable emotion
Methods, systems, and related products that provide emotion-sensitive responses to user's commands and other utterances received at an utterance-based user interface. Acknowledgements of user's utterances are adapted to the user and/or the user device, and emotions detected in the user's utterance that have been mapped from one or more emotion features extracted from the utterance. In some examples, extraction of a user's changing emotion during a sequence of interactions is used to generate a response to a user's uttered command. In some examples, emotion processing and command processing of natural utterances are performed asynchronously.
Methods and systems for synthesizing speech audio
A computer-implemented method for synthesizing speech audio includes obtaining a grammatical profile defining an input text of actual words as a function of at least syllable-occurrence rates and syllable-count-per-word rates; generating a dictionary of pseudo-words having the syllable-count-per-word rates, each pseudo-word consisting of one syllable or concatenated syllables selected from the input text, wherein substantially all of the pseudo-words are not actual words; constructing an output text product having the grammatical profile, the output text product comprising at least one sentence consisting of one or more pseudo-words selected from the dictionary; and synthesizing speech audio using the output text product. Related systems and computer-readable media are also provided.
Methods and systems for synthesizing speech audio
A computer-implemented method for synthesizing speech audio includes obtaining a grammatical profile defining an input text of actual words as a function of at least syllable-occurrence rates and syllable-count-per-word rates; generating a dictionary of pseudo-words having the syllable-count-per-word rates, each pseudo-word consisting of one syllable or concatenated syllables selected from the input text, wherein substantially all of the pseudo-words are not actual words; constructing an output text product having the grammatical profile, the output text product comprising at least one sentence consisting of one or more pseudo-words selected from the dictionary; and synthesizing speech audio using the output text product. Related systems and computer-readable media are also provided.
SPONTANEOUS TEXT TO SPEECH (TTS) SYNTHESIS
The present disclosure provides methods and apparatuses for spontaneous text-to-speech (TTS) synthesis. A target text may be obtained. A fluency reference factor may be determined based at least on the target text. An acoustic feature corresponding to the target text may be generated with the fluency reference factor. A speech waveform corresponding to the target text may be generated based on the acoustic feature.
SPONTANEOUS TEXT TO SPEECH (TTS) SYNTHESIS
The present disclosure provides methods and apparatuses for spontaneous text-to-speech (TTS) synthesis. A target text may be obtained. A fluency reference factor may be determined based at least on the target text. An acoustic feature corresponding to the target text may be generated with the fluency reference factor. A speech waveform corresponding to the target text may be generated based on the acoustic feature.
SPEECH SYNTHESIS METHOD, DEVICE AND COMPUTER-READABLE STORAGE MEDIUM
A speech synthesis method includes: obtaining an acoustic feature sequence of a text to be processed; processing the acoustic feature sequence by using a non-autoregressive computing model in parallel to obtain first audio information of the text, to be processed, wherein the first audio information comprises audio corresponding to each segment; processing the acoustic feature sequence and the first audio information by using an autoregressive computing model to obtain a residual value corresponding to each segment; and obtaining second audio information corresponding to an i-th segment based on the first audio information corresponding to the i-th segment and the residual values corresponding to a first to an (i-1)-th segment, wherein a synthesized audio of the text to be processed comprises each of the second audio information, i=1, 2 . . . n, n is a total number of the segments.