G10L2013/105

COMPUTING SYSTEM FOR UNSUPERVISED EMOTIONAL TEXT TO SPEECH TRAINING

A text to speech (TTS) model is trained based on training data including text samples. The text samples are provided to a text embedding model for outputting text embeddings for the text samples. The text embeddings are clustered into several clusters of text embeddings. The several clusters are representative of variations in emotion. The TTS model is then trained based upon the several clusters of text embeddings. Upon being trained, the TTS model is configured to receive text input and output a spoken utterance that corresponds to the text input. The TTS model is configured to output the spoken utterance with emotion. The emotion is based upon the text input and the training of the TTS model.

Duration informed attention network for text-to-speech analysis
11468879 · 2022-10-11 · ·

A method and apparatus include receiving a text input that includes a sequence of text components. Respective temporal durations of the text components are determined using a duration model. A first set of spectra is generated based on the sequence of text components. A second set of spectra is generated based on the first set of spectra and the respective temporal durations of the sequence of text components. A spectrogram frame is generated based on the second set of spectra. An audio waveform is generated based on the spectrogram frame. The audio waveform is provided as an output.

SPONTANEOUS TEXT TO SPEECH (TTS) SYNTHESIS
20230206899 · 2023-06-29 ·

The present disclosure provides methods and apparatuses for spontaneous text-to-speech (TTS) synthesis. A target text may be obtained. A fluency reference factor may be determined based at least on the target text. An acoustic feature corresponding to the target text may be generated with the fluency reference factor. A speech waveform corresponding to the target text may be generated based on the acoustic feature.

SYSTEMS AND METHODS FOR TRANSPOSING SPOKEN OR TEXTUAL INPUT TO MUSIC
20230197058 · 2023-06-22 ·

Described herein are real-time musical translation devices (RETM) and methods of use thereof. Exemplary uses of RETMs include optimizing the understanding and/or recall of an input message for a user and improving a cognitive process in a user.

LATENT-SEGMENTATION INTONATION MODEL

The intonation model of the present technology disclosed herein assigns different words within a sentence to be prominent, analyzes multiple prominence possibilities (in some cases, all prominence possibilities), and learns parameters of the model using large amounts of data. Unlike previous systems, intonation patterns are discovered from data. Speech data is sub-segmented into words, the different segments are analyzed and used for learning, and a determination is made as to whether the segmentations predict pitch

Clockwork hierarchical variational encoder

A method for providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word, and selecting a mel spectral embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. For each phoneme, using the selected mel spectral embedding, the method also includes: predicting a duration of the corresponding phoneme by encoding linguistic features of the corresponding phoneme with a corresponding syllable embedding for the syllable that includes the corresponding phoneme; and generating a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame representing mel-spectral information of the corresponding phoneme.

SPEAKING-RATE NORMALIZED PROSODIC PARAMETER BUILDER, SPEAKING-RATE DEPENDENT PROSODIC MODEL BUILDER, SPEAKING-RATE CONTROLLED PROSODIC-INFORMATION GENERATION DEVICE AND PROSODIC-INFORMATION GENERATION METHOD ABLE TO LEARN DIFFERENT LANGUAGES AND MIMIC VARIOUS SPEAKERS' SPEAKING STYLES
20170309271 · 2017-10-26 · ·

A speaking-rate dependent prosodic model builder and a related method are disclosed. The proposed builder includes a first input terminal for receiving a first information of a first language spoken by a first speaker, a second input terminal for receiving a second information of a second language spoken by a second speaker and a functional information unit having a function, wherein the function includes a first plurality of parameters simultaneously relevant to the first language and the second language or a plurality of sub-parameters in a second plurality of parameters relevant to the second language alone, and the functional information unit under a maximum a posteriori condition and based on the first information, the second information and the first plurality of parameters or the plurality of sub-parameters produces speaking-rate dependent reference information and constructs a speaking-rate dependent prosodic model of the second language.

Duration informed attention network (DURIAN) for audio-visual synthesis
11670283 · 2023-06-06 · ·

A method and apparatus include receiving a text input that includes a sequence of text components. Respective temporal durations of the text components are determined using a duration model. A spectrogram frame is generated based on the duration model. An audio waveform is generated based on the spectrogram frame. Video information is generated based on the audio waveform. The audio waveform is provided as an output along with a corresponding video.

Playback apparatus, setting apparatus, playback method, and program
09728201 · 2017-08-08 · ·

A playback apparatus includes: an acquiring unit that acquires auditory language data including data to be played back as a spoken voice; an analyzing unit that analyzes the auditory language data to output an analysis result; a setting unit that sets at least a portion of the auditory language data to a control portion to be played back at a set playback speed, based on the analysis result; and a voice playback unit that plays back the control portion as a spoken voice at the set playback speed.

Sonification of Words and Phrases Identified by Analysis of Text
20170278507 · 2017-09-28 · ·

A text mining tool is operated on a given text to obtain words and/or phrases ranked by frequency of occurrence. Thereafter, a text-to-speech converter is used to speak each word/phrase output by the text mining tool, and how loud each word/phrase is spoken depends on a corresponding frequency which is additionally output by the text mining tool, for each word/phrase. In certain embodiments, words/phrases are categorized into multiple themes by the text mining tool, and in these embodiments corresponding multiple voices and/or accents are used, to indicate via sonification, a specific theme of each word/phrase being spoken.