Patent classifications
G10L2013/105
METHOD AND ELECTRONIC DEVICE FOR INTELLIGENTLY READING DISPLAYED CONTENTS
A method for intelligently reading displayed contents by an electronic device is provided. The method includes obtaining a screen representation based on a plurality of contents displayed on a screen of the electronic device. The method includes extracting a plurality of insights comprising at least one of intent, importance, emotion, sound representation and information sequence of the plurality of contents from the plurality of contents based on the screen representation. The method includes generating audio emulating the extracted plurality of insights.
END-TO-END NEURAL TEXT-TO-SPEECH MODEL WITH PROSODY CONTROL
Methods and systems for generating an end-to-end neural text-to-speech (TTS) model to process an input text to generate speech representations. An annotated set of text documents including annotations inserted therein to indicate prosodic features are input into the TTS model. The TTS model is trained using the annotated dataset and a corresponding dataset of speech representations of the text documents that include prosody associated with the indicated prosodic features. The trained TTS model learns to associate the prosody with the annotations.
Variable-speed phonetic pronunciation machine
A machine causes a touch-sensitive screen to present a graphical user interface that depicts a slider control aligned with a word that includes a first alphabetic letter and a second alphabetic letter. A first zone of the slider control corresponds to the first alphabetic letter, and a second zone of the slider control corresponds to the second alphabetic letter. The machine detects a touch-and-drag input that begins within the first zone and enters the second zone. In response to the touch-and-drag input beginning within the first zone, the machine presents a first phoneme that corresponds to the first alphabetic letter, and the presenting of the first phoneme may include audio playback of the first phoneme. In response to the touch-and-drag input entering the second zone, the machine presents a second phoneme that corresponds to the second alphabetic letter, which may include audio playback of the second phoneme.
Attention-Based Clockwork Hierarchical Variational Encoder
A method (400) for representing an intended prosody in synthesized speech includes receiving a text utterance (310) having at least one word (240), and selecting an utterance embedding (204) for the text utterance. Each word in the text utterance has at least one syllable (230) and each syllable has at least one phoneme (220). The utterance embedding represents an intended prosody. For each syllable, using the selected utterance embedding, the method also includes: predicting a duration (238) of the syllable by decoding a prosodic syllable embedding (232, 234) for the syllable based on attention by an attention mechanism (340) to linguistic features (222) of each phoneme of the syllable and generating a plurality of fixed-length predicted frames (260) based on the predicted duration for the syllable.
METHOD AND APPARATUS FOR SPEECH SYNTHESIS, AND STORAGE MEDIUM
A method for speech synthesis includes obtaining text to be synthesized and an identifier of a speaker, the text being written in a first language; obtaining pronunciation information of each character in the text; generating linguistic features of the text by performing feature extraction on the pronunciation information of each character in the text based on the first language; and obtaining a target speech in a second language other than the first language, by performing speech synthesis based on the linguistic features and the identifier of the speaker.
COMPUTING SYSTEM FOR DOMAIN EXPRESSIVE TEXT TO SPEECH
A computing system obtains text that includes words and provides the text as input to an emotional classifier model that has been trained based upon emotional classification. The computing system obtains a textual embedding of the computer-readable text as output of the emotional classifier model. The computing system generates a phoneme sequence based upon the words of the text. The computing system, generates, by way of an encoder of a text to speech (TTS) model, a phoneme encoding based upon the phoneme sequence. The computing system provides the textual embedding and the phoneme encoding as input to a decoder of the TTS model. The computing system causes speech that includes the words to be played over a speaker based upon output of the decoder of the TTS model, where the speech reflects an emotion underlying the text due to the textual embedding provided to the encoder.
VOICE GENERATING METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM
A voice generating method and apparatus, an electronic device and a storage medium. The specific implementation solution includes: acquiring a text to be processed, and determining an associated text of the text to be processed; acquiring an associated prosodic feature of the associated text; determining an associated text feature of the associated text based on the text to be processed; determining a spectrum feature to be processed of the text to be processed based on the associated prosodic feature and the associated text feature; and generating a target voice corresponding to the text to be processed based on the spectrum feature to be processed.
METHOD AND SYSTEM FOR USER-INTERFACE ADAPTATION OF TEXT-TO-SPEECH SYNTHESIS
A method and system is disclosed for adapting speech synthesis according to user-interface input. While synthesizing speech from a text segment with a text-to-speech (TTS) system and concurrently displaying the text segment in a display device, the system may receive tracking operation input tracking a portion of text undergoing synthesis and identifying a context portion of the text for which prior-synthesized speech has been synthesized at a canonical speech-pace. The tracking information may be used to adjust a speech-pace of TTS synthesis of the portion from the canonical speech-pace to an adapted speech-pace, and speech characteristics of synthesized speech of the portion may be adapted by applying both the adapted speech-pace and synthesized speech characteristics of the prior-synthesized speech of the context portion to TTS synthesis processing of the portion. The synthesized speech of the identified portion may be output at the adapted speech-pace and with the adapted speech characteristics.
Clockwork hierarchal variational encoder
A method of providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word and selecting a mel spectral embedding for the text utterance. Each word has at least one syllable and each syllable has at least one phoneme. For each phoneme, the method further includes using the selected mel spectral embedding to: (i) predict a duration of the corresponding phoneme based on corresponding linguistic features associated with the word that includes the corresponding phoneme and corresponding linguistic features associated with the syllable that includes the corresponding phoneme; and (ii) generate a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame represents mel-spectral information of the corresponding phoneme.
UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS
Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.