G10L2013/083

Apparatus for media entity pronunciation using deep learning
11501764 · 2022-11-15 · ·

Methods, systems, and related products for voice-enabled computer systems are described. A machine-learning model is trained to produce pronunciation output based on text input. The trained machine-learning model is used to produce pronunciation data for text input even where the text input includes numbers, punctuation, emoji, or other non-letter characters. The machine-learning model is further trained based on real-world data from users to improve pronunciation output.

AUDIO PLAYING METHOD, ELECTRONIC DEVICE, AND STORAGE MEDIUM
20220269724 · 2022-08-25 · ·

The present application provides an audio playing method, an electronic device, and a computer readable storage medium. The method comprises: recognizing an audio file to be played as a text file containing sentence segmentation symbols; generating respective sentence segmentation tags at positions corresponding to the sentence segmentation symbols in the audio file, according to a correspondence relationship between the audio file and the text file; in response to a trigger operation, determining a target play point according to a current play position of the audio file and respective positions of the sentence segmentation tags; and playing the audio file from the target play point.

Using emoticons for contextual text-to-speech expressivity
09767789 · 2017-09-19 · ·

Techniques disclosed herein include systems and methods that improve audible emotional characteristics used when synthesizing speech from a text source. Systems and methods herein use emoticons identified from a source text to provide contextual text-to-speech expressivity. In general, techniques herein analyze text and identify emoticons included within the text. The source text is then tagged with corresponding mood indicators. For example, if the system identifies an emoticon at the end of a sentence, then the system can infer that this sentence has a specific tone or mood associated with it. Depending on whether the emoticon is a smiley face, angry face, sad face, laughing face, etc., the system can infer use or mood from the various emoticons and then change or modify the expressivity of the TTS output such as by changing intonation, prosody, speed, pauses, and other expressivity characteristics.

Systems and methods for providing non-lexical cues in synthesized speech

Systems and methods are disclosed for providing non-lexical cues in synthesized speech. An example system includes one or more storage devices including instructions and a processor to execute the instructions. The processor is to execute the instructions to: generate first and second non-lexical cues to enhance speech to be synthesized from text; determine a first insertion point of the first non-lexical cue in the text; determine a second insertion point of the second non-lexical cue in the text; and insert the first non-lexical cue at the first insertion point and the second non-lexical cue at the second insertion point. The example system also includes a transmitter to communicate the text with the inserted first non-lexical cue and the inserted second non-lexical cue over a network.

SYSTEM AND METHODOLOGY FOR MODULATION OF DYNAMIC GAPS IN SPEECH
20220189500 · 2022-06-16 ·

A system capable of speech gap modulation is configured to: receive at least one composite speech portion, which comprises at least one speech portion and at least one dynamic-gap portion, wherein the speech portion(s) comprising at least one variable-value speech portion, wherein the dynamic-gap portion(s) associated with a pause in speech; receive at least one synchronization point, wherein synchronization point(s) is associating a point in time in the composite speech portion(s) and a point in time in other media portion(s); and modulate dynamic-gap portion(s), based at least partially on the at variable-value speech portion(s), and on the point(s), thereby generating at least one modulated composite speech portion. This facilitates improved synchronization of the modulated composite speech portion(s) and the other media portion(s) at the synchronization point(s), when combining the other media portion(s) and the audio-format modulated composite speech portion(s) into a synchronized multimedia output.

METHOD FOR GENERATING SYNTHETIC SPEECH AND SPEECH SYNTHESIS SYSTEM
20220165247 · 2022-05-26 ·

This application relates to a speech synthesis system. In one aspect, the system includes an encoder configured to generate a speaker embedding vector corresponding to a verbal speech based on a first speech signal corresponding to a verbal utterance. The system may also include a synthesizer configured to perform at least once the cycle including generating a plurality of spectrograms corresponding to verbal utterance of the sequence of the text, based on the speaker embedding vector and a sequence of a text written in a particular natural language and selecting a first spectrogram from among the spectrograms, to output the first spectrogram. The system may further include a vocoder configured to generate a second speech signal corresponding to the sequence of the text based on the first spectrogram.

User interface for generating expressive content

Generation of expressive content is provided. An expressive synthesized speech system provides improved voice authoring user interfaces by which a user is enabled to efficiently author content for generating expressive output. An expressive synthesized speech system provides an expressive keyboard for enabling input of textual content and for selecting expressive operators, such as emoji objects or punctuation objects for applying predetermined prosody attributes or visual effects to the textual content. A voicesetting editor mode enables the user to author and adjust particular prosody attributes associated with the content for composing carefully-crafted synthetic speech. An active listening mode (ALM) is provided, which when selected, a set of ALM effect options are displayed, wherein each option is associated with a particular sound effect and/or visual effect. The user is enabled to rapidly respond with expressive vocal sound effects or visual effects while listening to others speak.

SYSTEMS AND METHODS FOR PROVIDING NON-LEXICAL CUES IN SYNTHESIZED SPEECH

Systems and methods are disclosed for providing non-lexical cues in synthesized speech. An example system includes processor circuitry to generate a breathing cue to enhance speech to be synthesized from text; determine a first insertion point of the breathing cue in the text, wherein the breathing cue is identified by a first tag of a markup language; generate a prosody cue to enhance speech to be synthesized from the text; determine a second insertion point of the prosody cue in the text, wherein the prosody cue is identified by a second tag of the markup language; insert the breathing cue at the first insertion point based on the first tag and the prosody cue at the second insertion point based on the second tag; and trigger a synthesis of the speech from the text, the breathing cue, and the prosody cue.

Systems and methods for providing non-lexical cues in synthesized speech

Systems and methods are disclosed for providing non-lexical cues in synthesized speech. An example system includes one or more storage devices including instructions and a processor to execute the instructions. The processor is to execute the instructions to: determine a user tone of the user input; generate a response to the user input based on the user tone; and identify a response tone associated with the user tone. The example system also includes a transmitter to communicate the response and the response tone over a network.

USER INTERFACE FOR GENERATING EXPRESSIVE CONTENT

Generation of expressive content is provided. An expressive synthesized speech system provides improved voice authoring user interfaces by which a user is enabled to efficiently author content for generating expressive output. An expressive synthesized speech system provides an expressive keyboard for enabling input of textual content and for selecting expressive operators, such as emoji objects or punctuation objects for applying predetermined prosody attributes or visual effects to the textual content. A voicesetting editor mode enables the user to author and adjust particular prosody attributes associated with the content for composing carefully-crafted synthetic speech. An active listening mode (ALM) is provided, which when selected, a set of ALM effect options are displayed, wherein each option is associated with a particular sound effect and/or visual effect. The user is enabled to rapidly respond with expressive vocal sound effects or visual effects while listening to others speak.