G10L2013/105

PREDICTION DEVICE, PREDICTION METHOD, AND PROGRAM

An estimation device (100), which is an estimation device that estimates a duration of a speech section, includes: a representation conversion unit (11) that performs representation conversion of a plurality of words included in learning utterance information to a plurality of pieces of numeric representation data; an estimation data generation unit (12) that generates estimation data by using a plurality of pieces of the learning utterance information and the plurality of pieces of numeric representation data; an estimation model learning unit (13) that learns an estimation model by using the estimation data and the durations of the plurality of words; and an estimation unit (20) that estimates the duration of a predetermined speech section based on utterance information of a user by using the estimation model.

User interface for generating expressive content

Generation of expressive content is provided. An expressive synthesized speech system provides improved voice authoring user interfaces by which a user is enabled to efficiently author content for generating expressive output. An expressive synthesized speech system provides an expressive keyboard for enabling input of textual content and for selecting expressive operators, such as emoji objects or punctuation objects for applying predetermined prosody attributes or visual effects to the textual content. A voicesetting editor mode enables the user to author and adjust particular prosody attributes associated with the content for composing carefully-crafted synthetic speech. An active listening mode (ALM) is provided, which when selected, a set of ALM effect options are displayed, wherein each option is associated with a particular sound effect and/or visual effect. The user is enabled to rapidly respond with expressive vocal sound effects or visual effects while listening to others speak.

Method and apparatus for training model, method and apparatus for synthesizing speech, device and storage medium

The present disclosure discloses a method and apparatus for training a model, a method and apparatus for synthesizing a speech, a device and a storage medium, and relates to the field of natural language processing and deep learning technology. The method for training a model may include: determining a phoneme feature and a prosodic word boundary feature of sample text data; inserting a pause character into the phoneme feature according to the prosodic word boundary feature to obtain a combined feature of the sample text data; and training an initial speech synthesis model according to the combined feature of the sample text data, to obtain a target speech synthesis model.

Unsupervised alignment for text to speech synthesis using neural networks

Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.

System for Communication Skills Training Using Juxtaposition of Recorded Takes
20220028369 · 2022-01-27 ·

An Internet-based application allows a trainee to record a performance of a scene containing roles A and B with performers for the scene's roles alternately speaking their respective lines. The system displays the lines in a teleprompter style, and based on the experience level of the trainee, may blank out increasing portions of the teleprompter-style lines. If the trainee is assigned role A, the system will present each role A line to be spoken by the trainee with a time progress bar indicating the speed/timing or time remaining for that line. The trainee's performance is recorded by a computer. The teleprompter timer ensures that the trainee's performance is coordinated with a take of role B, even though the trainee's take and the role B take are actually recorded at different times. The takes are played in tandem for evaluating effectiveness of the training.

Speech Synthesis Prosody Using A BERT Model

A method for generating a prosodic representation includes receiving a text utterance having one or more words. Each word has at least one syllable having at least one phoneme. The method also includes generating, using a Bidirectional Encoder Representations from Transformers (BERT) model, a sequence of wordpiece embeddings and selecting an utterance embedding for the text utterance, the utterance embedding representing an intended prosody. Each wordpiece embedding is associated with one of the one or more words of the text utterance. For each syllable, using the selected utterance embedding and a prosody model that incorporates the BERT model, the method also includes generating a corresponding prosodic syllable embedding for the syllable based on the wordpiece embedding associated with the word that includes the syllable and predicting a duration of the syllable by encoding linguistic features of each phoneme of the syllable with the corresponding prosodic syllable embedding for the syllable.

SPEECH SYNTHESIS METHOD AND APPARATUS, DEVICE, AND STORAGE MEDIUM
20230298564 · 2023-09-21 ·

Embodiments of this application provide a speech synthesis method performed by an electronic device. The method includes: acquiring a target text to be synthesized into a speech; generating hidden layer features and prosodic features of the target text, and predicting pronunciation duration of characters in the target text using an acoustic model corresponding to the target text; generating acoustic features corresponding to the target text based on the hidden layer features, the prosodic features and the pronunciation duration; and synthesizing a target speech corresponding to the target text according to the acoustic features. Using the solution provided by the embodiments of this application is beneficial to reducing the difficulty of speech synthesis.

METHODS AND APPARATUS TO CONVERT IMAGE TO AUDIO
20230134984 · 2023-05-04 ·

Methods, apparatus, systems, and articles of manufacture are disclosed. An example apparatus includes: An apparatus comprising: at least one memory; instructions; and processor circuitry to execute the instructions to: processor circuitry to execute the instructions to: identify a word in an image, the word to be converted to an audio waveform; encode the word identified in the image into an ordered list of phonemes; and synthesize the audio waveform of the word based on an output of a neural network that determines a duration that a phoneme of the ordered list of phonemes is to be expressed in the audio waveform.

TEXT DATA PROCESSING METHOD AND APPARATUS
20230360634 · 2023-11-09 ·

The present disclosure relates to text data processing methods and apparatuses. One example method includes obtaining target text, where a phoneme of the target text includes a first phoneme and a second phoneme that are adjacent to each other. Feature extraction is performed on the first phoneme and the second phoneme to obtain a first audio feature of the first phoneme and a second audio feature of the second phoneme. By using a target recurrent neural network (RNN) and based on the first audio feature, first speech data corresponding to the first phoneme is obtained.By using the target RNN and based on the second audio feature, second speech data corresponding to the second phoneme is obtained.By using a vocoder and based on the first speech data and the second speech data, audio corresponding to the first phoneme and audio corresponding to the second phoneme are obtained.

UNSUPERVISED ALIGNMENT FOR TEXT TO SPEECH SYNTHESIS USING NEURAL NETWORKS

Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.