G10L13/0335

Method and terminal for generating simulated voice of virtual teacher

Disclosed are a method and a terminal for generating simulated voices of virtual teachers. Real voice samples of teachers are collected and converted into text sequences, and a text emotion polarity training set and a text tone training set are constructed according to the text sequences; a lexical item emotion model is constructed based on lexical items in the text sequences and is trained by using the emotion polarity training set, and word vectors, an emotion polarity vector, and a weight parameter are obtained by training; and the similarity between the word vector and the emotion polarity vector is calculated, and emotion features are extracted according to a similarity calculation result, a conditional vocoder is constructed according to the voice styles and emotion features to generate new voices with emotion changes. The method and the terminal contribute to satisfying the application requirements of high-quality virtual teachers.

Cognitive modification of verbal communications from an interactive computing device

A method includes: determining, by a computer device, a current context associated with a user that is the target audience of an unprompted verbal output of an interactive computing device; determining, by the computer device, one or more parameters that are most effective in getting the attention of the user for the determined current context; and modifying, by the computer device, the unprompted verbal output of the interactive computing device using the determined one or more parameters.

END-TO-END MODULAR SPEECH SYNTHESIS SYSTEMS AND METHODS

A method for speech synthesis using prosody capture and transfer includes receiving a first speech in a target prosody and receiving a second speech in a target voice; extracting prosodic features from a first speech segment in the target prosody; generating a synthetic speech segment in the target voice with the target prosody based on transferring the prosodic features from the first speech segment per phoneme to a second speech segment.

Unsupervised alignment for text to speech synthesis using neural networks

Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.

Oral communication device and computing system for processing data and outputting user feedback, and related methods

Typical graphical user interfaces and predefined data fields limit the interaction between a person and a computing system. An oral communication device and a data enablement platform are provided for ingesting oral conversational data from people, and using machine learning to provide intelligence. At the front end, an oral conversational bot, or chatbot, interacts with a user. On the backend, the data enablement platform has a computing architecture that ingests data from various external data sources as well as data from internal applications and databases. These data and algorithms are applied to surface new data, identify trends, provide recommendations, infer new understanding, predict actions and events, and automatically act on this computed information. The chatbot then provides audio data that reflects the information computed by the data enablement platform. The system and the devices, for example, are adaptable to various industries.

GENERATING SPEECH SIGNALS USING BOTH NEURAL NETWORK-BASED VOCODING AND GENERATIVE ADVERSARIAL TRAINING
20210366461 · 2021-11-25 ·

Systems and methods are described for generating speech signals using an encoder/decoder that synthesizes human voice signals (a “vocoder”). A processor may receive inputs that include a plurality of mel scale spectrograms, a fundamental frequency signal, and a constant noise signal. The received inputs may be encoded into two vector sequences by concatenating the received inputs and then filtering a result of the concatenating operation. A plurality of harmonic samples may be generated from one of the two generated vector sequences using an additive oscillator. The harmonic samples may be generated using processing steps that include applying a sigmoid non-linearity to the one of the two generated vector sequences. Each of the plurality of harmonic samples may then be used to create a vector, which may be used as an input for a convolutional decoder for adversarial training. The convolutional decoder may output the speech signals based on the vector made up of the plurality of harmonic samples.

SOUND SIGNAL SYNTHESIS METHOD, NEURAL NETWORK TRAINING METHOD, AND SOUND SYNTHESIZER
20210350783 · 2021-11-11 ·

A sound signal synthesis method includes inputting control data representing conditions of a sound signal into a neural network, and thereby estimating first data representing a deterministic component of the sound signal and second data representing a stochastic component of the sound signal, and combining the deterministic component represented by the first data and the stochastic component represented by the second data, and thereby generating the sound signal. The neural network has learned a relationship between control data that represents conditions of a sound signal of a reference signal, a deterministic component of the sound signal of the reference signal, and a stochastic component of the sound signal of the reference signal.

Speech synthesis apparatus and method

The present disclosure relates to a speech synthesis apparatus and method that can remove discontinuity between phoneme units when generating a synthesized sound from the phoneme units, thereby implementing natural utterances and producing a high-quality synthesized sound having stable prosody.

Method for changing speed and pitch of speech and speech synthesis system
11776528 · 2023-10-03 · ·

This application relates to a method of synthesizing a speech of which a speed and a pitch are changed. In one aspect, the method includes a spectrogram may be generated by performing a short-time Fourier transformation on a first speech signal based on a first hop length and a first window length, and speech signals of sections having a second window length at the interval of a second hop length from the spectrogram. A ratio between the first hop length and the second hop length may be set to be equal to the value of a playback rate and a ratio between the first window length and the second window length may be set to be equal to the value of a pitch change rate, thereby generating a second speech signal of which the speed and the pitch are changed.

Development of voice and other interaction applications

Among other things, a developer of an interaction application for an enterprise can create items of content to be provided to an assistant platform for use in responses to requests of end-users. The developer can deploy the interaction application using defined items of content and an available general interaction model including intents and sample utterances having slots. The developer can deploy the interaction application without requiring the developer to formulate any of the intents, sample utterances, or slots of the general interaction model.