Patent classifications
G10L13/033
Method of embodying online media service having multiple voice systems
A method of embodying an online media service having a multiple voice system includes a first operation of collecting preset online articles and content from a specific media site and displaying the online articles and content on a screen of a personal terminal, a second operation of inputting a voice of a subscriber or setting a voice of a specific person among voices that are pre-stored in a database, a third operation of recognizing and classifying the online articles and content, a fourth operation of converting the classified online articles and content into speech, and a fifth operation of outputting the online articles and content using the voice of the subscriber or the specific person, which is set in the second operation.
Method of embodying online media service having multiple voice systems
A method of embodying an online media service having a multiple voice system includes a first operation of collecting preset online articles and content from a specific media site and displaying the online articles and content on a screen of a personal terminal, a second operation of inputting a voice of a subscriber or setting a voice of a specific person among voices that are pre-stored in a database, a third operation of recognizing and classifying the online articles and content, a fourth operation of converting the classified online articles and content into speech, and a fifth operation of outputting the online articles and content using the voice of the subscriber or the specific person, which is set in the second operation.
Nuance-based augmentation of sign language communication
In certain embodiments, nuance-based augmentation of gesture may be facilitated. In some embodiments, a video stream depicting sign language gestures of an individual may be obtained via a wearable device associated with a user. A textual translation of the sign language gestures in the video stream may be determined. Emphasis information related to the sign language gestures may be identified based on an intensity of the sign language gestures. One or more display characteristics may be determined based on the emphasis information. The textual translation may be caused to be displayed to the user via the wearable device according to the one or more display characteristics. In some embodiments, a unique voice profile for the individual may be determined. A spoken translation of the sign language gestures may be generated according to the textual translation, the unique voice profile, and the emphasis information.
PREDICTING SPECTRAL REPRESENTATIONS FOR TRAINING SPEECH SYNTHESIS NEURAL NETWORKS
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a neural network to perform speech synthesis. One of the methods includes obtaining a training data set for training a first neural network to process a spectral representation of an audio sample and to generate a prediction of the audio sample, wherein, after training, the first neural network obtains spectral representations of audio samples from a second neural network; for a plurality of audio samples in the training data set: generating a ground-truth spectral representation of the audio sample; and processing the ground-truth spectral representation using a third neural network to generate an updated spectral representation of the audio sample; and training the first neural network using the updated spectral representations, wherein the third neural network is configured to generate updated spectral representations that resemble spectral representations generated by the second neural network.
SYSTEM AND METHOD FOR POSTHUMOUS DYNAMIC SPEECH SYNTHESIS USING NEURAL NETWORKS AND DEEP LEARNING
A system and method for posthumous dynamic speech synthesis digitally clones the original voice of a deceased user, which allows an operational user to remember the original user, post mortem. The system utilizes a neural network and deep learning to digitally duplicate the vocal frequency, personality, and characteristics of the original voice from the deceased user. This systematic approach to dynamic speech synthesis involves several stages of compression, coding, decoding, and training the speech patterns of original voice. The data processing of original voice includes audio sampling and a Lossy-Lossless method of dual compression. Additionally, the voice data is compressed to generate a Mel spectrogram. A voice codec converts the spectrogram into a PNG file, which is synthesized into the cloned voice. After the algorithmic operations, coding, and decoding of voice data, the subsequently generated cloned voice is implemented into a physical media outlet for consumption by the operational user.
Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
A text-to-speech synthesis method using machine learning, the text-to-speech synthesis method is disclosed. The method includes generating a single artificial neural network text-to-speech synthesis model by performing machine learning based on a plurality of learning texts and speech data corresponding to the plurality of learning texts, receiving an input text, receiving an articulatory feature of a speaker, generating output speech data for the input text reflecting the articulatory feature of the speaker by inputting the articulatory feature of the speaker to the single artificial neural network text-to-speech synthesis model.
Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
A text-to-speech synthesis method using machine learning, the text-to-speech synthesis method is disclosed. The method includes generating a single artificial neural network text-to-speech synthesis model by performing machine learning based on a plurality of learning texts and speech data corresponding to the plurality of learning texts, receiving an input text, receiving an articulatory feature of a speaker, generating output speech data for the input text reflecting the articulatory feature of the speaker by inputting the articulatory feature of the speaker to the single artificial neural network text-to-speech synthesis model.
Emotion classification information-based text-to-speech (TTS) method and apparatus
Disclosed are an emotion classification information-based text-to-speech (TTS) method and device. The emotion classification information-based TTS method according to an embodiment of the present invention may, when emotion classification information is set in a received message, transmit metadata corresponding to the set emotion classification information to a speech synthesis engine and, when no emotion classification information is set in the received message, generate new emotion classification information through semantic analysis and context analysis of sentences in the received message and transmit the metadata to the speech synthesis engine. The speech synthesis engine may perform speech synthesis by carrying emotion classification information based on the transmitted metadata.
Emotion classification information-based text-to-speech (TTS) method and apparatus
Disclosed are an emotion classification information-based text-to-speech (TTS) method and device. The emotion classification information-based TTS method according to an embodiment of the present invention may, when emotion classification information is set in a received message, transmit metadata corresponding to the set emotion classification information to a speech synthesis engine and, when no emotion classification information is set in the received message, generate new emotion classification information through semantic analysis and context analysis of sentences in the received message and transmit the metadata to the speech synthesis engine. The speech synthesis engine may perform speech synthesis by carrying emotion classification information based on the transmitted metadata.
Voice synthesis for virtual agents
Techniques are described for generating a custom voice for a virtual agent. In one implementations, a method includes receiving information identifying a customer contacting a call center. The method includes selecting a voice for a virtual agent based on information about the customer. The method also includes assigning the voice to the virtual agent during communications with the customer.