Patent classifications
G10L13/0335
Speech synthesizer using artificial intelligence and method of operating the same
Disclosed herein is a speech synthesizer using artificial intelligence including a memory, a communication processor configured to receive utterance information of words uttered by a user from a terminal, and a processor configured to acquire a plurality of utterance intonation phrase (IP) ratios respectively corresponding to a plurality of words uttered by the user based on the utterance information, compare a plurality of IP ratio tables respectively corresponding to a plurality of voice actors with the plurality of utterance IP ratios, acquire a plurality of non-utterance IP ratios respectively corresponding to a plurality of unuttered words based on a result of comparison, and generate a personalized synthesized speech model based on the plurality of utterance IP ratios and the plurality of non-utterance IP ratios.
Dynamic speech output configuration
Techniques are described for providing dynamically configured speech output, through which text data from a message is presented as speech output through a text-to-speech (TTS) engine that employs a voice profile to provide a machine-generated voice that approximates that of the sender of the message. The sender can also indicate the type of voice they would prefer the TTS engine use to render their text to a recipient, and the voice to be used can be specified in a sender's user profile, as a preference or attribute of the sending user. In some examples, the voice profile to be used can be indicated as metadata included in the message. A voice profile can specify voice attributes such as the tone, pitch, register, timbre, pacing, gender, accent, and so forth. A voice profile can be generated through a machine learning (ML) process.
SOUND SIGNAL GENERATION METHOD, GENERATIVE MODEL TRAINING METHOD, SOUND SIGNAL GENERATION SYSTEM, AND RECORDING MEDIUM
A computer-implemented sound signal generation method includes: obtaining a first sound source spectrum of a sound signal to be generated; obtaining a first spectral envelope of the sound signal; and estimating fragment data representative of samples of the sound signal based on the obtained first sound source spectrum and the obtained first spectral envelope.
INFORMATION PROCESSING METHOD, ESTIMATION MODEL CONSTRUCTION METHOD, INFORMATION PROCESSING DEVICE, AND ESTIMATION MODEL CONSTRUCTING DEVICE
An information processing device includes a memory storing instructions, and a processor configured to implement the stored instructions to execute a plurality of tasks. The tasks includes: a first generating task that generates a series of fluctuations of a target sound based on first control data of the target sound to be synthesized, using a first model trained to have an ability to estimate a series of fluctuations of the target sound based on first control data of the target sound, and a second generating task that generates a series of features of the target sound based on second control data of the target sound and the generated series of fluctuations of the target sound, using a second model trained to estimate a series of features of the target sound based on second control data of the target sound and a series of fluctuations of the target sound.
Information processing apparatus that fades system utterance in response to interruption
An apparatus and method are capable of controlling the output of the system utterance upon the occurrence of barge-in utterance and enabling a smooth interactive between a user and the system. Fade processing is applied to lower at least one of volume, a speech rate, or a pitch (voice pitch) of system utterance from a starting time of the barge-in utterance acting as the user interruption utterance during executing the system utterance. Even after the completion of the fade processing, the output state upon completing the fade processing is maintained. In a case where the system utterance level is equal to or less than the predefined threshold during the fade processing, the system utterance is displayed on a display unit. One of stop, continuation, and rephrasing is executed based on an intention of the barge-in utterance and whether an important word is included in in the system utterance.
Voice processing method for processing voice signal representing voice, voice processing device for processing voice signal representing voice, and recording medium storing program for processing voice signal representing voice
A voice processing method realized by a computer includes compressing forward a first steady period of a plurality of steady periods in a voice signal representing voice, and extending forward a transition period between the first steady period and a second steady period of the plurality of steady periods in the voice signal. Each of the plurality of steady periods is a period in which acoustic characteristics are temporally stable. The second steady period is a period immediately after the first steady period and has a pitch that is different from a pitch of the first steady period.
METHOD FOR CHANGING SPEED AND PITCH OF SPEECH AND SPEECH SYNTHESIS SYSTEM
This application relates to a method of synthesizing a speech of which a speed and a pitch are changed. In one aspect, the method includes a spectrogram may be generated by performing a short-time Fourier transformation on a first speech signal based on a first hop length and a first window length, and speech signals of sections having a second window length at the interval of a second hop length from the spectrogram. A ratio between the first hop length and the second hop length may be set to be equal to the value of a playback rate and a ratio between the first window length and the second window length may be set to be equal to the value of a pitch change rate, thereby generating a second speech signal of which the speed and the pitch are changed.
Text-to-Speech Adapted by Machine Learning
Machine learned models take in vectors representing desired behaviors and generate voice vectors that provide the parameters for text-to-speech (TTS) synthesis. Models may be trained on behavior vectors that include user profile attributes, situational attributes, or semantic attributes. Situational attributes may include age of people present, music that is playing, location, noise, and mood. Semantic attributes may include presence of proper nouns, number of modifiers, emotional charge, and domain of discourse. TTS voice parameters may apply per utterance and per word as to enable contrastive emphasis.
SOUND SYNTHESIZING METHOD AND PROGRAM
A sound synthesizing method according to one aspect of the present disclosure relates to a sound synthesizing method that is realized by a computer, including receiving musical score data and acoustic data via a user interface; and generating, based on respective one of the musical score data and the acoustic data, acoustic features of a sound waveform having a desired timbre.
SOUND SYNTHESIS METHOD, SOUND SYNTHESIS APPARATUS, AND RECORDING MEDIUM STORING INSTRUCTIONS TO PERFORM SOUND SYNTHESIS METHOD
There is provided a sound synthesis apparatus. The apparatus comprises a transceiver configured to obtain a plurality of sound samples; and a processor, wherein the processor is configured to: preprocess each sound sample to convert each sound sample into a spectrogram; generate a plurality of latent codes by inputting the spectrogram of each sound sample to an encoder of an artificial neural network pre-trained to output a latent code that maximizes timbre information; generate one synthesized latent code by synthesizing the plurality of latent codes based on a weight present for each sound sample; and generate a synthesized sound by inputting the synthesized latent code to a decoder of the pre-trained artificial neural network.