G10L2013/021

Synthesis of speech from text in a voice of a target speaker using neural networks

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

ENHANCED SPOKEN DIALOGUE MODIFICATION

A method for modifying recorded audio of a video title may include: identifying a video title available for presentation; identifying a portion of audio data, of the video title, representing dialogue of an actor of the video title; identifying a first utterance of the dialogue to be replaced in the video title; selecting, based on a replaceable amount of the audio data, a second utterance to replace the utterance word in the portion of the audio data; generating, based on a voice profile of the actor, first speech signals representing the actor uttering the second utterance; removing second speech signals representing the actor uttering the first utterance from the dialogue; adding the first speech signals into the dialogue using at least a portion of the replaceable amount; and generating a modified version of the video title comprising the dialogue with the speech signals added.

MULTIFUNCTIONAL AUDIO SIGNAL GENERATION APPARATUS
20170372711 · 2017-12-28 ·

A sample counter in each channel performs counting operation at a given rate. Independently for each channel, the rate and an initial value for the counter are set, and start and stop of the counting operation of the counter are controlled, so that a partial portion of an original waveform corresponding to a count range from the set initial value to a count stop point is reproduced in the channel. A control section sets the initial values in individual ones of a set of channels, selected from among the channels, such that sample values at different sample positions of the original waveform are simultaneously retrieved in individual ones of the set of channels, and controls an overlap adder to add up the retrieved sample values, so that sample values of an audio waveform signal with a plurality of partial portions of the original waveform, partially overlapping each other are output.

METHOD AND DEVICE FOR OPTIMIZING SPEECH SYNTHESIS SYSTEM
20170206886 · 2017-07-20 ·

The present invention provides a method and a device for optimizing speech synthesis system. The method comprises: receiving speech synthesis requests contained text messages; and determining the load level of the speech synthesis system when the speech synthesis requests are received; and selecting speech synthesis paths corresponding to the load level and synthesizing the text into speech according to the speech synthesis paths.

SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.