G10L2013/021

Synthesis of speech from text in a voice of a target speaker using neural networks

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

Multifunctional audio signal generation apparatus
09792916 · 2017-10-17 · ·

A sample counter in each channel performs counting operation at a given rate. Independently for each channel, the rate and an initial value for the counter are set, and start and stop of the counting operation of the counter are controlled, so that a partial portion of an original waveform corresponding to a count range from the set initial value to a count stop point is reproduced in the channel. A control section sets the initial values in individual ones of a set of channels, selected from among the channels, such that sample values at different sample positions of the original waveform are simultaneously retrieved in individual ones of the set of channels, and controls an overlap adder to add up the retrieved sample values, so that sample values of an audio waveform signal with a plurality of partial portions of the original waveform, partially overlapping each other are output.

TAILORED VOICE NAVIGATION ANALYSIS SYSTEM

An approach, for tailoring voice navigation instruction output. A navigation audio tailor receives songs including instrumental segments and associated vocal segments. The navigation audio tailor identifies, instrumental only segments where the instrumental only segments mark time durations based on the instrumental segments being absent of the associated vocal segments. The navigation audio tailor receives, navigations instructions where the navigation instructions are based on text to create voice navigation instructions. The navigation audio tailor determines, navigation instruction output timing where the navigation instructions output timing is associated to one of the instrumental only segments to create tailored navigation instructions and the navigation audio tailor outputs the tailored navigation instructions to an audio playback device where the output combines at least one of the songs.

Synthesis of Speech from Text in a Voice of a Target Speaker Using Neural Networks

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

Synthesis of speech from text in a voice of a target speaker using neural networks

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

METHOD FOR PROVIDING GROUP CALL SERVICE, AND ELECTRONIC DEVICE SUPPORTING SAME

An electronic device includes a communication module and a processor operatively connected to the communication module. The processor is configured to: receive and store a first speech voice related to at least a first external device, and a second speech voice related to a second external device; if individual speech is detected, transmit the first speech voice or the second speech voice having a first playback speed to at least a first external device and a second external device; and, if simultaneous speech is detected, convert, into a second playback speed different from the first playback speed, at least a part of a synthesized voice in which at least first overlap speech of the first speech voice and at least second overlap speech of the second speech voice are successively connected, and transmit the synthesized voice to the at least first external device and the second external device.

Synthesis of Speech from Text in a Voice of a Target Speaker Using Neural Networks

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

SYNTHESIS OF SPEECH FROM TEXT IN A VOICE OF A TARGET SPEAKER USING NEURAL NETWORKS

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.

Multifunctional audio signal generation apparatus
10388290 · 2019-08-20 · ·

A sample counter in each channel performs counting operation at a given rate. Independently for each channel, the rate and an initial value for the counter are set, and start and stop of the counting operation of the counter are controlled, so that a partial portion of an original waveform corresponding to a count range from the set initial value to a count stop point is reproduced in the channel. A control section sets the initial values in individual ones of a set of channels, selected from among the channels, such that sample values at different sample positions of the original waveform are simultaneously retrieved in individual ones of the set of channels, and controls an overlap adder to add up the retrieved sample values, so that sample values of an audio waveform signal with a plurality of partial portions of the original waveform, partially overlapping each other are output.

Method and device for optimizing speech synthesis system

The present invention provides a method and a device for optimizing speech synthesis system. The method comprises: receiving speech synthesis requests contained text messages; and determining the load level of the speech synthesis system when the speech synthesis requests are received; and selecting speech synthesis paths corresponding to the load level and synthesizing the text into speech according to the speech synthesis paths.