Patent classifications
G10L13/07
Voice synthesis apparatus and voice synthesis method utilizing diphones or triphones and machine learning
A voice synthesis method includes: sequentially acquiring voice units comprising at least one of diphone or a triphone in accordance with synthesis information for synthesizing voices; generating statistical spectral envelopes using a statistical model built by machine learning in accordance with the synthesis information for synthesizing the voices; and concatenating the sequentially acquired voice units and modifying a frequency spectral envelope of each voice unit in accordance with the generated statistical spectral envelope, thereby synthesizing a voice signal based on the concatenated voice units having the modified frequency spectra.
METHODS, SYSTEMS, AND MEDIA FOR SEAMLESS AUDIO MELDING BETWEEN SONGS IN A PLAYLIST
In accordance with some embodiments of the disclosed subject matter, mechanisms for seamless audio melding between audio items in a playlist are provided. In some embodiments, a method for transitioning between audio items in playlists is provided, comprising: identifying a sequence of audio items in a playlist of audio items, wherein the sequence of audio items includes a first audio item and a second audio item that is to be played subsequent to the first audio item; and modifying an end portion of the first audio item and a beginning portion of the second audio item, where the end portion of the first audio item and the beginning portion of the second audio item are to be played concurrently to transition between the first audio item and the second audio item, wherein the end portion of the first audio item and the beginning portion of the second audio item have an overlap duration, and wherein modifying the end portion of the first audio item and the beginning portion of the second audio item comprises: generating a first spectrogram corresponding to the end portion of the first audio item and a second spectrogram corresponding to the beginning portion of the second audio item; identifying, for each frequency band in a series of frequency bands, a window over which the first spectrogram within the end portion of the first audio item and the second spectrogram within the beginning portion of the second audio item have a particular cross-correlation; modifying, for each frequency band in the series of frequency bands, the end portion of the first spectrogram and the beginning portion of the second spectrogram such that amplitudes of frequencies within the frequency band decrease within the first spectrogram over the end portion of the first spectrogram and that amplitudes of frequencies within the frequency band increase within the second spectrogram over the beginning portion of the second spectrogram; and generating a modified version of the first audio item the includes the modified end portion of the first audio item based on the modified end portion of the first spectrogram and generating a modified version of the second audio item that includes the modified beginning portion of the second audio item based on the modified beginning portion of the second spectrogram.
Learnable speed control for speech synthesis
A method, computer program, and computer system is provided for synthesizing speech at one or more speeds. A context associated with one or more phonemes corresponding to a speaking voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a voice sample corresponding to the speaking voice is synthesized using the generated mel-spectrogram features.
SINGING VOICE CONVERSION
A method, computer program, and computer system is provided for converting a singing first singing voice associated with a first speaker to a second singing voice associated with a second speaker. A context associated with one or more phonemes corresponding to the first singing voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a sample corresponding to the first singing voice is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.
SINGING VOICE CONVERSION
A method, computer program, and computer system is provided for converting a singing first singing voice associated with a first speaker to a second singing voice associated with a second speaker. A context associated with one or more phonemes corresponding to the first singing voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a sample corresponding to the first singing voice is converted to a sample corresponding to the second singing voice using the generated mel-spectrogram features.
Systems and methods for generating synthesized speech responses to voice inputs by training a neural network model based on the voice input prosodic metrics and training voice inputs
The system provides a synthesized speech response to a voice input, based on the prosodic character of the voice input. The system receives the voice input and calculates at least one prosodic metric of the voice input. The at least one prosodic metric can be associated with a word, phrase, grouping thereof, or the entire voice input. The system also determines a response to the voice input, which may include the sequence of words that form the response. The system generates the synthesized speech response, by determining prosodic characteristics based on the response, and on the prosodic character of the voice input. The system outputs the synthesized speech response, which includes a more natural, relevant, or both answer to the call of the voice input. The prosodic character of the voice input and/or response may include pitch, note, duration, prominence, timbre, rate, and rhythm, for example.
METHODS, SYSTEMS, AND MEDIA FOR SEAMLESS AUDIO MELDING BETWEEN SONGS IN A PLAYLIST
In accordance with some embodiments of the disclosed subject matter, mechanisms for seamless audio melding between audio items in a playlist are provided. In some embodiments, a method for transitioning between audio items in playlists is provided, comprising: identifying a sequence of audio items in a playlist of audio items, wherein the sequence of audio items includes a first audio item and a second audio item that is to be played subsequent to the first audio item; and modifying an end portion of the first audio item and a beginning portion of the second audio item, where the end portion of the first audio item and the beginning portion of the second audio item are to be played concurrently to transition between the first audio item and the second audio item, wherein the end portion of the first audio item and the beginning portion of the second audio item have an overlap duration, and wherein modifying the end portion of the first audio item and the beginning portion of the second audio item comprises: generating a first spectrogram corresponding to the end portion of the first audio item and a second spectrogram corresponding to the beginning portion of the second audio item; identifying, for each frequency band in a series of frequency bands, a window over which the first spectrogram within the end portion of the first audio item and the second spectrogram within the beginning portion of the second audio item have a particular cross-correlation; modifying, for each frequency band in the series of frequency bands, the end portion of the first spectrogram and the beginning portion of the second spectrogram such that amplitudes of frequencies within the frequency band decrease within the first spectrogram over the end portion of the first spectrogram and that amplitudes of frequencies within the frequency band increase within the second spectrogram over the beginning portion of the second spectrogram; and generating a modified version of the first audio item the includes the modified end portion of the first audio item based on the modified end portion of the first spectrogram and generating a modified version of the second audio item that includes the modified beginning portion of the second audio item based on the modified beginning portion of the second spectrogram.
SPEECH SYNTHESIS UNIT SELECTION
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for selecting units for speech synthesis. One of the methods includes determining a sequence of text units that each represent a respective portion of text for speech synthesis; and determining multiple paths of speech units that each represent the sequence of text units by selecting a first speech unit that includes speech synthesis data representing a first text unit; selecting multiple second speech units including speech synthesis data representing a second text unit based on (i) a join cost to concatenate the second speech unit with a first speech unit and (ii) a target cost indicating a degree that the second speech unit corresponds to the second text unit; and defining paths from the selected first speech unit to each of the multiple second speech units to include in the multiple paths of speech units.
SPEECH SYNTHESIS UNIT SELECTION
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for selecting units for speech synthesis. One of the methods includes determining a sequence of text units that each represent a respective portion of text for speech synthesis; and determining multiple paths of speech units that each represent the sequence of text units by selecting a first speech unit that includes speech synthesis data representing a first text unit; selecting multiple second speech units including speech synthesis data representing a second text unit based on (i) a join cost to concatenate the second speech unit with a first speech unit and (ii) a target cost indicating a degree that the second speech unit corresponds to the second text unit; and defining paths from the selected first speech unit to each of the multiple second speech units to include in the multiple paths of speech units.
Speaker retrieval device, speaker retrieval method, and computer program product
A speaker retrieval device includes a first converting unit, a receiving unit, and a searching unit. The first converting unit converts, using an inverse transform model of a first conversion model for converting score vectors representing the features of voice quality into acoustic models, pre-registered acoustic models into score vectors; and registers the score vectors in a corresponding manner to a speaker identifier in score management information. The receiving unit receives input of a score vector. The searching unit searches the score management information for the speaker identifiers whose score vectors are similar to the received score vector.