Patent classifications
G10L13/07
Learnable speed control of speech synthesis
A method, computer program, and computer system is provided for synthesizing speech at one or more speeds. A context associated with one or more phonemes corresponding to a speaking voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a voice sample corresponding to the speaking voice is synthesized using the generated mel-spectrogram features.
SPEECH PROCESSING DEVICE, SPEECH PROCESSING METHOD, AND RECORDING MEDIUM
A speech processing device according to an aspect of the present invention examines precision and quality of each piece of data stored in a database so that it is able to generate highly stable synthesized speech close to human voice
A speech processing device according to an aspect of the present invention includes a first storing means for storing an original-speech F0 pattern being an F0 pattern extracted from recorded speech and first determination information associated with the original-speech F0 pattern, and a first determining means for determining whether or not to reproduce an original-speech F0 pattern, in accordance with first determination information.
SPEECH PROCESSING DEVICE, SPEECH PROCESSING METHOD, AND RECORDING MEDIUM
A speech processing device according to an aspect of the present invention examines precision and quality of each piece of data stored in a database so that it is able to generate highly stable synthesized speech close to human voice
A speech processing device according to an aspect of the present invention includes a first storing means for storing an original-speech F0 pattern being an F0 pattern extracted from recorded speech and first determination information associated with the original-speech F0 pattern, and a first determining means for determining whether or not to reproduce an original-speech F0 pattern, in accordance with first determination information.
UNIT-SELECTION TEXT-TO-SPEECH SYNTHESIS BASED ON PREDICTED CONCATENATION PARAMETERS
Systems and processes for performing unit-selection text-to-speech synthesis are provided. In an example process, text to be converted to speech is received. The text is represented as a sequence of target units. A plurality of candidate speech segments corresponding to the sequence of target units are selected. Predicted statistical parameters of acoustic features associated with the sequence of target units are determined. The predicted statistical parameters of acoustic features are used to determine target costs and concatenation costs associated with the plurality of candidate speech segments. Based on a combined cost determined from the target costs and concatenation costs, a subset of candidate speech segments is selected from the plurality of candidate speech segments. Speech corresponding to the received text is generated using the subset of candidate speech segments.
METHODS AND SYSTEMS FOR SYNTHESISING SPEECH FROM TEXT
A method for synthesising speech from text includes receiving text and encoding, by way of an encoder module, the received text. The method further includes determining, by way of an attention module, a context vector from the encoding of the received text, wherein determining the context vector comprises at least one of: applying a threshold function to an attention vector and accumulating the thresholded attention vector, or applying an activation function to the attention vector and accumulating the activated attention vector. The method further includes determining speech data from the context vector.
METHODS AND SYSTEMS FOR SYNTHESISING SPEECH FROM TEXT
A method for synthesising speech from text includes receiving text and encoding, by way of an encoder module, the received text. The method further includes determining, by way of an attention module, a context vector from the encoding of the received text, wherein determining the context vector comprises at least one of: applying a threshold function to an attention vector and accumulating the thresholded attention vector, or applying an activation function to the attention vector and accumulating the activated attention vector. The method further includes determining speech data from the context vector.
Estimating Clean Speech Features Using Manifold Modeling
The technology described in this document can be embodied in a computer-implemented method that includes receiving, at one or more processing devices, a portion of an input signal representing noisy speech, and extracting, from the portion of the input signal, one or more frequency domain features of the noisy speech. The method also includes generating a set of projected features by projecting each of the one or more frequency domain features on a manifold that represents a model of frequency domain features for clean speech. The method further includes using the set of projected features for at least one of: a) generating synthesized speech that represents a noise-reduced version of the noisy speech, b) performing speaker recognition, or c) performing speech recognition.
Estimating Clean Speech Features Using Manifold Modeling
The technology described in this document can be embodied in a computer-implemented method that includes receiving, at one or more processing devices, a portion of an input signal representing noisy speech, and extracting, from the portion of the input signal, one or more frequency domain features of the noisy speech. The method also includes generating a set of projected features by projecting each of the one or more frequency domain features on a manifold that represents a model of frequency domain features for clean speech. The method further includes using the set of projected features for at least one of: a) generating synthesized speech that represents a noise-reduced version of the noisy speech, b) performing speaker recognition, or c) performing speech recognition.
Methods, systems, and media for seamless audio melding between songs in a playlist
In accordance with some embodiments of the disclosed subject matter, mechanisms for seamless audio melding between audio items in a playlist are provided. In some embodiments, a method for transitioning between audio items in playlists is provided, comprising: identifying a sequence of audio items in a playlist of audio items, wherein the sequence of audio items includes a first audio item and a second audio item that is to be played subsequent to the first audio item; and modifying an end portion of the first audio item and a beginning portion of the second audio item, where the end portion of the first audio item and the beginning portion of the second audio item are to be played concurrently to transition between the first audio item and the second audio item, wherein the end portion of the first audio item and the beginning portion of the second audio item have an overlap duration, and wherein modifying the end portion of the first audio item and the beginning portion of the second audio item comprises: generating a first spectrogram corresponding to the end portion of the first audio item and a second spectrogram corresponding to the beginning portion of the second audio item; identifying, for each frequency band in a series of frequency bands, a window over which the first spectrogram within the end portion of the first audio item and the second spectrogram within the beginning portion of the second audio item have a particular cross-correlation; modifying, for each frequency band in the series of frequency bands, the end portion of the first spectrogram and the beginning portion of the second spectrogram such that amplitudes of frequencies within the frequency band decrease within the first spectrogram over the end portion of the first spectrogram and that amplitudes of frequencies within the frequency band increase within the second spectrogram over the beginning portion of the second spectrogram; and generating a modified version of the first audio item the includes the modified end portion of the first audio item based on the modified end portion of the first spectrogram and generating a modified version of the second audio item that includes the modified beginning portion of the second audio item based on the modified beginning portion of the second spectrogram.
Methods and apparatus for voice-enabling a web application
Methods and apparatus for voice-enabling a web application, wherein the web application includes one or more web pages rendered by a web browser on a computer. At least one information source external to the web application is queried to determine whether information describing a set of one or more supported voice interactions for the web application is available, and in response to determining that the information is available, the information is retrieved from the at least one information source. Voice input for the web application is then enabled based on the retrieved information.