Patent classifications
G10L13/06
Method and apparatus for generating speech
A speech generation method and apparatus are disclosed. The speech generation method includes obtaining, by a processor, a linguistic feature and a prosodic feature from an input text, determining, by the processor, a first candidate speech element through a cost calculation and a Viterbi search based on the linguistic feature and the prosodic feature, generating, at a speech element generator implemented at the processor, a second candidate speech element based on the linguistic feature or the prosodic feature and the first candidate speech element, and outputting, by the processor, an output speech by concatenating the second candidate speech element and a speech sequence determined through the Viterbi search.
Synthetic speech processing
A speech-processing system receives input data representing text. An input encoder processes the input data to determine first embedding data representing the text. A local attention encoder processes a subset of the first embedding data in accordance with a predicted size to determine second embedding data. An attention encoder processes the second embedding data to determine third embedding data. A decoder processes the third embedding data to determine audio data corresponding to the text.
Automatic synthesis of translated speech using speaker-specific phonemes
An embodiment includes converting an original audio signal to an original text string, the original audio signal being from a recording of the original text string spoken by a specific person in a source language. The embodiment generates a translated text string by translating the original text string from the source language to a target language, including translation of a word from the source language to a target language. The embodiment assembles a standard phoneme sequence from a set of standard phonemes, where the standard phoneme sequence includes a standard pronunciation of the translated word. The embodiment also associates a custom phoneme with a standard phoneme of the standard phoneme sequence, where the custom phoneme includes the specific person's pronunciation of a sound in the translated word. The embodiment synthesizes the translated text string to a translated audio signal including the translated word pronounced using the custom phoneme.
Text-to-speech from media content item snippets
A text-to-speech engine creates audio output that includes synthesized speech and one or more media content item snippets. The input text is obtained and partitioned into text sets. A track having lyrics that match a part of one of the text sets is identified. The location of the track's audio that contains the lyric is extracted based on forced alignment data. The extracted audio is combined with synthesized speech corresponding to the remainder of the input text to form audio output.
POSE ESTIMATION MODEL LEARNING APPARATUS, POSE ESTIMATION APPARATUS, METHODS AND PROGRAMS FOR THE SAME
A pause estimation model learning apparatus includes: a morphological analysis unit configured to perform morphological analysis on training text data to provide M types of information, M being an integer that is equal to or larger than 2; a feature selection unit configured to combine N pieces of information, among the M pieces of information, to be an input feature when a predetermined certain condition is satisfied, and select predetermined one of the N pieces of information to be the input feature when the certain condition is not satisfied, N being an integer that is equal to or larger than 2 and equal to or smaller than M; and a learning unit configured to learn a pause estimation model by using the input feature selected by the feature selection unit and a pause correct label.
POSE ESTIMATION MODEL LEARNING APPARATUS, POSE ESTIMATION APPARATUS, METHODS AND PROGRAMS FOR THE SAME
A pause estimation model learning apparatus includes: a morphological analysis unit configured to perform morphological analysis on training text data to provide M types of information, M being an integer that is equal to or larger than 2; a feature selection unit configured to combine N pieces of information, among the M pieces of information, to be an input feature when a predetermined certain condition is satisfied, and select predetermined one of the N pieces of information to be the input feature when the certain condition is not satisfied, N being an integer that is equal to or larger than 2 and equal to or smaller than M; and a learning unit configured to learn a pause estimation model by using the input feature selected by the feature selection unit and a pause correct label.
VOICE COMMUNICATION BETWEEN A SPEAKER AND A RECIPIENT OVER A COMMUNICATION NETWORK
Voice communication, between a speaker and a recipient, either or both of which may be in a motor vehicle, is provided via a communication network. In a first step, an input speech utterance is received from the speaker. Optionally, a bandwidth of a connection to the communication network is evaluated at the side of the speaker. The input speech utterance is then converted to text. At least the text is transmitted over the communication network. In case of a sufficiently large bandwidth, the input speech utterance may be transmitted as voice and as text. The transmitted text is converted into an output speech utterance that simulates a voice of the speaker. Finally, the output speech utterance is provided to the recipient.
VOICE COMMUNICATION BETWEEN A SPEAKER AND A RECIPIENT OVER A COMMUNICATION NETWORK
Voice communication, between a speaker and a recipient, either or both of which may be in a motor vehicle, is provided via a communication network. In a first step, an input speech utterance is received from the speaker. Optionally, a bandwidth of a connection to the communication network is evaluated at the side of the speaker. The input speech utterance is then converted to text. At least the text is transmitted over the communication network. In case of a sufficiently large bandwidth, the input speech utterance may be transmitted as voice and as text. The transmitted text is converted into an output speech utterance that simulates a voice of the speaker. Finally, the output speech utterance is provided to the recipient.
SPEECH SYNTHESIS METHOD, AND ELECTRONIC DEVICE
The disclosure provides a speech synthesis method, and an electronic device. The technical solution is described as follows. A text to be synthesized and speech features of a target user are obtained. Predicted first acoustic features based on the text to be synthesized and the speech features are obtained. A target template audio is obtained from a template audio library based on the text to be synthesized. Second acoustic features of the target template audio are extracted. Target acoustic features are generated by splicing the first acoustic features and the second acoustic features. Speech synthesis is performed on the text to be synthesized based on the target acoustic features and the speech features, to generate a target speech of the text to be synthesized.
DYNAMIC ADJUSTMENT OF CONTENT DESCRIPTIONS FOR VISUAL COMPONENTS
In some implementations, a mobile application may receive at least one string included in at least one content description associated with at least one visual component of a screen generated by a mobile application. The mobile application may apply a function to the at least one string, wherein the function performs a targeted replacement of characters included in the at least one string based on at least one optimization associated with a text-to-speech algorithm. Accordingly, the mobile application may receive output from the function that includes at least one modified string based on the at least one string and generate an audio signal, based on the at least one modified string, using the text-to-speech algorithm.