Patent classifications
G10L13/086
Speech Recognition and Text-to-Speech Learning System
An example text-to-speech learning system performs a method for generating a pronunciation sequence conversion model. The method includes generating a first pronunciation sequence from a speech input of a training pair and generating a second pronunciation sequence from a text input of the training pair. The method also includes determining a pronunciation sequence difference between the first pronunciation sequence and the second pronunciation sequence; and generating a pronunciation sequence conversion model based on the pronunciation sequence difference. An example speech recognition learning system performs a method for generating a pronunciation sequence conversion model. The method includes extracting an audio signal vector from a speech input and applying an audio signal conversion model to the audio signal vector to generate a converted audio signal vector. The method also includes adapting an acoustic model based on the converted audio signal vector to generate an adapted acoustic model.
Speech generation using crosslingual phoneme mapping
Computer generated speech can be generated for cross-lingual natural language textual data streams by utilizing a universal phoneme set. In a variety of implementations, the natural language textual data stream includes a primary language portion in a primary language and a secondary language portion that is not in the primary language. Phonemes corresponding to the secondary language portion can be determined from a set of phonemes in a universal data set. These phonemes can be mapped back to a set of phonemes for the primary language. Audio data can be generated for these phonemes to pronounce the secondary language portion of the natural language textual data stream utilizing phonemes associated with the primary language.
Phonemes And Graphemes for Neural Text-to-Speech
A method includes receiving a text input including a sequence of words represented as an input encoder embedding. The input encoder embedding includes a plurality of tokens, with the plurality of tokens including a first set of grapheme tokens representing the text input as respective graphemes and a second set of phoneme tokens representing the text input as respective phonemes. The method also includes, for each respective phoneme token of the second set of phoneme tokens: identifying a respective word of the sequence of words corresponding to the respective phoneme token and determining a respective grapheme token representing the respective word of the sequence of words corresponding to the respective phoneme token. The method also includes generating an output encoder embedding based on a relationship between each respective phoneme token and the corresponding grapheme token determined to represent a same respective word as the respective phoneme token.
Speech synthesis system, method and non-transitory computer readable medium with language option selection and acoustic models
A speech synthesis system includes an operating interface, a storage unit and a processor. The operating interface provides a plurality of language options for a user to select one output language option therefrom. The storage unit stores a plurality of acoustic models. Each acoustic model corresponds to one of the language options and includes a plurality of phoneme labels corresponding to a specific vocal. The processor receives a text file and generates output speech data corresponding to the specific vocal according to the text file, a speech synthesizer, and one of the acoustic models which corresponds to the output language option.
Electronic apparatus and controlling method thereof
An electronic apparatus which includes a memory configured to store first voice recognition information related to a first language and second voice recognition information related to a second language, and a processor to obtain a first text corresponding to a user voice that is received on the basis of first voice recognition information. The processor, based on an entity name being included in the user voice according to the obtained first text, identifies a segment in the user voice in which the entity name is included, and obtains a second text corresponding to the identified segment of the user voice on the basis of the second voice recognition information, and obtains control information corresponding to the user voice on the basis of the first text and the second text.
Maintaining original volume changes of a character in revoiced media stream
Methods, systems, and computer-readable media for artificially generating a revoiced media stream and maintaining original volume changes of a character in the revoiced media stream are provided. For example, a media stream including an individual speaking may be obtained. A transcript of the media stream may be obtained. The transcript of the media stream may be translated to a target language. A revoiced media stream in which the translated transcript in the target language is spoken by a virtual entity may be generated, wherein a ratio of the volume levels between first and second sets of words in the revoiced media stream is substantially identical to the ratio of volume levels between corresponding first and second utterances in the received media stream.
System and method for intelligent language switching in automated text-to-speech systems
Systems, methods, and computer-readable storage media for providing for intelligent switching of languages and/or pronunciations in a text-to-speech system. As the system receives text, the text is analyzed to identify portions which should have speech constructed using a pronunciation distinct from the remaining portions of the text. The text-to-speech system uses multiple pronunciation dictionaries to generate and produce speech corresponding to the text, where the identified portions of the text are in a different language or have a different accent from the remainder of the text. Having generated speech corresponding to the text in multiple languages, accents, or dialects, the system combines the portions, then communicates the speech to the text recipient.
DYNAMIC TRANSLATION FOR A CONVERSATION
A conversation design is received for a conversation bot that enables the conversation bot to provide a service using a conversation flow specified at least in part by the conversation design. The conversation design specifies in a first human language at least a portion of a message content to be provided by the conversation bot. It is identified that an end-user of the conversation bot prefers to converse in a second human language different from the first human language. In response to a determination that the message content is to be provided by the conversation bot to the end-user, the message content of the conversation design is dynamically translated for the end-user from the first human language to the second human language. The translated message content is provided to the end-user in a message from the conversation bot.
METHOD AND SYSTEM FOR REMOTE COMMUNICATION BASED ON REAL-TIME TRANSLATION SERVICE
A method for remote communication based on a real-time translation service according to an embodiment of the present disclosure, as a method for providing remote communication based on a real-time translation service by a real-time translation application executed by at least one or more processors of a computing device, comprises performing augmented reality-based remote communication; setting an initial value of a translation function for the remote communication; obtaining communication data of other users through the remote communication; performing language detection for the obtained communication data; when a target translation language is detected within the communication data from the performed language detection, translating communication data of the target translation language detected; and providing the translated communication data.
Systems and methods for replaying content dialogue in an alternate language
Systems and methods are described herein for replaying content dialogue in an alternate language in response to a user command. While the content is playing on a media device, a first language in which the content dialogue is spoken is identified. Upon receiving a voice command to repeat a portion of the dialogue, the language in which the command was spoken is identified. The portion of the content dialogue to repeat is identified and translated from the first language to the second language. The translated portion of the content dialogue is then output. In this way, the user can simply ask in their native language for the dialogue to be repeated and the repeated portion of the dialogue is presented in the user's native language.