G10L13/086

INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD

In a terminal device (10), a communication unit (15) receives a counterpart utterance text in a counterpart language section and a counterpart utterance speech in a counterpart non-language section, and a control unit (11) outputs the counterpart utterance text after performing language translation, and outputs the counterpart utterance speech in the counterpart non-language section without performing language translation. For example, the control unit (11) outputs the counterpart utterance speech in the counterpart non-language section before outputting a result of the language translation of the counterpart utterance text.

Multilingual text-to-speech synthesis
11769483 · 2023-09-26 · ·

A multilingual text-to-speech synthesis method and system are disclosed. The method includes receiving an articulatory feature of a speaker regarding a first language, receiving an input text of a second language, and generating output speech data for the input text of the second language that simulates the speaker's speech by inputting the input text of the second language and the articulatory feature of the speaker regarding the first language to a single artificial neural network multilingual text-to-speech synthesis model. The single artificial neural network multilingual text-to-speech synthesis model is generated by learning similarity information between phonemes of the first language and phonemes of the second language based on a first learning data of the first language and a second learning data of the second language.

Hotword-Aware Speech Synthesis
20210366459 · 2021-11-25 · ·

A method includes receiving text input data for conversion into synthesized speech and determining, using a hotword-aware model trained to detect a presence of a hotword assigned to a user device, whether a pronunciation of the text input data includes the hotword. The hotword is configured to initiate a wake-up process on the user device for processing the hotword and/or one or more other terms following the hotword in the audio input data. When the pronunciation of the text input data includes the hotword, the method also includes generating an audio output signal from the text input data and providing the audio output signal to an audio output device to output the audio output signal. The audio output signal when captured by an audio capture device of the user device, configured to prevent initiation of the wake-up process on the user device.

Multilingual text-to-speech synthesis
11217224 · 2022-01-04 · ·

A multilingual text-to-speech synthesis method and system are disclosed. The method includes receiving first learning data including a learning text of a first language and learning speech data of the first language corresponding to the learning text of the first language, receiving second learning data including a learning text of a second language and learning speech data of the second language corresponding to the learning text of the second language, and generating a single artificial neural network text-to-speech synthesis model by learning similarity information between phonemes of the first language and phonemes of the second language based on the first learning data and the second learning data.

PROCESSING SEQUENCES USING CONVOLUTIONAL NEURAL NETWORKS

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing sequences using convolutional neural networks. One of the methods includes, for each of the time steps: providing a current sequence of audio data as input to a convolutional subnetwork, wherein the current sequence comprises the respective audio sample at each time step that precedes the time step in the output sequence, and wherein the convolutional subnetwork is configured to process the current sequence of audio data to generate an alternative representation for the time step; and providing the alternative representation for the time step as input to an output layer, wherein the output layer is configured to: process the alternative representation to generate an output that defines a score distribution over a plurality of possible audio samples for the time step.

Semi-structured content aware bi-directional transformer

A method, computer system, and a computer program product for natural language processing are provided. A first text corpus that includes semi-structured content that includes hierarchical nodes may be received. Some of the hierarchical nodes may be masked. Node embeddings and level embeddings may be generated from the semi-structured content of the first text corpus and from the masked hierarchical nodes. The node embeddings and the level embeddings may be included in a bi-directional transformer model. The bi-directional transformer model may be trained on the first text corpus by reducing loss from the bi-directional transformer model predicting the masked hierarchical nodes.

METHODS AND SYSTEMS FOR CONTROL OF CONTENT IN AN ALTERNATE LANGUAGE OR ACCENT
20230316009 · 2023-10-05 ·

Systems and methods are described herein for replaying content dialogue in an alternate language in response to a user command. While the content is playing on a media device, a first language in which the content dialogue is spoken is identified. Upon receiving a voice command to repeat a portion of the dialogue, the language in which the command was spoken is identified. The portion of the content dialogue to repeat is identified and translated from the first language to the second language. The translated portion of the content dialogue is then output. In this way, the user can simply ask in their native language for the dialogue to be repeated and the repeated portion of the dialogue is presented in the user's native language.

DYNAMIC SYSTEM RESPONSE CONFIGURATION
20230282201 · 2023-09-07 ·

A natural language processing system may use system response configuration data to determine customized output data forms when outputting data for a user. The system response configuration data may represent various output attributes the system may use when creating output data. The system may have a certain number of existing profiles where a profile is associated with certain settings for the system response configuration data/attributes. The system may also use various data such as context data, sentiment data, or the like to customize system response configuration data during a dialog. Other components, such as natural language generation (NLG), text-to-speech (TTS), or the like, may use the customized system response configuration data to determine the form, timing, etc. of output data to be presented to a user.

Automatic generation of videos for digital products using instructions of a markup document on web based documents
11756528 · 2023-09-12 · ·

A system for generating videos uses a domain-specific instructional language and a video rendering engine that produces videos against a digital product which changes and evolves over time. The video rendering engine uses the instructions in an instruction markup document written in the domain-specific instructional language to generate a video while navigating a web-based document, (which is different from the instruction markup document), representing the digital product for which the video is generated. The video rendering engine navigates the web-based document, coupled with the instruction markup document, which explains the operations to be performed on the web-based document. The instruction markup document also identifies the special effects that manipulate the underlying product in real-time, includes the spoken text for generating subtitles, and provides formalized change management by design.

GENERATING REVOICED MEDIA STREAMS IN A VIRTUAL REALITY
20230156294 · 2023-05-18 ·

Methods, systems, and computer-readable media for generating videos with characters indicating regions of images are provided. For example, an image containing a first region may be received. At least one characteristic of a character may be obtained. A script containing a first segment of the script may be received. The first segment of the script may be related to the first region of the image. The at least one characteristic of a character and the script may be used to generate a video of the character presenting the script and at least part of the image, where the character visually indicates the first region of the image while presenting the first segment of the script.