G10L13/086

Training Speech Synthesis to Generate Distinct Speech Sounds

A method (800) of training a text-to-speech (TTS) model (108) includes obtaining training data (150) including reference input text (104) that includes a sequence of characters, a sequence of reference audio features (402) representative of the sequence of characters, and a sequence of reference phone labels (502) representative of distinct speech sounds of the reference audio features. For each of a plurality of time steps, the method includes generating a corresponding predicted audio feature (120) based on a respective portion of the reference input text for the time step and generating, using a phone label mapping network (510), a corresponding predicted phone label (520) associated with the predicted audio feature. The method also includes aligning the predicted phone label with the reference phone label to determine a corresponding predicted phone label loss (622) and updating the TTS model based on the corresponding predicted phone label loss.

Generating videos with a character indicating a region of an image
11595738 · 2023-02-28 · ·

Methods, systems, and computer-readable media for generating videos with characters indicating regions of images are provided. For example, an image containing a first region may be received. At least one characteristic of a character may be obtained. A script containing a first segment of the script may be received. The first segment of the script may be related to the first region of the image. The at least one characteristic of a character and the script may be used to generate a video of the character presenting the script and at least part of the image, where the character visually indicates the first region of the image while presenting the first segment of the script.

DIALOGUE APPARATUS, METHOD AND PROGRAM

A dialogue apparatus includes a speech recognition unit (1) configured to perform speech recognition on utterance input to generate a text corresponding to the utterance, a speech waveform corresponding to the utterance, and information regarding a length of sound of the utterance; a language understanding unit (2) configured to grasp contents of the utterance by using the text corresponding to the utterance; a dialogue management unit (3) configured to determine contents of a response corresponding to the utterance by using the content of the utterance; an utterance state extraction unit (4) configured to extract a state of the utterance by using the text corresponding to the utterance, the speech waveform corresponding to the utterance, and the information regarding the length of the sound of the utterance; a response state determination unit (5) configured to determine a state of the response according to the state of the utterance; a response sentence generation unit (6) configured to generate a response sentence by using the content of the response; and a speech synthesis unit (7) configured to synthesize speech corresponding to the response sentence with the state of the response taken into account.

Dynamic system response configuration

A natural language processing system may use system response configuration data to determine customized output data forms when outputting data for a user. The system response configuration data may represent various output attributes the system may use when creating output data. The system may have a certain number of existing profiles where a profile is associated with certain settings for the system response configuration data/attributes. The system may also use various data such as context data, sentiment data, or the like to customize system response configuration data during a dialog. Other components, such as natural language generation (NLG), text-to-speech (TTS), or the like, may use the customized system response configuration data to determine the form, timing, etc. of output data to be presented to a user.

GENERATING PERSONALIZED VIDEOS FROM TEXTUAL INFORMATION
20250234072 · 2025-07-17 ·

Systems, methods and non-transitory computer readable media for generating personalized videos from textual information are provided. An indication of a preference of a user is obtained. Further, textual information for generating a personalized video is obtained from the user. At least one characteristic of a character is selected based on the preference of the user. An artificial neural network, the textual information and the selected at least one characteristic of the character is used to generate the personalized video depicting the character with the selected at least one characteristic.

Automatic dubbing method and apparatus

An automatic dubbing method is disclosed. The method comprises: extracting speeches of a voice from an audio portion of a media content (504); obtaining a voice print model for the extracted speeches of the voice (506); processing the extracted speeches by utilizing the voice print model to generate replacement speeches (508); and replacing the extracted speeches of the voice with the generated replacement speeches in the audio portion of the media content (510).

METHOD AND APPARATUS FOR SPEECH SYNTHESIS, AND STORAGE MEDIUM

A method for speech synthesis includes obtaining text to be synthesized and an identifier of a speaker, the text being written in a first language; obtaining pronunciation information of each character in the text; generating linguistic features of the text by performing feature extraction on the pronunciation information of each character in the text based on the first language; and obtaining a target speech in a second language other than the first language, by performing speech synthesis based on the linguistic features and the identifier of the speaker.

SEMI-STRUCTURED CONTENT AWARE BI-DIRECTIONAL TRANSFORMER
20220358906 · 2022-11-10 ·

A method, computer system, and a computer program product for natural language processing are provided. A first text corpus that includes semi-structured content that includes hierarchical nodes may be received. Some of the hierarchical nodes may be masked. Node embeddings and level embeddings may be generated from the semi-structured content of the first text corpus and from the masked hierarchical nodes. The node embeddings and the level embeddings may be included in a bi-directional transformer model. The bi-directional transformer model may be trained on the first text corpus by reducing loss from the bi-directional transformer model predicting the masked hierarchical nodes.

TRANSLATIONAL BOT FOR GROUP COMMUNICATION
20230099757 · 2023-03-30 ·

The present disclosure is directed to systems, methods and devices for providing real-time translation for group communications. A speech input may be received from a first group communication device associated with a first language. One or more groups to distribute the speech input may be determined, wherein each of the one or more groups comprises at least one group communication device associated with a language that is different than the first language. The received speech input may be translated into a corresponding language for each of the one or more groups, and the translated speech may be sent to each group communication device of the one or more groups in a language corresponding to each of the one or more groups.

METHOD, DEVICE, AND NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM TO PROVIDE AUDIO ENGAGEMENT SERVICE FOR COLLECTING PRONUNCIATIONS BY ACCENT
20230095928 · 2023-03-30 ·

A method, device and non-transitory computer-readable recording medium provide an audio engagement service for collecting pronunciations by accent. An audio engagement service method includes setting an accent indicating a country, an ethnic group, or a geographic region of a participant; generating audio uttered by the participant of a given example text into pronunciation content of the accent; and providing the pronunciation content with information on the accent.