G10L13/02

Multilingual speech synthesis and cross-language voice cloning

A method includes receiving an input text sequence to be synthesized into speech in a first language and obtaining a speaker embedding, the speaker embedding specifying specific voice characteristics of a target speaker for synthesizing the input text sequence into speech that clones a voice of the target speaker. The target speaker includes a native speaker of a second language different than the first language. The method also includes generating, using a text-to-speech (TTS) model, an output audio feature representation of the input text by processing the input text sequence and the speaker embedding. The output audio feature representation includes the voice characteristics of the target speaker specified by the speaker embedding.

Methods and systems for pushing audiovisual playlist based on text-attentional convolutional neural network
11580979 · 2023-02-14 · ·

In some embodiments, methods and systems for pushing audiovisual playlists based on a text-attentional convolutional neural network include a local voice interactive terminal, a dialog system server and a playlist recommendation engine, where the dialog system server and the playlist recommendation engine are respectively connected to the local voice interactive terminal. In some embodiments, the local voice interactive terminal includes a microphone array, a host computer connected to the microphone array, and a voice synthesis chip board connected to the microphone array. In some embodiments, the playlist recommendation engine obtains rating data based on a rating predictor constructed by the neural network; the host computer parses the data into recommended playlist information; and the voice terminal synthesizes the results and pushes them to a user in the form of voice.

Methods and systems for pushing audiovisual playlist based on text-attentional convolutional neural network
11580979 · 2023-02-14 · ·

In some embodiments, methods and systems for pushing audiovisual playlists based on a text-attentional convolutional neural network include a local voice interactive terminal, a dialog system server and a playlist recommendation engine, where the dialog system server and the playlist recommendation engine are respectively connected to the local voice interactive terminal. In some embodiments, the local voice interactive terminal includes a microphone array, a host computer connected to the microphone array, and a voice synthesis chip board connected to the microphone array. In some embodiments, the playlist recommendation engine obtains rating data based on a rating predictor constructed by the neural network; the host computer parses the data into recommended playlist information; and the voice terminal synthesizes the results and pushes them to a user in the form of voice.

CORRECTION METHOD OF SYNTHESIZED SPEECH SET FOR HEARING AID
20230038118 · 2023-02-09 · ·

A method for correcting a synthesized speech set for hearing aid according to an aspect of the present invention includes the steps of outputting first synthesized speech for testing on the basis of first synthesized speech data for testing correlated with a first phoneme label in a synthesized speech set for testing, accepting a first answer selected by a user, outputting second synthesized speech for testing on the basis of second synthesized speech data for testing correlated with a second phoneme label in the synthesized speech set for testing, accepting a second answer selected by the user, and correlating first synthesized speech data for hearing aid with the second phoneme label instead of second synthesized speech data for hearing aid in a synthesized speech set for hearing aid, in a case in which the first answer matches the second phoneme label and also the second answer does not match the second phoneme label.

CORRECTION METHOD OF SYNTHESIZED SPEECH SET FOR HEARING AID
20230038118 · 2023-02-09 · ·

A method for correcting a synthesized speech set for hearing aid according to an aspect of the present invention includes the steps of outputting first synthesized speech for testing on the basis of first synthesized speech data for testing correlated with a first phoneme label in a synthesized speech set for testing, accepting a first answer selected by a user, outputting second synthesized speech for testing on the basis of second synthesized speech data for testing correlated with a second phoneme label in the synthesized speech set for testing, accepting a second answer selected by the user, and correlating first synthesized speech data for hearing aid with the second phoneme label instead of second synthesized speech data for hearing aid in a synthesized speech set for hearing aid, in a case in which the first answer matches the second phoneme label and also the second answer does not match the second phoneme label.

METHOD AND APPARATUS FOR PROCESSING VIRTUAL VIDEO LIVESTREAMING, STORAGE MEDIUM AND ELECTRONIC DEVICE
20230039789 · 2023-02-09 ·

A method includes: receiving text data and motion data of a virtual object, the motion data including a motion identifier of a specified motion and a start position identifier of a start position that the specified motion starts being in line with text in the text data; generating audio data and expression data of the virtual object according to the text data, and generating facial images of the virtual object according to the expression data; generating a background image sequence containing the specified motion according to the start position identifier and the motion identifier, the background image sequence including at least one background image; performing image fusion processing on the facial images and the at least one background image to obtain one or more live video frames; and synthesizing the live video frames with the audio data into a live video stream in real time.

METHOD AND APPARATUS FOR PROCESSING VIRTUAL VIDEO LIVESTREAMING, STORAGE MEDIUM AND ELECTRONIC DEVICE
20230039789 · 2023-02-09 ·

A method includes: receiving text data and motion data of a virtual object, the motion data including a motion identifier of a specified motion and a start position identifier of a start position that the specified motion starts being in line with text in the text data; generating audio data and expression data of the virtual object according to the text data, and generating facial images of the virtual object according to the expression data; generating a background image sequence containing the specified motion according to the start position identifier and the motion identifier, the background image sequence including at least one background image; performing image fusion processing on the facial images and the at least one background image to obtain one or more live video frames; and synthesizing the live video frames with the audio data into a live video stream in real time.

Systems and Methods for Assisted Translation and Lip Matching for Voice Dubbing
20230039248 · 2023-02-09 ·

Systems and methods for generating candidate translations for use in creating synthetic or human-acted voice dubbings, aiding human translators in generating translations that match the corresponding video, automatically grading how well a candidate translation matches the corresponding video, suggesting modifications to the speed and/or timing of the translated text to improve the grading of a candidate translation, and suggesting modifications to the voice dubbing and/or video to improve the grading of a candidate translation. In that regard, the present technology may be used to fully automate the process of generating lip-matched translations and associated voice dubbings, or as an aid for human-in-the-loop processes that may reduce or eliminate the time and effort required from translators, adapters, voice actors, and/or audio editors to generate voice dubbings.

Systems and Methods for Assisted Translation and Lip Matching for Voice Dubbing
20230039248 · 2023-02-09 ·

Systems and methods for generating candidate translations for use in creating synthetic or human-acted voice dubbings, aiding human translators in generating translations that match the corresponding video, automatically grading how well a candidate translation matches the corresponding video, suggesting modifications to the speed and/or timing of the translated text to improve the grading of a candidate translation, and suggesting modifications to the voice dubbing and/or video to improve the grading of a candidate translation. In that regard, the present technology may be used to fully automate the process of generating lip-matched translations and associated voice dubbings, or as an aid for human-in-the-loop processes that may reduce or eliminate the time and effort required from translators, adapters, voice actors, and/or audio editors to generate voice dubbings.

System answering of user inputs
11556575 · 2023-01-17 · ·

Techniques for structuring knowledge bases specific to a user or group of users and techniques for using the knowledge bases to answer user inputs are described. A knowledge base may be populated with information provided by users associated with the knowledge base. Users associated with a knowledge base may be proactive in providing content to the knowledge base and/or a system may solicit an answer to a user input from users associated with a particular knowledge base. When the system receives an answer, the system may populate the knowledge base with the answer and may output the answer to the user that originated the user input. The system may output user inputs to be answered using messages or by establishing two-way communication sessions.