G10L13/027

Automatic Voiceover Generation

A method includes receiving a voice request to generate synthesized voiceover speech for a target advertisement having one or more advertising campaign attributes. The method also includes generating, based on the one or more advertising campaign attributes, a voiceover script that includes a sequence of text for the synthesized voiceover speech. The method also includes generating, using a text-to-speech (TTS) system, the synthesized voiceover speech. The TTS system is configured to receive, as input, the sequence of text for the voiceover script and generate, as output, the synthesized voiceover speech. Here, the synthesized voiceover speech has speech characteristics specified by a target TTS vertical. The method also includes overlaying the synthesized voiceover speech on the target advertisement.

VOICE COMMUNICATION BETWEEN A SPEAKER AND A RECIPIENT OVER A COMMUNICATION NETWORK

Voice communication, between a speaker and a recipient, either or both of which may be in a motor vehicle, is provided via a communication network. In a first step, an input speech utterance is received from the speaker. Optionally, a bandwidth of a connection to the communication network is evaluated at the side of the speaker. The input speech utterance is then converted to text. At least the text is transmitted over the communication network. In case of a sufficiently large bandwidth, the input speech utterance may be transmitted as voice and as text. The transmitted text is converted into an output speech utterance that simulates a voice of the speaker. Finally, the output speech utterance is provided to the recipient.

BRAIN COMPUTER INTERFACE RUNNING A TRAINED ASSOCIATIVE MODEL APPLYING MULTIWAY REGRESSION TO SIMULATE ELECTROCORTICOGRAPHY SIGNAL FEATURES FROM SENSED EEG SIGNALS, AND CORRESPONDING METHOD

Brain computer interface BCI comprising an input adapted to be connected to at least one electroencephalography EEG sensor to receive EEG signals, the BCI further comprising a processor running an associative model trained to simulate electrocorticography ECoG signal features from EEG signals received via the input, the BCI comprising an output to transmit the simulated ECoG signal features.

BRAIN COMPUTER INTERFACE RUNNING A TRAINED ASSOCIATIVE MODEL APPLYING MULTIWAY REGRESSION TO SIMULATE ELECTROCORTICOGRAPHY SIGNAL FEATURES FROM SENSED EEG SIGNALS, AND CORRESPONDING METHOD

Brain computer interface BCI comprising an input adapted to be connected to at least one electroencephalography EEG sensor to receive EEG signals, the BCI further comprising a processor running an associative model trained to simulate electrocorticography ECoG signal features from EEG signals received via the input, the BCI comprising an output to transmit the simulated ECoG signal features.

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING SYSTEM, AND INFORMATION PROCESSING METHOD
20230026093 · 2023-01-26 ·

An information processing apparatus (100) includes an acquisition unit (132) that acquires, from a storage unit (120) that stores episode data (D1) of a speaker, the episode data (D1) regarding topic information included in utterance data (D30) of the speaker, and an interaction control unit (133) that controls an interaction with the speaker so as to include an episode based on the episode data (D1).

Virtual Conversational Agent
20230026945 · 2023-01-26 ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating and operating voice conversing virtual agents with pre-modeled and inherited human behavior across use cases and domains. One of the methods includes: using a first non-domain specific neural network based model to predict a non-domain specific conversational situation, the first neural network based model trained with labelled parts of conversations from more than one domain; forwarding the non-domain specific conversational situation to a second domain specific neural network based model; using the second domain specific neural network based model to predict a conversational situation and to provide a system intent, the second domain specific neural network based model trained with labelled parts of conversation from a specified domain; and generating a response based at least in part on the predicted conversational situation and system intent.

APPARATUS, METHOD, AND COMPUTER PROGRAM FOR PROVIDING LIP-SYNC VIDEO AND APPARATUS, METHOD, AND COMPUTER PROGRAM FOR DISPLAYING LIP-SYNC VIDEO
20230023102 · 2023-01-26 · ·

Provided is a lip-sync video providing apparatus for providing a video in which a voice and lip shapes are synchronized. The lip-sync video providing apparatus is configured to obtain a template video including at least one frame and depicting a target object, obtain a target voice to be used as a voice of the target object, generate a lip image corresponding to the voice for each frame of the template video by using a trained first artificial neural network, and generate lip-sync data including frame identification information of a frame in the template video, the lip image, and position information regarding the lip image in a frame in the template video.

APPARATUS, METHOD, AND COMPUTER PROGRAM FOR PROVIDING LIP-SYNC VIDEO AND APPARATUS, METHOD, AND COMPUTER PROGRAM FOR DISPLAYING LIP-SYNC VIDEO
20230023102 · 2023-01-26 · ·

Provided is a lip-sync video providing apparatus for providing a video in which a voice and lip shapes are synchronized. The lip-sync video providing apparatus is configured to obtain a template video including at least one frame and depicting a target object, obtain a target voice to be used as a voice of the target object, generate a lip image corresponding to the voice for each frame of the template video by using a trained first artificial neural network, and generate lip-sync data including frame identification information of a frame in the template video, the lip image, and position information regarding the lip image in a frame in the template video.

END-TO-END SPEECH CONVERSION

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for end to end speech conversion are disclosed. In one aspect, a method includes the actions of receiving first audio data of a first utterance of one or more first terms spoken by a user. The actions further include providing the first audio data as an input to a model that is configured to receive first given audio data in a first voice and output second given audio data in a synthesized voice without performing speech recognition on the first given audio data. The actions further include receiving second audio data of a second utterance of the one or more first terms spoken in the synthesized voice. The actions further include providing, for output, the second audio data of the second utterance of the one or more first terms spoken in the synthesized voice.

TEXT-TO-SPEECH SYNTHESIS METHOD AND SYSTEM, AND A METHOD OF TRAINING A TEXT-TO-SPEECH SYNTHESIS SYSTEM

A text-to-speech synthesis method includes receiving text, inputting the received text in a synthesizer that includes a prediction network configured to convert the received text into speech data having a speech attribute that includes emotion, intention, projection, pace, and/or accent, and outputting said speech data. The prediction network is obtained by obtaining a first sub-dataset and a second sub-dataset, where the first sub-dataset and the second sub-dataset each include audio samples and corresponding text, and the speech attribute of the audio samples of the second sub-dataset is more pronounced than the speech attribute of the audio samples of the first sub-dataset, training a first model using the first sub-dataset until a performance metric reaches a first predetermined value, training a second model by further training the first model using the second sub-dataset until the performance metric reaches a second predetermined value, and selecting one trained model as the prediction network.