G10L25/90

Multilingual speech translation with adaptive speech synthesis and adaptive physiognomy

Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a machine learning model per identified speaker, the trained machine learning models to be used to generate a spoken version of the translated, extracted speech segments based on the identified speaker; generate, per translated, extracted speech segment, a spoken version of the translated, extracted speech segments using a trained machine learning model that corresponds to the identified speaker of the translated, extracted speech segment and prosody information for the extracted speech segments; and replace the extracted speech segments from the audio track of the audio/visual file with the spoken versions spoken version of the translated, extracted speech segments to generate a modified audio track.

Multilingual speech translation with adaptive speech synthesis and adaptive physiognomy

Techniques for the generation of dubbed audio for an audio/video are described. An exemplary approach is to receive a request to generate dubbed speech for an audio/visual file; and in response to the request to: extract speech segments from an audio track of the audio/visual file associated with identified speakers; translate the extracted speech segments into a target language; determine a machine learning model per identified speaker, the trained machine learning models to be used to generate a spoken version of the translated, extracted speech segments based on the identified speaker; generate, per translated, extracted speech segment, a spoken version of the translated, extracted speech segments using a trained machine learning model that corresponds to the identified speaker of the translated, extracted speech segment and prosody information for the extracted speech segments; and replace the extracted speech segments from the audio track of the audio/visual file with the spoken versions spoken version of the translated, extracted speech segments to generate a modified audio track.

Processing speech signals in voice-based profiling
11538472 · 2022-12-27 · ·

This document describes a data processing system for processing a speech signal for voice-based profiling. The data processing system segments the speech signal into a plurality of segments, with each segment representing a portion of the speech signal. For each segment, the data processing system generates a feature vector comprising data indicative of one or more features of the portion of the speech signal represented by that segment and determines whether the feature vector comprises data indicative of one or more features with a threshold amount of confidence. For each of a subset of the generated feature vectors, the system processes data in that feature vector to generate a prediction of a value of a profile parameter and transmits an output responsive to machine executable code that generates a visual representation of the prediction of the value of the profile parameter.

Processing speech signals in voice-based profiling
11538472 · 2022-12-27 · ·

This document describes a data processing system for processing a speech signal for voice-based profiling. The data processing system segments the speech signal into a plurality of segments, with each segment representing a portion of the speech signal. For each segment, the data processing system generates a feature vector comprising data indicative of one or more features of the portion of the speech signal represented by that segment and determines whether the feature vector comprises data indicative of one or more features with a threshold amount of confidence. For each of a subset of the generated feature vectors, the system processes data in that feature vector to generate a prediction of a value of a profile parameter and transmits an output responsive to machine executable code that generates a visual representation of the prediction of the value of the profile parameter.

Interaction system, non-transitory computer readable storage medium, and method for controlling interaction system
11538491 · 2022-12-27 · ·

An interaction system that interacts with a user is disclosed. The interaction system includes: an input device that receives a speech signal of the user; a computing device that determines a speech content of the interaction system for a speech content acquired from the speech signal of the user such that a frequency distribution of speech feature values of the speech content of the interaction system approaches an ideal frequency distribution; and an output device that outputs the determined speech content of the interaction system.

PRIVATE SPEECH FILTERINGS
20220406315 · 2022-12-22 ·

In some examples, an electronic device comprises an image sensor to detect a user action, an audio input device to receive an audio signal, and a processor coupled to the audio input device and the image sensor. The processor is to determine that the audio signal includes private speech based on the user action, remove the private speech from the audio signal to produce a filtered audio signal, and transmit the filtered audio signal.

PRIVATE SPEECH FILTERINGS
20220406315 · 2022-12-22 ·

In some examples, an electronic device comprises an image sensor to detect a user action, an audio input device to receive an audio signal, and a processor coupled to the audio input device and the image sensor. The processor is to determine that the audio signal includes private speech based on the user action, remove the private speech from the audio signal to produce a filtered audio signal, and transmit the filtered audio signal.

Systems and methods for noise cancellation

A computing device may receive audio data from a microphone representing audio in an environment of the device, which may correspond to an utterance and noise. A model may be trained to process the audio data to cancel noise from the audio data. The model may include an encoder that includes one or more dense layers, one or more recurrent layers, and a decoder that includes one or more dense layers.

Systems and methods for noise cancellation

A computing device may receive audio data from a microphone representing audio in an environment of the device, which may correspond to an utterance and noise. A model may be trained to process the audio data to cancel noise from the audio data. The model may include an encoder that includes one or more dense layers, one or more recurrent layers, and a decoder that includes one or more dense layers.

Dynamic creation and insertion of content

In an aspect, during a presentation of a presentation material, viewers of the presentation material can be monitored. Based on the monitoring, new content can be determined for insertion into the presentation material. The new content can be automatically inserted to the presentation material in real time. In another aspect, during the presentation, a presenter of the presentation material can be monitored. The presenter's speech can be intercepted and analyzed to detect a level of confidence. Based on the detected level of confidence, the presenter's speech can be adjusted and the adjusted speech can be played back automatically, for example, in lieu of the presenter's original speech that is intercepted.