G10L2021/0135

AUDIO REACTIVE AUGMENTED REALITY

Methods, systems, and storage media for augmenting a video are disclosed. Exemplary implementations may: receive a selection of an effect; receive user-generated content comprising video data and audio data; detect a characteristic of the audio data comprising at least a volume and/or a pitch of the audio data during a period of time; determine a series of numeric values based on the characteristic of the audio data during the period of time, individual numeric values of the series of numeric values being correlated with an amplitude of the volume and/or pitch at a discrete point within the period of time; and augment at least one of the video data and/or the audio data to include the effect based on the series of numeric values at discrete points in time within the period of time.

SYSTEMS AND METHODS FOR ADAPTING HUMAN SPEAKER EMBEDDINGS IN SPEECH SYNTHESIS

Novel methods and systems for adapting a voice cloning synthesizer for a new speaker using real speech data are disclosed. Utterances from one or more target speakers are parameterized and are used to initialize an embedding vector for use with a voice synthesizer, by means of clustering the utterance data and determining the centroid of the data, using a speaker identification neural network, and/or by finding the closest stored embedded vector to the utterance data.

System and method for cross-speaker style transfer in text-to-speech and training data generation

Systems are configured for generating spectrogram data characterized by a voice timbre of a target speaker and a prosody style of source speaker by converting a waveform of source speaker data to phonetic posterior gram (PPG) data, extracting additional prosody features from the source speaker data, and generating a spectrogram based on the PPG data and the extracted prosody features. The systems are configured to utilize/train a machine learning model for generating spectrogram data and for training a neural text-to-speech model with the generated spectrogram data.

Voice morphing apparatus having adjustable parameters
11600284 · 2023-03-07 · ·

A voice morphing apparatus having adjustable parameters is described. The disclosed system and method include a voice morphing apparatus that morphs input audio to mask a speaker's identity. Parameter adjustment uses evaluation of an objective function that is based on the input audio and output of the voice morphing apparatus. The voice morphing apparatus includes objectives that are based adversarially on speaker identification and positively on audio fidelity. Thus, the voice morphing apparatus is adjusted to reduce identifiability of speakers while maintaining fidelity of the morphed audio. The voice morphing apparatus may be used as part of an automatic speech recognition system.

REAL-TIME SPEECH-TO-SPEECH GENERATION (RSSG) AND SIGN LANGUAGE CONVERSION APPARATUS, METHOD AND A SYSTEM THEREFORE
20220327294 · 2022-10-13 ·

Areal-time speech-to-speech generator and sign gestures converter system is disclosed. The system is still challenging for deaf or hearing impaired people. Embodiments of the invention provide direct speech to speech translation system and further conversion to sign gestures is disclosed. Direct speech to speech translation and further sign gesture conversion uses a one-tier approach, creating a unified-model for whole application. The single-model ecosystem takes in audio (MEL spectrogram) as an input and gives out audio (MEL spectrogram) as an output to a speech-sign converter device with a display. This solves the bottleneck problem by converting the translated speech directly to sign language gesture from first language with emotion by preserving phonetic information along the way. This model needs parallel audio samples in two languages. The training methodology involves augmenting or changing both sides of the audio equally and later converts to sign gestures which are being displayed on a speech-sign converter device.

Audio translator
11605369 · 2023-03-14 · ·

Audio translation system includes a feature extractor and a style transfer machine learning model. The feature extractor generates for each of a plurality of source voice files one or more source voice parameters encoded as a collection of source feature vectors, and generates for each of a plurality of target voice files one or more target voice parameters encoded as a collection of target feature vectors. The style transfer machine learning model trained on the collection of source feature vectors for the plurality of source voice files and the collection of target feature vectors for the plurality of target voice files to generate a style transformed feature vector.

Speaker conversion for video games

This specification describes a computer-implemented method of generating speech audio for use in a video game, wherein the speech audio is generated using a voice convertor that has been trained to convert audio data for a source speaker into audio data for a target speaker. The method comprises receiving: (i) source speech audio, and (ii) a target speaker identifier. The source speech audio comprises speech content in the voice of a source speaker. Source acoustic features are determined for the source speech audio. A target speaker embedding associated with the target speaker identifier is generated as output of a speaker encoder of the voice convertor. The target speaker embedding and the source acoustic features are inputted into an acoustic feature encoder of the voice convertor. One or more acoustic feature encodings are generated as output of the acoustic feature encoder. The one or more acoustic feature encodings are derived from the target speaker embedding and the source acoustic features. Target speech audio is generated for the target speaker. The target speech audio comprises the speech content in the voice of the target speaker. The generating comprises decoding the one or more acoustic feature encodings using an acoustic feature decoder of the voice convertor.

Learning speech data generating apparatus, learning speech data generating method, and program

A training speech data generating apparatus includes: a voice conversion unit that converts, using fourth noise data, which is noise data based on third noise data, and speech data, the speech data so as to make the speech data clearly audible under a noise environment corresponding to the fourth noise data; and a noise superimposition unit that obtains training speech data by superimposing the third noise data and the converted speech data.

Generative voice for automated bot handoff to customer service representative

A system and method for applying a generative voice associated with a particular customer service representative to an automated bot that initially interacts with a customer to provide a seamless handoff between the automated bot and the particular customer service representative is described. In one embodiment, when a call from a customer is received at the customer service call center, the customer is matched with a potential customer service representative that is likely to handle the customer's call. The customer will then initially interact with an automated bot that has applied a generative voice associated with the likely customer service representative. The customer can talk with the automated bot using the generative voice and, if needed, when the call is handed off from the automated bot to the customer service representative, the customer will not notice a change in voice or other discontinuity on the call.

VOICE MODIFICATION FOR WEARABLE DEVICE

A method for audio modification for a wearable device in operative Bluetooth communication with a device includes receiving a sound signal at a microphone of the wearable device, encoding the sound signal at a digital signal processor of the wearable device using a codec to provide an encoded audio stream, routing the encoded audio stream from the digital signal processor over a Bluetooth parallel simultaneous transport channel to the device, modifying the encoded audio stream using an audio modification software application executing at a processor external to the wearable device to provide an encoded modified audio stream, communicating the encoded modified audio stream from the audio modification software application executing on the processor over the Bluetooth parallel simultaneous transport channel back to the wearable device, and communicating the encoded modified audio stream from the wearable device over a Bluetooth hands free profile (HFP) channel to the device.