G10L21/06

GENERATING AUDIO WAVEFORMS USING ENCODER AND DECODER NEURAL NETWORKS

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing an input audio waveform using a generator neural network to generate an output audio waveform. In one aspect, a method comprises: receiving an input audio waveform; processing the input audio waveform using an encoder neural network to generate a set of feature vectors representing the input audio waveform; and processing the set of feature vectors representing the input audio waveform using a decoder neural network to generate an output audio waveform that comprises a respective output audio sample for each of a plurality of output time steps.

Hotword-based speaker recognition
11557301 · 2023-01-17 · ·

Systems, methods performed by data processing apparatus and computer storage media encoded with computer programs for receiving an utterance from a user in a multi-user environment, each user having an associated set of available resources, determining that the received utterance includes at least one predetermined word, comparing speaker identification features of the uttered predetermined word with speaker identification features of each of a plurality of previous utterances of the predetermined word, the plurality of previous predetermined word utterances corresponding to different known users in the multi-user environment, attempting to identify the user associated with the uttered predetermined word as matching one of the known users in the multi-user environment, and based on a result of the attempt to identify, selectively providing the user with access to one or more resources associated with a corresponding known user.

Viseme data generation for presentation while content is output

Systems and methods for viseme data generation are disclosed. Uncompressed audio data is generated and/or utilized to determine the beats per minute of the audio data. Visemes are associated with the audio data utilizing a Viterbi algorithm and the beats per minute. A time-stamped list of viseme data is generated that associates the visemes with the portions of the audio data that they correspond to. An animatronic toy and/or an animation is caused to lip sync using the viseme data while audio corresponding to the audio data is output.

AUTHORING AN IMMERSIVE HAPTIC DATA FILE USING AN AUTHORING TOOL

Methods and systems of authoring audio signal(s) into haptic data file(s) are disclosed. An audio analysis module analyses the audio signal(s) using filterbank(s) or by performing a spectrogram analysis. Transients are detected in the audio signal. If present, the transients are processed to determine a transient score and a transient binary. A database stores device specific information and actuator specific information. A haptic perceptual bandwidth of an electronic computing device having an embedded actuator is determined by using information from the database. A user interface allows modification of time-amplitude values and transient values based on the determined haptic perceptual bandwidth. Authored time amplitude values are aggregated in authored audio descriptor data, which is passed to a transformation module that fits the data into the haptic perceptual bandwidth and implements algorithms to produce transformed audio descriptor data. Finally, the transformed audio descriptor data is converted to the haptic data file.

AUTHORING AN IMMERSIVE HAPTIC DATA FILE USING AN AUTHORING TOOL

Methods and systems of authoring audio signal(s) into haptic data file(s) are disclosed. An audio analysis module analyses the audio signal(s) using filterbank(s) or by performing a spectrogram analysis. Transients are detected in the audio signal. If present, the transients are processed to determine a transient score and a transient binary. A database stores device specific information and actuator specific information. A haptic perceptual bandwidth of an electronic computing device having an embedded actuator is determined by using information from the database. A user interface allows modification of time-amplitude values and transient values based on the determined haptic perceptual bandwidth. Authored time amplitude values are aggregated in authored audio descriptor data, which is passed to a transformation module that fits the data into the haptic perceptual bandwidth and implements algorithms to produce transformed audio descriptor data. Finally, the transformed audio descriptor data is converted to the haptic data file.

SYSTEMS, METHODS, AND DEVICES FOR AUDIO CORRECTION
20220417659 · 2022-12-29 ·

Systems, methods, and devices relating to audio correction are described. A first portion of content including first spoken audio content indicating first word(s) may be determined. Background audio content of the first portion of the content may be determined. A voice profile may be determined based on the first spoken audio content. Based on the voice profile, second spoken audio content indicating second word(s) to replace the first word(s) may be generated. Based on mixing the background audio content and the second spoken audio content, a second portion of content may be determined. In the content, the first portion of the content may be replaced with the generated second portion of content.

PROHIBITING VOICE ATTACKS

In an approach for prohibiting voice attacks, a processor, in response to receiving a voice input from a source, determines, using a predetermined filter including an allowlist, that the voice input does not match any corresponding entry of the predetermined filter. A processor routes the voice input to an adversarial pipeline for processing. A processor identifies an adversarial example of the voice input using a predetermined connectionist temporal classification method. A processor generates a configurable distorted adversarial example using the adversarial example identified. In response to a user reply, a processor injects the configurable distorted adversarial example as noise into a voice stream of the user reply in real-time to alter the voice stream. A processor routes the altered voice stream to the source.

SOUND SIGNAL PROCESSING APPARATUS AND METHOD OF PROCESSING SOUND SIGNAL
20220392479 · 2022-12-08 · ·

A sound signal processing apparatus may include: a directional microphone configured to detect a user voice signal including a user's voice by arranging the directional microphone to face an utterance point of the user's voice; a non-directional microphone configured to detect a mixed sound signal comprising the user voice and an external sound; and a processor configured to generate an external sound signal by attenuating the user's voice from the mixed sound signal, by differentially calculating the user voice signal from the mixed sound signal.

SOMATOSENSORY VIBRATION GENERATING DEVICE AND METHOD FOR FORMING SOMATOSENSORY VIBRATION
20220392318 · 2022-12-08 ·

The invention provides a somatosensory vibration generating device comprising: an audio signal receiving module for receiving sound waves of external environmental sounds and converting the sound waves into a first audio frequency signal; a digital-to-analog conversion module for performing digital-to-analog conversion on the first audio frequency signal to generate and output a second audio frequency signal after digital-to-analog conversion; a digital signal processing module for converting the second audio frequency signal output by the digital-to-analog conversion module into a first vibration signal; an operational amplifier for performing gain processing on the first vibration signal and outputting a second vibration signal after gain processing; and at least one tactile transducer at least comprising a vibration element and a tactile transducer; and a frequency of the second audio frequency signal is less than 200 Hz.

SOMATOSENSORY VIBRATION GENERATING DEVICE AND METHOD FOR FORMING SOMATOSENSORY VIBRATION
20220392318 · 2022-12-08 ·

The invention provides a somatosensory vibration generating device comprising: an audio signal receiving module for receiving sound waves of external environmental sounds and converting the sound waves into a first audio frequency signal; a digital-to-analog conversion module for performing digital-to-analog conversion on the first audio frequency signal to generate and output a second audio frequency signal after digital-to-analog conversion; a digital signal processing module for converting the second audio frequency signal output by the digital-to-analog conversion module into a first vibration signal; an operational amplifier for performing gain processing on the first vibration signal and outputting a second vibration signal after gain processing; and at least one tactile transducer at least comprising a vibration element and a tactile transducer; and a frequency of the second audio frequency signal is less than 200 Hz.