G10L2021/02082

DATA AUGMENTATION SYSTEM AND METHOD FOR MULTI-MICROPHONE SYSTEMS

A method, computer program product, and computing system for obtaining one or more speech signals from a first device, thus defining one or more first device speech signals. One or more speech signals may be obtained from a second device, thus defining one or more second device speech signals. One or more acoustic relative transfer functions mapping reverberation from the one or more first device speech signals to the one or more second device speech signals may be generated. One or more augmented second device speech signals may be generated based upon, at least in part, the one or more acoustic relative transfer functions and first device training data.

Electronic device and controlling method using non-speech audio signal in the electronic device
11562741 · 2023-01-24 · ·

An electronic device is provided. The electronic device comprises a speaker, a plurality of microphones, at least one processor operatively connected with the speaker and the plurality of microphones, and a memory operatively connected with the at least one processor, wherein the memory is configured to store instructions which, when executed, cause the at least one processor to perform speech audio processing or non-speech audio processing on audio signals received via the plurality of microphones, upon obtaining a non-speech audio signal based on the speech audio processing or the non-speech audio processing, identify a non-speech audio signal pattern corresponding to the non-speech audio signal, obtain a non-speech audio signal-based first command based on the identified non-speech audio signal pattern, and perform at least one action corresponding to the obtained non-speech audio signal-based first command.

Presence detection using ultrasonic signals with concurrent audio playback

Techniques for presence-detection devices to detect movement of a person in an environment by emitting ultrasonic signals using a loudspeaker that is concurrently outputting audible sound. To detect movement by the person, the devices characterize the change in the frequency, or the Doppler shift, of the reflections of the ultrasonic signals off the person caused by the movement of the person. However, when a loudspeaker plays audible sound while emitting the ultrasonic signal, audio signals generated by microphones of the devices include distortions caused by the loudspeaker. These distortions can be interpreted by the presence-detection devices as indicating movement of a person when there is no movement, or as indicating lack of movement when a user is moving. The techniques include processing audio signals to remove distortions to more accurately identify changes in the frequency of the reflections of the ultrasonic signals caused by the movement of the person.

Mixed adaptive and fixed coefficient neural networks for speech enhancement

Systems, methods and computer-readable media are provided for speech enhancement using a hybrid neural network. An example process can include receiving, by a first neural network portion of the hybrid neural network, audio data and reference data, the audio data including speech data, noise data, and echo data; filtering, by the first neural network portion, a portion of the audio data based on adapted coefficients of the first neural network portion, the portion of the audio data including the noise data and/or echo data; based on the filtering, generating, by the first neural network portion, filtered audio data including the speech data and an unfiltered portion of the noise data and/or echo data; and based on the filtered audio data and the reference data, extracting, by a second neural network portion of the hybrid neural network, the speech data from the filtered audio data.

Wireless audio synchronization using a spread code

Disclosed herein are system, apparatus, article of manufacture, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof, for synchronizing playback of audiovisual content among multiple speakers. In some embodiments, a first smart speaker receives a spread spectrum signal from a second smart speaker over an audio data channel. The first smart speaker despreads the spread spectrum signal based on a spreading code. The first smart speaker determines a time of receipt of the spread spectrum signal based on the despreading. The first smart speaker receives a time of transmission of the spread spectrum signal. The first smart speaker then calculates a playback delay based on the time of receipt and the time of transmission. Then the first smart speaker controls the playback of the audiovisual content based on the playback delay.

AUDIO BEAMFORMING WITH NULLING CONTROL SYSTEM AND METHODS
20230224635 · 2023-07-13 ·

Audio beamforming systems and methods that enable more precise control of lobes and nulls of an array microphone are provided. Optimized beamformer coefficients can be generated to result in beamformed signals associated with one or more lobes steered towards one or more desired sound locations and one or more nulls steered towards one or more undesired sound location. The performance of acoustic echo cancellation can be improved and enhanced.

SENSITIVITY MODE FOR AN AUDIO SPOTTING SYSTEM
20230223042 · 2023-07-13 ·

An audio spotting system configured for various operating modes including a regular mode and sensitivity mode is described. An example cascade audio spotting system may include a high-power subsystem including a high-power trigger and a transfer module. This high-power trigger includes one or more detection models used to detect whether a target sound activity is included in the one or more audio streams. The one or more detection models are associated with a first set of hyperparameters when the cascade audio spotting system is in a regular mode, and the one or more detection models are associated with a second set of hyperparameters when the cascade audio spotting system is in a sensitivity mode. The transfer module provides at least one of one or more processed audio streams for further processing in response to the high-power trigger detecting the target sound activity in the one or more audio streams.

CASCADE AUDIO SPOTTING SYSTEM
20230223041 · 2023-07-13 ·

Systems and methods for identifying audio events in one or more audio streams include the use of a cascade audio spotting system (such as a cascade keyword spotting system (KWS)) to reduce power consumption while maintaining a desired performance. An example cascade audio spotting system may include a first module and a high-power subsystem. The first module is to receive an audio stream from one or more audio streams, process the audio stream to detect a first target sound activity in the audio stream, and provide a first signal in response to detecting the first target sound activity in the audio stream. The high-power subsystem is to (in response to the first signal being provided by the first module) receive the one or more audio streams and process the one or more audio streams to detect a second target sound activity in the one or more audio streams.

Adaptive multichannel dereverberation for automatic speech recognition

Utilizing an adaptive multichannel technique to mitigate reverberation present in received audio signals, prior to providing corresponding audio data to one or more additional component(s), such as automatic speech recognition (ASR) components. Implementations disclosed herein are “adaptive”, in that they utilize a filter, in the reverberation mitigation, that is online, causal and varies depending on characteristics of the input. Implementations disclosed herein are “multichannel”, in that a corresponding audio signal is received from each of multiple audio transducers (also referred to herein as “microphones”) of a client device, and the multiple audio signals (e.g., frequency domain representations thereof) are utilized in updating of the filter—and dereverberation occurs for audio data corresponding to each of the audio signals (e.g., frequency domain representations thereof) prior to the audio data being provided to ASR component(s) and/or other component(s).

Automated transcript generation from multi-channel audio

Systems and methods are described for generating a transcript of a legal proceeding or other multi-speaker conversation or performance in real time or near-real time using multi-channel audio capture. Different speakers or participants in a conversation may each be assigned a separate microphone that is placed in proximity to the given speaker, where each audio channel includes audio captured by a different microphone. Filters may be applied to isolate each channel to include speech utterances of a different speaker, and these filtered channels of audio data may then be processed in parallel to generate speech-to-text results that are interleaved to form a generated transcript.