Patent classifications
G10L2021/02166
Asynchronous ad-hoc distributed microphone array processing in smart home applications using voice biometrics
Voice biometrics scoring is performed on received asynchronous audio outputs from microphones distributed at ad hoc locations to generate confidence scores that indicate a likelihood of an enrolled user speech utterance in the output, a subset of the outputs is selected based on the confidence scores, and the subset is spatially processed to provide audio output for voice application use. Alternatively, asynchronous spatially processed audio outputs and corresponding biometric identifiers are received from corresponding devices distributed at ad hoc locations, audio frames of the outputs are synchronized using the biometric identifiers, and the synchronized frames are coherently combined. Alternatively, uttered speech associated with respective ad hoc distributed devices is received and non-coherently combined to generate a final output of uttered speech. The uttered speech is recognized from respective spatially processed outputs generated by the respective devices using biometrics of talkers enrolled by the devices.
Directional noise suppression
Systems and methods of providing improved directional noise suppression in an electronic device implement a technique that specifies a direction or speaker of interest, determines the directions corresponding to speakers not lying in the direction of interest, beam forms the reception pattern of the device microphone array to focus in the direction of interest and suppresses signals from the other directions, creating beam formed reception data. A spatial mask is generated as a function of direction relative to the direction of interest. The spatial mask emphasizes audio reception in the direction of interest and attenuates audio reception in the other directions. The beam formed reception data is then multiplied by the spatial mask to generate an audio signal with directional noise suppression.
Speech processing apparatus and method using a plurality of microphones
A speech processing apparatus includes a plurality of microphones configured to receive a plurality of input signals, and processing circuitry configured to generate a spatial filtering signal corresponding to the plurality of input signals through spatial filtering, generate estimated noise information by integrating directional noise information representing a level of a noise signal received from a direction of interest with diffuse noise information representing levels of noise signals received from various directions based on whether the plurality of input signals have directionality, and generate an estimated speech signal by filtering the spatial filtering signal based on the estimated noise information.
Device to amplify and clarify voice
A voice enhancing device amplifies and clarifies the voice of a user with hypophonia or other voice issues. The device includes a collar of either rigid or a soft material that is shaped to comfortably sit on the shoulders of the user. One or more microphone arrays are adjustably mounted to the collar to capture audio of the user talking. An electronics module enhances the captured audio signal and generates an enhanced audio signal that drives at least one speaker adjustably attached to the collar. The electronic controller implements one or more of an AGC amplifier to correct amplitude variation in spoked words, adaptive filtering to actively filter out background noise, a variable attack and decay function to improve intelligibility of the spoken words, a diphthong modification function to clarify the spoken words, and an echo cancelation function to reduce echo and feedback in the enhanced audio.
Microphone array based deep learning for time-domain speech signal extraction
A device for processing audio signals in a time-domain includes a processor configured to receive multiple audio signals corresponding to respective microphones of at least two or more microphones of the device, at least one of the multiple audio signals comprising speech of a user of the device. The processor is configured to provide the multiple audio signals to a machine learning model, the machine learning model having been trained based at least in part on an expected position of the user of the device and expected positions of the respective microphones on the device. The processor is configured to provide an audio signal that is enhanced with respect to the speech of the user relative to the multiple audio signals, wherein the audio signal is a waveform output from the machine learning model.
Generating an audio signal from multiple microphones based on uncorrelated noise detection
An audio capture device selects between multiple microphones to generate an output audio signal depending on detected conditions. The audio capture device determines whether one or more microphones are wet or dry and selects one or more audio signals from the one or more microphones depending on their respective conditions. The audio capture device generates a mono audio output signal or a stereo output signal depending on the respective conditions of the one or more microphones.
SPATIALLY INFORMED ACOUSTIC ECHO CANCELATION
A plurality of microphone signals can be captured with a plurality of microphones of the device. One or more echo dominant audio signals can be determined based on a pick-up beam directed towards one or more speakers of a playback device. Sound that is emitted from the one or more speakers and sensed by the plurality of microphones can be removed from plurality of microphone signals, by using the one or more echo dominant audio signals as a reference, resulting in clean audio.
End-To-End Time-Domain Multitask Learning for ML-Based Speech Enhancement
Disclosed is a multi-task machine learning model such as a time-domain deep neural network (DNN) that jointly generate an enhanced target speech signal and target audio parameters from a mixed signal of target speech and interference signal. The DNN may encode the mixed signal, determine masks used to jointly estimate the target signal and the target audio parameters based on the encoded mixed signal, apply the mask to separate the target speech from the interference signal to jointly estimate the target signal and the target audio parameters, and decode the masked features to enhance the target speech signal and to estimate the target audio parameters. The target audio parameters may include a voice activity detection (VAD) flag of the target speech. The DNN may leverage multi-channel audio signal and multi-modal signals such as video signals of the target speaker to improve the robustness of the enhanced target speech signal.
Sensory perception accelerator
To reduce the reliance on software for complex computations used in machine sensory perception, a sensory perception accelerator may include a neural network accelerator a linear algebra accelerator. The neural network accelerator may include systolic arrays to perform neural network computation circuits concurrently on image data and audio data. The linear algebra accelerator may include matrix computation circuits operable to perform matrix operations on image data and motion data.
Context aware hearing optimization engine
One or more context aware processing parameters and an ambient audio stream are received. One or more sound characteristics associated with the ambient audio stream are identified using a machine learning model. One or more actions to perform are determined using the machine learning model and based on the one or more context aware processing parameters and the identified one or more sound characteristics. The one or more actions are performed.