G10L2021/02166

Multi-stream target-speech detection and channel fusion

Audio processing systems and methods include an audio sensor array configured to receive a multichannel audio input and generate a corresponding multichannel audio signal and target-speech detection logic and an automatic speech recognition engine or VoIP application. An audio processing device includes a target speech enhancement engine configured to analyze a multichannel audio input signal and generate a plurality of enhanced target streams, a multi-stream target-speech detection generator comprising a plurality of target-speech detector engines each configured to determine a probability of detecting a specific target-speech of interest in the stream, wherein the multi-stream target-speech detection generator is configured to determine a plurality of weights associated with the enhanced target streams, and a fusion subsystem configured to apply the plurality of weights to the enhanced target streams to generate an enhancement output signal.

Microphone array device, conference system including microphone array device and method of controlling a microphone array device

A microphone array device including microphone capsules and at least one processing unit configured to receive output signals of the microphone capsules, dynamically steer an audio beam based on the received output signal of the microphone capsules, and generate and provide an audio output signal based on the received output signal of the microphone capsules. The processing unit is configured to operate in a dynamic beam mode where at least one focused audio beam is formed that points towards a detected audio source and in a default beam mode where a broader audio beam is formed that covers substantially a default detection area. The microphone array may be incorporated into a conference system.

Online target-speech extraction method based on auxiliary function for robust automatic speech recognition

A target speech signal extraction method for robust speech recognition includes: initializing a steering vector for a target speech source and an adaptive vector, setting a real output channel of the target speech source as an output by the adaptive vector, initializing adaptive vectors for a noise and setting a dummy channel as an output by the adaptive vectors for the noise; setting a cost function for minimizing dependency between a real output for the target speech source and a dummy output for the noise; setting an auxiliary function to the cost function, and updating the adaptive vector for the target speech source and the adaptive vectors for the noise by using the auxiliary function and the steering vector; estimating the target speech signal by using the adaptive vector thereby extracting the target speech signal from the input signals; and updating the steering vector for the target speech source.

Hearing device comprising a keyword detector and an own voice detector and/or a transmitter

A hearing device, e.g. a hearing aid, is configured to be arranged at least partly on a user's head or at least partly implanted in a user's head. The hearing device comprises a) at least one input transducer for picking up an input sound signal from the environment and providing at least one electric input signal representing said input sound signal; b) a signal processor providing a processed signal based on one or more of said at least one electric input signals; c) an output unit for converting said processed signal or a signal originating therefrom to stimuli perceivable by said user as sound; d) a keyword spotting system comprising d1) a keyword detector configured to detect a limited number of predefined keywords or phrases or sounds in said at least one electric input signal or in a signal derived therefrom, and to provide a keyword indicator of whether or not, or with what probability, said keywords or phrases or sounds are detected, and d2) an own voice detector for providing an own voice indicator estimating whether or not, or with what probability, a given input sound signal originates from the voice of the user of the hearing device. The hearing device further comprises e) a controller configured to provide an own-voice-keyword indicator of whether or not or with what probability a given one of said keywords or phrases or sounds is currently detected and spoken by said user, said own-voice-keyword indicator being dependent on said keyword indicator and said own voice indicator.

Speech-Tracking Listening Device
20220417679 · 2022-12-29 ·

A system (20) includes a plurality of microphones (22), configured to generate different respective signals in response to acoustic waves (36) arriving at the microphones, and a processor (34). The processor is configured to receive the signals, to combine the signals into multiple channels, which correspond to different respective directions relative to the microphones by virtue of each channel representing any portion of the acoustic waves arriving from the corresponding direction with greater weight, relative to others of the directions, to calculate respective energy measures of the channels, to select one of the directions, in response to the energy measure for the channel corresponding to the selected direction passing one or more energy thresholds, and to output a combined signal representing the selected direction with greater weight, relative to others of the directions. Other embodiments are also described.

ACOUSTIC CROSSTALK SUPPRESSION DEVICE AND ACOUSTIC CROSSTALK SUPPRESSION METHOD

An acoustic crosstalk suppression device includes a speaker estimation unit configured to estimate a main speaker based on voice signals collected by n units of microphones corresponding to n number of persons (n: an integer equal to or larger than 3); n units of filter update units each of which is configured to update a parameter of a filter configured to generate a suppression signal of a crosstalk component included in a voice signal of the main speaker; and a crosstalk suppression unit configured to suppress the crosstalk component by using a synthesis suppression signal generated by the maximum (n-1) units of filter update units corresponding to reference signals collected by the maximum (n-1) units of microphones.

ACOUSTIC OUTPUT APPARATUS

The present disclosure relates to an acoustic output apparatus. The acoustic output apparatus comprising: at least one low-frequency acoustic driver that outputs sound from at least two first sound guiding holes; at least one high-frequency acoustic driver that outputs sound from at least two second sound guiding holes; and a controller configured to cause the low-frequency acoustic driver to output sound in a first frequency range, and cause the high-frequency acoustic driver to output sound in a second frequency range, wherein the second frequency range includes frequencies higher than the first frequency range.

Detecting a trigger of a digital assistant

Systems and processes for operating an intelligent automated assistant are provided. In accordance with one example, a method includes, at an electronic device with one or more processors, memory, and a plurality of microphones, sampling, at each of the plurality of microphones of the electronic device, an audio signal to obtain a plurality of audio signals; processing the plurality of audio signals to obtain a plurality of audio streams; and determining, based on the plurality of audio streams, whether any of the plurality of audio signals corresponds to a spoken trigger. The method further includes, in accordance with a determination that the plurality of audio signals corresponds to the spoken trigger, initiating a session of the digital assistant; and in accordance with a determination that the plurality of audio signals does not correspond to the spoken trigger, foregoing initiating a session of the digital assistant.

Pre-voice separation/recognition synchronization of time-based voice collections based on device clockcycle differentials

Methods and devices for conducting, based on a clock difference, a synchronization process on voice information collected by a plurality of voice collection devices. Then, after the synchronization process is performed on the voice information collected by the plurality of voice collection devices, conducting a voice separation and recognition process on voice information that was collected by the plurality of voice collection devices and synchronized based on the clock difference among the plurality of voice collection devices.

Beamformer enhanced direction of arrival estimation in a reverberant environment with directional noise

An estimator of direction of arrival (DOA) of speech from a far-field talker to a device in the presence of room reverberation and directional noise includes audio inputs received from multiple microphones and one or more beamformer outputs generated by processing the microphone inputs. A first DOA estimate is obtained by performing generalized cross-correlation between two or more of the microphone inputs. A second DOA estimate is obtained by performing generalized cross-correlation between one of the one or more beamformer outputs and one or more of: the microphone inputs and other of the one or more beamformer outputs. A selector selects the first or second DOA estimate based on an SNR estimate at the microphone inputs and a noise reduction amount estimate at the beamformer outputs. The SNR and noise reduction estimates may be obtained based on the detection of a keyword spoken by a desired talker.