Spatially-controlled noise reduction for headsets with variable microphone array orientation
10079026 ยท 2018-09-18
Assignee
Inventors
Cpc classification
G10K2200/10
PHYSICS
G10K2210/1081
PHYSICS
International classification
H04R1/10
ELECTRICITY
Abstract
A method may include determining a desired speech estimate originating from a speech acceptance direction range while reducing a level of interfering noise, determining an interfering noise estimate originating from a noise rejection direction range while reducing a level of desired speech, calculating a ratio of the desired speech estimate to the interfering noise estimate, dynamically computing a set of thresholds based on the speech acceptance direction range, noise rejection direction range, a background noise level, and a noise type, estimating a power spectral density of background noise arriving from the noise rejection direction range, calculating a frequency-dependent gain function based on the power spectral density of background noise and thresholds, and applying the frequency-dependent gain function to at least one microphone signal generated by the plurality of microphones to reduce noise arriving from the noise rejection direction while preserving desired speech arriving from the speech acceptance direction.
Claims
1. A method for voice processing in an audio device having an array of a plurality of microphones wherein the array is capable of having a plurality of positional orientations relative to a user of the array, the method comprising: determining a desired speech estimate originating from a speech acceptance direction range of a speech acceptance direction while reducing a level of interfering noise; determining an interfering noise estimate originating from a noise rejection direction range of a noise rejection direction while reducing a level of desired speech; calculating a ratio of the desired speech estimate to the interfering noise estimate; dynamically computing a set of thresholds based on the speech acceptance direction range, noise rejection direction range, a background noise level, and a noise type; estimating a power spectral density of background noise arriving from the noise rejection direction range; calculating a frequency-dependent gain function based on the power spectral density of background noise and thresholds; and applying the frequency-dependent gain function to at least one microphone signal generated by the plurality of microphones to reduce noise arriving from the noise rejection direction while preserving desired speech arriving from the speech acceptance direction.
2. The method of claim 1, wherein calculating the frequency-dependent gain function comprises setting one or more coefficients of the frequency-dependent gain function based on a comparison of the ratio to one of the thresholds.
3. The method of claim 1, wherein calculating the frequency-dependent gain function comprises setting one or more coefficients of the frequency-dependent gain function based on a comparison of a cross-correlation between microphone signals generated by the plurality of microphones to one of the thresholds.
4. The method of claim 1, wherein calculating the frequency-dependent gain function comprises setting one or more coefficients of the frequency-dependent gain function based on a direction of arrival estimate for desired speech.
5. The method of claim 1, wherein the noise type comprises one of directional noise, diffused noise, and uncorrelated noise.
6. The method of claim 1, further comprising dynamically adjusting the set of thresholds based on ambient noise conditions.
7. The method of claim 1, further comprising adjusting the maximum noise reduction limit based on ambient noise conditions.
8. The method of claim 1, further comprising: computing the ratio at separate frequencies; and adjusting the power spectral density of the background noise separately as a function of a computed frequency-dependent ratio for each of the separate frequencies.
9. The method of claim 1, further comprising modifying the set of thresholds as a function of speech acceptance direction range and noise rejection direction range.
10. The method of claim 1, further comprising controlling the null direction of a spatially-controlled adaptive nullformer based on the ratio.
11. The method of claim 10, wherein an output of the spatially-controlled adaptive nullformer is used as a reference signal for an adaptive noise reduction filter.
12. An integrated circuit for implementing at least a portion of an audio device having an array of a plurality of microphones wherein the array is capable of having a plurality of positional orientations relative to a user of the array, comprising: a plurality of microphone inputs, each microphone input associated with one of the plurality of microphones; a processor configured to: determine a desired speech estimate originating from a speech acceptance direction range of a speech acceptance direction while reducing a level of interfering noise; determine an interfering noise estimate originating from a noise rejection direction range of a noise rejection direction while reducing a level of desired speech; calculate a ratio of the desired speech estimate to the interfering noise estimate; dynamically compute a set of thresholds based on the speech acceptance direction range, noise rejection direction range, a background noise level, and a noise type; estimate a power spectral density of background noise arriving from the noise rejection direction range; calculate a frequency-dependent gain function based on the power spectral density of background noise and thresholds; and apply the frequency-dependent gain function to at least one microphone signal generated by the plurality of microphones to reduce noise arriving from the noise rejection direction while preserving desired speech arriving from the speech acceptance direction.
13. The integrated circuit of claim 12, wherein calculating the frequency-dependent gain function comprises setting one or more coefficients of the frequency-dependent gain function based on a comparison of the ratio to one of the thresholds.
14. The integrated circuit of claim 12, wherein calculating the frequency-dependent gain function comprises setting one or more coefficients of the frequency-dependent gain function based on a comparison of a cross-correlation between microphone signals generated by the plurality of microphones to one of the thresholds.
15. The integrated circuit of claim 12, wherein calculating the frequency-dependent gain function comprises setting one or more coefficients of the frequency-dependent gain function based on a direction of arrival estimate for desired speech.
16. The integrated circuit of claim 12, wherein the noise type comprises one of directional noise, diffused noise, and uncorrelated noise.
17. The integrated circuit of claim 12, wherein the processor is further configured to dynamically adjust the set of thresholds based on ambient noise conditions.
18. The integrated circuit of claim 12, wherein the processor is further configured to adjust the maximum noise reduction limit based on ambient noise conditions.
19. The integrated circuit of claim 12, wherein the processor is further configured to: compute the ratio at separate frequencies; and adjust the power spectral density of the background noise separately as a function of a computed frequency-dependent ratio for each of the separate frequencies.
20. The integrated circuit of claim 12, wherein the processor is further configured to modify the set of thresholds as a function of speech acceptance direction range and noise rejection direction range.
21. The integrated circuit of claim 12, wherein the processor is further configured to control the null direction of a spatially-controlled adaptive nullformer based on the ratio.
22. The integrated circuit of claim 21, wherein an output of the spatially-controlled adaptive nullformer is used as a reference signal for an adaptive noise reduction filter.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) A more complete understanding of the example, present embodiments and certain advantages thereof may be acquired by referring to the following description taken in conjunction with the accompanying drawings, in which like reference numbers indicate like features, and wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
DETAILED DESCRIPTION
(18) In this disclosure, systems and methods are proposed for non-linear beamforming based noise reduction in a dual microphone array that is robust to dynamic changes in desired speech arrival direction. The systems and methods herein may be useful in, among other applications, in-ear fitness headsets wherein the microphones are placed in a control box. In such headsets, the microphone array position with respect to a user's mouth varies significantly depending on the headset wearing preference of the user. Moreover, the microphone array orientation is not constant because head movements and obstructions from collared shirts and heavy jackets may prevent the control box from resting in a consistent position. Hence, the desired speech arrival direction is not constant in such configurations, and the systems and methods proposed herein may ensure that the user speech is preserved under various array orientation while improving the signal to noise ratio more than single microphone processing would. Specifically, given a pre-specified speech arrival direction range, the systems and methods disclosed herein may suppress interfering noise that arrives from directions outside of a speech arrival direction range. The systems and methods disclosed herein may also derive a statistic that estimates an interference to desired speech ratio and use this statistic to dynamically update a background noise estimate for a single channel spectral subtraction-based noise reduction algorithm. The aggressiveness of noise reduction may also be controlled based on the derived statistic. Ambient aware information such as a noise level and/or a noise type, (e.g., diffused or directional or uncorrelated noise) may also be used to appropriately control the background noise estimation process. The derived statistics may also be used to detect the presence of desired near-field signals. This signal detection may be used in various applications as described below.
(19) In accordance with embodiments of this disclosure, an automatic playback management framework may use one or more audio event detectors. Such audio event detectors for an audio device may include a near-field detector that may detect when sounds in the near-field of the audio device are detected, such as when a user of the audio device (e.g., a user that is wearing or otherwise using the audio device) speaks, a proximity detector that may detect when sounds in proximity to the audio device are detected, such as when another person in proximity to the user of the audio device speaks, and a tonal alarm detector that detects acoustic alarms that may have been originated in the vicinity of the audio device.
(20)
(21) As shown in
(22) As shown in
(23)
(24)
(25)
(26)
(27) As shown in
(28) As known in the art, a first-order beamformer is one that combines two microphone signals to form a virtual signal acquisition beam focused towards a desired look direction such that signals arriving from directions other than the look direction are attenuated. Typically, output signal-to-noise ratio of a beamformer is high due to the attenuation of signals arriving from directions other than the desired look direction. For example,
(29) In order to determine if desired speech is present in a speech acceptance angle, a spatial statistic may be derived by forming a set of fixed beamformers including speech beamformer 54 and noise beamformer 55. Speech beamformer 54 may comprise microphone inputs corresponding to microphone inputs 52 that may generate a beam based on microphone signals (e.g., x.sub.1, x.sub.2) received by such inputs. Speech beamformer 54 may be configured to form a beam to spatially filter audible sounds from microphones 51 coupled to microphone inputs 52. In some embodiments, speech beamformer 54 may comprise a unidirectional beamformer configured to form a respective unidirectional beam in a desired look direction to receive and spatially filter audible sounds from microphones 51 coupled to microphone inputs 52, wherein such respective unidirectional beam may have a spatial null in a direction opposite of the look direction. In some embodiments, speech beamformer 54 may be implemented as a time-domain beamformer. Speech beamformer 54 may be formed to capture most of the speech arriving from a speech acceptance direction while suppressing interfering noise coming from other directions.
(30) Noise beamformer 55 may comprise microphone inputs corresponding to microphone inputs 52 that may generate a beam based on microphone signals (e.g., x.sub.1, x.sub.2) received by such inputs. Noise beamformer 55 may be configured to form a beam to spatially filter audible sounds from microphones 51 coupled to microphone inputs 52. In some embodiments, noise beamformer 55 may comprise a unidirectional beamformer configured to form a respective unidirectional beam in a desired look direction (e.g., different than the look direction of speech beamformer 54) to receive and spatially filter audible sounds from microphones 51 coupled to microphone inputs 52, wherein such respective unidirectional beam may have a spatial null in a direction opposite of the look direction. In some embodiments, noise beamformer 55 may be implemented as a time-domain beamformer. Similarly to speech beamformer 54, noise beamformer 55 may be formed to capture noise coming from a noise rejection direction while suppressing signals arriving from the speech acceptance direction.
(31) Either or both of speech beamformer 54 and noise beamformer 55 may comprise a first-order beamformer.
(32) Each of the null directions for speech beamformer 54 and noise beamformer 55 may be chosen based on pre-specified speech acceptance and noise rejection direction ranges, respectively.
y.sub.s[n]=v.sub.1.sup.n[n]x.sub.1[n]v.sub.2.sup.n[n]x.sub.2[nn.sub.s]
y.sub.n[n]=v.sub.1.sup.n[n]v.sub.1.sup.s[n]x.sub.1[nn.sub.n]v.sub.2.sup.n[n]v.sub.2.sup.s[n]x.sub.2[n]
where v.sub.1.sup.s[n] and v.sub.2.sup.s[n] are calibration gains compensating for near-field propagation loss effects and the calibrated values may be different for various headset positions. The gains v.sub.1.sup.n[n] and v.sub.1.sup.n[n] are the microphone calibration gains adjusted dynamically to account for microphone sensitivity mismatches. The delay n.sub.s of speech beamformer 54 and delay n.sub.n of noise beamformer 55 may be calculated as:
(33)
where d is the microphone spacing, c is the speed of sound, F.sub.s is a sampling frequency, is an expected direction of arrival of a most commonly present dominant interfering signal, and is the angle of arrival of the desired speech in a most prevailing headset position.
(34) The instantaneous spatial statistics for an inverse signal-to-noise ratio may be computed as:
(35)
where m is a frame index,
where .sub.idr is a smoothing constant and E.sub.i[m] is an instantaneous frame energy. The energies may be computed based on sum of weighted squares. A weighted averaging method may provide better detection results when compared with a more inexpensive exponential averaging method. The weights may be assigned to provide more emphasis on a present frame of data and less emphasis on past frames. For example, weights for a present frame may be 1 and the weights for the past frames may follow a linear relation, (e.g., 0.25 for the oldest data and 1 for the latest data among the past frames). Thus, a weighted energy E.sub.i(m) for a frame of data x[m,n] may be given by:
(36)
where N is the number of samples in a frame and y.sub.i[m,n] is a beamformer output. The instantaneous inverse signal-to-noise ratio may be further smoothed using a slow-attack/fast-decay approach, such as given by:
(37)
(38) When an acoustic source is close to a microphone, a direct-to-reverberant signal ratio at the microphone is usually high. The direct-to-reverberant ratio usually depends on the reverberation time (RT.sub.60) of a room/enclosure and/or other physical structures that are in the path between the near-field source and the microphone. When the distance between the source and the microphone increases, the direct-to-reverberant ratio decreases due to propagation loss in the direct path, and the energy of reverberant signal will be comparable to the direct path signal. This concept provides a statistic that may indicate the presence of a near-field signal that is robust to an array position. A cross-correlation sequence between microphones 51 may be computed as:
(39)
Wherein range of
(40)
floor
(41)
A maximum normalized correlation statistic may be computed as:
(42)
where E.sub.xi corresponds to microphone signal energy of the i.sup.th microphone energy. This statistic is further smoothed to get
[n]=.sub.[n1]+(1.sub.){tilde over ()}[n]
where .sub. is a smoothing constant.
(43) A spatial resolution of the cross-correlation sequence may be increased by interpolating the cross-correlation sequence using the Lagrange interpolation function. A direction of arrival (DOA) statistic may be estimated by selecting a lag corresponding to a maximum value of the interpolated cross-correlation sequence, {tilde over (r)}.sub.x1x2[m]:
(44)
(45) The selected lag index may then be converted into an angular value by using the following formula:
(46)
where F.sub.r=rF.sub.s is an interpolated sampling frequency and r is an interpolation rate. To reduce the estimation error due to outliers, the direction of arrival estimate may be median filtered to provide a smoothed version of a raw direction of arrival estimate. In some embodiments, a median filter window size may be set at three estimates.
(47) A technique known as spectral subtraction may be used to reduce noise in an audio system. If s[n] is a clean speech sample corrupted by an additive and uncorrelated noise sample n[n], then a noisy speech sample x[n] may be given by:
x[n]=s[n]+n[n].
Because x[n] and n[n] are uncorrelated, a discrete power spectrum of the noisy speech P.sub.x[k] may be given by:
P.sub.[k]=P.sub.[k]+P.sub.[k]
where P.sub.s[k] and the P.sub.n[k] are the discrete power spectrum of speech and the discrete power spectrum of noise, respectively.
(48) If the discrete power spectral density (PSD) of the noise source is completely known, it may be subtracted from the noisy speech signal using what is known as a Wiener filter solution in order to produce clean speech. Specifically:
P.sub.[k]=P.sub.[k]P.sub.[k].
(49) A frequency response H[k] of the above subtraction process may be written as
(50)
(51) Typically, a noise source is not known, so the crux of a spectral subtraction algorithm is the estimation of power spectral density of the noise. For a single microphone noise reduction solution, the noise is estimated from the noisy speech, which is the only available signal. The noise estimated from noisy speech thus may not be accurate. Therefore, a system may need to perform adjustment to spectral subtraction in order to reduce speech distortion resulting from inaccurate noise estimates. For this reason, many spectral subtraction based noise reduction methods introduce a parameter that controls the spectral weighting factor, such that frequencies with low signal-to-noise ratio are attenuated and frequencies with high signal-to-noise ratio are not modified. The frequency response above may be modified as:
(52)
where {circumflex over (P)}.sub.n[k] is the power spectrum of the noise estimate, and is a parameter which controls a spectral weighting factor based on a sub-band signal. The response H[k] above may be used in a weighting filter. A clean speech estimate Y[k] may be obtained by applying the response H [k] of the weighting filter to the Fourier transform of the noisy speech signal X[k], as follows:
Y[k]=X[k]H[k].
(53) The various spatial statistics described above may be used by audio device 50 as a powerful aid to augment single-channel noise reduction techniques similar to spectral subtraction described above. Such spatial statistics provide information regarding the likelihood of desired speech and noise-only presence conditions. For example, such information may be used in a binary approach to update the background noise whenever a noise-only presence condition is detected. Similarly, the background noise estimation may be frozen if there is a high likelihood of desired speech presence. Further, instead of using such binary approach, audio device 50 may use a multiple state discrete signaling approach to obtain maximum benefits from the spatial statistics by accounting for noise level fluctuations. Specifically, what is known as a modified Doblinger noise estimate may be augmented by audio device 50 with the spatial statistics as further described below. A modified Doblinger noise estimate may be given by:
(54)
where {circumflex over (P)}.sub.n[m,k] is a noise spectral density estimate at spectral bin k, P.sub.x[m,k] is a power spectral density of noisy signal and .sub.pn is a noise update rate that controls the rate at which the background noise is estimated. A minimum statistic condition in the above update equation may render the noise estimate under-biased at all times. This under-biased noise estimate may introduce musical artifacts during the noise reduction process.
(55) As shown in
(56) The performance of the spatially-controlled noise reduction algorithm described herein may be improved if the background noise in microphone signal x.sub.1 is reduced. Such background noise reduction may be performed via an adaptive filter architecture implemented by nullformer 60, adaptive filter 74, and combiner 72. Given two microphone signals x.sub.1 and x.sub.2, the adaptive architecture implemented by nullformer 60, adaptive filter 74, and combiner 72 may generate a background noise signal that is closely matched (in a mean square error sense) with the background noise present in one of the microphone signals. Adaptive nullformer 60 may generate a reference signal to adaptive filter 74 by combining the two microphone signals x.sub.1 and x.sub.2 such that the desired speech signal leakage in the reference signal is minimized to avoid speech suppression during the background noise removal process. Specifically, to obtain the reference signal, adaptive nullformer 60 may have a null focused towards the desired speech direction. However, unlike fixed noise beamformer 55, the null for adaptive nullformer 60 may be dynamically modified as a desired speech direction is modified. Combiner 72 may remove the background noise signal generated by adaptive filter 74 from microphone signal x.sub.1.
(57) VAD and system controls block 70 may track the desired speech direction as shown in
(58) Speech leakage that may arise from false tracking of a desired speech direction may induce speech suppression in adaptive filter 74. The effects of poor desired speech detection in high noise may be mitigated by ensuring that coefficients of adaptive filter 74 are not updated whenever a speech signal is detect by VAD and system controls 70. Logic inverse to that shown in
(59) Voice activity detection may be performed by VAD and system controls 70 based on an output of speech beamformer 54. Speech beamformer 54 thus helps in improving input signal-to-noise ratio for the voice activity detector, thus increasing the speech detection performance in noisy conditions while reducing the false alarms from competing speech like interference arriving from the noise rejection direction. Any suitable approach may be used for detecting the presence of speech in a given input signal, as is known in the art.
(60) The inverse signal-to-noise ratio ISNR as shown in
(61) The noise beam signal energy E[m] may be used as background noise level estimate. The instantaneous energy may be smoothed further using a recursive averaging filter to reduce the variance of the noise level estimate. The measured noise level may be split into five different noise levels, namely, very-low, low, medium, high and very-high noise levels. As shown in
(62) In order to avoid frequent noise mode state transitions, the instantaneous noise modes from past history may be used to derive a slow varying noise mode. The discrete noise mode distribution may be updated every frame based on instantaneous noise mode values from current and past frames. The noise mode that occurred most frequently is chosen as the current noise mode. For example, if the noise mode distribution for the past 2000 frames consists of very-low10 frames, low500 frames, medium900 frames, high500 frames, very-high90 frames, then the current noise mode may be set to medium.
(63) Accordingly, the inverse signal-to-noise ratio ISNR thresholds upperThresh, medThresh and lowerThresh may be dynamically adjusted based on the noise mode as follows:
dyn[upper|med|lower]Thresh=[upper|med|lower]Thresh+[upper|med|lower]ThresOffset[i], i=Very-low,low,medium,high,very-high
where the offset values for the thresholds may be determined empirically and may be tuned as a function of desired speech acceptance and noise rejection direction ranges. Similarly, the maximum achievable noise reduction limit in each spectral bin may be dynamically adjusted to maintain good trade-off between noise reduction and speech suppression. For example, in extremely high noise conditions, it is preferable to have less noise reduction while preserving the speech. Spectral subtraction algorithms in general, suppress speech in extremely high noise conditions since the SNR is low at all frequency bins. Similarly, to noise reduce residual noise artifacts, the spectral subtraction based gain calculation may be substituted by a linear attenuation function at low/medium noise conditions if the spatial statistics points to high likelihood of noise only conditions, as shown in U.S. Pat. No. 7,454,010, which is incorporated herein by reference.
(64) The foregoing describes systems and methods for implementing a robust dual microphone based non-linear beamforming technique that is robust to changes in array position with respect to a user's mouth. The technique provides tuning flexibility wherein the speech acceptance and noise rejection direction may be intuitively controlled by appropriate thresholds. In addition, the proposed technique may be easily modified to be used in a headset with a fixed desired speech direction. The performance of the technique may be further improved if a robust near-field detector may be augmented with the non-linear beamformer described herein. The performance of the technique may be further improved if a robust near-field detector, such as that disclosed in U.S. patent application Ser. No. 15/584,347 and incorporated herein by reference, is augmented with a proposed non-linear beamformer method.
(65) It should be understoodespecially by those having ordinary skill in the art with the benefit of this disclosurethat the various operations described herein, particularly in connection with the figures, may be implemented by other circuitry or other hardware components. The order in which each operation of a given method is performed may be changed, and various elements of the systems illustrated herein may be added, reordered, combined, omitted, modified, etc. It is intended that this disclosure embrace all such modifications and changes and, accordingly, the above description should be regarded in an illustrative rather than a restrictive sense.
(66) Similarly, although this disclosure makes reference to specific embodiments, certain modifications and changes can be made to those embodiments without departing from the scope and coverage of this disclosure. Moreover, any benefits, advantages, or solutions to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element.
(67) Further embodiments likewise, with the benefit of this disclosure, will be apparent to those having ordinary skill in the art, and such embodiments should be deemed as being encompassed herein.