Multi-microphone method for estimation of target and noise spectral variances for speech degraded by reverberation and optionally additive noise
09723422 · 2017-08-01
Assignee
Inventors
Cpc classification
H04R25/30
ELECTRICITY
H04R25/407
ELECTRICITY
International classification
Abstract
The application relates to an audio processing system and a method of processing a noisy (e.g. reverberant) signal comprising first (v) and optionally second (w) noise signal components and a target signal component (x), the method comprising a) Providing or receiving a time-frequency representation Y.sub.i(k,m) of a noisy audio signal y.sub.i at an i.sup.th input unit, i=1, 2, . . . , M, where M≧2; b) Providing (e.g. predefined spatial) characteristics of said target signal component and said noise signal component(s); and c) Estimating spectral variances or scaled versions thereof λ.sub.V, λ.sub.X of said first noise signal component v (representing reverberation) and said target signal component x, respectively, said estimates of λ.sub.V and λ.sub.X being jointly optimal in maximum likelihood sense, based on the statistical assumptions that a) the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), and V.sub.i(k,m) (and W.sub.i(k,m)) of respective signals y.sub.i(n), and signal components x.sub.i, and v.sub.i (and w.sub.i) are zero-mean, complex-valued Gaussian distributed, b) that each of them are statistically independent across time m and frequency k, and c) that X.sub.i(k,m) and V.sub.i(k,m) (and W.sub.i(k,m)) are uncorrelated. An advantage of the invention is that it provides the basis for an improved intelligibility of an input speech signal. The invention may e.g. be used for hearing assistance devices, e.g. hearing aids.
Claims
1. A method of processing a noisy audio signal y(n) including a target signal component x(n) and a first noise signal component v(n), n representing time, the method comprising: providing or receiving a time-frequency representation Y.sub.i(k,m) of the noisy audio signal y.sub.i(n) at an i.sup.th input unit, i=1, 2, . . . , M, where M is larger than or equal to two, in a number of frequency bands and a number of time instances, k being a frequency band index and m being a time index; providing characteristics of said target signal component represented by a look vector d(k,m), whose elements (i=1, 2, . . . , M) define the frequency and time dependent absolute acoustic transfer function from a target signal source to each of the M input units, or the relative acoustic transfer function of the ith input unit to a reference input unit, or an inter input covariance matrix d(k,m).Math.d(k,m).sup.H; providing characteristics of said first noise signal component defined by an inter input unit covariance matrix C.sub.v(k,m); estimating spectral variances or scaled versions thereof λ.sub.V, λ.sub.X of said first noise signal component v and said target signal component x, respectively, as a function of frequency index k and time index m, said estimates of λ.sub.V and λ.sub.X being jointly optimal in maximum likelihood sense, jointly optimal being taken to mean that both of the spectral variance λ.sub.V, λ.sub.X are estimated in the same maximum likelihood estimation process, based on the statistical assumptions that a) the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), and V.sub.i(k,m) of respective signals y.sub.i(n), and signal components x.sub.i(n), and v.sub.i(n) are zero-mean, complex-valued Gaussian distributed, b) that each of them are statistically independent across time m and frequency k, and c) that X.sub.i(k,m) and V.sub.i(k,m) are uncorrelated; and processing the noisy audio signal y.sub.i(n) based on the estimated spectral variances or scaled versions thereof to provide a noise reduced signal.
2. A method according to claim 1 wherein the noisy audio signal y.sub.i(n) comprises a reverberant signal comprising a target signal component and a reverberation signal component.
3. A method according to claim 1 wherein said characteristics of the first noise signal component v is represented by an inter input unit covariance matrix C.sub.v or a scaled version thereof and wherein said first noise signal component v.sub.i(n) is essentially spatially isotropic.
4. A method according to claim 1 wherein said first noise signal component v.sub.i(n) is constituted by late reverberations.
5. A method according to claim 1 wherein the first noise signal component is a reverberation signal component v(n), and the noisy audio signal y(n) further comprises a second noise signal component being an additive noise signal component w(n), and wherein the method further comprises providing characteristics of said second noise signal component defined by a predetermined inter input unit covariance matrix C.sub.w(k,m).
6. A method according to claim 5 wherein the noisy audio signal y.sub.i(n) at the i.sup.th input unit comprises a target signal component x.sub.i(n), a reverberation signal component v.sub.i(n), and an additive noise component w.sub.i(n).
7. A method according to claim 5 wherein the characteristics of said second noise signal component w is represented by a predetermined inter input unit covariance matrix C.sub.W of the additive noise.
8. A method according to claim 1 wherein the characteristics of the target signal is represented by a look vector d(k,m) whose elements (i=1, 2, . . . , M) define the frequency and time dependent absolute acoustic transfer function from a target signal source to each of the M input units, or the relative acoustic transfer function from the i.sup.th input unit to a reference input unit.
9. A method according to claim 8 wherein said look vector d(km) and said noise covariance matrix C.sub.V(k,m), and optionally C.sub.W(k,m), are determined in an off-line procedure.
10. A method according to claim 1 further comprising: estimating the inter input unit covariance matrix Ĉ.sub.Y(k,m) of the noisy audio signal based on a number D of observations.
11. A method according to claim 10 wherein said maximum-likelihood estimates of the spectral variances λ.sub.X(k,m) and λ.sub.V(k,m) of the target signal component x and the noise signal component v, respectively, are derived from estimates of the inter-input unit covariance matrices C.sub.Y(k,m), C.sub.X(k,m), C.sub.V(k,m), and optionally C.sub.W(k,m), and the look vector d(k,m).
12. A method according to claim 1 wherein processing the noisy audio signal y.sub.i(n) based on the estimated spectral variances or scaled versions thereof to provide a noise reduced signal comprises: applying beamforming to the noisy audio signal y(n) providing a beamformed signal and single channel post filtering to the beamformed signal to suppress noise signal components from a direction of the target signal and to provide the resulting noise reduced signal.
13. A method according to claim 12 wherein said beamforming is a target signal enhancement spatial filtering based on MVDR filtering applied to the time-frequency representation Y.sub.i(k,m) of the noisy audio signal y.sub.i(n) at an i.sup.th input unit, i=1, 2, . . . , M, to provide a beamformed signal wherein signal components from other directions than a direction of the target signal component are attenuated, while leaving signal components from the direction of the target signal component un-attenuated.
14. A method according to any one of claim 12 wherein gain values g.sub.sc(k,m) applied to the beamformed signal in the single channel post filtering process are based on the estimates of the spectral variances λ.sub.X(k,m) and λ.sub.V(k,m) of the target signal component x and the first noise signal component v, respectively.
15. A data processing system comprising: a processor; and a memory having stored thereon program code which when executed cause the processor to perform the method of claim 1.
16. An audio processing system for processing a noisy audio signal y comprising a target signal component x and a first noise signal component v, the audio processing system comprising: a multitude M of input units adapted to provide or to receive a time-frequency representation Y.sub.i(k,m) of the noisy audio signal y.sub.i(n) at an i.sup.th input unit, i=1, 2, . . . , M, where M is larger than or equal to two, in a number of frequency bands and a number of time instances, k being a frequency band index and m being a time index; a look vector d(k,m), whose elements (i=1, 2, . . . , M) define the frequency and time dependent absolute acoustic transfer function from a target signal source to each of the M input units, or the relative acoustic transfer function form the ith input unit to a reference input unit, or an inter input covariance matrix d(k,m).Math.d(k,m).sup.H, for the target signal component; an inter-input unit covariance matrix C.sub.v(k,m) for the first noise signal component, or scaled versions thereof; a covariance estimation unit for estimating an inter input unit covariance matrix Ĉ.sub.Y(k,m), or a scaled version thereof, of the noisy audio signal based on the time-frequency representation Y.sub.i(k,m) of the noisy audio signals y.sub.i(n); and a spectral variance estimation unit for estimating spectral variances λ.sub.X(k,m) and λ.sub.V(k,m) or scaled versions thereof of the target signal component x and the first noise signal component v, respectively, based on said look vector d(k,m), said inter-input unit covariance matrix C.sub.v(k,m), and the covariance matrix Ĉ.sub.Y(k,m) of the noisy audio signal, or scaled versions thereof, wherein said estimates of λ.sub.V and λ.sub.X are jointly optimal in maximum likelihood sense, jointly optimal being taken to mean that both of the spectral variance λ.sub.V and λ.sub.X are estimated in the same maximum likelihood estimation process, based on the statistical assumptions that a) the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), and V.sub.i(k,m) of respective signals y.sub.i(n), and signal components x.sub.i(n), and v.sub.i(n) are zero-mean, complex-valued Gaussian distributed, b) that each of them are statistically independent across time m and frequency k, and c) that X.sub.i(k,m) and V.sub.i(k,m) are uncorrelated; and a signal processing unit adapted to process the noisy audio signal y.sub.i(n) based on the estimated spectral variances or scaled versions thereof to provide a noise reduced signal.
17. An audio processing system according to claim 16 wherein the noisy audio signal y(n) comprises a target signal component x(n), a first noise signal component being a reverberation signal component v(n), and a second noise signal component being an additive noise signal component w(n), and wherein the audio processing system comprises a predetermined inter input unit covariance matrix C.sub.W of the additive noise.
18. An audio processing system according to claim 17 wherein the spectral variance estimation unit is configured to estimate spectral variances λ.sub.X(k,m) and λ.sub.V(k,m) or scaled versions thereof of the target signal component x and the first noise signal component v, respectively, based on said look vector d(k,m), said inter-input unit covariance matrix C.sub.v(k,m) of the first noise component, said inter-input unit covariance matrix C.sub.W(k,m) of the second noise component, and said covariance matrix Ĉ.sub.Y(k,m) of the noisy audio signal, or scaled versions thereof, wherein said estimates of λ.sub.V and λ.sub.X are jointly optimal in maximum likelihood sense, based on the statistical assumptions that a) the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), V.sub.i(k,m), and W.sub.i(k,m) of respective signals y.sub.i(n), and signal components x.sub.i(n), v.sub.i(n), w.sub.i(n) are zero-mean, complex-valued Gaussian distributed, b) that each of them are statistically independent across time m and frequency k, and c) that X.sub.i(k,m), V.sub.i(k m) and W.sub.i(k,m) are mutually uncorrelated.
19. An audio processing system according to claim 16 further comprising: one of a hearing aid, a headset, an earphone, and an ear protection device, or a combination thereof.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1) The disclosure will be explained more fully below in connection with a preferred embodiment and with reference to the drawings in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9) The figures are schematic and simplified for clarity, and they just show details which are essential to the understanding of the disclosure, while other details are left out. Throughout, the same reference signs are used for identical or corresponding parts.
(10) Further scope of applicability of the present disclosure will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only. Other embodiments may become apparent to those skilled in the art from the following detailed description.
DETAILED DESCRIPTION OF EMBODIMENTS
(11)
(12)
(13)
(14)
(15)
(16)
(17) In an embodiment, at least one of the M input units IU.sub.i comprises an input transducer, e.g. a microphone for converting an electric input sound to an electric input signal (cf. e.g.
(18) A. Two Microphone Maximum-Likelihood Estimation of Speech and Late-Reverberation Spectral Variances for Speech Signals in the Presence of Reverberation (Only) (
(19) Another embodiment of an audio processing device according to the present disclosure illustrating a more specific implementation (but comprising the same elements as shown and discussed in
(20) In the following, the 2-microphone system is described in more detail. Let us assume that one target speaker is present in the acoustical scene, and that the signal reaching the hearing aid microphones consists of the two components a) and b) described above. The goal is to estimate the power at given frequencies and time instants of these two signal components. The signal reaching microphone number i may be written as
y.sub.i(n)=x.sub.i(n)+v.sub.i(n),
where x.sub.i(n) is the target signal component at the microphone, and v.sub.i(n) is the undesired reverberation component, which we assume is uncorrelated with the target signal x.sub.i(n), and y.sub.i(n) is the observable reverberant signal. The reverberant signal at each microphone is passed through an analysis filterbank (AFB) leading to a signal in the time-frequency domain,
Y.sub.i(k,m)=X.sub.i(k,m)+V.sub.i(k,m),
where k is a frequency index and m is a time (frame) index (and i=1, 2). For convenience, these spectral coefficients may be thought of as Discrete-Fourier Transform (DFT) coefficients.
Since all operations are identical for each frequency index, we skip the frequency index in the following for notational convenience. For example, instead of Y.sub.i(k,m), we simply write Y.sub.i(m).
(21) For a given frequency index k and time index m, noisy spectral coefficients for each microphone are collected in a vector (of size 2, since M=2; in general of size M), T indicating vector (matrix) transposition:
Y(m)=[Y.sub.1(m)Y.sub.2(m)].sup.T,
X(m)=[X.sub.1(m)X.sub.2(m)].sup.T,
and
V(m)=[V.sub.1(m)V.sub.2(m)].sup.T,
so that
Y(m)=X(m)+V(m).
(22) For a given frame index m, and frequency index k (suppressed in the notation), let d′(m)=[d′.sub.1(m) d′.sub.2(m)] denote a vector (of size 2) whose elements d.sub.1′ and d.sub.2′ represent the (generally complex-valued) acoustic transfer function from target sound source to each microphone (Mic.sub.1, Mic.sub.2), respectively. It is often more convenient to operate with a normalized version of d′(m). More specifically, let
d(m)=d′(m)/d′.sub.i(m).
denote a vector whose elements d.sub.i(m) (i=1, 2, . . . . , M, here M=2) represent the relative transfer function from the target source to the i.sup.th microphone. This implies that the i.sup.th element in this vector equals one, and the remaining elements describe the acoustic transfer function from the other microphones to this reference microphone.
(23) This means that the noise free microphone vector X(m) (which cannot be observed directly), can be expressed as
X(m)=d(m)
where
(24) The inter-microphone covariance matrix for the clean signal is then given by
C.sub.X(m)=λ.sub.X(m)d(m)d(m).sup.H,
where H denotes Hermitian transposition.
(25) In an embodiment, the inter-microphone covariance matrix of the late-reverberation is modelled as the covariance arising from an isotropic field,
C.sub.V(m)=λ.sub.V(m)C.sub.iso,
where C.sub.iso is the covariance matrix of the late-reverberation, and λ.sub.V(m) is the reverberation power at the reference microphone, which, obviously, is time-varying to take into account the time-varying power level of reverberation.
(26) The inter-microphone covariance matrix is given by
C.sub.Y(m)=C.sub.X(m)+C.sub.V(m),
because the target and late-reverberation signals are assumed to be uncorrelated. Inserting expressions from above, we arrive at the following expression for C.sub.Y(m),
C.sub.Y(m)=λ.sub.X(m)d(m)d(m).sup.H+λ.sub.V(m)C.sub.iso.
(27) In practice, vector d(m) may be estimated in an off-line calibration procedure (if we assume the target to be in a fixed location compared to the hearing aid microphone array, i.e., if the user “chooses with the nose”), or it may be estimated online.
(28) The matrix C.sub.iso is preferably estimated off-line by exposing hearing aids mounted on a dummy head for a reverberant sound field (e.g. approximated as an isotropic field), and measuring the resulting inter-microphone covariance matrix.
(29) Given the expression above, we wish to find estimates of spectral variances λ.sub.X(m) and λ.sub.V(m). In particular, it is possible to derive the following expressions for maximum likelihood estimates of these quantities. Let
(30)
denote an estimate of the noisy inter-microphone covariance matrix C.sub.Y(m), based on D observations. Ĉ.sub.Y is determined in a unit for estimating inter-microphone covariance (CovEU in
(31)
with
Q.sub.u(m)=I−d(m)(d(m).sup.HC.sub.iso.sup.−1d(m)).sup.−1d(m).sup.HC.sub.iso.sup.−1,
I being the identity matrix (vector), and M=2 is the number of microphones.
(32) Furthermore,
λ.sub.X,ml(m)=w.sub.mvdr.sup.H(m)(Ĉ.sub.Y(m)−λ.sub.V,ml(m)C.sub.iso)w.sub.mvdr(m),
where
(33)
is a vector of filter weights for a minimum-variance distortionless response (MVDR), see e.g. [Haykin; 2001]. The filter weights w.sub.mvdr(m) (w_mvdr(m,k) in
(34) The two boxed equations above constitute an embodiment of our proposed method for estimating spectral variances of a target speaker in reverberation, as a function of time (index m) and frequency (suppressed index k).
(35) The spectral variances λ.sub.X(m) and λ.sub.V(m) have several usages as exemplified in the following sections A1 and A2.
(36) A1. Direct-to-Reverberation Ratio Estimation
(37) The ratio λ.sub.X(m)/λ.sub.V(m) can be seen as an estimate of the direct-to-reverberation ratio (DRR). The DRR correlates with the distance to the sound source [Hioka et al.; 2011], and is also linked to speech intelligibility. Having a DRR estimate available in a hearing assistance device allows e.g. the device to change to a relevant processing strategy, or to inform the user of the hearing assistance device that the device finds the processing conditions difficult, etc.
(38) A2 De-Reverberation
(39) A common strategy for de-reverberation in the time-frequency domain is to suppress the time-frequency tiles where the target-to-reverb ratio is small and maintain the time-frequency tiles where the target-to-reverb ration is large (or suppress such TF-tiles less). The perceptual result of such processing is a target signal where the reverberation has been reduced. The crucial component in any such system is to determine from the available reverberant signal which time-frequency tiles are dominated by reverberation, and which are not.
(40) As before, reverberant microphone signals y.sub.i are decomposed into a time-frequency representation, using analysis filterbanks (AFB in
{tilde over (Y)}(m)={tilde over (X)}(m)+{tilde over (V)}(m),
where
{tilde over (Y)}(m)=w(m).sup.HY(m),
{tilde over (X)}(m)=w(m).sup.HX(m),
and
{tilde over (V)}(m)=w(m).sup.HV(m),
where, as before, the frequency index k for notational convenience has been suppressed.
(41) We are interested in estimates of the power of the target component and of the late-reverberation component entering the single-channel post-processing filter. These can be found using our estimated spectral variances as
{tilde over (λ)}.sub.X,ml(m)=E|w(m).sup.HX(m)|.sup.2=λ.sub.X,ml(m)|w(m).sup.Hd(m)|.sup.2,
and
{tilde over (λ)}.sub.V,ml(m)=E|w(m).sup.HV(m)|.sup.2=λ.sub.V,ml(m)w(m).sup.HC.sub.isow(m),
respectively.
(42) So, the power of the target component and of the late-reverberation component entering the single-channel post-processing filter can be found from our maximum-likelihood estimates of spectral variances, λ.sub.X,ml(m) and λ.sub.V,ml(m), and quantities which are otherwise available.
(43) The single-channel post-processing filter then uses the estimates λ.sub.X,ml(m) and λ.sub.V,ml(m) to find an appropriate gain g.sub.SC(m) to apply to the beamformer output, Y(m). That is, g.sub.SC(m) may generally be expressed as a function of λ.sub.X,ml(m) and λ.sub.V,ml(m) and potentially other parameters. For example, for a Wiener gain function, we have (e.g., [Loizou; 2013])
(44)
whereas for the Ephraim-Malah gain function [Ephraim-Malah; 1984], we have
g.sub.em(m)=ƒ({tilde over (λ)}.sub.X,ml(m)/{tilde over (λ)}.sub.V,ml(m),|{tilde over (Y)}(m)|.sup.2/{tilde over (λ)}.sub.V,ml(m)).
(45) Many other possible gain functions exist, but they are typically a function of both λ.sub.X,ml(m) and λ.sub.V,ml(m), and potentially other parameters.
(46) Finally, the gain function g.sub.SC(m) is applied to the beamformer output Y(m) to result in the de-reverberated time-frequency tile X(m), i.e.,
{circumflex over (X)}(m)=g.sub.SC(m){tilde over (Y)}(m).
(47) In an embodiment of the system of
(48) B. Two Microphone Maximum-Likelihood Estimation of Speech and Late-Reverberation Spectral Variances for Speech Signals in the Presence of Reverberation and Additive Noise (
(49) The following outline illustrates yet another embodiment of an audio processing device according to the present disclosure shown in
(50) Let us assume that one target speaker is present in the acoustical scene, and that the signal reaching the hearing aid microphones consists of the three components a), b), and c) described above. The goal is to estimate the power at given frequencies and time instants of the signal components a) and b). The observable reverberant signal y.sub.i(n) reaching microphone number i may be written as
y.sub.i(n)=x.sub.i(n)+v.sub.i(n)+w.sub.i(n),
where x.sub.i(n) is the target signal component at the microphone, v.sub.i(n) is the undesired reverberation component, and w.sub.i(n) is the additive noise component, which are all assumed to be mutually uncorrelated with each other. The reverberant signal at each microphone is passed through an analysis filter bank leading to a signal in the time-frequency domain,
Y.sub.i(k,m)=X.sub.i(k,m)+V.sub.i(k,m)+W.sub.i(k,m),
where k is a frequency index and m is a time (frame) index. For convenience, these spectral coefficients may be thought of as Discrete-Fourier Transform (DFT) coefficients.
(51) Since all operations are identical for each frequency index, we skip the frequency index in the following for notational convenience. For example, instead of Y.sub.i(k,m), we simply write Y.sub.i(m).
(52) For a given frequency index k and time index m, noisy spectral coefficients for each microphone are collected in a vector,
Y(m)=[Y.sub.1(m)Y.sub.2(m)].sup.T,
X(m)=[X.sub.1(m)X.sub.2(m)].sup.T,
V(m)=[V.sub.1(m)V.sub.2(m)].sup.T,
and
W(m)=[W.sub.1(m)W.sub.2(m)].sup.T
so that
Y(m)=X(m)+V(m)+W(m).
(53) For a given frame index m, and frequency index k (suppressed in the notation), let
d′(m)=[d′.sub.1(m)d′.sub.2(m)]
denote the (generally complex-valued) acoustic transfer function from target sound source to each microphone. It is often more convenient to operate with a normalized version of d′(m). More specifically, let
d(m)=d′(m)/d′.sub.i(m).
denote a vector whose elements d.sub.i(m) represent the relative transfer function from the target source to the ith microphone. This implies that the ith element in this vector equals one, and the remaining elements describe the acoustic transfer function from the other microphones to this reference microphone.
(54) This means that the noise free microphone vector X(m) (which cannot be observed directly), can be expressed as
X(m)=d(m)
where
(55) The inter-microphone covariance matrix for the clean signal is then given by
C.sub.X(m)=λ.sub.X(m)d(m)d(m).sup.H,
where H denotes Hermitian transposition.
(56) We model the inter-microphone covariance matrix of the late-reverberation as the covariance arising from an isotropic field,
C.sub.V(m)=λ.sub.V(m)C.sub.iso,
where C.sub.iso is the covariance matrix of the late-reverberation, normalized to have a value of 1 at the diagonal element corresponding to reference microphone, and λ.sub.V(m) is the reverberation power at the reference microphone, which, obviously, is time-varying to take into account the time-varying power level of reverberation.
(57) Finally, we assume that the covariance matrix of the additive noise is known and time-invariant. In practice, this matrix can be estimated from noise-only signal regions preceding speech activity, using a voice-activity detector.
(58) The inter-microphone covariance matrix of the noisy and reverberant signal is then given by
C.sub.Y(m)=C.sub.X(m)+C.sub.V(m)+C.sub.W,
because the target, the late-reverberation, and the noise were assumed mutually uncorrelated. As mentioned, C.sub.W is assumed known and constant (hence the lack of time-index). Inserting expressions from above, we arrive at the following expression for C.sub.Y(m),
C.sub.Y(m)=λ.sub.X(m)d(m)d(m).sup.H+λ.sub.V(m)C.sub.iso+C.sub.W.
(59) In practice, vector d(m) may be estimated in an off-line calibration procedure (if we assume the target to be in a fixed location compared to the hearing aid microphone array, i.e., if the user “chooses with the nose”), or it may be estimated online.
(60) Matrix C.sub.iso is estimated offline by exposing hearing aids mounted on a dummy head for a reverberant sound field (e.g. approximated as an isotropic field), and measuring the resulting inter-microphone covariance matrix.
(61) Given the expression above, we wish to find estimates of spectral variances λ.sub.X(m) and λ.sub.V(m). In particular, it is possible to derive the following expressions for maximum likelihood estimates of these quantities. Let
(62)
denote an estimate of the noisy inter-microphone covariance matrix C.sub.Y(m), based on D observations.
B1 Special Case: No Additive Noise (C.sub.W=0)
(63) We first consider the case when there is no additive noise present (C.sub.W=0), because in this case the resulting ML estimators are particularly simple. In practice, the noise is never completely absent, but the following results hold for high signal-to-noise ratios, i.e., when C.sub.W is small compared to C.sub.V(m), or in very reverberant situations, i.e., when C.sub.W is small compared to C.sub.X(m).
(64) In this case, the following maximum-likelihood estimates of spectral variances λ.sub.X(m) and λ.sub.V(m) can be derived:
(65)
where
Q.sub.u(m)=I−d(m)(d(m).sup.HC.sub.iso.sup.−1d(m)).sup.−1d(m).sup.HC.sub.iso.sup.−1,
and M=2 is the number of microphones. Furthermore,
λ.sub.X,ml(m)=w.sub.mvdr.sup.H(m)(Ĉ.sub.Y(m)−λ.sub.V,ml(m)C.sub.iso)w.sub.mvdr(m),
where
(66)
is a vector of filter weights for an minimum-variance distortionless response (MVDR), see e.g. [Haykin; 2001].
(67) The two boxed equations above constitute an embodiment of the proposed method in the special case of low additive noise, for estimating spectral variances of a target speaker in reverberation, as a function of time (index m) and frequency (suppressed index k), same result as provided in section A above.
(68) B2. General Case: Additive Noise (C.sub.W≠0)
(69) To express the maximum likelihood estimates of the spectral variances λ.sub.X(m) and λ.sub.V(m) in this general case, we need to introduce some additional notation.
(70) First, let us introduce an M×M−1 complex-valued blocking matrix BεC.sup.M×M-1 given by
[Bd]=I−d(m)(d(m).sup.Hd(m)).sup.−1d(m).sup.H,
i.e., the matrix B is given by the first M−1 columns of the matrix on the right-hand side.
(71) Also, let us define a pre-whitening matrix DεC.sup.M-1×M-1, which has the property that
(B.sup.HC.sub.WB).sup.−1=D.sup.HD.
(72) Matrix D can, e.g., be found from a Cholesky decomposition of the matrix on the left-hand side above.
(73) In any case, matrices B and D can be computed from known quantities at any time instant m.
(74) To describe the maximum likelihood estimates compactly, we need to introduce the signal quantities from the previous section in a blocked and whitened domain. The quantities in this new domain are denoted with ′. We define
Y′(m)=D.sup.HB.sup.HY(m),
and similarly for X′(m), V′(m), and W′(m). Covariance matrices in this blocked and pre-whitened domain are given by
C.sub.Y′(m)=D.sup.HB.sup.HC.sub.Y(m)BD,
and similarly for C.sub.X′(m), C.sub.iso′(m), C.sub.W′(m), and Ĉ.sub.Y′(m). Note that all these (square) covariance matrices have dimension M′=M−1, where M is the number of microphones.
(75) Finally, let us introduce some additional notation. Let
C.sub.Y′(m)=UΛ.sub.Y′U.sup.H
denote the eigenvalue decomposition of the (blocked and pre-whitened) covariance matrix C.sub.Y′(m), where the columns of matrix U are eigen vectors and diagonal elements of the diagonal matrix
Λ.sub.Y′=diag(λ.sub.y1 . . . λ.sub.yM′).
Similarly, let
C.sub.iso′=UΛ.sub.iso′U.sup.H
denote the eigenvalue decomposition of the (blocked and pre-whitened) matrix C.sub.iso′, such that
Λ.sub.iso′=diag(λ.sub.iso,1, . . . ,λ.sub.iso,M′)
is a diagonal eigen value matrix.
(76) Furthermore, let g.sub.m denote the m'th diagonal element of the matrix
U.sup.HĈ.sub.Y′(m)U.
(77) Then it can be shown that the maximum likelihood estimate λ.sub.V,ML of λ.sub.V can be found as one of the roots of the polynomial (in the variable λ.sub.V):
(78)
(79) Specifically, λ.sub.V(m) is found as the positive, real root of the polynomial. In most cases, there is only one such root.
(80) The corresponding maximum-likelihood estimate λ.sub.X,ML(m) of the target speech spectral variance λ.sub.X(m) can then be found from quantities in the non-blocked and non-prewhitened domain as:
λ.sub.X,ML(m)=w.sub.mvdr.sup.H(m)(Ĉ.sub.Y(m)−λ.sub.V,ML(m)C.sub.iso−C.sub.W)w.sub.mvdr(m),
where
(81)
where
C.sub.V+W(m)=λ.sub.V,ML(m)C.sub.iso+C.sub.W.
(82) The spectral variances λ.sub.X(m) and λ.sub.V(m) have several usages as exemplified in the following sections B3 and B4.
(83) B3. Direct-to-Reverberation Ratio Estimation
(84) The ratio λ.sub.X(m)/λ.sub.V(m) can be seen as an estimate of the direct-to-reverberation ratio (DRR). The DRR correlates with the distance to the sound source [Hioka et al.; 2011], and is also linked to speech intelligibility. Having available on-board a hearing a DRR estimate allows the hearing aid to change to a relevant processing strategy, or informs the hearing aid user that the hearing aid finds the processing conditions difficult, etc.
(85) B4. Dereverberation—Special Case with No (or Low) Additive Noise (C.sub.W=0)
(86) In this special case, the target signal is disturbed by reverberation, but no additional noise.
(87) A common strategy for dereverberation in the time-frequency domain is to suppress the time-frequency tiles where the target-to-reverb ratio is small and maintain the time-frequency tiles where the target-to-reverb ration is large. The perceptual result of such processing is a target signal where the reverberation has been reduced. The crucial component in any such system is to determine from the available reverberant signal which time-frequency tiles are dominated by reverberance, and which are not.
(88) As before, reverberant microphone signals are decomposed into a time-frequency representation, using analysis filter banks. The proposed method (shaded box) is applied to the filter bank output to estimate spectral variances λ.sub.X,ml(m) and λ.sub.V,ml(m) as a function of time and frequency. We assume that the noisy microphone signals are passed through a linear beamformer with weights collected in the vector w(m,k). This beamformer may or may not be an MVDR beamformer. If an MVDR beamformer is desired, then the MVDR beamformer of the proposed method (inside the shaded ML.sub.est-box) in
{tilde over (Y)}(m)={tilde over (X)}(m)+{tilde over (V)}(m),
where
{tilde over (Y)}(m)=w(m).sup.HY(m),
{tilde over (X)}(m)=w(m).sup.HX(m),
and
{tilde over (V)}(m)=w(m).sup.HV(m),
where, as before, we skipped the frequency index k for notational convenience.
(89) We are interested in estimates of the power of the target component and of the late-reverberation component entering the single-channel post-processing filter. These can be found using our estimated spectral variances as
{tilde over (λ)}.sub.X,ml(m)=E|w(m).sup.HX(m)|.sup.2=λ.sub.X,ml(m)|w(m).sup.Hd(m)|.sup.2,
and
{tilde over (λ)}.sub.V,ml(m)=E|w(m).sup.HV(m)|.sup.2=λ.sub.V,ml(m)w(m).sup.HC.sub.isow(m),
respectively.
(90) So, the power of the target component and of the late-reverberation component entering the single-channel post-processing filter can be found from our maximum-likelihood estimates of spectral variances, λ.sub.X(m) and λ.sub.V(m), and quantities which are otherwise available.
(91) The single-channel post-processing filter then uses the estimates {tilde over (λ)}.sub.X,ml(m) and {tilde over (λ)}.sub.V,ml(m) to find an appropriate gain g.sub.SC(m) to apply to the beamformer output, {tilde over (Y)}(m). That is, g.sub.SC(m) may generally be expressed as a function of {tilde over (λ)}.sub.X,ml(m) and {tilde over ({dot over (λ)})}.sub.V,ml(m) and potentially other parameters. For example, for a Wiener gain function, we have (e.g., [Loizou; 2013])
(92)
whereas for the Ephraim-Malah gain function [Ephraim-Malah; 1984], we have
g.sub.em(m)=ƒ({tilde over (λ)}.sub.X,ml(m)/{tilde over (λ)}.sub.V,ml(m),|{tilde over (Y)}(m)|.sup.2/{tilde over (λ)}.sub.V,ml(m)).
(93) Many other possible gain functions exist, but they are typically a function of both {tilde over (λ)}.sub.X,ml(m) and {tilde over (λ)}.sub.V,ml(m), and potentially other parameters.
(94) Finally, the gain function g.sub.SC(m) is applied to the beamformer output {tilde over (Y)}(m) to result in the dereverberated time-frequency tile {circumflex over (X)}(m), i.e.,
{circumflex over (X)}(m)=g.sub.SC(m){tilde over (Y)}(m).
as also disclosed in section A above.
B5. Dereverberation—General Case with Additive Noise (C.sub.W≠0)
(95) In the general case, the target signal is disturbed by both reverberation and additive noise. Analogously to the previous section, we are interested in the spectral variances of all signal components, entering the single-channel postfilter. As above, the spectral variances of the target and the reverberation component can be found from the maximum-likelihood estimates as
{tilde over (λ)}.sub.X,ml(m)=E|w(m).sup.HX(m)|.sup.2=λ.sub.X,ml(m)|w(m).sup.Hd(m)|.sup.2,
and
{tilde over (λ)}.sub.V,ml(m)=E|w(m).sup.HV(m)|.sup.2=λ.sub.V,ml(m)w(m).sup.HC.sub.isow(m),
respectively.
(96) Furthermore, the spectral variance of the additive noise component entering the single-channel beamformer is given by
λ.sub.W(m)=E|w(m).sup.HW(m)|.sup.2=w(m).sup.HC.sub.Ww(m)
(97) Generally speaking, the single-channel postfilter gain is function of function of {tilde over (λ)}.sub.X,ml(m), {tilde over (λ)}.sub.V,ml(m), λ.sub.W(m), and potentially other parameters. For example, one could define the total spectral disturbance as the sum of the reverberation and noise variances,
λ.sub.dist(m)={tilde over (λ)}.sub.V,ml(m)+λ.sub.W(m).
(98) Then a signal-to-total-disturbance ratio would be given by
ξ(m)={tilde over (λ)}.sub.V,ml(m)/λ.sub.dist(m).
(99) With this, new versions of the Wiener gain function or the Ephraim-Malah gain function could be defined analogously to the description above. However, rather than suppressing only the reverberation component, these new gain functions suppress the reverberation and the additive noise component jointly.
(100)
(101)
(102)
(103) a) Providing or receiving a time-frequency representation Y.sub.i(k,m) of the noisy audio signal y.sub.i(n) at an i.sup.th input unit, i=1, 2, . . . , M, where M is larger than or equal to two, in a number of frequency bands and a number of time instances, k being a frequency band index and m being a time index;
b) Estimating spectral variances or scaled versions thereof λ.sub.V, λ.sub.X of said first noise signal component v and said target signal component x, respectively, as a function of frequency index k and time index m, said estimates of λ.sub.V and λ.sub.X being jointly optimal in maximum likelihood sense.
(104) The maximum likelihood optimization is based (exclusively) on the following statistical assumptions that the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), and V.sub.i(k,m) (and optionally W.sub.i(k,m)) of respective signals y.sub.i(n), and signal components x.sub.i(n), and v.sub.i(n) (and optionally w.sub.i(n)) are zero-mean, complex-valued Gaussian distributed, that each of them are statistically independent across time m and frequency k, and that X.sub.i(k,m) and V.sub.i(k,m) (and optionally W.sub.i(k,m)) are mutually uncorrelated
(105) The method is—in general—based on the assumption that characteristics (e.g. spatial characteristics) of the target and noise signal components are known.
(106) The assumptions regarding the characteristics of the target and noise signal components are e.g. that the direction to the target signal relative to the input units is known (fixed d) and that the spatial fingerprint of the first noise signal component is also known, e.g. isotropic (C.sub.v=C.sub.iso). In case a second, additive noise component is present, it is assumed that its characteristics in the form of an inter input covariance matrix C.sub.w is known.
(107) The invention is defined by the features of the independent claim(s). Preferred embodiments are defined in the dependent claims. Any reference numerals in the claims are intended to be non-limiting for their scope.
(108) Some preferred embodiments have been shown in the foregoing, but it should be stressed that the invention is not limited to these, but may be embodied in other ways within the subject-matter defined in the following claims and equivalents thereof.
REFERENCES
(109) US2009248403A WO12159217A1 US2013343571A1 US2010246844A1 [Braun&Habets; 2013] S. Braun and E. A. P. Habets, “Dereverberation in noisy environments using reference signals and a maximum likelihood estimator”, Presented at the 21.sup.st European Signal Processing Conference (EUSIPCO 2013), 5 pages (EUSIPCO 2013 1569744623). [Schaub; 2008] Arthur Schaub, “Digital hearing Aids”, Thieme Medical. Pub., 2008. [Haykin; 2001] S. Haykin, “Adaptive Filter Theory,” Fourth Edition, Prentice Hall Information and System Sciences Series, 2001. [Hioka et al.; 2011]: Y. Hioka, K. Niwa, S. Sakauchi, K. Furuya, and Y. Haneda, “Estimating Direct-to-Reverberant Energy Ratio Using D/R Spatial Correlation Matrix Model”, IEEE Trans. Audio, Speech, and Language Processing, Vol. 19, No. 8, November 2011, pp. 2374-2384. [Loizou; 2013]: P. C. Loizou, “Speech Enhancement: Theory and Practice,” Second Edition, February, 2013, CRC Press [Ephraim-Malah; 1984]: Y. Ephraim and D. Malah, “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator,” IEEE Trans. Acoustics, Speech, and Signal Processing, Vol. ASSP-32, No. 6, December 1984, pp. 1109-1121. [Kjems&Jensen; 2012] U. Kjems, J. Jensen, “Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement”, 20th European Signal Processing Conference (EUSIPCO 2012), pp. 295-299, 2012. [Ye&DeGroat; 1995] H. Ye and R. D. DeGroat, “Maximum likelihood DOA estimation and asymptotic Cram'er-Rao bounds for additive unknown colored noise,” Signal Processing, IEEE Transactions on, vol. 43, no. 4, pp. 938-949, 1995. [Shimitzu et al.; 2007] Hikaru Shimizu, Nobutaka Ono, Kyosuke Matsumoto, Shigeki Sagayama, Isotropic noise suppression in the power spectrum domain by symmetric microphone arrays, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 21-24, 2007, New Paltz, N.Y., pp. 54-57.