Multi-microphone method for estimation of target and noise spectral variances for speech degraded by reverberation and optionally additive noise

Abstract

The application relates to an audio processing system and a method of processing a noisy (e.g. reverberant) signal comprising first (v) and optionally second (w) noise signal components and a target signal component (x), the method comprising a) Providing or receiving a time-frequency representation Y.sub.i(k,m) of a noisy audio signal y.sub.i at an i.sup.th input unit, i=1, 2, . . . , M, where M≧2; b) Providing (e.g. predefined spatial) characteristics of said target signal component and said noise signal component(s); and c) Estimating spectral variances or scaled versions thereof λ.sub.V, λ.sub.X of said first noise signal component v (representing reverberation) and said target signal component x, respectively, said estimates of λ.sub.V and λ.sub.X being jointly optimal in maximum likelihood sense, based on the statistical assumptions that a) the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), and V.sub.i(k,m) (and W.sub.i(k,m)) of respective signals y.sub.i(n), and signal components x.sub.i, and v.sub.i (and w.sub.i) are zero-mean, complex-valued Gaussian distributed, b) that each of them are statistically independent across time m and frequency k, and c) that X.sub.i(k,m) and V.sub.i(k,m) (and W.sub.i(k,m)) are uncorrelated. An advantage of the invention is that it provides the basis for an improved intelligibility of an input speech signal. The invention may e.g. be used for hearing assistance devices, e.g. hearing aids.

Claims

1. A method of processing a noisy audio signal y(n) including a target signal component x(n) and a first noise signal component v(n), n representing time, the method comprising: providing or receiving a time-frequency representation Y.sub.i(k,m) of the noisy audio signal y.sub.i(n) at an i.sup.th input unit, i=1, 2, . . . , M, where M is larger than or equal to two, in a number of frequency bands and a number of time instances, k being a frequency band index and m being a time index; providing characteristics of said target signal component represented by a look vector d(k,m), whose elements (i=1, 2, . . . , M) define the frequency and time dependent absolute acoustic transfer function from a target signal source to each of the M input units, or the relative acoustic transfer function of the ith input unit to a reference input unit, or an inter input covariance matrix d(k,m).Math.d(k,m).sup.H; providing characteristics of said first noise signal component defined by an inter input unit covariance matrix C.sub.v(k,m); estimating spectral variances or scaled versions thereof λ.sub.V, λ.sub.X of said first noise signal component v and said target signal component x, respectively, as a function of frequency index k and time index m, said estimates of λ.sub.V and λ.sub.X being jointly optimal in maximum likelihood sense, jointly optimal being taken to mean that both of the spectral variance λ.sub.V, λ.sub.X are estimated in the same maximum likelihood estimation process, based on the statistical assumptions that a) the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), and V.sub.i(k,m) of respective signals y.sub.i(n), and signal components x.sub.i(n), and v.sub.i(n) are zero-mean, complex-valued Gaussian distributed, b) that each of them are statistically independent across time m and frequency k, and c) that X.sub.i(k,m) and V.sub.i(k,m) are uncorrelated; and processing the noisy audio signal y.sub.i(n) based on the estimated spectral variances or scaled versions thereof to provide a noise reduced signal.

2. A method according to claim 1 wherein the noisy audio signal y.sub.i(n) comprises a reverberant signal comprising a target signal component and a reverberation signal component.

3. A method according to claim 1 wherein said characteristics of the first noise signal component v is represented by an inter input unit covariance matrix C.sub.v or a scaled version thereof and wherein said first noise signal component v.sub.i(n) is essentially spatially isotropic.

4. A method according to claim 1 wherein said first noise signal component v.sub.i(n) is constituted by late reverberations.

5. A method according to claim 1 wherein the first noise signal component is a reverberation signal component v(n), and the noisy audio signal y(n) further comprises a second noise signal component being an additive noise signal component w(n), and wherein the method further comprises providing characteristics of said second noise signal component defined by a predetermined inter input unit covariance matrix C.sub.w(k,m).

6. A method according to claim 5 wherein the noisy audio signal y.sub.i(n) at the i.sup.th input unit comprises a target signal component x.sub.i(n), a reverberation signal component v.sub.i(n), and an additive noise component w.sub.i(n).

7. A method according to claim 5 wherein the characteristics of said second noise signal component w is represented by a predetermined inter input unit covariance matrix C.sub.W of the additive noise.

8. A method according to claim 1 wherein the characteristics of the target signal is represented by a look vector d(k,m) whose elements (i=1, 2, . . . , M) define the frequency and time dependent absolute acoustic transfer function from a target signal source to each of the M input units, or the relative acoustic transfer function from the i.sup.th input unit to a reference input unit.

9. A method according to claim 8 wherein said look vector d(km) and said noise covariance matrix C.sub.V(k,m), and optionally C.sub.W(k,m), are determined in an off-line procedure.

10. A method according to claim 1 further comprising: estimating the inter input unit covariance matrix Ĉ.sub.Y(k,m) of the noisy audio signal based on a number D of observations.

11. A method according to claim 10 wherein said maximum-likelihood estimates of the spectral variances λ.sub.X(k,m) and λ.sub.V(k,m) of the target signal component x and the noise signal component v, respectively, are derived from estimates of the inter-input unit covariance matrices C.sub.Y(k,m), C.sub.X(k,m), C.sub.V(k,m), and optionally C.sub.W(k,m), and the look vector d(k,m).

12. A method according to claim 1 wherein processing the noisy audio signal y.sub.i(n) based on the estimated spectral variances or scaled versions thereof to provide a noise reduced signal comprises: applying beamforming to the noisy audio signal y(n) providing a beamformed signal and single channel post filtering to the beamformed signal to suppress noise signal components from a direction of the target signal and to provide the resulting noise reduced signal.

13. A method according to claim 12 wherein said beamforming is a target signal enhancement spatial filtering based on MVDR filtering applied to the time-frequency representation Y.sub.i(k,m) of the noisy audio signal y.sub.i(n) at an i.sup.th input unit, i=1, 2, . . . , M, to provide a beamformed signal wherein signal components from other directions than a direction of the target signal component are attenuated, while leaving signal components from the direction of the target signal component un-attenuated.

14. A method according to any one of claim 12 wherein gain values g.sub.sc(k,m) applied to the beamformed signal in the single channel post filtering process are based on the estimates of the spectral variances λ.sub.X(k,m) and λ.sub.V(k,m) of the target signal component x and the first noise signal component v, respectively.

15. A data processing system comprising: a processor; and a memory having stored thereon program code which when executed cause the processor to perform the method of claim 1.

16. An audio processing system for processing a noisy audio signal y comprising a target signal component x and a first noise signal component v, the audio processing system comprising: a multitude M of input units adapted to provide or to receive a time-frequency representation Y.sub.i(k,m) of the noisy audio signal y.sub.i(n) at an i.sup.th input unit, i=1, 2, . . . , M, where M is larger than or equal to two, in a number of frequency bands and a number of time instances, k being a frequency band index and m being a time index; a look vector d(k,m), whose elements (i=1, 2, . . . , M) define the frequency and time dependent absolute acoustic transfer function from a target signal source to each of the M input units, or the relative acoustic transfer function form the ith input unit to a reference input unit, or an inter input covariance matrix d(k,m).Math.d(k,m).sup.H, for the target signal component; an inter-input unit covariance matrix C.sub.v(k,m) for the first noise signal component, or scaled versions thereof; a covariance estimation unit for estimating an inter input unit covariance matrix Ĉ.sub.Y(k,m), or a scaled version thereof, of the noisy audio signal based on the time-frequency representation Y.sub.i(k,m) of the noisy audio signals y.sub.i(n); and a spectral variance estimation unit for estimating spectral variances λ.sub.X(k,m) and λ.sub.V(k,m) or scaled versions thereof of the target signal component x and the first noise signal component v, respectively, based on said look vector d(k,m), said inter-input unit covariance matrix C.sub.v(k,m), and the covariance matrix Ĉ.sub.Y(k,m) of the noisy audio signal, or scaled versions thereof, wherein said estimates of λ.sub.V and λ.sub.X are jointly optimal in maximum likelihood sense, jointly optimal being taken to mean that both of the spectral variance λ.sub.V and λ.sub.X are estimated in the same maximum likelihood estimation process, based on the statistical assumptions that a) the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), and V.sub.i(k,m) of respective signals y.sub.i(n), and signal components x.sub.i(n), and v.sub.i(n) are zero-mean, complex-valued Gaussian distributed, b) that each of them are statistically independent across time m and frequency k, and c) that X.sub.i(k,m) and V.sub.i(k,m) are uncorrelated; and a signal processing unit adapted to process the noisy audio signal y.sub.i(n) based on the estimated spectral variances or scaled versions thereof to provide a noise reduced signal.

17. An audio processing system according to claim 16 wherein the noisy audio signal y(n) comprises a target signal component x(n), a first noise signal component being a reverberation signal component v(n), and a second noise signal component being an additive noise signal component w(n), and wherein the audio processing system comprises a predetermined inter input unit covariance matrix C.sub.W of the additive noise.

18. An audio processing system according to claim 17 wherein the spectral variance estimation unit is configured to estimate spectral variances λ.sub.X(k,m) and λ.sub.V(k,m) or scaled versions thereof of the target signal component x and the first noise signal component v, respectively, based on said look vector d(k,m), said inter-input unit covariance matrix C.sub.v(k,m) of the first noise component, said inter-input unit covariance matrix C.sub.W(k,m) of the second noise component, and said covariance matrix Ĉ.sub.Y(k,m) of the noisy audio signal, or scaled versions thereof, wherein said estimates of λ.sub.V and λ.sub.X are jointly optimal in maximum likelihood sense, based on the statistical assumptions that a) the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), V.sub.i(k,m), and W.sub.i(k,m) of respective signals y.sub.i(n), and signal components x.sub.i(n), v.sub.i(n), w.sub.i(n) are zero-mean, complex-valued Gaussian distributed, b) that each of them are statistically independent across time m and frequency k, and c) that X.sub.i(k,m), V.sub.i(k m) and W.sub.i(k,m) are mutually uncorrelated.

19. An audio processing system according to claim 16 further comprising: one of a hearing aid, a headset, an earphone, and an ear protection device, or a combination thereof.

Description

BRIEF DESCRIPTION OF DRAWINGS

(1) The disclosure will be explained more fully below in connection with a preferred embodiment and with reference to the drawings in which:

(2) FIGS. 1A-1C schematically show a first scenario comprising a number of acoustic paths between a sound source and a receiver of sound located in a room with reverberation (FIG. 1A) and an exemplary illustration of amplitude versus time for a sound signal in the room (FIG. 1B), and a second scenario comprising a number of acoustic paths between a sound source and a receiver of sound located in a room with reverberation and additive noise,

(3) FIGS. 2A-2B schematically illustrate a conversion of a signal in the time domain to the time-frequency domain, FIG. 2A illustrating a time dependent sound signal (amplitude versus time) and its sampling in an analogue to digital converter, FIG. 2B illustrating a resulting ‘map’ of time-frequency units after a (short-time) Fourier transformation of the sampled signal,

(4) FIGS. 3A-3C show three exemplary embodiments of block diagrams of an audio processing system according to the present disclosure illustrating the proposed scheme of estimation of speech and noise spectral variances, FIG. 3A, 3B illustrating systems adapted to handle a noisy audio signal in the form of a reverberant target speech signal and FIG. 3C illustrating a system adapted to handle a noisy audio signal in the form of a reverberant target speech signal in additive noise,

(5) FIGS. 4A-4B show a scenario wherein the method according to the present disclosure (shaded box) is used to compute gain values for a single-channel post-processing step for de-reverberation, FIG. 4A illustrating a system adapted to handle a noisy audio signal in the form of a reverberant target speech signal, FIG. 4B illustrating a system adapted to handle a noisy audio signal in the form of a reverberant target speech signal in additive noise,

(6) FIG. 5 shows an embodiment of an audio processing system according to the present disclosure,

(7) FIG. 6 shows a further embodiment of an audio processing device according to the present disclosure, and

(8) FIG. 7 shows a flow diagram illustrating a method of processing a noisy input signal according to the present disclosure.

(9) The figures are schematic and simplified for clarity, and they just show details which are essential to the understanding of the disclosure, while other details are left out. Throughout, the same reference signs are used for identical or corresponding parts.

(10) Further scope of applicability of the present disclosure will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only. Other embodiments may become apparent to those skilled in the art from the following detailed description.

DETAILED DESCRIPTION OF EMBODIMENTS

(11) FIG. 1 schematically shows a number of acoustic paths between a sound source and a receiver of sound located in a room (FIG. 1A) and an exemplary illustration of amplitude (|MAG|) versus time (Time) for a sound signal in the room (FIG. 1B).

(12) FIG. 1A schematically shows an example of an acoustically propagated signal from an audio source (S in FIG. 1A) to a listener (L in FIG. 1A) via direct (p.sub.0) and reflected propagation paths (p.sub.1, p.sub.2, p.sub.3, p.sub.4, respectively) in an exemplary location (Room). The resulting acoustically propagated signal received by a listener, e.g. via a listening device worn by the listener (at L in FIG. 1A) is a sum of the five (and possibly more, depending on the room) differently delayed and attenuated (and possibly otherwise distorted) contributions. The direct (p.sub.0) and early reflections (here the one time reflected (p.sub.1)) propagation paths are indicated FIG. 1A in dashed line, whereas the ‘late reflections’ (here the 2, 3, and 4 times reflected (p.sub.2, p.sub.3, p.sub.4)) time reflected (p.sub.1)) are indicated FIG. 1A in dotted line. FIG. 1B schematically illustrates an example of a resulting time variant sound signal (magnitude |MAG| [dB] versus time) from the sound source S as received at the listener L. In FIG. 1B a predetermined time Δt.sub.pd defining the ‘late reverberations’ is indicated. The late reverberations are in the present example taken to be those signal components that arrive at the listener a time t.sub.pd after it was issued by the sound source S. In other words, ‘late reverberations’ are signal components of a sound that arrive at a given input unit (e.g. the i.sup.th) a predefined time Δt.sub.pd after the first peak (p0) of the impulse response has arrived at the input unit in question. In an embodiment, the predefined time Δt.sub.pd is larger than or equal to 30 ms, such as larger than or equal to 40 ms, e.g. larger than or equal to 50 ms. In an embodiment, such ‘late reverberations’ include sound components that have been subject to two or more (p2, p3, p4, . . . , as exemplified in FIG. 1), such as three or more reflections from surfaces (e.g. walls) in the environment. The appropriate number of reflections and/or the appropriate predefined time Δt.sub.pd separating the target signal components (dashed part of the graph in FIG. 1B) from the (undesired) reverberation (noise) signal components (dotted part of the graph in FIG. 1B) depend on the location (distance to and properties of reflective surfaces) and the distance between audio source (S) and listener (L), the effect of reverberation being smaller the smaller the distance between source and listener.

(13) FIG. 1C shows a second scenario comprising a number of acoustic paths between a sound source (S) constituting the target signal and a receiver of sound (L) located in a room (room) with reverberation (reverberation) and additive noise (AD). The characteristics (e.g. an inter input unit covariance matrix C.sub.w) of the additive source (AD) are assumed be known.

(14) FIG. 2 schematically illustrates a conversion of a signal in the time domain to the time-frequency domain, FIG. 2A illustrating a time dependent sound signal (amplitude versus time) and its sampling in an analogue to digital converter, FIG. 2B illustrating a resulting ‘map’ of time-frequency units after a (short-time) 2 Fourier transformation of the sampled signal.

(15) FIG. 2A illustrates a time dependent sound signal x(t) (amplitude (SPL [dB]) versus time (t)), its sampling in an analogue to digital converter and a grouping of time samples in frames, each comprising N.sub.s samples. The graph showing a Amplitude versus time (solid line in FIG. 2A) may e.g. represent the time variant analogue electric signal provided by an input transducer, e.g. a microphone, before being digitized by an analogue to digital conversion unit. FIG. 2B illustrates a ‘map’ of time-frequency units resulting from a Fourier transformation (e.g. a discrete Fourier transform, DFT) of the input signal of FIG. 2A, where a given time-frequency unit (m,k) corresponds to one DFT-bin and comprises a complex value of the signal X(m,k) in question (X(m,k)=|X|.Math.e.sup.iφ, |X|=magnitude and φ=phase) in a given time frame m and frequency band k. In the following, a given frequency band is assumed to contain one (generally complex) value of the signal in each time frame. It may alternatively comprise more than one value. The terms ‘frequency range’ and ‘frequency band’ are used in the present disclosure. A frequency range may comprise one or more frequency bands. The Time-frequency map of FIG. 2B illustrates time frequency units (m,k) for k=1, 2, . . . , K frequency bands and m=1, 2, . . . , N.sub.M time units. Each frequency band Δf.sub.k is indicated in FIG. 2B to be of uniform width. This need not be the case though. The frequency bands may be of different width (or alternatively, frequency channels may be defined which contain a different number of uniform frequency bands, e.g. the number of frequency bands of a given frequency channel increasing with increasing frequency, the lowest frequency channel(s) comprising e.g. a single frequency band). The time intervals Δt.sub.m (time unit) of the individual time-frequency bins are indicated in FIG. 2B to be of equal size. This need not be the case though, although it is assumed in the present embodiments. A time unit Δt.sub.m is typically equal to the number N.sub.s of samples in a time frame (cf. FIG. 2A) times the length in time t.sub.s of a sample (t.sub.s=(1/f.sub.s), where f.sub.s is a sampling frequency). A time unit is e.g. of the order of ms in an audio processing system.

(16) FIG. 3A schematically shows an embodiment of an audio processing device (APD) according to the present disclosure. The audio processing device (APD) comprises a multitude M of input units (IU.sub.i, i=1, 2, . . . , M), each being adapted to provide a time-frequency representation Y.sub.i of a (time varying) noisy input signal y.sub.i at an i.sup.th input unit, i=1, 2, . . . , M, where M is larger than or equal to two. The noisy input signal y.sub.i is e.g. a noisy target speech signal comprising a target speech signal component x.sub.i and a (first) noise signal component v.sub.i, which is additive and essentially uncorrelated to the target signal (e.g. a speech signal), in other words y.sub.i(n)=x.sub.i(n)+v.sub.i(n), 1=1, 2, . . . , M, where n represents time. In the present context, the noisy audio signal is assumed to be a reverberant target speech signal y.sub.i comprising a target speech signal component x.sub.i and a reverberation signal component v.sub.i, as discussed in connection with FIG. 1 above. The time-frequency representation Y.sub.i(k,m) comprises a (generally complex) value of the input signal in a given frequency band k (k=1, 2, . . . . K) and time instance m (m=1, 2, . . . . , Nm). In the embodiment of FIG. 3A, each input unit IU.sub.i comprises an input transducer or an input terminal IT.sub.i for receiving a noisy audio signal y.sub.i (e.g. an acoustic signal or an electric signal) and providing it as an electric input signal IN.sub.i to an analysis filterbank (AFB) for providing a time-frequency representation Y.sub.i(k,m) of the corresponding electric input signal IN.sub.i, and hence of the noisy input signal y.sub.i. The audio processing device (APD) further comprises a multi-channel MVDR beamformer filtering unit (MVDR) to provide signal mvdr comprising filter weights w.sub.mvdr(k,m). The filter weights w.sub.mvdr(k,m) are being determined by the MVDR filter unit (MVDR) from a predetermined look vector d(k,m) (d) (or a scaled version thereof) and a predetermined inter-input unit covariance matrix C.sub.v(k,m) (Ĉ.sub.v) (or a scaled version thereof) for the (first) noise signal component of the noisy input signal. In an embodiment, the look vector (d) and the covariance matrix (Ĉ.sub.v) are predetermined in off-line procedures. The audio processing device (APD) further comprises a covariance estimation unit (CovEU) for estimating an inter input unit covariance matrix Ĉ.sub.Y(k,m) (or a scaled version thereof) of the noisy input signal based on the time-frequency representation Y.sub.i(k,m) of the noisy audio signals y.sub.i. The audio processing device (APD) further comprises a spectral variance estimation unit (SVarEU) for estimating spectral variances λ.sub.X(k,m) and λ.sub.V(k,m) or scaled versions thereof of the target signal component x and the (first) noise signal component v, respectively. The estimated spectral variances λ.sub.X(k,m) and λ.sub.V(k,m) are based on the filter weights w.sub.mvdr(k,m) (signal mvdr) provided by the MVDR filter, the mvdr, predetermined target look vector (d) and noise covariance matrix (Ĉ.sub.v) (or scaled versions thereof), and the covariance matrix Ĉ.sub.Y(k,m) of the noisy audio signal provided by the covariance estimation unit (CovEU). The spectral variance estimation unit (SVarEU) is configured to provide that the estimates of λ.sub.V and λ.sub.X are jointly optimal in maximum likelihood sense based on the statistical assumptions that the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), and V.sub.i(k,m) of respective signals y.sub.i(n), and signal components x.sub.i(n), and v.sub.i(n) are zero-mean, complex-valued Gaussian distributed, that each of them are statistically independent across time m and frequency k, and that X.sub.i(k,m) and V.sub.i(k,m) are uncorrelated.

(17) In an embodiment, at least one of the M input units IU.sub.i comprises an input transducer, e.g. a microphone for converting an electric input sound to an electric input signal (cf. e.g. FIG. 3B). The M input units IU.sub.i may all be located in the same physical device. Alternatively, a first (IU.sub.1) of the M input units (IU.sub.i) is located in the audio processing device (APD, e.g. a hearing aid device), and at second (IU.sub.2) of the M input units (IU.sub.i) is located a distance to the first input unit that is larger than a maximum outer dimension of the audio processing device (APD) where the first input unit (IU.sub.1) is located. In an embodiment, a first of the M input units is located in a first audio processing device (e.g. a first hearing aid device) and a second of the M input units is located in another device, the audio processing device and the other device being configured to establish a communication link between them. In an embodiment, the other device is another audio processing device (e.g. a second hearing aid device of a binaural hearing assistance system). In an embodiment, the other device is or comprises a remote control device of the audio processing device, e.g. embodied in a cellular telephone, e.g. in a SmartPhone.

(18) A. Two Microphone Maximum-Likelihood Estimation of Speech and Late-Reverberation Spectral Variances for Speech Signals in the Presence of Reverberation (Only) (FIG. 3B, 4A):

(19) Another embodiment of an audio processing device according to the present disclosure illustrating a more specific implementation (but comprising the same elements as shown and discussed in FIG. 3A) is shown in FIG. 3B. FIG. 3B shows an audio processing device (APD) for estimation of spectral variances λ.sub.x, λ.sub.v of target speech and reverberation signal components of a noisy input signal, wherein the number (M) of input units is two, and wherein the two input units (Mic.sub.1, Mic.sub.2) each comprises a microphone unit (Mic.sub.i) and an analysis filterbank (AFB in FIG. 3B). It is, as illustrated in FIG. 3A, straightforward to generalize this description to systems with more than 2 microphones (M>2). Also, the two microphones may be located in the same device (e.g. in a listening device, such as a hearing assistance device), but may alternatively be located in different (physically separate) devices, e.g. in two separate audio processing devices, such as in two separate hearing assistance devices of a binaural hearing assistance system, adapted for wirelessly communicating with each other allowing the two microphone signals to be available in the audio processing device (APD) in question. In a preferred embodiment, the audio processing device comprises at least two input units relatively closely spaced apart (within in the housing of the audio processing device) and one input unit located elsewhere, e.g. in another audio processing device, e.g. a SmartPhone.

(20) In the following, the 2-microphone system is described in more detail. Let us assume that one target speaker is present in the acoustical scene, and that the signal reaching the hearing aid microphones consists of the two components a) and b) described above. The goal is to estimate the power at given frequencies and time instants of these two signal components. The signal reaching microphone number i may be written as
y.sub.i(n)=x.sub.i(n)+v.sub.i(n),
where x.sub.i(n) is the target signal component at the microphone, and v.sub.i(n) is the undesired reverberation component, which we assume is uncorrelated with the target signal x.sub.i(n), and y.sub.i(n) is the observable reverberant signal. The reverberant signal at each microphone is passed through an analysis filterbank (AFB) leading to a signal in the time-frequency domain,
Y.sub.i(k,m)=X.sub.i(k,m)+V.sub.i(k,m),
where k is a frequency index and m is a time (frame) index (and i=1, 2). For convenience, these spectral coefficients may be thought of as Discrete-Fourier Transform (DFT) coefficients.
Since all operations are identical for each frequency index, we skip the frequency index in the following for notational convenience. For example, instead of Y.sub.i(k,m), we simply write Y.sub.i(m).

(21) For a given frequency index k and time index m, noisy spectral coefficients for each microphone are collected in a vector (of size 2, since M=2; in general of size M), T indicating vector (matrix) transposition:
Y(m)=[Y.sub.1(m)Y.sub.2(m)].sup.T,
X(m)=[X.sub.1(m)X.sub.2(m)].sup.T,
and
V(m)=[V.sub.1(m)V.sub.2(m)].sup.T,
so that
Y(m)=X(m)+V(m).

(22) For a given frame index m, and frequency index k (suppressed in the notation), let d′(m)=[d′.sub.1(m) d′.sub.2(m)] denote a vector (of size 2) whose elements d.sub.1′ and d.sub.2′ represent the (generally complex-valued) acoustic transfer function from target sound source to each microphone (Mic.sub.1, Mic.sub.2), respectively. It is often more convenient to operate with a normalized version of d′(m). More specifically, let
d(m)=d′(m)/d′.sub.i(m).
denote a vector whose elements d.sub.i(m) (i=1, 2, . . . . , M, here M=2) represent the relative transfer function from the target source to the i.sup.th microphone. This implies that the i.sup.th element in this vector equals one, and the remaining elements describe the acoustic transfer function from the other microphones to this reference microphone.

(23) This means that the noise free microphone vector X(m) (which cannot be observed directly), can be expressed as
X(m)=d(m)X(m),
where X(m) is the spectral coefficient of the target signal at the reference microphone.

(24) The inter-microphone covariance matrix for the clean signal is then given by
C.sub.X(m)=λ.sub.X(m)d(m)d(m).sup.H,
where H denotes Hermitian transposition.

(25) In an embodiment, the inter-microphone covariance matrix of the late-reverberation is modelled as the covariance arising from an isotropic field,
C.sub.V(m)=λ.sub.V(m)C.sub.iso,
where C.sub.iso is the covariance matrix of the late-reverberation, and λ.sub.V(m) is the reverberation power at the reference microphone, which, obviously, is time-varying to take into account the time-varying power level of reverberation.

(26) The inter-microphone covariance matrix is given by
C.sub.Y(m)=C.sub.X(m)+C.sub.V(m),
because the target and late-reverberation signals are assumed to be uncorrelated. Inserting expressions from above, we arrive at the following expression for C.sub.Y(m),
C.sub.Y(m)=λ.sub.X(m)d(m)d(m).sup.H+λ.sub.V(m)C.sub.iso.

(27) In practice, vector d(m) may be estimated in an off-line calibration procedure (if we assume the target to be in a fixed location compared to the hearing aid microphone array, i.e., if the user “chooses with the nose”), or it may be estimated online.

(28) The matrix C.sub.iso is preferably estimated off-line by exposing hearing aids mounted on a dummy head for a reverberant sound field (e.g. approximated as an isotropic field), and measuring the resulting inter-microphone covariance matrix.

(29) Given the expression above, we wish to find estimates of spectral variances λ.sub.X(m) and λ.sub.V(m). In particular, it is possible to derive the following expressions for maximum likelihood estimates of these quantities. Let

(30) ${\hat{C}}_{Y} (m) = \frac{1}{D} {.Math.}_{j = m - D + 1}^{m} Y (m) {Y (m)}^{H}$
denote an estimate of the noisy inter-microphone covariance matrix C.sub.Y(m), based on D observations. Ĉ.sub.Y is determined in a unit for estimating inter-microphone covariance (CovEU in FIG. 3B). Then, the following maximum-likelihood (ml) estimates of spectral variances λ.sub.X(m) and λ.sub.V(m) can be derived:

(31) $\begin{matrix} λ_{V, ml} (m) = \frac{1}{M - 1} tr (Q_{u} (m) {\hat{C}}_{Y} (m) C_{iso}^{- 1}), \end{matrix}$
with
Q.sub.u(m)=I−d(m)(d(m).sup.HC.sub.iso.sup.−1d(m)).sup.−1d(m).sup.HC.sub.iso.sup.−1,
I being the identity matrix (vector), and M=2 is the number of microphones.

(32) Furthermore,
λ.sub.X,ml(m)=w.sub.mvdr.sup.H(m)(Ĉ.sub.Y(m)−λ.sub.V,ml(m)C.sub.iso)w.sub.mvdr(m),
where

(33) $w_{mvdr} (m) = \frac{C_{iso}^{- 1} d (m)}{{d (m)}^{H} C_{iso}^{- 1} d (m)}$
is a vector of filter weights for a minimum-variance distortionless response (MVDR), see e.g. [Haykin; 2001]. The filter weights w.sub.mvdr(m) (w_mvdr(m,k) in FIG. 3B) are determined in MVDR filter unit for computing filter weights (MVDR in FIG. 3B). The spectral variances λ.sub.X(m) and λ.sub.V(m) are estimated in unit for computing spectral variances (SVarEU in FIG. 3B).

(34) The two boxed equations above constitute an embodiment of our proposed method for estimating spectral variances of a target speaker in reverberation, as a function of time (index m) and frequency (suppressed index k).

(35) The spectral variances λ.sub.X(m) and λ.sub.V(m) have several usages as exemplified in the following sections A1 and A2.

(36) A1. Direct-to-Reverberation Ratio Estimation

(37) The ratio λ.sub.X(m)/λ.sub.V(m) can be seen as an estimate of the direct-to-reverberation ratio (DRR). The DRR correlates with the distance to the sound source [Hioka et al.; 2011], and is also linked to speech intelligibility. Having a DRR estimate available in a hearing assistance device allows e.g. the device to change to a relevant processing strategy, or to inform the user of the hearing assistance device that the device finds the processing conditions difficult, etc.

(38) A2 De-Reverberation

(39) A common strategy for de-reverberation in the time-frequency domain is to suppress the time-frequency tiles where the target-to-reverb ratio is small and maintain the time-frequency tiles where the target-to-reverb ration is large (or suppress such TF-tiles less). The perceptual result of such processing is a target signal where the reverberation has been reduced. The crucial component in any such system is to determine from the available reverberant signal which time-frequency tiles are dominated by reverberation, and which are not. FIG. 4A shows a possible way of using the proposed estimation method for de-reverberation.

(40) As before, reverberant microphone signals y.sub.i are decomposed into a time-frequency representation, using analysis filterbanks (AFB in FIG. 4A). The proposed method of processing a noisy audio signal is implemented in unit ML.sub.est (shaded box in FIG. 4A corresponding to ML.sub.est-unit in FIG. 3A), as discussed in connection with FIG. 3, and is applied to the filterbank outputs Y.sub.1(m,k), Y.sub.2(m,k) to estimate spectral variances λ.sub.X,ml(m) and λ.sub.V,ml(m) as a function of time (m) and frequency (k). We assume that the noisy microphone signals Y.sub.1(m,k), Y.sub.2(m,k) are passed through a linear beamformer (Beamformer w(m,k) in FIG. 4A) with weights collected in the vector w(m,k). It should be noted that this beamformer may or may not be an MVDR beamformer. If an MVDR beamformer is desired, then the MVDR beamformer weights of the proposed method (inside the shaded box ML.sub.est of FIG. 4A) may be re-used (e.g. using unit MVDR in FIG. 3A). The output of the beamformer is then given by
{tilde over (Y)}(m)={tilde over (X)}(m)+{tilde over (V)}(m),
where
{tilde over (Y)}(m)=w(m).sup.HY(m),
{tilde over (X)}(m)=w(m).sup.HX(m),
and
{tilde over (V)}(m)=w(m).sup.HV(m),
where, as before, the frequency index k for notational convenience has been suppressed.

(41) We are interested in estimates of the power of the target component and of the late-reverberation component entering the single-channel post-processing filter. These can be found using our estimated spectral variances as
{tilde over (λ)}.sub.X,ml(m)=E|w(m).sup.HX(m)|.sup.2=λ.sub.X,ml(m)|w(m).sup.Hd(m)|.sup.2,
and
{tilde over (λ)}.sub.V,ml(m)=E|w(m).sup.HV(m)|.sup.2=λ.sub.V,ml(m)w(m).sup.HC.sub.isow(m),
respectively.

(42) So, the power of the target component and of the late-reverberation component entering the single-channel post-processing filter can be found from our maximum-likelihood estimates of spectral variances, λ.sub.X,ml(m) and λ.sub.V,ml(m), and quantities which are otherwise available.

(43) The single-channel post-processing filter then uses the estimates λ.sub.X,ml(m) and λ.sub.V,ml(m) to find an appropriate gain g.sub.SC(m) to apply to the beamformer output, Y(m). That is, g.sub.SC(m) may generally be expressed as a function of λ.sub.X,ml(m) and λ.sub.V,ml(m) and potentially other parameters. For example, for a Wiener gain function, we have (e.g., [Loizou; 2013])

(44) $g_{wiener} (m) = \frac{{\tilde{λ}}_{X, ml} (m) / {\tilde{λ}}_{V, ml} (m)}{{\tilde{λ}}_{X, ml} (m) / {\tilde{λ}}_{V, ml} (m) + 1},$
whereas for the Ephraim-Malah gain function [Ephraim-Malah; 1984], we have
g.sub.em(m)=ƒ({tilde over (λ)}.sub.X,ml(m)/{tilde over (λ)}.sub.V,ml(m),|{tilde over (Y)}(m)|.sup.2/{tilde over (λ)}.sub.V,ml(m)).

(45) Many other possible gain functions exist, but they are typically a function of both λ.sub.X,ml(m) and λ.sub.V,ml(m), and potentially other parameters.

(46) Finally, the gain function g.sub.SC(m) is applied to the beamformer output Y(m) to result in the de-reverberated time-frequency tile X(m), i.e.,
{circumflex over (X)}(m)=g.sub.SC(m){tilde over (Y)}(m).

(47) In an embodiment of the system of FIG. 4A, the Beamformer w(m,k) unit (e.g. an MVDR beamformer) and the Single-Channel Post Processing unit is implemented as a multi-channel Wiener filter (MVF).

(48) B. Two Microphone Maximum-Likelihood Estimation of Speech and Late-Reverberation Spectral Variances for Speech Signals in the Presence of Reverberation and Additive Noise (FIG. 3C, 4B):

(49) The following outline illustrates yet another embodiment of an audio processing device according to the present disclosure shown in FIG. 3C and FIG. 4B. The description of follows the above description of FIG. 3B and FIG. 4A but represents a scenario where—in addition to reverberant speech—additive noise is assumed to be present. Again, FIG. 3C shows an audio processing device (APD) for estimation of spectral variances λ.sub.x, λ.sub.v of target speech and reverberation signal components of a noisy input signal (here comprising speech, reverberation and additive noise), wherein the number (M) of input units is two, and wherein the two input units (Mic.sub.1, Mic.sub.2) each comprises a microphone unit (Mic.sub.i) and an analysis filterbank (AFB in FIG. 3C). It is straightforward to generalize this description to systems with more than 2 microphones (M>2).

(50) Let us assume that one target speaker is present in the acoustical scene, and that the signal reaching the hearing aid microphones consists of the three components a), b), and c) described above. The goal is to estimate the power at given frequencies and time instants of the signal components a) and b). The observable reverberant signal y.sub.i(n) reaching microphone number i may be written as
y.sub.i(n)=x.sub.i(n)+v.sub.i(n)+w.sub.i(n),
where x.sub.i(n) is the target signal component at the microphone, v.sub.i(n) is the undesired reverberation component, and w.sub.i(n) is the additive noise component, which are all assumed to be mutually uncorrelated with each other. The reverberant signal at each microphone is passed through an analysis filter bank leading to a signal in the time-frequency domain,
Y.sub.i(k,m)=X.sub.i(k,m)+V.sub.i(k,m)+W.sub.i(k,m),
where k is a frequency index and m is a time (frame) index. For convenience, these spectral coefficients may be thought of as Discrete-Fourier Transform (DFT) coefficients.

(51) Since all operations are identical for each frequency index, we skip the frequency index in the following for notational convenience. For example, instead of Y.sub.i(k,m), we simply write Y.sub.i(m).

(52) For a given frequency index k and time index m, noisy spectral coefficients for each microphone are collected in a vector,
Y(m)=[Y.sub.1(m)Y.sub.2(m)].sup.T,
X(m)=[X.sub.1(m)X.sub.2(m)].sup.T,
V(m)=[V.sub.1(m)V.sub.2(m)].sup.T,
and
W(m)=[W.sub.1(m)W.sub.2(m)].sup.T
so that
Y(m)=X(m)+V(m)+W(m).

(53) For a given frame index m, and frequency index k (suppressed in the notation), let
d′(m)=[d′.sub.1(m)d′.sub.2(m)]
denote the (generally complex-valued) acoustic transfer function from target sound source to each microphone. It is often more convenient to operate with a normalized version of d′(m). More specifically, let
d(m)=d′(m)/d′.sub.i(m).
denote a vector whose elements d.sub.i(m) represent the relative transfer function from the target source to the ith microphone. This implies that the ith element in this vector equals one, and the remaining elements describe the acoustic transfer function from the other microphones to this reference microphone.

(54) This means that the noise free microphone vector X(m) (which cannot be observed directly), can be expressed as
X(m)=d(m)X(m),
where X(m) is the spectral coefficient of the target signal at the reference microphone.

(55) The inter-microphone covariance matrix for the clean signal is then given by
C.sub.X(m)=λ.sub.X(m)d(m)d(m).sup.H,
where H denotes Hermitian transposition.

(56) We model the inter-microphone covariance matrix of the late-reverberation as the covariance arising from an isotropic field,
C.sub.V(m)=λ.sub.V(m)C.sub.iso,
where C.sub.iso is the covariance matrix of the late-reverberation, normalized to have a value of 1 at the diagonal element corresponding to reference microphone, and λ.sub.V(m) is the reverberation power at the reference microphone, which, obviously, is time-varying to take into account the time-varying power level of reverberation.

(57) Finally, we assume that the covariance matrix of the additive noise is known and time-invariant. In practice, this matrix can be estimated from noise-only signal regions preceding speech activity, using a voice-activity detector.

(58) The inter-microphone covariance matrix of the noisy and reverberant signal is then given by
C.sub.Y(m)=C.sub.X(m)+C.sub.V(m)+C.sub.W,
because the target, the late-reverberation, and the noise were assumed mutually uncorrelated. As mentioned, C.sub.W is assumed known and constant (hence the lack of time-index). Inserting expressions from above, we arrive at the following expression for C.sub.Y(m),
C.sub.Y(m)=λ.sub.X(m)d(m)d(m).sup.H+λ.sub.V(m)C.sub.iso+C.sub.W.

(59) In practice, vector d(m) may be estimated in an off-line calibration procedure (if we assume the target to be in a fixed location compared to the hearing aid microphone array, i.e., if the user “chooses with the nose”), or it may be estimated online.

(60) Matrix C.sub.iso is estimated offline by exposing hearing aids mounted on a dummy head for a reverberant sound field (e.g. approximated as an isotropic field), and measuring the resulting inter-microphone covariance matrix.

(61) Given the expression above, we wish to find estimates of spectral variances λ.sub.X(m) and λ.sub.V(m). In particular, it is possible to derive the following expressions for maximum likelihood estimates of these quantities. Let

(62) ${\hat{C}}_{Y} (m) = \frac{1}{D} {.Math.}_{j = m - D + 1}^{m} Y (j) {Y (j)}^{H}$
denote an estimate of the noisy inter-microphone covariance matrix C.sub.Y(m), based on D observations.
B1 Special Case: No Additive Noise (C.sub.W=0)

(63) We first consider the case when there is no additive noise present (C.sub.W=0), because in this case the resulting ML estimators are particularly simple. In practice, the noise is never completely absent, but the following results hold for high signal-to-noise ratios, i.e., when C.sub.W is small compared to C.sub.V(m), or in very reverberant situations, i.e., when C.sub.W is small compared to C.sub.X(m).

(64) In this case, the following maximum-likelihood estimates of spectral variances λ.sub.X(m) and λ.sub.V(m) can be derived:

(65) $\begin{matrix} λ_{V, ml} (m) = \frac{1}{M - 1} tr (Q_{u} (m) {\hat{C}}_{Y} (m) C_{iso}^{- 1}), \end{matrix}$
where
Q.sub.u(m)=I−d(m)(d(m).sup.HC.sub.iso.sup.−1d(m)).sup.−1d(m).sup.HC.sub.iso.sup.−1,
and M=2 is the number of microphones. Furthermore,
λ.sub.X,ml(m)=w.sub.mvdr.sup.H(m)(Ĉ.sub.Y(m)−λ.sub.V,ml(m)C.sub.iso)w.sub.mvdr(m),
where

(66) $w_{mvdr} (m) = \frac{C_{iso}^{- 1} d (m)}{{d (m)}^{H} C_{iso}^{- 1} d (m)}$
is a vector of filter weights for an minimum-variance distortionless response (MVDR), see e.g. [Haykin; 2001].

(67) The two boxed equations above constitute an embodiment of the proposed method in the special case of low additive noise, for estimating spectral variances of a target speaker in reverberation, as a function of time (index m) and frequency (suppressed index k), same result as provided in section A above.

(68) B2. General Case: Additive Noise (C.sub.W≠0)

(69) To express the maximum likelihood estimates of the spectral variances λ.sub.X(m) and λ.sub.V(m) in this general case, we need to introduce some additional notation.

(70) First, let us introduce an M×M−1 complex-valued blocking matrix BεC.sup.M×M-1 given by
[Bd]=I−d(m)(d(m).sup.Hd(m)).sup.−1d(m).sup.H,
i.e., the matrix B is given by the first M−1 columns of the matrix on the right-hand side.

(71) Also, let us define a pre-whitening matrix DεC.sup.M-1×M-1, which has the property that
(B.sup.HC.sub.WB).sup.−1=D.sup.HD.

(72) Matrix D can, e.g., be found from a Cholesky decomposition of the matrix on the left-hand side above.

(73) In any case, matrices B and D can be computed from known quantities at any time instant m.

(74) To describe the maximum likelihood estimates compactly, we need to introduce the signal quantities from the previous section in a blocked and whitened domain. The quantities in this new domain are denoted with ′. We define
Y′(m)=D.sup.HB.sup.HY(m),
and similarly for X′(m), V′(m), and W′(m). Covariance matrices in this blocked and pre-whitened domain are given by
C.sub.Y′(m)=D.sup.HB.sup.HC.sub.Y(m)BD,
and similarly for C.sub.X′(m), C.sub.iso′(m), C.sub.W′(m), and Ĉ.sub.Y′(m). Note that all these (square) covariance matrices have dimension M′=M−1, where M is the number of microphones.

(75) Finally, let us introduce some additional notation. Let
C.sub.Y′(m)=UΛ.sub.Y′U.sup.H
denote the eigenvalue decomposition of the (blocked and pre-whitened) covariance matrix C.sub.Y′(m), where the columns of matrix U are eigen vectors and diagonal elements of the diagonal matrix
Λ.sub.Y′=diag(λ.sub.y1 . . . λ.sub.yM′).
Similarly, let
C.sub.iso′=UΛ.sub.iso′U.sup.H
denote the eigenvalue decomposition of the (blocked and pre-whitened) matrix C.sub.iso′, such that
Λ.sub.iso′=diag(λ.sub.iso,1, . . . ,λ.sub.iso,M′)
is a diagonal eigen value matrix.

(76) Furthermore, let g.sub.m denote the m'th diagonal element of the matrix
U.sup.HĈ.sub.Y′(m)U.

(77) Then it can be shown that the maximum likelihood estimate λ.sub.V,ML of λ.sub.V can be found as one of the roots of the polynomial (in the variable λ.sub.V):

(78) $- {.Math.}_{m = 1}^{M^{'}} λ_{iso, m} (λ_{V} λ_{iso, m} + 1 - g_{m}) {.Math.}_{k = 1}^{M^{'}, k \neq m} {(λ_{V} λ_{iso, k} + 1)}^{2} = 0.$

(79) Specifically, λ.sub.V(m) is found as the positive, real root of the polynomial. In most cases, there is only one such root.

(80) The corresponding maximum-likelihood estimate λ.sub.X,ML(m) of the target speech spectral variance λ.sub.X(m) can then be found from quantities in the non-blocked and non-prewhitened domain as:
λ.sub.X,ML(m)=w.sub.mvdr.sup.H(m)(Ĉ.sub.Y(m)−λ.sub.V,ML(m)C.sub.iso−C.sub.W)w.sub.mvdr(m),
where

(81) $w_{mvdr} (m) = \frac{C_{V + W}^{- 1} (m) d (m)}{{d (m)}^{H} C_{V + W}^{- 1} (m) d (m)},$
where
C.sub.V+W(m)=λ.sub.V,ML(m)C.sub.iso+C.sub.W.

(82) The spectral variances λ.sub.X(m) and λ.sub.V(m) have several usages as exemplified in the following sections B3 and B4.

(83) B3. Direct-to-Reverberation Ratio Estimation

(84) The ratio λ.sub.X(m)/λ.sub.V(m) can be seen as an estimate of the direct-to-reverberation ratio (DRR). The DRR correlates with the distance to the sound source [Hioka et al.; 2011], and is also linked to speech intelligibility. Having available on-board a hearing a DRR estimate allows the hearing aid to change to a relevant processing strategy, or informs the hearing aid user that the hearing aid finds the processing conditions difficult, etc.

(85) B4. Dereverberation—Special Case with No (or Low) Additive Noise (C.sub.W=0)

(86) In this special case, the target signal is disturbed by reverberation, but no additional noise.

(87) A common strategy for dereverberation in the time-frequency domain is to suppress the time-frequency tiles where the target-to-reverb ratio is small and maintain the time-frequency tiles where the target-to-reverb ration is large. The perceptual result of such processing is a target signal where the reverberation has been reduced. The crucial component in any such system is to determine from the available reverberant signal which time-frequency tiles are dominated by reverberance, and which are not. FIG. 4B shows a possible way of using the proposed estimation method for dereverberation.

(88) As before, reverberant microphone signals are decomposed into a time-frequency representation, using analysis filter banks. The proposed method (shaded box) is applied to the filter bank output to estimate spectral variances λ.sub.X,ml(m) and λ.sub.V,ml(m) as a function of time and frequency. We assume that the noisy microphone signals are passed through a linear beamformer with weights collected in the vector w(m,k). This beamformer may or may not be an MVDR beamformer. If an MVDR beamformer is desired, then the MVDR beamformer of the proposed method (inside the shaded ML.sub.est-box) in FIG. 4B may be re-used.) The output of the beamformer is then given by
{tilde over (Y)}(m)={tilde over (X)}(m)+{tilde over (V)}(m),
where
{tilde over (Y)}(m)=w(m).sup.HY(m),
{tilde over (X)}(m)=w(m).sup.HX(m),
and
{tilde over (V)}(m)=w(m).sup.HV(m),
where, as before, we skipped the frequency index k for notational convenience.

(89) We are interested in estimates of the power of the target component and of the late-reverberation component entering the single-channel post-processing filter. These can be found using our estimated spectral variances as
{tilde over (λ)}.sub.X,ml(m)=E|w(m).sup.HX(m)|.sup.2=λ.sub.X,ml(m)|w(m).sup.Hd(m)|.sup.2,
and
{tilde over (λ)}.sub.V,ml(m)=E|w(m).sup.HV(m)|.sup.2=λ.sub.V,ml(m)w(m).sup.HC.sub.isow(m),
respectively.

(90) So, the power of the target component and of the late-reverberation component entering the single-channel post-processing filter can be found from our maximum-likelihood estimates of spectral variances, λ.sub.X(m) and λ.sub.V(m), and quantities which are otherwise available.

(91) The single-channel post-processing filter then uses the estimates {tilde over (λ)}.sub.X,ml(m) and {tilde over (λ)}.sub.V,ml(m) to find an appropriate gain g.sub.SC(m) to apply to the beamformer output, {tilde over (Y)}(m). That is, g.sub.SC(m) may generally be expressed as a function of {tilde over (λ)}.sub.X,ml(m) and {tilde over ({dot over (λ)})}.sub.V,ml(m) and potentially other parameters. For example, for a Wiener gain function, we have (e.g., [Loizou; 2013])

(92) 0 $g_{wiener} (m) = \frac{{\tilde{λ}}_{X, ml} (m) / {\tilde{λ}}_{V, ml} (m)}{{\tilde{λ}}_{X, ml} (m) / {\tilde{λ}}_{V, ml} (m) + 1},$
whereas for the Ephraim-Malah gain function [Ephraim-Malah; 1984], we have
g.sub.em(m)=ƒ({tilde over (λ)}.sub.X,ml(m)/{tilde over (λ)}.sub.V,ml(m),|{tilde over (Y)}(m)|.sup.2/{tilde over (λ)}.sub.V,ml(m)).

(93) Many other possible gain functions exist, but they are typically a function of both {tilde over (λ)}.sub.X,ml(m) and {tilde over (λ)}.sub.V,ml(m), and potentially other parameters.

(94) Finally, the gain function g.sub.SC(m) is applied to the beamformer output {tilde over (Y)}(m) to result in the dereverberated time-frequency tile {circumflex over (X)}(m), i.e.,
{circumflex over (X)}(m)=g.sub.SC(m){tilde over (Y)}(m).
as also disclosed in section A above.
B5. Dereverberation—General Case with Additive Noise (C.sub.W≠0)

(95) In the general case, the target signal is disturbed by both reverberation and additive noise. Analogously to the previous section, we are interested in the spectral variances of all signal components, entering the single-channel postfilter. As above, the spectral variances of the target and the reverberation component can be found from the maximum-likelihood estimates as
{tilde over (λ)}.sub.X,ml(m)=E|w(m).sup.HX(m)|.sup.2=λ.sub.X,ml(m)|w(m).sup.Hd(m)|.sup.2,
and
{tilde over (λ)}.sub.V,ml(m)=E|w(m).sup.HV(m)|.sup.2=λ.sub.V,ml(m)w(m).sup.HC.sub.isow(m),
respectively.

(96) Furthermore, the spectral variance of the additive noise component entering the single-channel beamformer is given by
λ.sub.W(m)=E|w(m).sup.HW(m)|.sup.2=w(m).sup.HC.sub.Ww(m)

(97) Generally speaking, the single-channel postfilter gain is function of function of {tilde over (λ)}.sub.X,ml(m), {tilde over (λ)}.sub.V,ml(m), λ.sub.W(m), and potentially other parameters. For example, one could define the total spectral disturbance as the sum of the reverberation and noise variances,
λ.sub.dist(m)={tilde over (λ)}.sub.V,ml(m)+λ.sub.W(m).

(98) Then a signal-to-total-disturbance ratio would be given by
ξ(m)={tilde over (λ)}.sub.V,ml(m)/λ.sub.dist(m).

(99) With this, new versions of the Wiener gain function or the Ephraim-Malah gain function could be defined analogously to the description above. However, rather than suppressing only the reverberation component, these new gain functions suppress the reverberation and the additive noise component jointly.

(100) FIG. 5 shows an embodiment of an audio processing system (APD) according to the present disclosure. The audio processing system (APD) comprises the same elements as shown in FIG. 3A: Input units IU.sub.i, i=1, 2, M providing time-frequency representations Y of noisy audio signals y (comprising a target signal component x and a first noise signal component v, and optionally a second, additive noise signal component w) to a maximum likelihood estimations unit ML.sub.est for estimating spectral variances λ.sub.X,ml(m) and λ.sub.V,ml(m) of the target signal component x and a first noise signal component v, respectively (or scaled versions thereof). In the embodiment of FIG. 5 input units UI.sub.i further comprise normalization filter units H.sub.i. The normalization filter units have a transfer function H.sub.i(k), which makes the source providing the electric input signal in question comparable and interchangeable with the other sources. This has the advantage that the signal contents of the individual noisy input signals y.sub.i can be compared. The i.sup.th input unit IU.sub.i (i=1, 2, . . . , M) comprises input transducer IT.sub.i for converting an input sound signal y.sub.i to an electric input signal I.sub.i or another input device for providing the electric input signal I.sub.i Normalization filter H.sub.i (e.g. an adaptive filter) filters electric input signal I.sub.i to a normalized signal IN.sub.i (e.g. within a predetermined voltage range) and feeds the normalized time domain signal IN.sub.i to analysis filterbank AFB, which provides a time-frequency representation Y.sub.i(m,k) of the noisy input signal y.sub.i to the maximum likelihood estimation unit ML.sub.est. This allows to compensate unmatched microphones, to use different kinds of sensors (microphones, vibration sensors, optical sensors, electrodes e.g. for sensing brain waves, etc.), to compensate for different location of sensors, etc. The maximum likelihood estimations unit ML.sub.est further receives predetermined target look vector (d) and noise covariance matrix (Ĉ.sub.v) (or scaled versions thereof) allowing estimation of spectral variances λ.sub.X,ml(m) and λ.sub.V,ml(m). The processing in the ML.sub.est unit is indicated in FIG. 5 to be performed in individual frequency bands k, k=1, 2, . . . , K, by the solid ‘shadow boxes’ denoted 1-K ‘behind’ the front ML.sub.est box). In an embodiment, where a second, additive noise component w.sub.i is present in the noisy input signals y.sub.i, a further predetermined noise covariance matrix (Ĉ.sub.w) for the additive noise is assumed to be provided to the maximum likelihood estimation unit ML.sub.est.

(101) FIG. 6 shows an embodiment of an audio processing device according to the present disclosure comprising the same elements as the embodiment in FIG. 5, only the maximum likelihood estimations unit ML.sub.est for estimating spectral variances λ.sub.X,ml(m) and λ.sub.V,ml(m) form part of more general signal processing unit SPU comprising e.g. also beamformer and single channels post filtering as discussed in connection with FIG. 4 and/or other signal processing making use of spectral variances λ.sub.X,ml(m) and λ.sub.V,ml(m) (or scaled versions thereof). The signal processing unit SPU comprises a memory wherein characteristics of the target and noise signal components are stored, e.g. a predetermined target look vector (d) and first noise covariance matrix (Ĉ.sub.v, e.g. C.sub.iso) and optionally a second covariance matrix (C.sub.w) (or scaled versions thereof). The signal processing unit SPU provides enhanced, e.g. de-reverberated, signal X(m,k). The signal processing unit SPU may e.g. be configured to apply a frequency dependent gain to the resulting enhanced signal X to compensate for a hearing impairment of a user. The embodiment of FIG. 6 further comprises synthesis filterbank SFB for converting the enhanced time-frequency domain signal X(m,k) to time domain (output) signal OUT, which may be further processed or as here fed to output unit OU. The output unit may be an output transducer for converting an electric signal to a stimulus perceived by the user as an acoustic signal. In an embodiment, the output transducer comprises a receiver (speaker) for providing the stimulus as an acoustic signal to the user. The output unit OU may alternatively or additionally comprise a number of electrodes of a cochlear implant hearing device or a vibrator of a bone conducting hearing device or a transceiver for transmitting the resulting signal to another device. The embodiment of an audio processing device shown in FIG. 6 may implement a hearing assistance device.

(102) FIG. 7 shows a flow diagram illustrating a method of processing a noisy input signal according to the present disclosure. The noisy audio signal y(n) comprises a target signal component x(n) and a first noise signal component v(n) (and optionally a second additive noise component w(n)), n representing time. The method comprises the steps of

(103) a) Providing or receiving a time-frequency representation Y.sub.i(k,m) of the noisy audio signal y.sub.i(n) at an i.sup.th input unit, i=1, 2, . . . , M, where M is larger than or equal to two, in a number of frequency bands and a number of time instances, k being a frequency band index and m being a time index;
b) Estimating spectral variances or scaled versions thereof λ.sub.V, λ.sub.X of said first noise signal component v and said target signal component x, respectively, as a function of frequency index k and time index m, said estimates of λ.sub.V and λ.sub.X being jointly optimal in maximum likelihood sense.

(104) The maximum likelihood optimization is based (exclusively) on the following statistical assumptions that the time-frequency representations Y.sub.i(k,m), X.sub.i(k,m), and V.sub.i(k,m) (and optionally W.sub.i(k,m)) of respective signals y.sub.i(n), and signal components x.sub.i(n), and v.sub.i(n) (and optionally w.sub.i(n)) are zero-mean, complex-valued Gaussian distributed, that each of them are statistically independent across time m and frequency k, and that X.sub.i(k,m) and V.sub.i(k,m) (and optionally W.sub.i(k,m)) are mutually uncorrelated

(105) The method is—in general—based on the assumption that characteristics (e.g. spatial characteristics) of the target and noise signal components are known.

(106) The assumptions regarding the characteristics of the target and noise signal components are e.g. that the direction to the target signal relative to the input units is known (fixed d) and that the spatial fingerprint of the first noise signal component is also known, e.g. isotropic (C.sub.v=C.sub.iso). In case a second, additive noise component is present, it is assumed that its characteristics in the form of an inter input covariance matrix C.sub.w is known.

(107) The invention is defined by the features of the independent claim(s). Preferred embodiments are defined in the dependent claims. Any reference numerals in the claims are intended to be non-limiting for their scope.

(108) Some preferred embodiments have been shown in the foregoing, but it should be stressed that the invention is not limited to these, but may be embodied in other ways within the subject-matter defined in the following claims and equivalents thereof.

REFERENCES

(109) US2009248403A WO12159217A1 US2013343571A1 US2010246844A1 [Braun&Habets; 2013] S. Braun and E. A. P. Habets, “Dereverberation in noisy environments using reference signals and a maximum likelihood estimator”, Presented at the 21.sup.st European Signal Processing Conference (EUSIPCO 2013), 5 pages (EUSIPCO 2013 1569744623). [Schaub; 2008] Arthur Schaub, “Digital hearing Aids”, Thieme Medical. Pub., 2008. [Haykin; 2001] S. Haykin, “Adaptive Filter Theory,” Fourth Edition, Prentice Hall Information and System Sciences Series, 2001. [Hioka et al.; 2011]: Y. Hioka, K. Niwa, S. Sakauchi, K. Furuya, and Y. Haneda, “Estimating Direct-to-Reverberant Energy Ratio Using D/R Spatial Correlation Matrix Model”, IEEE Trans. Audio, Speech, and Language Processing, Vol. 19, No. 8, November 2011, pp. 2374-2384. [Loizou; 2013]: P. C. Loizou, “Speech Enhancement: Theory and Practice,” Second Edition, February, 2013, CRC Press [Ephraim-Malah; 1984]: Y. Ephraim and D. Malah, “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator,” IEEE Trans. Acoustics, Speech, and Signal Processing, Vol. ASSP-32, No. 6, December 1984, pp. 1109-1121. [Kjems&Jensen; 2012] U. Kjems, J. Jensen, “Maximum likelihood based noise covariance matrix estimation for multi-microphone speech enhancement”, 20th European Signal Processing Conference (EUSIPCO 2012), pp. 295-299, 2012. [Ye&DeGroat; 1995] H. Ye and R. D. DeGroat, “Maximum likelihood DOA estimation and asymptotic Cram'er-Rao bounds for additive unknown colored noise,” Signal Processing, IEEE Transactions on, vol. 43, no. 4, pp. 938-949, 1995. [Shimitzu et al.; 2007] Hikaru Shimizu, Nobutaka Ono, Kyosuke Matsumoto, Shigeki Sagayama, Isotropic noise suppression in the power spectrum domain by symmetric microphone arrays, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 21-24, 2007, New Paltz, N.Y., pp. 54-57.

Multi-microphone method for estimation of target and noise spectral variances for speech degraded by reverberation and optionally additive noise

Assignee

Inventors

Cpc classification

Classification Explorer

G10L2021/02166

PHYSICS

Classification Explorer

H04R29/005

ELECTRICITY

Classification Explorer

H04R25/30

ELECTRICITY

Classification Explorer

H04R3/005

ELECTRICITY

Classification Explorer

G10L2021/02082

PHYSICS

Classification Explorer

G10L21/0232

PHYSICS

Classification Explorer

H04R25/407

ELECTRICITY

Classification Explorer

G10L21/0208

PHYSICS

International classification

Classification Explorer

H04R25/00

ELECTRICITY

Classification Explorer

G10L21/0208

PHYSICS

Classification Explorer

H04R29/00

ELECTRICITY

Abstract

Claims

Description