SPATIO-TEMPORAL SPEECH ENHANCEMENT TECHNIQUE BASED ON GENERALIZED EIGENVALUE DECOMPOSITION

Abstract

The present invention describes a speech enhancement method using microphone arrays and a new iterative technique for enhancing noisy speech signals under low signal-to-noise-ratio (SNR) environments. A first embodiment involves the processing of the observed noisy speech both in the spatial- and the temporal-domains to enhance the desired signal component speech and an iterative technique to compute the generalized eigenvectors of the multichannel data derived from the microphone array. The entire processing is done on the spatio-temporal correlation coefficient sequence of the observed data in order to avoid large matrix-vector multiplications. A further embodiment relates to a speech enhancement system that is composed of two stages. In the first stage, the noise component of the observed signal is whitened, and in the second stage a spatio-temporal power method is used to extract the most dominant speech component. In both the stages, the filters are adapted using the multichannel spatio-temporal correlation coefficients of the data and hence avoid large matrix vector multiplications.

Claims

1: A speech enhancement method, comprising: obtaining a speech signal using at least one input microphone; calculating a whitening filter using a silence interval in the obtained speech signal; applying the whitening filter to the obtained speech signal to generate a whitened speech signal in which noise components present in the obtained speech signal are whitened; estimating a clean speech signal by applying a multi-channel filter to the whitened speech signal; and outputting the clean speech signal via an audio device.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0033] A more complete appreciation of the invention and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, in which like reference numerals refer to identical or corresponding parts throughout the several views, and in which:

[0034] FIG. 1: illustrates a block diagram of one embodiment of the present invention;

[0035] FIG. 2: illustrates a table providing an example of Pseudo Code for an Iterative Whitening process

[0036] FIG. 3: illustrates a table providing an example of Pseudo Code for an Spatio-Temporal Power Method;

[0037] FIG. 4: illustrates a table providing an example of Pseudo Code for an Algorithm Implementation of one embodiment of the claimed invention;

[0038] FIG. 5: illustrates a flow diagram of a method of one embodiment of the present invention; and

[0039] FIG. 6: illustrates a block diagram of one embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0040] One embodiment of the present invention relates to a method of Spatio-Temporal Eigenfiltering using a signal model. For instance, letting s(l) denote a clean speech source signal which is measured at the output of an n-microphone array in the presence of colored noise v(l) at time instant l. The output of j.sup.th microphone is given as

[00001] $\begin{matrix} y_{j} (l) = v_{j} (l) + {.Math.}_{p = -}^{} .Math. .Math. h_{jp} .Math. s (l - p) = v_{j} (l) + x_{j} (l) & (1) \end{matrix}$

where {h.sub.jp} are the coefficients of the acoustic impulse response between the speech source and the j.sup.th microphone, and x.sub.j(l) and v.sub.j(l) are the filtered speech and noise component received at the j.sup.th microphone, respectively. The additive noise v.sub.j(l) is assumed to be uncorrelated with the clean speech signal and possesses a certain autocorrelation structure. One of the goals of the speech enhancement system is to compute a set of filters w.sub.j, j=0, . . . , n1 such that the speech component of x.sub.j(l) is enhanced while the noise component v.sub.j(l) is reduced. The filters w.sub.1 are usually finite impulse response (FIR) filters due to the finite reverberation time of the environment. In fact, acoustic impulse responses decay with time such that only a finite number of tap values h.sub.jp in Eq. (1) are essentially non-zero. The vector model of signal corresponding to an n-element microphone array can be written as

y(l)=x(l)+v(l)(2)

where y(l)=[y.sub.1(l) y.sub.2(l) . . . y.sub.n(l)].sup.T, x(l)=[x.sub.1(l) x.sub.2(l) . . . x.sub.n(l)].sup.T, and v(l)=[v.sub.1(l) v.sub.2(l) . . . v.sub.n(l)].sup.T are the observed signal, the clean speech signal and the noise signal respectively.

[0041] With regard to Spatio-Temporal Eigenfiltering, a goal is to transform the speech enhancement problem into an iterative multichannel filtering task in which the output of the multichannel filter {W.sub.p(k)} at time instant l and iteration k can be written as

[00002] $\begin{matrix} z_{k} (l) = {.Math.}_{p = 0}^{L} .Math. .Math. W_{p} (k) .Math. y (l - p) . & (3) \end{matrix}$

where {W.sub.p(k)} is the nn multichannel enhancement filter of length L at iteration k, and the n-dimensional signal z.sub.k(l) is the output of this multichannel filter. Upon filter convergence for sufficiently large k, one of the signals in z.sub.k(l) will contain a close approximation of the original signal x.sub.i(l). Equation (3) can further be written by substituting the value of y(l) as

[00003] $\begin{matrix} z_{k} (l) = {.Math.}_{p = 0}^{L} .Math. .Math. W_{p} (k) .Math. (v (l - p) + x (l - p)) . & (4) \end{matrix}$

One of the goals of the present invention is to adapt the matrix coefficient sequence {W.sub.p(k)} to maximize the signal-to-noise ratio (SNR) at the system output. To achieve this goal, the power in z.sub.k(l) at the k.sup.th iteration is given by the following expression for P(k):

[00004] $\begin{matrix} P (k) = tr .Math. {\frac{1}{N} .Math. {.Math.}_{l = N (k - 1) + 1}^{Nk} .Math. .Math. z_{k} (l) .Math. z_{k}^{T} (l)} = {.Math.}_{p = 0}^{L} .Math. .Math. {.Math.}_{q = 0}^{L} .Math. .Math. tr .Math. {W_{p} (k) .Math. {Ry}_{q - p} .Math. W_{q}^{T} (k)}, & (5) \end{matrix}$

where N is the length of the data sequence, the notation tr{.} corresponds to the trace of a matrix, and {Ry.sub.p} denotes the multichannel autocorrelation sequence of y and is given by

[00005] $\begin{matrix} {Ry}_{p} = \frac{1}{N} .Math. {.Math.}_{l = N (k - 1) + 1}^{NK} .Math. .Math. y (l) .Math. y^{T} (l - p), - \frac{L}{2} p \frac{L}{2} . & (6) \end{matrix}$

[0042] Note that {W.sub.p(k)} is assumed to be zero outside the range 0pL, and {Ry.sub.p} is assumed to be zero outside the range |p|(L/2). Under the assumption of uncorrelated speech and noise, the total signal power can be written as P(k)=P.sub.x(k)+P.sub.v(k), where

[00006] $\begin{matrix} P_{x} (k) = {.Math.}_{p = 0}^{L} .Math. .Math. {.Math.}_{q = 0}^{L} .Math. .Math. tr .Math. {W_{p} (k) .Math. {R_{X}}_{q - p} .Math. W_{q}^{T} (k)} .Math. .Math. P_{v} (k) = {.Math.}_{p = 0}^{L} .Math. .Math. {.Math.}_{q = 0}^{L} .Math. .Math. tr .Math. {W_{p} (k) .Math. {R_{V}}_{_{q - p}} .Math. W_{q}^{T} (k)}, & (7) .Math. (8) \end{matrix}$

[0043] The problem of SNR maximization in the presence of colored noise is closely related to the problem of the generalized eigenvalue decomposition (GEVD). This problem has also been referred to as oriented principal component analysis (OPCA) [17]. The nomenclature is consistent with the fact that the generalized eigenvectors point in directions which maximize the signal variance and minimize the noise variance. However, since both {Rx.sub.p} and {Rv.sub.p} are not directly available, the values in {Rv.sub.p} are typically estimated during an appropriate silence period of the noisy speech in which there is no speech activity. Letting the number of samples of the noise sequence be denoted as N.sub.v(<<N) then the multichannel autocorrelation sequence corresponding to the noise process can be written as

[00007] $\begin{matrix} R_{V_{p}} = \frac{1}{N_{v}} .Math. {.Math.}_{l = N_{v} (k - 1) + 1}^{N_{v} .Math. k} .Math. .Math. v (l) .Math. v^{T} (l - p), - \frac{L}{2} p \frac{L}{2} . & (9) \end{matrix}$

[0044] As for the replacement of {Rx.sub.p}, the multichannel autocorrelation sequence {Ry.sub.p} is used to find the stationary points of the following spatio-temporal power ratio:

[00008] $\begin{matrix} J ({W_{p} (k)}) = \frac{tr .Math. {{.Math.}_{p = 0}^{L} .Math. .Math. {.Math.}_{q = 0}^{L} .Math. .Math. W_{p} (k) .Math. {Ry}_{q - p} .Math. W_{q}^{T} (k)}}{tr .Math. {{.Math.}_{p = 0}^{L} .Math. .Math. {.Math.}_{q = 0}^{L} .Math. .Math. W_{p} (k) .Math. {R_{V}}_{q - p} .Math. W_{q}^{T} (k)}} . & (10) \end{matrix}$

[0045] The function J({W.sub.p(k)}) is the spatio-temporal extension of the generalized Rayleigh quotient, and the solution that maximizes equation (10) are the generalized eigenvectors (or eigenfilters) of the multichannel autocorrelation sequence pair ({Rx.sub.p}, {Ry.sub.p}). For sufficiently many iterations k, the multichannel FIR filter sequence {W.sub.p(k)} is designed to satisfy the following equations:

[00009] $\begin{matrix} {.Math.}_{p = 0}^{L} .Math. .Math. {.Math.}_{q = 0}^{L} .Math. .Math. W_{p} (k) .Math. R_{V_{q - p}} .Math. W_{q}^{T} (k) = {\begin{matrix} if .Math. .Math. .Math. q - p .Math. = 0 \\ 0 & otherwise .Math. \end{matrix} .Math. .Math. {.Math.}_{p = 0}^{L} .Math. .Math. {.Math.}_{q = 0}^{L} .Math. .Math. W_{p} (k) .Math. R_{V_{q - p}} .Math. W_{q}^{T} (k) = {\begin{matrix} I & if .Math. .Math. .Math. q - p .Math. = 0 \\ 0 & otherwise \end{matrix} . & (11) .Math. (12) \end{matrix}$

where and {W.sub.p} denote the generalized eigenvalues and eigenvectors of ({Rx.sub.p}, {Ry.sub.p}). This solution maximizes the energy of the speech component of the noisy mixture while minimizing the noise energy at the same time.

[0046] The present invention also addresses spatio-temporal generalized eigenvalue decomposition. The present method relies on multichannel correlation coefficient sequences of the noisy speech process and noise process defined in (6) and (9). Next, the multichannel convolution operations needed for the update of the filter sequence {W.sub.p} are defined as

[00010] $\begin{matrix} {\overline{Ry}}_{q} (k) = {\begin{matrix} {.Math.}_{p = 0}^{L} .Math. .Math. ({Ry}_{p - q}) .Math. W_{p}^{T} (k) & if .Math. - \frac{L}{2} q \frac{L}{2} \\ 0 & otherwise \end{matrix} . .Math. {Gy}_{p} (k) = {\begin{matrix} {.Math.}_{q = 0}^{L} .Math. .Math. W_{q} (k) .Math. {\overline{Ry}}_{p - q} (k) & if .Math. .Math. 0 p L \\ 0 & otherwise \end{matrix} . .Math. {\overline{R_{V}}}_{p} (k) = {\begin{matrix} {.Math.}_{p = 0}^{L} .Math. .Math. ({R_{V}}_{q - p}) .Math. W_{p}^{T} (k) & if .Math. - \frac{L}{2} q \frac{L}{2} \\ 0 & otherwise \end{matrix} . .Math. {G_{V}}_{p} (k) = {\begin{matrix} {.Math.}_{q = 0}^{L} .Math. .Math. W_{q} (k) .Math. {\overline{R_{V}}}_{p - q} (k) & if .Math. .Math. 0 p L \\ 0 & otherwise \end{matrix} . & (13) .Math. - .Math. (16) \end{matrix}$

[0047] In the above set of equations, H(.) denotes a form of multichannel weighting on the autocorrelation sequences necessary to ensure the validity of the autocorrelation sequence for an FIR filtering operations needed in the algorithm update. Through numerical simulations it has been determined that this weighting is necessary both on the autocorrelation sequence itself as well as its filtered version at each iteration of the algorithm. This weighting amounts to multiplying each element of the resultant matrix sequence by a Bartlett window centered at p=q, although other windowing functions common in the digital signal processing literature can also be used. Next, we define the scalar terms

[00011] $\begin{matrix} f_{2} (k) = \frac{1}{n} .Math. {.Math.}_{i = 1}^{n} .Math. .Math. {.Math.}_{j = 1}^{n} .Math. .Math. {.Math.}_{p = 0}^{L} .Math. .Math. .Math. g_{ijp}^{v} (k) .Math., f_{1} (k) = \frac{1}{n} .Math. {.Math.}_{i = 1}^{n} .Math. .Math. {.Math.}_{j = 1}^{n} .Math. .Math. {.Math.}_{p = 0}^{L} .Math. .Math. .Math. g_{ijp}^{y} (k) .Math., & (17) \end{matrix}$

where g.sub.ijp.sup.y(k) and g.sub.ijp.sup.v(k) are the elements of coefficient sequence G.sub.y.sub.p(k) and G.sub.v.sub.p(k) respectively. Following these definitions, define the scaled gradient [18] for the update of spatio-temporal eigenvectors as

[00012] $\begin{matrix} G_{p} (k) = \frac{f_{2} (k)}{f_{1} (k)} .Math. \overline{triu} [{Gy}_{p} (k)] + tril [G_{V_{p}} (k)], & (18) \end{matrix}$

[0048] where triu[.] with its overline denotes the strictly upper triangular part of its matrix argument and tril[.] denotes the lower triangular part of its matrix argument. In the first instantiation of the invention, the correction term in the update process is defined as

[00013] $\begin{matrix} U_{p} (k) = {.Math.}_{q = 0}^{L} .Math. (G_{p - q} (k)) .Math. W_{q} (k), .Math. 0 p L & (19) \end{matrix}$

and the final update for the weights become

[00014] $\begin{matrix} W_{p} (k + 1) = (1 +) .Math. c (k) .Math. W_{p} (k) - .Math. \frac{c (k)}{d (k)} .Math. U_{p} (k), & (20) \end{matrix}$

where

[00015] $d (k) = \frac{1}{n} .Math. {.Math.}_{i = 1}^{n} .Math. {.Math.}_{j = 1}^{n} .Math. {.Math.}_{p = 0}^{L} .Math. .Math. g_{ijp} (k) .Math., and .Math. .Math. c (k) = \frac{1}{d (k)} .$

Typically, step sizes in the range 0.350.5 have been chosen and appear to work well. The enhanced signal can be obtained from the output of this system as the first element y.sub.1(l) of the vector y(l)=[y.sub.1(l)y.sub.2(l) . . . y.sub.n(l)].sup.T at time instant l.

[0049] In Table 2 shown in FIG. 4, there is illustrated a pseudo code for the algorithm implementation in MATLAB, a common technical computing environment well-known to those skilled in the art, in which the functions starting with the letter m represent the multichannel extensions of single channel standard functions on sequences.

[0050] In addition, in a further embodiment, the present invention addresses an alternate implementation of the previously-described procedure employing a spatio-temporal whitening system with an Iterative Multichannel Noise Whitening Algorithm.

[0051] In this embodiment, a two stage speech enhancement system is used, in which the first stage acts as a noise-whitening system and the second stage employs a spatio-temporal power method on the noise-whitened signal to produce the enhanced speech. A significant advantage of the present method is its computational simplicity which makes the algorithm viable for applications on many common computing devices such as cellular telephones, personal digital assistants, portable media players, and other computational devices. Since all the processing is performed on the spatio-temporal correlation coefficient sequences, the method avoids large matrix-vector manipulations.

[0052] The first step in the present technique is to whiten the noise component of the observed noisy data. As is common in speech enhancement systems, it is assumed that access to an interval in the noisy speech where the speech is signal is absent is available. Such an interval is often referred to as the silence interval and can be detected by using a speech/silence detector or a voice activity detector (VAD). For purposes of the present invention it is assumed that the speech source is silent for N.sub.v+L+1 sample times from l=N.sub.v(k1)(L/2) to l=N.sub.v(k1)+(L/2). From this noise-only segment, it is possible to compute a whitening filter which is then applied to the rest of the noisy speech in order to whiten the noise component present in it. The present method involves designing a multichannel whitening filter of length L which iteratively whitens the spatio-temporal autocorrelation sequence corresponding to the noise process defined as

[00016] $\begin{matrix} R_{V_{p}} = \frac{1}{N_{v}} .Math. {.Math.}_{l = N_{v} (k - 1) + 1}^{N_{v} .Math. k} .Math. v (l) .Math. v^{T} (l - p), .Math. - \frac{L}{2} p \frac{L}{2}, & (21) \end{matrix}$

where N.sub.v is the number of noise samples used in the computation of the whitening filter. After sufficiently many iterations k, the multichannel FIR filter sequence {W.sub.p(k)} is designed to satisfy the following equation

[00017] $\begin{matrix} {.Math.}_{p = 0}^{L} .Math. {.Math.}_{q = 0}^{L} .Math. W_{p} (k) .Math. R_{V_{q - p}} .Math. W_{q}^{T} (k) = {\begin{matrix} I .Math. .Math. if .Math. .Math. .Math. q - p .Math. = 0 \\ 0 .Math. .Math. otherwise \end{matrix} . & (22) \end{matrix}$

where I is an nn identity matrix. Note that {W.sub.p(k)} is assumed to be zero outside the range 0pL and {Rv.sub.p} is assumed to be zero outside the

[00018] $range - \frac{L}{2} p \frac{L}{2} .$

The filter coefficient sequence {W.sub.p(k)} can be updated in terms of the following multichannel sequences of length L defined as

[00019] $\begin{matrix} {\overline{R_{V}}}_{q} (k) = {\begin{matrix} {.Math.}_{p = 0}^{L} .Math. (R_{V_{q - p}}) .Math. W_{p}^{T} (k) & if .Math. - \frac{L}{2} q \frac{L}{2} \\ 0 & otherwise \end{matrix} . & (23) \\ G_{V_{p}} (k) = {\begin{matrix} {.Math.}_{q = 0}^{L} .Math. W_{q} (k) .Math. {\overline{R_{V}}}_{p - q} (k) & if .Math. .Math. 0 p L \\ 0 & otherwise \end{matrix} . & (24) \\ {\tilde{U}}_{p} (k) = {.Math.}_{q = 0}^{L} .Math. (G_{V_{p - q}} (k)) .Math. W_{q} (k), 0 p L & (25) \end{matrix}$

[0053] and the final update for {W.sub.p} becomes

[00020] $\begin{matrix} W_{p} (k + 1) = (1 +) .Math. c (k) .Math. W_{p} (k) - .Math. \frac{c (k)}{d (k)} .Math. {\tilde{U}}_{p} (k), .Math. 0 p L & (26) \end{matrix}$

where

[00021] $d (k) = \frac{1}{n} .Math. {.Math.}_{i = 1}^{n} .Math. {.Math.}_{j = 1}^{n} .Math. {.Math.}_{p = 0}^{L} .Math. .Math. g_{ijp} (k) .Math., and .Math. .Math. c (k) = \frac{1}{d (k)}$

are the gradient scaling factors [18] chosen to stabilize the algorithm and reduce the sensitivity of the gradient based update on the step size. Typically, step sizes in the range 0.350.5 have been chosen and appear to work well. In the above set of equations, H(.) denotes a form of multichannel weighting on the autocorrelation sequences as described previously. After the filter convergence we obtain the noise-whitened signal as

[00022] $\begin{matrix} {\tilde{y}}_{k} (l) = {.Math.}_{p = 0}^{L} .Math. W_{p} (k) .Math. y (l - p) & (27) \end{matrix}$

[0054] Once the noise-whitened vector signal {tilde over (y)}.sub.k(l) is obtained, the spatio-temporal power method is applied to this vector signal in order to obtain the enhanced speech.

[0055] The present embodiment also includes a spatio-temporal power method which is the second stage in the present technique and involves the design of a multichannel filter {b.sub.p(k)}, where {b.sub.p(k)} is a (1n) vector sequence, which upon convergence yields a single channel signal {circumflex over (x)}(1) which closely resembles the clean speech signal s(l) with some delay D. The output of the multichannel filter {b.sub.p(k)} at time instant k is given as

[00023] $\begin{matrix} {\hat{s}}_{k} (l) = {.Math.}_{p = 0}^{L} .Math. b_{p} (k) .Math. \tilde{y} (l - p) & (28) \end{matrix}$

[0056] As a design criterion for the filter sequence {b.sub.p(k)}, the power of the output signal .sub.k(l), maximized, i.e.,

[00024] $\begin{matrix} maximize .Math. .Math. ({b_{p}}) = \frac{1}{2} .Math. {.Math.}_{k = 1}^{N} .Math. {\hat{s}}_{k}^{2} (l) & (29) \\ such .Math. .Math. that .Math. .Math. {.Math.}_{.Math. p = 0}^{.Math. L} .Math. .Math. b_{p} .Math. b_{p + q}^{T} =_{q}, .Math. - \frac{L}{2} q \frac{L}{2} & (30) \end{matrix}$

[0057] The constraints in (30) correspond to the paraunitary constraints on the filter {b.sub.p(k)}. Note that in the conventional power method, unit-norm constraints are often placed on the filter coefficients; however, as a recent simulation study [20] indicates, the paraunitary constraints have beneficial impact not only on the robustness of the algorithms but also on the quality of the output speech. Our method for solving (29)-(30) employs a gradient ascent procedure in which each matrix tap b.sub.p is replaced by the derivative of J(b.sub.p) with respect to b.sub.p, after which the updated coefficient sequence is adjusted to maintain the paraunitary constraints in (30). It can be shown that

[00025] $\begin{matrix} \frac{({b_{p}})}{b_{p}} = {.Math.}_{q = 0}^{L} .Math. b_{q} .Math. R_{p - q}, & (31) \end{matrix}$

where the multichannel autocorrelation sequence R.sub.p is given by

[00026] $\begin{matrix} R_{p} = \frac{1}{N} .Math. {.Math.}_{l = 1}^{N} .Math. {\tilde{y}}_{k} (l) .Math. {\tilde{y}}_{k}^{T} (l - p), .Math. - \frac{L}{2} p \frac{L}{2} . & (32) \end{matrix}$

[0058] Thus, the first step of our procedure at each iteration sets

[00027] $\begin{matrix} {\tilde{b}}_{p} (k) = {.Math.}_{q = 0}^{L} .Math. b_{q} (k) .Math. R_{p - q}, .Math. 0 p L . & (33) \end{matrix}$

[0059] At this point, the coefficient sequence {{tilde over (b)}.sub.p(k)} needs to be modified to enforce the paraunitary constraints in (30). We modify the coefficient sequence such that

{b.sub.p(k+1)}=A({tilde over (b)}.sub.0(k){tilde over (b)}.sub.1(k), . . . ,{tilde over (b)}.sub.L(k)),0pL(34)

where A is a mapping that forces {b.sub.p(k+1)} to satisfy (30) at each iteration. Such constraints can be enforced at each iteration by normalizing each complex Fourier-transformed filter weight in each filter channel by its magnitude. After sufficiently many iterations of (33)-(34), the signal .sub.k(l) closely resembles the clean speech signal at time instant l. A block diagram of the propose system is shown in FIG. 1, and in Tables 1a and 1b in FIGS. 2 and 3, respectively, pseudo code for the algorithm implementation in MATLAB have been provided. The functions starting with M represent the multichannel extensions of single channel standard functions.

[0060] FIG. 5 illustrates an example of one embodiment of the present invention. In steps 500-504 of FIG. 5 there is illustrated a speech enhancement method. Specifically, in 500 there is shown a step of obtaining a measured speech signal using at least one input microphone. In 501 there is illustrated a step of calculating a whitening filter using a silence interval in the obtained measured speech signal. In 502 there is shown a step of applying the whitening filter to the measured speech signal to generate a whitened speech signal in which noise components present in the measured speech signal are whitened. In 503 there is shown a step of estimating a clean speech signal by applying a multi-channel filter to the generated whitened speech signal. Finally, in 504 there is shown a step of outputting the clean speech signal via an audio device.

[0061] In FIG. 6 there is shown an embodiment of the invention in which a device that performs speech enhancement is shown. In FIG. 6 there is illustrated a first circuit that obtains a measured speech signal using at least one input microphone 600. The first circuit includes, for example, an input unit 610 that functions to convert the measured speech into a form usable by the second and third circuits. In addition, there is shown a second circuit which calculates a whitening filter using a silence interval in the obtained measured speech signal and applies the whitening filter to the measured speech signal to generate a whitened speech signal in which noise components present in the measured speech signal are whitened. The second circuit includes, for example, the iterative noise whitening unit 620 which calculates and uses the whitening filter using the method described above. The iterative noise whitening unit 620 also uses data from the speech/silence detector 650, which determines when no speech is included in the signal. Also illustrated in FIG. 6 is a third circuit that estimates a clean speech signal by applying a multi-channel filter to the generated whitened speech signal, and outputs the clean speech signal to an audio output device 640. The third circuit includes, for example, a Spatio-Temporal Power Unit 630 which applies a multi-channel filter to the speech signal using the method described above and outputs the clean speech signal to the output device 640.

[0062] All embodiments of the present invention conveniently may be implemented using a conventional general-purpose computer, personal media device, cellular telephone, or micro-processor programmed according to the teachings of the present invention, as will be apparent to those skilled in the computer art. The present invention may also be implemented in an attachment that works with other computational devices, such as a personal headset or recording apparatus that transmits or otherwise makes its processed audio signal available to these other computational devices in its operation. Appropriate software may readily be prepared by programmers of ordinary skill based on the teachings of the present disclosure, as will be apparent to those skilled in the software art.

[0063] A computer or other computational device may implement the methods of the present invention, wherein the computer or computational devices housing houses a motherboard which contains a CPU, memory (e.g., DRAM, ROM, EPROM, EEPROM, SRAM, SDRAM, and Flash RAM), and other optional special purpose logic devices (e.g., ASICs) or configurable logic devices (e.g., GAL and reprogrammable FPGA). The computer or computational device also includes plural input devices, (e.g., keyboard and mouse), and a display card for controlling a monitor or other visual display device. Additionally, the computer or computational device may include a floppy disk drive; other removable media devices (e.g. compact disc, tape, electronic flash memory, and removable magneto-optical media); and a hard disk or other fixed high density media drives, connected using an appropriate device bus (e.g., a SCSI bus, an Enhanced IDE bus, an Ultra DMA bus, or another standard communications bus). The computer or computational device may also include an optical disc reader, an optical disc reader/writer unit, or an optical disc jukebox, which may be connected to the same device bus or to another device bus. Computational devices of a similar nature to the above description include, but are not limited to, cellular telephones, personal media devices, or other devices enabled with computational capability using microprocessors or devices with similar numerical computing capability. In addition, devices that interface with such systems can embody the proposed invention through their interaction with the host device.

[0064] Examples of computer readable media associated with the present invention include optical discs, hard disks, floppy disks, tape, magneto-optical disks, PROMs (e.g., EPROM, EEPROM, Flash EPROM), DRAM, SRAM, SDRAM, and so on. Stored on any one or on a combination of these computer readable media, the present invention includes software for controlling both the hardware of the computational device and for enabling the computer to interact with a human user. Such software may include, but is not limited to, device drivers, operating systems and user applications, such as development tools. Computer readable medium may store computer program instructions (e.g., computer code devices) which when executed by a computer causes the computer to perform the method of the present invention. The computer code devices of the present invention may be any interpretable or executable code mechanism, including but not limited to, scripts, interpreters, dynamic link libraries, Java classes, and complete executable programs. Moreover, parts of the processing of the present invention may be distributed (e.g., between (1) multiple CPUs or (2) at least one CPU and at least one configurable logic device) for better performance, reliability, and/or cost.

[0065] The invention may also be implemented by the preparation of application specific integrated circuits or by interconnecting an appropriate network of conventional component circuits, as will be readily apparent to those skilled in the art.

[0066] Numerous modifications and variations of the present invention are possible in light of the above teachings. Of course, the particular hardware or software implementation of the present invention may be varied while still remaining within the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.

SPATIO-TEMPORAL SPEECH ENHANCEMENT TECHNIQUE BASED ON GENERALIZED EIGENVALUE DECOMPOSITION

Assignee

Inventors

Cpc classification

Classification Explorer

G10L2021/02166

PHYSICS

Classification Explorer

G10L21/0216

PHYSICS

Classification Explorer

G10L2021/02168

PHYSICS

Classification Explorer

G10L21/0208

PHYSICS

International classification

Classification Explorer

G10L21/0216

PHYSICS

Abstract

Claims

Description