Speech enhancement for target speakers
09741360 · 2017-08-22
Assignee
Inventors
Cpc classification
G10L21/0308
PHYSICS
International classification
G10L15/02
PHYSICS
G10L21/0308
PHYSICS
G10L15/14
PHYSICS
Abstract
A method of speech enhancement for target speakers is presented. A blind source separation (BSS) module is used to separate a plurality of microphone recorded audio mixtures into statistically independent audio components. At least one of a plurality of speaker profiles are used to score and weight each audio components, and a speech mixer is used to first mix the weighted audio components, then align the mixed signals, and finally add the aligned signals to generate an extracted speech signal. Similarly, a noise mixer is used to first weight the audio components, then mix the weighted signals, and finally add the mixed signals to generate an extracted noise signal. Post processing is used to further enhance the extracted speech signal with a Wiener filtering or spectral subtraction procedure by subtracting the shaped power spectrum of extracted noise signal from that of the extracted speech signal.
Claims
1. A method for speech enhancement for at least one of a plurality of target speakers using at least two of a plurality of audio mixtures performing on a digital computer with executable programming code and data memories comprising steps of: separating the at least two of a plurality of audio mixtures into a same number of audio components by using a blind source separation signal processor; weighting and mixing the at least two of a plurality of audio components into an extracted speech signal, wherein a plurality of speech mixing weights are generated by comparing the audio components with target speaker profile(s); weighting and mixing the at least two of a plurality of audio components into an extracted noise signal, wherein a plurality of noise mixing weights are generated by comparing the audio components with at least one of a plurality of noise profiles, or the target speaker profile(s) when no noise profile is provided; and enhancing the extracted speech signal with a Wiener filter by first shaping a power spectrum of said extracted noise signal via matching it to a power spectrum of said extracted speech signal, and then subtracting the shaped extracted noise power spectrum from the power spectrum of said extracted speech signal.
2. The method as claimed in claim 1 further comprising steps of transforming the at least two of a plurality of audio mixtures into a frequency domain representation, and separating the audio mixtures in the frequency domain with a demixing matrix for each frequency bin by an independent vector analysis module or a joint blind source separation module.
3. The method as claimed in claim 1 further comprising steps of generating the extracted speech signal by first weighting the audio components, then mixing the weighted audio components with the inverse of the demixing matrix of each frequency bin, then delaying the weighted and mixed audio components, and lastly summing the delayed, weighted and mixed audio components.
4. The method as claimed in claim 3 further comprising steps of extracting acoustic features from each audio components, providing at least one of a plurality of target speaker profiles parameterized with Gaussian mixture models (GMMs) modeling the probability density function of said acoustic features, calculating a logarithm likelihood for each audio component with the GMMs of speaker profile(s), smoothing the logarithm likelihood using an exponentially weighted moving average model, and mapping each smoothed logarithm likelihood to one of the speech mixing weights with a monotonically increasing function.
5. The method as claimed in claim 3 further comprising steps of estimating and tracking the delays among the weighted and mixed audio components using a generalized cross correlation delay estimator.
6. The method as claimed in claim 1 further comprising steps of generating the extracted noise signal by first weighting the audio components, and then adding the weighted audio components to generate the extracted noise signal.
7. The method as claimed in claim 6, wherein at least one of a plurality of noise profiles are provided, further comprising steps of extracting acoustic features from each audio component, calculating a logarithm likelihood for each audio component with Gaussian Mixture Models (GMMs) of the noise profile(s), smoothing each logarithm likelihood using an exponentially weighted moving average model, and transforming each smoothed logarithm likelihood to one of the noise mixing weights with a monotonically increasing function.
8. The method as claimed in claim 6, wherein no noise profile is provided, further comprising steps of extracting acoustic features from each audio component, calculating a logarithm likelihood for each audio component with Gaussian Mixture Models (GMMs) of speaker profile(s), smoothing the logarithm likelihood using an exponentially weighted moving average model, and transforming each smoothed logarithm likelihood to one of the noise mixing weights with a monotonically decreasing function.
9. The method as claimed in claim 1 further comprising steps of shaping the power spectrum of said extracted noise signal by approximately matching the power spectrum of said extracted noise signal to the power spectrum of said extracted speech signal during a noise dominating period, and enhancing the extracted speech signal with a Wiener filter by subtracting the shaped noise power spectrum from that of the extracted speech spectrum.
10. A system for speech enhancement for at least one of a plurality of target speakers using at least two of a plurality of audio recordings performing on a digital computer with executable programming code and data memories comprising: a blind source separation (BSS) module separating at least two of a plurality of audio mixtures into a same number of audio components in a frequency domain with a demixing matrix for each frequency bin; a speech mixer connecting to the BSS module and mixing the audio components into an extracted speech by weighting each audio component according to its relevance to target speaker profile(s), and mixing correspondingly weighted audio components; a noise mixer connecting to the BSS module and mixing the audio components into an extracted noise signal by weighting each audio component according to its relevance to noise profiles, and mixing correspondingly weighted audio components; a post processing module connecting to the speech and noise mixers and suppressing residual noise in said extracted speech signal using a Wiener filter with the extracted noise signal as a noise reference signal.
11. The system as claimed in claim 10, wherein the speech mixer comprises a speech mixer weight generator generating mixing weight for each audio component, a matrix mixer mixing the weighted audio component using an inverse of demixing matrix for each frequency bin, and a delay estimator estimating delays among the weighted and mixed audio components using a generalized cross correlation signal processor, and a delay-and-sum mixer aligning the weighted and mixed audio components and adding them to generate the extracted speech signal.
12. The system as claimed in claim 10, wherein the speech mixer further comprises an acoustic feature extractor extracting acoustic features from each audio component, a unit for calculating a logarithm likelihood of each audio component with at least one of a plurality of provided speaker profiles represented as parameters of Gaussian Mixture Models (GMMS) modelling the probability density function of said acoustic features, a unit for smoothing the logarithm likelihood using a weighted exponentially average model, and a unit transforming each smoothed logarithm likelihood to a speech mixing weight with a monotonically increasing mapping.
13. The system as claimed in claim 10, wherein the noise mixer further comprises a noise mixer weight generator generating a noise mixing weight for each audio component, and a weight-and-sum mixer weighting the audio components with the noise mixing weight and adding the weighted audio components to generate the extracted noise signal.
14. The system as claimed in claim 13, wherein the noise mixer comprises an acoustic feature extractor extracting acoustic features from each audio component, a unit for calculating a logarithm likelihood of each audio component, a unit for smoothing each logarithm likelihood using a weighted exponentially average model, and a unit for transforming each logarithm likelihood to the noise mixing weight with a monotonically increasing or decreasing function.
15. The system as claimed in claim 14, wherein at least one of a plurality of noise profiles are provided and are used to calculate the logarithm likelihood, and a monotonically increasing mapping is used to transform the smoothed logarithm likelihood to the noise mixing weight.
16. The system as claimed in claim 14, wherein no noise profile is provided, the target speaker profiles are used to calculate the logarithm likelihood, and a monotonically decreasing mapping is used to transform the smoothed logarithm likelihood to the noise mixing weight.
17. The system as claimed in claim 10, wherein the post processor comprises a module matching a power spectrum of said extracted noise signal to a power spectrum of the extracted speech signal during a noise dominating period, and the Wiener filter subtracts the matched noise power spectrum from that of the extracted speech signal to generate the enhanced speech signal spectrum.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION OF THE EMBODIMENTS
(8) Overview of the Present Invention
(9) The present invention describes a speech enhancement method for at least one of a plurality of target speakers. At least two of a plurality of microphones are used to capture audio mixtures. A blind source separation (BSS) algorithm, or an independent component analysis (ICA) algorithm, is used to separate these audio mixtures into approximately statistically independent audio components. For each audio component, at least one of a plurality of predefined target speaker profiles is used to evaluate a probability or a likelihood suggesting that the selected audio component belongs to the considered target speakers. All audio components are weighted according to the above mentioned likelihoods and mixed together to generate a single extracted speech signal that best matches the target speaker models. In a similar way, for each audio component, at least one of a plurality of noise models, or the target speaker models in the absence of noise models, are used to evaluate a probability or a likelihood suggesting that the considered audio component is noise or does not contain any speech signal from target speakers. All audio components are weighted according to the above mentioned likelihoods and mixed to generate a single extracted noise signal. Using the extracted noise signal, a Wiener filtering or a spectral subtraction is used to further suppress the residual noise and interferences in the extracted speech signal.
(10)
(11)
(12)
(13) Blind Source Separation
(14)
(15) In
(16) In general, a plurality of analysis filter banks transform a plurality of time domain audio mixtures into a plurality of frequency domain audio mixtures, which can be written as:
x(n,t).fwdarw.X(n,k,m), (Equation 1)
(17) where x(n, t) is the time domain signal of the n.sup.th audio mixture at discrete time t, and X(n, k, m) is the frequency domain signal of the n.sup.th audio mixture, the k.sup.th frequency bin, and the m.sup.th frame or block. For each frequency bin, a vector is formed as X(k, m)=[X(1, k, m), X(2, k, m), . . . , X(N, k, m)], and for the m.sup.th block, a separation matrix W(k, m) is solved to separate these audio mixtures into audio components as
[Y(1,k,m),Y(2,k,m), . . . ,Y(N,k,m)]=W(k,m)X(k,m), (Equation 2)
(18) where N is the number of audio mixtures. A stochastic gradient descent algorithm with a small enough step size is used to solve for W(k, m). Hence, W(k, m) evolves slowly with respect to its frame index m. Forming a frequency source vector as Y(n, m)=[Y(n, 1, m), Y(n, 2, m), . . . , Y(n, K, m)], the well known frequency permutation problem is solved by exploiting the statistical independency among different source vectors and the statistical dependency among the components from the same source vector, thus the name of IVA. Scaling ambiguity is another well known issue of a BSS implementation. One convention to remove this ambiguity is to scale the separation matrix in each bin such that all its diagonal elements have unit amplitude and zero phase.
(19) Speech Mixer
(20)
(21) In
(22) A speaker profile can be a parametric model depicting the pdf of acoustic features extracted from speech signal of a given speaker. Commonly used acoustic features are linear prediction cepstral coefficients (LPCC), perceptual linear prediction (PLP) cepstral coefficients, and Mel-frequency cepstral coefficients (MFCC). PLP cepstral coefficients and MFCC can be directly derived from a frequency domain signal representation, and thus they are preferred choices when a frequency domain BSS is used.
(23) For each source component Y(n, m), a feature vector, say f(n, m), is extracted, and compared against one or multiple speaker profiles to generate a non negative score, say s(n, m). A higher score suggests a better match between feature f(n, m) and the considered speaker profile(s). As a common practice in speaker recognition, the feature vector here may contain information from the current frame and previous frames. One common set of features are the MFCC, delta-MFCC and delta-delta-MFCC.
(24) Gaussian mixture model (GMM) is a widely used finite parametric mixture model for speaker recognition, and it can be used to evaluate the required score s(n, m). A universe background model (UBM) is created to depict the pdf of acoustic features from a target population. The target speaker profiles are modeled by the same GMM, but with their parameters adapted from the UBM. Typically, only means of the Gaussian components in UBM are allowed to be adapted. In this way, the speaker profiles in the database 504 comprise two sets of parameters: one set of parameters for the UBM containing the means, covariance matrices and component weights of Gaussian components in the UBM, and another set of parameters for the speaker profiles only containing the adapted means of GMMs.
(25) With speaker profiles and the UBM, a logarithm likelihood ratio (LLR),
r(n,m)=log p[f(n,m)|speaker profiles]−log p[f(n,m)|UBM (Equation 3)
(26) is calculated. When multiple speaker profiles are used, likelihood p[f(n, m)|speaker profiles] should be understood as the sum of likelihood of f(n, m) on each speaker profile. This LLR is noisy, and an exponentially weighted moving average is used to calculate a smoother LLR as
r.sub.s(n,m)=ar.sub.s(n,m)+(1−a)r(n,m), (Equation 4)
(27) where 0<a<1 is a forgetting factor.
(28) A monotonically increasing mapping, e.g. an exponential function, is used to map a smoothed LLR to a non negative score s(n, m). Then for each source component, a speech mixing weight is generated as a normalized score as
g(n,m)=s(n,m)/[s(1,m)+s(2,m)+ . . . +s(N,m)+s.sub.0], (Equation 5)
(29) where s.sub.0 is a proper positive offset such that g(n, m) approaches zero when all the scores are small enough to be negligible, and approaches one when s(n, m) is large enough. In this way, speech mixing weight for an audio component is positively correlated with the amount of desired speech signals it contains.
(30) In the matrix mixer 516, the weighted audio components are mixed to generate N mixtures as
[Z(1,k,m),Z(2,k,m), . . . ,Z(N,k,m)]=W.sup.−1(k,m)[g(1,m)Y(1,k,m),g(2,m)Y(2,k,m), . . . ,g(N,m)Y(N,k,m)], (Equation 6)
(31) where W.sup.−1(k, m) is the inverse of W(k, m).
(32) Finally, a delay-and-sum procedure is used to combine mixtures Z(n, k, m) into the single extracted speech signal 214, 314. Since Z(n, k, m) is a frequency domain signal, generalized cross correlation (GCC) method is a convenient choice for delay estimation. A GCC method calculates the weighted cross correlation between two signals in the frequency domain, and searches for the delay in the time domain by converting frequency domain cross correlation coefficients into time domain cross correlation coefficients using inverse DFT. Phase transform (PHAT) is a popular choice of GCC implementation which only keeps the phase information for time domain cross correlation calculation. In the frequency domain, a delay operation corresponds to a phase shifting. Hence the extracted speech signal can be written as
T(k,m)=exp(jw.sub.kd.sub.1)Z(1,k,m)+exp(jw.sub.kd.sub.2)Z(2,k,m)+ . . . +exp(jw.sub.kd.sub.N)Z(N,k,m), (Equation 7)
(33) where j is the imaginary unit, w.sub.k is the radian frequency of the kth frequency bin, and d.sub.n is the delay compensation of the nth mixture. Note that only the relative delays among mixtures can be uniquely determined, and the mean delay can be an arbitrary value. One convention is to assume d.sub.1+d.sub.2+ . . . +d.sub.N=0 to uniquely determine a set of delays.
(34) The weighting and mixing procedure here can better keep the desired speech signal than a hard switching method. For example, considering a transient stage where the desired speaker is active and the BSS has not converged yet, the target speech signal is scattered in the audio components. A hard switching procedure inevitably distorts the desired speech signals by only selecting one audio component as the output. The present method as described combines all these audio components with weights positively correlated with the amount of desired speech signals in each audio component, and hence can well preserve the target speech signals.
(35) Noise Mixer
(36)
(37) When N microphones are adopted, and thus N source components are extracted, the noise mixer weight generator generates N weights, h(1, m), h(2, m), . . . , h(N, m). Simple weighting and additive mixing generates extracted noise signal E(k, m) as
E(k,m)=h(1,m)Y(1,k,m)+h(1,m)Y(1,k,m)+ . . . +h(N,m)Y(N,k,m). (Equation 8)
(38) When a noise GMM is available, the same method for speech mixer weight generation can be used to calculate the noise mixer weights by replacing the speaker profile GMM with the noise profile GMM. When a noise GMM is unavailable, a convenient choice is to use the minus LLR of (Equation 3) as the LLR of noise, and then follow the same procedure for speech mixer weight generation to calculate the noise mixer weights.
(39) Post Processing
(40)
(41) A simple method to shape the noise spectrum is by applying a positive gain on the power spectrum of extracted noise signal as b(k, m)|E(k, m)|.sup.2. The equalization coefficient b(k, m) can be estimated by matching the amplitudes between b(k, m)|E(k, m)|.sup.2 and |T(k, m)|.sup.2 during the periods that the desired speakers are inactive. For each bin, the equalization coefficient should be close to a constant in a static or slowly time varying acoustic environment. Hence, an exponentially weighted moving averaging method can be used to estimate the equalization coefficients.
(42) Another simple method for determination of the equalization coefficient of a frequency bin is simply to assign a constant to it. This simple method is preferred if no aggressive noise suppression is required.
(43) The enhanced speech signal 220, 320 is given by c(k, m) T(k, m), where c(k, m) is a non negative gain determined by the Wiener filtering or spectral subtraction. A simple spectral subtraction determines this gain as
c(k,m)=max[1−b(k,m)|E(k,m)|.sup.2/|T(k,m)|.sup.2,0]. (Equation 9)
(44) This simple method might be good for certain applications, like voice recognition, but may not be sufficient for other applications as it introduces watering sound. A Wiener filter using decision-directed approach can smooth out this gain fluctuations to suppress the watering noise to an inaudible level.
(45) It is to be understood that the above described embodiments are merely illustrative of numerous and varied other embodiments which may constitute applications of the principles of the invention. Such other embodiments may be readily devised by those skilled in the art without departing from the spirit or scope of this invention and it is our intent they be deemed within the scope of our invention.