System and method for acoustic echo cancelation using deep multitask recurrent neural networks
11521634 · 2022-12-06
Assignee
Inventors
Cpc classification
H04M9/08
ELECTRICITY
International classification
Abstract
A method for performing echo cancellation includes: receiving a far-end signal from a far-end device at a near-end device; recording a microphone signal at the near-end device including: a near-end signal; and an echo signal corresponding to the far-end signal; extracting far-end features from the far-end signal; extracting microphone features from the microphone signal; computing estimated near-end features by supplying the microphone features and the far-end features to an acoustic echo cancellation module including: an echo estimator including a first stack of a recurrent neural network configured to compute estimated echo features based on the far-end features; and a near-end estimator including a second stack of the recurrent neural network configured to compute the estimated near-end features based on an output of the first stack and the microphone signal; computing an estimated near-end signal from the estimated near-end features; and transmitting the estimated near-end signal to the far-end device.
Claims
1. A method for performing echo cancellation comprising: receiving, by a plurality of input layer gated recurrent units (GRUs) of a recurrent neural network, a plurality of far-end features; computing a plurality of estimated near-end features using the recurrent neural network based on a plurality of microphone features extracted from a microphone signal and the plurality of far-end features; and computing an estimated near-end signal using the plurality of estimated near-end features.
2. The method of claim 1, wherein the estimated near-end signal is computed using a plurality of estimated echo features.
3. The method of claim 2, wherein the plurality of estimated echo features are computed using the recurrent neural network.
4. The method of claim 3, wherein the plurality of far-end features are extracted from a far-end signal received, from a far-end device, at a near-end device.
5. The method of claim 4, wherein the microphone signal is recorded at a microphone of the near-end device.
6. The method of claim 5, wherein the microphone features comprise a current frame of microphone features and a causal window of a plurality of previous frames of microphone features.
7. The method of claim 1, wherein the far-end features comprise a current frame of far-end features and a causal window of a plurality of previous frames of far-end features.
8. The method of claim 2, wherein the estimated echo features comprise a current frame of estimated echo features and a causal window of a plurality of previous frames of estimated echo features, and wherein the estimated near-end features comprise a current frame of estimated near-end features and the causal window of a plurality of previous frames of estimated near-end features.
9. The method of claim 1, wherein the estimated near-end features comprise log short time Fourier transform features in logarithmic spectral space.
10. A communication device configured to perform echo cancellation, the communication device comprising: a modem; a microphone; a processor; and memory storing instructions that, when executed by the processor, cause the processor to: receive, by a plurality of input layer gated recurrent units (GRUs) of a recurrent neural network, a plurality of far-end features; compute a plurality of estimated near-end features using the recurrent neural network based on a plurality of microphone features extracted from a microphone signal and the plurality of far-end features; and compute an estimated near-end signal using the plurality of estimated near-end features.
11. The communication device of claim 10, wherein the estimated near-end signal is computed using a plurality of estimated echo features.
12. The communication device of claim 11, wherein the plurality of estimated echo features are computed using the recurrent neural network.
13. The communication device of claim 12, wherein the plurality of far-end features are extracted from a far-end signal received, from a far-end device, at the communication device via the modem.
14. The communication device of claim 13, wherein the microphone signal is recorded at the microphone of the communication device.
15. The communication device of claim 14, wherein the microphone features comprise a current frame of microphone features and a causal window of a plurality of previous frames of microphone features.
16. The communication device of claim 10, wherein the far-end features comprise a current frame of far-end features and a causal window of a plurality of previous frames of far-end features.
17. The communication device of claim 11, wherein the estimated echo features comprise a current frame of estimated echo features and a causal window of a plurality of previous frames of estimated echo features, and wherein the estimated near-end features comprise a current frame of estimated near-end features and the causal window of a plurality of previous frames of estimated near-end features.
18. The communication device of claim 10, wherein the estimated near-end features comprise log short time Fourier transform features in logarithmic spectral space.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The accompanying drawings, together with the specification, illustrate exemplary embodiments of the present disclosure, and, together with the description, serve to explain the principles of the present disclosure.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
DETAILED DESCRIPTION
(17) In the following detailed description, only certain exemplary embodiments of the present disclosure are shown and described, by way of illustration. As those skilled in the art would recognize, the disclosure may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. In the figures and the below discussion, like reference numerals refer to like components.
(18)
(19) For the sake of clarity, as used herein, given v(t) as an arbitrary time-domain signal at time t: the short-time Fourier transform (STFT) complex-valued spectrum of v(t) at frame k and frequency bin f is denoted by V.sub.k,f; its phase is denoted by ∠V.sub.k,f; and its logarithmic magnitude is denoted by {tilde over (V)}.sub.k,f.Math.{tilde over (V)}.sub.k represents the vector of logarithmic magnitudes at all frequency bins f and frame k.
(20)
d(t)=s(t)+y(t)
In some embodiments, the microphone signal d(t) also includes other components such as additive noise n(t) (e.g., d(t)=s(t)+y(t)+n(t)). The acoustic echo signal y(t) is a modified version of far-end speech signal x(t) and includes room impulse response (RIR) and loudspeaker distortion, both of which may cause nonlinearities in the relationship between x(t) and y(t).
(21) Broadly, the acoustic echo cancellation (AEC) problem is to retrieve the clean near-end signal s(t) after removing acoustic echoes due to detection of the far-end signal x(t) by the near-field microphone 14. Comparative systems, as shown in
(22) Aspects of embodiments of the present disclosure relate to the recurrent neural network (RNN) architectures for acoustic echo cancellation (AEC). Some embodiments relate to the use of deep gated recurrent unit (GRU) networks (see, e.g., K. Cho, B. van Merriënboer, C. Gulcehre, D. Bandanau, F. Bougares, H. Schwen, and Y. Bengio, “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” in Proc. Empirical Methods in Natural Language Processing, 2014, pp. 1724-1734. and J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in Proc. NIPS Deep Learning Workshop, 2014.) in an encoder-decoder architecture to map the spectral features of the microphone signals d(t) and far-end signals x(t) to a hyperspace (e.g., a feature space such as logarithmic spectral space), and then decode the target spectral features of the near-end signal s(t) from the encoded hyperspace. In some embodiments, the RNN acoustic echo cancellation module is trained using multitask learning to learn an auxiliary task of estimating the echo signal y(t) in order to improve the main task of estimating the clean near-end speech signal s(t) as estimated near-end signal q(t). As discussed in more detail below, experimental results show that embodiments of the present disclosure cancel acoustic echo in both single-talk and double-talk periods with nonlinear distortions without requiring a separate double-talk detector.
(23)
(24) In the embodiment shown in
(25) For the sake of convenience, aspects of embodiments of the present disclosure will be described herein where the spectral feature vectors are computed using a 512-point short time Fourier transform (STFT) with a frame shift of 256-point (given the 16 kHz sampling rate, each frame corresponds to 32 milliseconds with a 16 millisecond shift between frames, resulting is 16 milliseconds of overlap between frames). In some embodiments, the absolute value module 214 reduces the 512-point STFT magnitude vector to 257-point by removing the conjugate symmetric half. In some embodiments, the features (e.g., the microphone signal features {tilde over (D)}.sub.k,f) are standardized to have zero mean and unit variance using the scalars calculated from the training data, as discussed in more detail below. As would be understood by one of skill in the art, the spectral feature vectors may be computed with more than 512 points or fewer than 512 points and with longer or shorter frame shifts (e.g., more overlap or less overlap between frames).
(26) In a manner similar to that of comparative systems as discussed above with respect to
(27) In some embodiments of the present disclosure, a near-end estimator 250 accepts the microphone signal features {tilde over (D)}.sub.k,f, the far-end signal features {tilde over (X)}.sub.k,f, and the estimated echo features {tilde over (V)}.sub.k,f (or another output of the echo estimator 230) to compute estimated near-end speech features {tilde over (Q)}.sub.k,f. The estimated near-end speech features {tilde over (Q)}.sub.k,f may then be supplied to feature inversion module or signal synthesis module 270, which may include an exponential operation module 272 (to invert the logarithmic operation applied to the input signals) and an inverse short time Fourier transform (iSTFT) module 274 to transform the estimated near-end speech features {tilde over (Q)}.sub.k,f from the feature space or hyperspace to a time domain signal q(t), which is an estimate of the near-end speech or near-end signal s(t).
(28) In various speech processing applications, using past and/or future frames of data can help in computing estimates characteristics of the current frame. In some of such speech processing applications, a fixed context window is used as the input to a fully-connected first layer of a deep neural network. In these comparative methods, the contextual information can be lost after this first layer as the information flows through deeper layers.
(29) Accordingly, some aspects of embodiments of the present disclosure use the context features for both inputs and outputs of the neural network in order to keep the contextual information available throughout the neural network. According to some embodiments, the input features for a current frame includes the feature vector {tilde over (X)}.sub.k of current frame k and feature vectors ({tilde over (X)}.sub.k-1, {tilde over (X)}.sub.k-2, . . . , {tilde over (X)}.sub.k-6) of six previous frames or causal frames (k−1, k−2, . . . , k−6). According to some embodiments of the present disclosure, causal windows (using only data from previous frames, as opposed to future frames) are chosen to prevent extra latency (e.g., when using causal windows of frames there is no need to wait for the arrival of future frames k+1, k+2, . . . before processing a current frame k). The seven frames with 50% overlap of the embodiment discussed above creates a receptive filed of 112 ms, which is generally long enough for processing the speech signal. To incorporate context awareness, some aspects of embodiments of the present disclosure relate to the use of unrolled deep gated recurrent unit (GRU) networks with seven time-steps (or frames) for both the echo estimation module and the near-end estimation module. However, embodiments of the present disclosure are not limited thereto and may be implemented with more than six prior frames of data or fewer than six prior frames of data.
(30)
(31) According to some embodiments of the present disclosure, each GRU computes its output activation in accordance with:
h.sub.k=(1−z.sub.k)⊙h.sub.k-1+z.sub.k⊙ĥ.sub.k
where ⊙ is an element-wise multiplication, and the update gates z.sub.k are:
z.sub.k=σ(W.sub.z{tilde over (X)}.sub.k+U.sub.zh.sub.k-1)
where σ is a sigmoid function. The candidate hidden state ĥ.sub.k is computed by
ĥ.sub.k=elu(W{tilde over (X)}.sub.k+U(r.sub.k⊙h.sub.k-1))
where elu is exponential linear unit function, and reset gates r.sub.k are computed by
r.sub.k=σ(W.sub.r{tilde over (X)}.sub.k+U.sub.rh.sub.k-1)
where U, W, U.sub.r, W.sub.r, U.sub.z, and W.sub.z are the internal weight matrices of the GRUs. In some embodiments, each of the GRUs in a given layer (e.g., each of the GRUs in layer 232) uses the same set of weights (hence the “recurrent” nature of the neural network). In some embodiments, the values of the internal weight matrices are learned through a training process, described in more detail below.
(32)
(33) In the embodiment shown in
(34) As noted above, in the embodiment shown in
(35) In the embodiment shown in
(36) In the embodiment shown in
(37)
(38) In the particular domain of acoustic echo cancellation described here, the training data may include: far-end signals x(t); near-end signals s(t); and echo signals y(t). In some embodiments of the present disclosure, at 510, the computer system generates training data in a manner similar to that described in H. Zhang and D. Wang, “Deep Learning for Acoustic Echo Cancellation in Noisy and Double-Talk Scenarios,” in Proc. Annual Conference of the International Speech Communication Association, 2018, pp. 3239-3243. In some embodiments, the TIMIT dataset is used to generate the training data (see, e.g., F. Lamel, R. H. Kassel, and S. Seneff, “Speech database development: Design and analysis of the acoustic-phonetic corpus,” in Speech Input/Output Assessment and Speech Databases, 1989.).
(39)
(40) At 517, each utterance of a near-end speaker of the pair is padded or extended to the same length as that of its corresponding far-end signal x(t) (e.g., for each concatenated far-end signal generated in accordance with the paired far-end human speaker) by filling zeroes before and after the utterance to have the same size as the far-end signal to generate ground truth near-end signals s(t). (Embodiments of the present disclosure are not limited thereto, and, in some embodiments, noise is added to the entire padded signal.) In some embodiments, more than one far-end signal x(t) and near-end signal s(t) pair is selected for each near-end far-end pair.
(41) At 519, the computer system mixes (e.g., adds) the ground truth echo signals y(t) and the ground truth near-end signals s(t) computed for each pair to generate a corresponding training microphone signal d(t). For training mixtures, in some embodiments, the computer system generates the training microphone signals d(t) at 519 at signal to echo ratio (SER) level randomly chosen from {−6, −3, 0, 3, 6}dB by mixing the near-end speech signal and echo signal. The SER level is calculated on the double-talk period as:
(42)
(43)
(44)
(45) At 515-3, to simulate the loudspeaker distortion, the computer system applies the a sigmoidal function such as:
(46)
where b(t)=1.5x.sub.clip(t)−0.3x.sub.clip(t).sup.2 and a=4 if b(t)>0 and a=0.5 otherwise.
(47) According to one embodiment, at 515-5, a room impulse response (RIR) g(t) is randomly chosen from a set of RIRs, where the length of each of the RIRs is 512, the simulation room size is 4 meters×4 meters×3 meters, and a simulated microphone is fixed at the location of [2 2 1.5] meters (at the center of the room). A simulated loudspeaker is placed at seven random places with 1.5 m distance from the microphone. In some embodiments of the present disclosure, a plurality of different RIRs are also generated with different room sizes and different placements of the simulated microphone and/or simulated speaker.
(48) In some embodiments, the RIRs are generated using an image method (see, e.g., J. B. Allen, D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of Acoustic Society of America, vol. 65, no. 4, pp. 943-950, 1979.) at reverberation time (T.sub.60) of 200 ms. From the generated RIRs, in some embodiments some of the RIRs are used to generate the training data (e.g., may be randomly selected) while others are reserved to generate test data.
(49) At 515-7, the output of sigmoidal function is convolved with the randomly chosen room impulse response (RIR) g(t) in order to simulate the acoustic transmission of the distorted (nonlinear) far-end signal x.sub.nl(t) played through the loudspeaker in the room:
y.sub.nl(t)=x.sub.nl(t)*g(t)
where * indicates a convolution operation.
(50) In some embodiments, a linear acoustic path y.sub.lin(t) is simulated by only convolving the original far-end signal x(t) with the RIR g(t) to generate the echo signal, where nonlinearities such as clipping and loudspeaker distortion are not applied for this model:
y.sub.lin(t)=x(t)*g(t)
(51) Referring back to
(52) At 530, the computer system trains the neural network of the AEC 228 in accordance with the training data. In more detail, as discussed above, each of the GRUs computes its corresponding activation h from its inputs based on internal weight matrices U, W, U.sub.r, W.sub.r, U.sub.z, and W.sub.z. In addition, each of the fully connected units includes a plurality of internal weights W and biases b (e.g., applying an affine function of the form Wx+b) for mapping the inputs to the fully connected units to the outputs in feature space (e.g., STFT space).
(53) Training the neural network involves learning the internal weights of the GRUs and the fully connected units such that the output feature vectors (estimated near-end features {tilde over (Q)} and estimated echo features {tilde over (V)}) are close to the ground truth feature vectors (ground truth near-end features {tilde over (S)} and ground truth echo features {tilde over (Y)}). The difference between the output feature vectors {tilde over (Q)} and {tilde over (V)} and the ground truth feature vectors {tilde over (S)} and {tilde over (Y)} may be measured using a loss function, representing how well the neural network, as configured with the current set of internal weights, approximates the underlying data.
(54) In one embodiment, a mean absolute error (MAE) loss function is used for training the neural network. A mean absolute error is calculated between a ground-truth source (near-end signal s(t)) and a network estimated output (estimated near-end signal q(t)) in the feature domain (e.g., the STFT domain, as discussed above). Some embodiments use a weighted loss function that accounts for both the near-end signal s(t) and the echo path signal y(t) to compute the network weights. Accordingly, in one embodiment, the loss for a given frame k is computed based on the current frame and the previous six frames in accordance with:
(55)
where β is the weighting factor between the loss associated with the near-end signal and the loss associated with the echo signal, {tilde over (S)}.sub.i corresponds to the ground truth near-end features for an i-th frame, {tilde over (Q)}.sub.i corresponds to the estimated near-end features for the i-th frame, {tilde over (Y)}.sub.i corresponds to the ground truth echo features for the i-th frame, and {tilde over (V)}.sub.i corresponds to the estimated echo features for the i-th frame. In embodiments where m previous frames of data are used for context (e.g., a causal window of length m frames), the summations run from n=0 to m. For the sake of convenience, in the embodiments described in detail herein, m=6.
(56) In some embodiments of the present disclosure, the weights are computed using gradient descent and backpropagation. In particular, the weights are iteratively adjusted based on the differences between the current output of the neural network and the ground truth. In some embodiments of the present disclosure, the models are trained using AMSGrad optimization (see, e.g., J. Reddi, S. Kale, and S. Kumar, “On the convergence of Adam and beyond,” in International Conference on Learning Representations (ICLR), 2018.), and in particular the Adam variant (see, e.g., D. P. Kingma and J. L. Ba, “Adam: a method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.) by setting β.sub.1=0.9, β.sub.2=0.999, and ϵ=10.sup.−3 for 100 epochs, with a batch size of 100. In some embodiments, the weights of all layers are initialized with the Xavier method (see, e.g., X. Glorot, and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. International Conference on Artificial Intelligence and Statistics, 2010, pp. 249-256.) and with the biases initialized to zero. In some embodiments, L2 regularization for all the weights with a regularization constant of 0.000001 is used to prevent overfitting.
(57) After training the weights of the neural network, the trained network may be tested using the test set of the training data to verify the accuracy of the network. As noted above, the test set may be formed using utterances from speakers who were not used in the training set and/or use RIRs and/or other distortions that were not present in the training set. Accordingly, the test set may be used to evaluate whether that the training process has trained a neural network to perform a generalized function for acoustic echo cancellation, rather than overfitting to the particular characteristics of the training data (e.g., removing acoustic echoes characteristic of the particular human speakers or RIRs of the training data).
(58) After training the neural network and determining that the performance of the trained network is sufficient (e.g., based on the test set), the weights may be saved and used to configure an neural network running on an end-user device such as a smartphone or a tablet computer. In various embodiments of the present disclosure, the neural network of the acoustic echo cancellation module is implemented on at least one processor 1120 of the end-user device 10 (see, e.g.,
(59)
(60)
(61) Similarly, at 612, the acoustic echo cancellation system 200 receives the microphone signal d(t) and, at 614, the near-end signal feature extraction module 210 extracts microphone signal features {tilde over (D)} from the microphone signal d(t).
(62) At 620, the second stack of the neural network, corresponding to the near-end estimator 250′, computes estimated near-end features {tilde over (Q)} from the far-end features {tilde over (X)}, the echo estimator features (e.g., h), and the microphone features {tilde over (D)}. As shown in
(63) At 622, feature inversion module 270 of the acoustic echo cancellation system 200 computes an estimated near-end signal q(t) for the current frame from the estimated near-end features {tilde over (Q)} of the current frame. As noted above, the features (e.g., the far-end signal features {tilde over (X)}, the microphone features {tilde over (D)}, and the estimated near-end features {tilde over (Q)} may be in a feature space or hyperspace such as STFT space (e.g., spectral features or spectral domain). Accordingly, in some embodiments, the feature inversion module 270 transforms the estimated spectral features {tilde over (Q)} from the feature space to a time domain signal q(t) suitable for playback on a speaker at the far-end device. As shown in
(64)
Experimental Results
(65) To evaluate the performance of an acoustic echo cancellation system 200 as described above, experiments were performed using training data generated from the TIMIT dataset (see, e.g., F. Lamel, R. H. Kassel, and S. Seneff, “Speech database development: Design and analysis of the acoustic-phonetic corpus,” in Speech Input/Output Assessment and Speech Databases, 1989.). In some embodiments of the present disclosure, seven utterances of near-end speakers were used to generate 3,500 training mixtures where each near-end signal was mixed with five different far-end signals. From the remaining 430 speakers, 100 pairs of speakers were randomly chosen as the far-end and near-end speakers. To generate 300 testing mixtures, the same procedure as described above, but with only three utterances of near-end speakers, where each near-end signal was mixed with one far-end signal. Therefore, the testing mixtures are from human speakers that were not part of the training set.
(66) Perceptual Evaluation of Speech Quality (PESQ) scores of unprocessed test mixtures for linear and nonlinear models (no echo cancellation) are shown in Table 1. The unprocessed PESQ scores are calculated by comparing the microphone signal against near-end signal during the double-talk period.
(67) TABLE-US-00001 TABLE 1 PESQ scores for unprocessed test mixtures in linear and nonlinear models of acoustic path Acoustic Testing SER (dB) Path Model 0 3.5 7 Linear 1.87 2.11 2.34 Nonlinear 1.78 2.03 2.26
(68) In some instances, echo return loss enhancement (ERLE) was used to evaluate the echo reduction that is achieved by the acoustic echo cancellation system 200 according to embodiments of the present disclosure during the single-talk situations where only the echo is present, where ERLE is defined as:
(69)
where E is the statistical expectation operation which is realized by averaging.
(70) To evaluate the performance of the system during the double-talk periods, we used perceptual evaluation of speech quality (PESQ). In some embodiments, PESQ is calculated by comparing the estimated near-end speech q(t) against the ground-truth near-end speech s(t) during the double-talk only periods. A PESQ score ranges from −0.5 to 4.5 and a higher score indicates better quality.
(71) In the following discussion, a frequency domain normalized least mean square (NLMS) (see, e.g., C. Faller and J. Chen, “Suppressing acoustic echo in a spectral envelope space,” IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 13, no. 5, pp. 1048-1062, 2005.) is used as a comparative example. A double-talk detector (DTD) is used based on the energy of microphone signal d(t) and far-end signal x(t). In some instances, a post-processing algorithm is further based on the method presented in R. Martin and S. Gustafsson, “The echo shaping approach to acoustic echo control”, Speech Communication, vol. 20, no. 3-4, pp. 181-190, 1996. Embodiments of the present disclosure are also compared against the bidirectional long short-term memory (BLSTM) method described in H. Zhang and D. Wang, “Deep Learning for Acoustic Echo Cancellation in Noisy and Double-Talk Scenarios,” in Proc. Annual Conference of the International Speech Communication Association, 2018, pp. 3239-3243.
(72) Embodiments of the present disclosure are compared against comparative methods using a linear model of the acoustic path (e.g., linear acoustic echoes). Table 2 shows the average ERLE values and PESQ gains for the conventional NLMS filter, BLSTM, and a context-aware multitask GRU according to embodiments of the present disclosure (denoted as “CA Multitask GRU”). The PESQ gain is calculated as the difference of PESQ value of each method with respect to its unprocessed PESQ value. Table 2 also shows the results for context-aware single-task GRU (denoted as “CA Single-task GRU”) according to embodiments of the present disclosure that only uses the second stack of GRU layers with {tilde over (D)}.sub.k and {tilde over (X)}.sub.k as the inputs, where the loss function is calculated by only penalizing the network outputs against ground-truth feature vector {tilde over (S)} of near-end speech s(t). The results show that multitask GRU according to some embodiments of the present disclosure outperforms single-task GRU according to some embodiments of the present disclosure in terms of both PESQ and ERLE. It also shows that embodiments of the present disclosure outperform both conventional NLMS+Post-processing and BLSTM methods in all conditions.
(73) TABLE-US-00002 TABLE 2 ERLE and PESQ scores in a linear model of acoustic path Testing SER (dB) Method 0 3.5 7 ERLE (dB) NLMS + Post-processing 29.38 25.88 21.97 BLSTM 51.61 50.04 47.42 CA Single-task GRU 62.88 61.81 60.11 CA Multitask GRU 64.66 64.16 62.26 PESQ gain NLMS + Post-processing 0.93 0.81 0.68 BLSTM 0.80 0.78 0.74 CA Single-task GRU 0.98 0.95 0.93 CA Multitask GRU 1.04 1.02 0.99
(74) Embodiments of the present disclosure are also compared against comparative methods using a nonlinear model of the acoustic path (e.g., nonlinear acoustic echoes). In this set of experiments, the nonlinear ground truth echo signal y.sub.nl(t) was used to generate the microphone signals d(t), therefore the model contains both power amplifier clipping and loudspeaker distortions (e.g., corresponding to 515-3 and 515-7 of
(75) TABLE-US-00003 TABLE 3 ERLE and PESQ scores in nonlinear model of acoustic path Testing SER (dB) Method 0 3.5 7 ERLE (dB) NLMS + Post-processing 16.76 14.26 12.33 AES + DNN — 36.59 — CA Multitask GRU 61.79 60.52 59.47 PESQ gain NLMS + Post-processing 0.54 0.43 0.31 AES + DNN — 0.62 — CA Multitask GRU 0.84 0.83 0.81
(76) Embodiments of the present disclosure achieve superior echo reduction without significant near-end distortion (e.g., the spectra corresponding to the estimated near-end signal and the actual near-end signal are very similar).
(77) The performance of embodiments of the present disclosure was also evaluated in the presence of additive noise and a nonlinear model of the acoustic path. In these embodiments, when generating the training data, white noise at 10 dB SNR was added to the near-end signal s(t), with nonlinear acoustic path at 3.5 dB SER level. Embodiments of the present disclosure were then compared against a conventional NLMS+Post-processing system. As shown in Table 4 below, aspects of embodiments of the present disclosure outperform the comparative method by a large margin.
(78) TABLE-US-00004 TABLE 4 ERLE and PESQ scores in nonlinear model of acoustic path (SER = 3.5 dB) and additive noise (SNR = 10 dB) ERLE (dB) NLMS + Post-processing 10.13 CA Multitask GRU 46.12 None 1.80 PESQ NLMS + Post-processing 2.01 CA Multitask GRU 2.50
(79) In addition, the alternative hybrid embodiment discussed above was evaluated for unseen RIRs for different reverberation times and loudspeaker distances from the microphone. In this evaluation, the models were trained and tested using the same RIRs discussed above corresponding to a room size of 4 meters×4 meters×3 meters with reverberation time of 200 ms, and random loudspeaker distance of 1.5 meters from microphone and total length of 512 samples. During the testing of a hybrid system according to embodiments of the present disclosure, the loudspeaker distance was changed 15 cm. The results of frequency domain NLMS and a hybrid method of NLMS and multitask GRU according to embodiments of the present disclosure that was trained with the above RIRs are shown in Table 5. The multitask GRU was further fine-tuned with the RIRs that were generated in multiple room sizes (small, medium, and large), various reverberation times (from 250 ms to 900 ms), and loudspeaker distance of 15 cm. The fine-tuned results are also shown in Table 5, below. These results suggest that the hybrid method according to some embodiments of the present disclosure can perform better if the model is fine-tuned with the impulse response of the target device (e.g., target end-user near-end device).
(80) TABLE-US-00005 TABLE 5 ERLE and PESQ scores of hybrid method ERLE (dB) NLMS 14.70 Hybrid Multitask GRU 37.68 Hybrid Multitask GRU (Fine-tuned) 41.17 None 2.06 PESQ NLMS 2.70 Hybrid Multitask GRU 3.23 Hybrid Multitask GRU (Fine-tuned) 3.37
Additional Embodiments
(81) Some embodiments of the present disclosure are directed to different architectures for the neural network of the acoustic echo cancellation system 200.
(82)
(83) The estimated features 912 of the near-end signal are obtained directly from the output of the network 900. These features are converted back to the time-domain to synthesize the estimated near-end speech signal, e.g., using the feature inversion module 270 described above. In some embodiments, for both microphone d(t) and near-end signals s(t) sampled at the rate of 16 kHz, a frame size of 512 samples with 50% overlap was used. A 512-point short-time Fourier transform (STFT) was then applied to each frame of input signals resulted in 257 frequency bins. The final log-magnitude (Log-Mag) features were computed after calculating the logarithm operation on the magnitude values. In some embodiments of the present disclosure, the log-mel-magnitude (Log-Mel-Mag) was used as the final features 912 to reduce the dimensionality of the feature space and therefore reduce the complexity of the technique applied in these embodiments. In some embodiments, the features are compressed by using a 80-dimentional Mel-transformation matrix.
(84) In order to use contextual information, in some embodiments, features for contextual frames of both input signals are also extracted and concatenated as the input features.
(85) In various embodiments either log-magnitude (Log-Mag) features or Log-mag (or Log-Mel-Mag) of the near-end speech signal were used as the target labels during training.
(86) In some embodiments, AMSGRAD is used as the optimizer during training. In some embodiments, mean absolute error (MAE) between the target labels and the output of the network was used as the loss function.
(87)
(88)
(89) Accordingly, aspects of embodiments of the present disclosure relate to deep neural networks, including deep multitask recurrent neural networks, for acoustic echo cancellation (AEC). As shown in experimental results, embodiments of the present disclosure perform well in both single-talk and double-talk periods. Some aspects of embodiments of the present disclosure relate to end-to-end multitask learning of both the echo and the near-end signal simultaneously, which improves the overall performance of the trained AEC system. In addition, some aspects of embodiments relate to the use of low-latency causal context windows to improve the context-awareness when estimating the near-end signal with the acoustic echoes removed. When compared based on reference datasets, embodiments of the present disclosure can reduce the echo more significantly than comparative techniques and is robust to additive background noise. Further, a hybrid method according to some embodiments of the present disclosure is more robust to the changes in room impulse response (RIR) and can perform well if fine-tuned by augmenting the data simulated with the impulse response of the target device (e.g., the end-user near-end device 10) under use.
(90) As such, aspects of embodiments of the present disclosure relate to echo cancellation or echo suppression using a trained neural deep recurrent neural network. While the present disclosure has been described in connection with certain exemplary embodiments, it is to be understood that the disclosure is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims, and equivalents thereof.