Audio signal processing method and system for echo suppression using an MMSE-LSA estimator
11984107 ยท 2024-05-14
Assignee
Inventors
- Abdel Yussef HUSSENBOCUS (Heverlee, BE)
- Christophe Mansard (Brussels, BE)
- Stijn ROBBEN (Boutersem, BE)
Cpc classification
G10K11/178
PHYSICS
International classification
Abstract
An audio signal processing method implemented by an audio system with an audio sensor and a speaker unit includes: measuring, by the audio sensor, acoustic signals reaching the audio sensor, producing a sensor audio signal; retrieving a speaker audio signal corresponding to a speaker acoustic signal from the speaker unit while measuring the acoustic signals reaching the audio sensor to produce the sensor audio signal; converting the speaker and sensor audio signals to speaker and sensor audio spectra; estimating, based on the speaker audio spectrum, an echo audio spectrum of an echo audio signal caused by the speaker acoustic signal in the sensor audio signal; computing, based on the echo audio spectrum and the sensor audio spectrum, echo suppression gains to be applied to the sensor audio spectrum, by using a MMSE-LSA estimator; and applying the echo suppression gains to the sensor audio spectrum.
Claims
1. An audio signal processing method implemented by an audio system which comprises at least an audio sensor and a speaker unit, wherein the speaker unit is configured to convert a speaker audio signal received as input into a speaker acoustic signal which is output by the speaker unit for a user of the audio system, wherein the audio signal processing method comprises: measuring, by the audio sensor, acoustic signals reaching the audio sensor, thereby producing a sensor audio signal, retrieving the speaker audio signal corresponding to the speaker acoustic signal output by the speaker unit while measuring the acoustic signals reaching the audio sensor to produce the sensor audio signal, converting the speaker audio signal to frequency domain, thereby producing a speaker audio spectrum, converting the sensor audio signal to frequency domain, thereby producing a sensor audio spectrum, estimating, based on the speaker audio spectrum, an echo audio spectrum of an echo audio signal caused by the speaker acoustic signal in the sensor audio signal, computing, based on the echo audio spectrum and the sensor audio spectrum, echo suppression gains to be applied to the sensor audio spectrum, by using a Minimum Mean Square Error-Log Spectral Amplitude, MMSE-LSA, estimator, applying the echo suppression gains to the sensor audio spectrum.
2. The audio signal processing method according to claim 1, wherein estimating the echo audio spectrum comprises: determining a spectral transfer function of an acoustic path from the speaker unit to the audio sensor, determining the echo audio spectrum by applying the spectral transfer function to the speaker audio spectrum.
3. The audio signal processing method according to claim 2, wherein the spectral transfer function is predetermined independently from the speaker audio spectrum and the sensor audio spectrum.
4. The audio signal processing method according to claim 2, wherein the spectral transfer function is dynamically adapted based on the speaker audio spectrum and the sensor audio spectrum.
5. The audio signal processing method according to claim 4, further comprising evaluating whether a spectral transfer function updating criterion is satisfied and, responsive to the spectral transfer function updating criterion being satisfied, updating the spectral transfer function by comparing the speaker audio spectrum and the sensor audio spectrum.
6. The audio signal processing method according to claim 5, wherein evaluating whether the spectral transfer function updating criterion is satisfied comprises at least one among the following: determining whether the sensor audio signal includes a voice audio signal corresponding to a voice acoustic signal emitted by the user, wherein the spectral transfer function updating criterion is satisfied when it is determined that the sensor audio signal does not include a voice audio signal, and/or determining whether the sensor audio signal includes a noise audio signal having a noise level below a predetermined noise threshold, wherein the spectral transfer function updating criterion is satisfied when it is determined that the noise level is below said predetermined noise threshold, and/or determining whether the sensor audio signal includes an echo audio signal having an echo level above a predetermined echo threshold, wherein the spectral transfer function updating criterion is satisfied when it is determined that the echo level is above said predetermined echo threshold.
7. The audio signal processing method according to claim 5, wherein the spectral transfer function updating criterion is evaluated for a plurality of frequencies or frequency sub-bands and the spectral transfer function is updated for each frequency or frequency sub-band for which the spectral transfer function updating criterion is satisfied.
8. The audio signal processing method according to claim 1, further comprising smoothing in frequency the echo suppression gains, or the sensor audio spectrum obtained after applying the echo suppression gains.
9. The audio signal processing method according to claim 1, wherein the audio system comprises two or more audio sensors which comprise an internal sensor and an external sensor, wherein the internal sensor is arranged to measure acoustic signals which reach the internal sensor by propagating internally to a head of the user and the external sensor is arranged to measure acoustic signals which reach the external sensor by propagating externally to the user's head, wherein: echo suppression gains are computed on a first frequency band for an internal audio spectrum of an internal audio signal produced by the internal sensor, echo suppression gains are computed on a second frequency band for an external audio spectrum of an external audio signal produced by the external sensor, wherein the second frequency band is different from the first frequency band and includes frequencies which are greater than a maximum frequency of the first frequency band.
10. The audio signal processing method according to claim 1, wherein the MMSE-LSA estimator uses an exponential integral function which is approximated by a linear function.
11. An audio system comprising at least an audio sensor and a speaker unit, wherein the speaker unit is configured to convert a speaker audio signal received as input into a speaker acoustic signal for a user of the audio system, wherein the audio sensor is configured to produce a sensor audio signal by measuring acoustic signals reaching the audio sensor, wherein said audio system further comprises a processing circuit configured to: retrieve the speaker audio signal corresponding to the speaker acoustic signal output by the speaker unit while measuring the acoustic signals reaching the audio sensor to produce the sensor audio signal, convert the speaker audio signal to frequency domain, thereby producing a speaker audio spectrum, convert the sensor audio signal to frequency domain, thereby producing a sensor audio spectrum, estimate, based on the speaker audio spectrum, an echo audio spectrum of an echo audio signal caused by the speaker acoustic signal in the sensor audio signal, compute, based on the echo audio spectrum and the sensor audio spectrum, echo suppression gains to be applied to the sensor audio spectrum, by using a Minimum Mean Square Error-Log Spectral Amplitude, MMSE-LSA, estimator, apply the echo suppression gains to the sensor audio spectrum.
12. The audio system according to claim 11, wherein the processing circuit is configured to estimate the echo audio spectrum by: determining a spectral transfer function of an acoustic path from the speaker unit to the audio sensor, determining the echo audio spectrum by applying the spectral transfer function to the speaker audio spectrum.
13. The audio system according to claim 12, wherein the spectral transfer function is predetermined independently from the speaker audio spectrum and the sensor audio spectrum.
14. The audio system according to claim 12, wherein the processing circuit is configured to dynamically adapt the spectral transfer function based on the speaker audio spectrum and the sensor audio spectrum.
15. The audio system according to claim 14, wherein the processing circuit is further configured to evaluate whether a spectral transfer function updating criterion is satisfied and, responsive to the spectral transfer function updating criterion being satisfied, update the spectral transfer function by comparing the speaker audio spectrum and the sensor audio spectrum.
16. The audio system according to claim 15, wherein evaluating whether the spectral transfer function updating criterion is satisfied comprises at least one among the following: determining whether the sensor audio signal includes a voice audio signal corresponding to a voice acoustic signal emitted by the user, wherein the spectral transfer function updating criterion is satisfied when it is determined that the sensor audio signal does not include a voice audio signal, and/or determining whether the sensor audio signal includes a noise audio signal having a noise level below a predetermined noise threshold, wherein the spectral transfer function updating criterion is satisfied when it is determined that the noise level is below said predetermined noise threshold, and/or determining whether the sensor audio signal includes an echo audio signal having an echo level above a predetermined echo threshold, wherein the spectral transfer function updating criterion is satisfied when it is determined that the echo level is above said predetermined echo threshold.
17. The audio system according to claim 15, wherein the processing circuit is configured to evaluate the spectral transfer function updating criterion for a plurality of frequencies or frequency sub-bands and to update said spectral transfer function for each frequency or frequency sub-band for which the spectral transfer function updating criterion is satisfied.
18. The audio system according to claim 11, wherein the processing circuit is further configured to smooth in frequency the echo suppression gains, or the sensor audio spectrum obtained after applying the echo suppression gains.
19. The audio system according to claim 11, wherein the audio system comprises two or more audio sensors which comprise an internal sensor and an external sensor, wherein the internal sensor is arranged to measure acoustic signals which reach the internal sensor by propagating internally to a head of the user and the external sensor is arranged to measure acoustic signals which reach the external sensor by propagating externally to the user's head, wherein the processing circuit is further configured to: compute echo suppression gains on a first frequency band for an internal audio spectrum of an internal audio signal produced by the internal sensor, compute echo suppression gains on a second frequency band for an external audio spectrum of an external audio signal produced by the external sensor, wherein the second frequency band is different from the first frequency band and includes frequencies which are greater than a maximum frequency of the first frequency band.
20. The audio system according to claim 11, wherein the MMSE-LSA estimator uses an exponential integral function which is approximated by a linear function.
21. A non-transitory computer readable medium comprising computer readable code to be executed by an audio system comprising at least an audio sensor and a speaker unit, wherein the speaker unit is configured to convert a speaker audio signal received as input into a speaker acoustic signal which is output by the speaker unit for a user of the audio system, wherein said audio system further comprises a processing circuit, wherein said computer readable code causes said audio system to: measure, by the audio sensor, acoustic signals reaching the audio sensor, thereby producing a sensor audio signal, retrieve the speaker audio signal corresponding to the speaker acoustic signal output by the speaker unit while measuring the acoustic signals reaching the audio sensor to produce the sensor audio signal, convert the speaker audio signal to frequency domain, thereby producing a speaker audio spectrum, convert the sensor audio signal to frequency domain, thereby producing a sensor audio spectrum, estimate, based on the speaker audio spectrum, an echo audio spectrum of an echo audio signal caused by the speaker acoustic signal in the sensor audio signal, compute, based on the echo audio spectrum and the sensor audio spectrum, echo suppression gains to be applied to the sensor audio spectrum, by using a Minimum Mean Square Error-Log Spectral Amplitude, MMSE-LSA, estimator, apply the echo suppression gains to the sensor audio spectrum.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1) The invention will be better understood upon reading the following description, given as an example that is in no way limiting, and made in reference to the figures which show:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9) In these figures, references identical from one figure to another designate identical or analogous elements. For reasons of clarity, the elements shown are not to scale, unless explicitly stated otherwise.
(10) Also, the order of steps represented in these figures is provided only for illustration purposes and is not meant to limit the present disclosure which may be applied with the same steps executed in a different order.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
(11)
(12) As illustrated by
(13) As illustrated by
(14) As illustrated by
(15) In some embodiments, the processing circuit 13 comprises one or more processors and one or more memories. The one or more processors may include for instance a central processing unit (CPU), a graphical processing unit (GPU), a digital signal processor (DSP), a field-programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc. The one or more memories may include any type of computer readable volatile and non-volatile memories (magnetic hard disk, solid-state disk, optical disk, electronic memory, etc.). The one or more memories may store a computer program product (software), in the form of a set of program-code instructions to be executed by the one or more processors in order to implement all or part of the steps of an audio signal processing method 20.
(16)
(17) As illustrated by
(18) The acoustic signals reaching the audio sensor 11 may or may not include a voice acoustic signal emitted by the user, with the presence of a voice activity varying over time as the user speaks.
(19) During e.g. a voice call, the speaker unit 12 typically emits a speaker acoustic signal for the user (which or may not include the far-end speaker's voice). In this case, the acoustic signals reaching the audio sensor 11, measured during step S200, may include the speaker acoustic signal emitted by the speaker unit 12. This speaker acoustic signal, output by the speaker unit 12 when measuring the acoustic signals by the audio sensor 11, is the result of the conversion of a speaker audio signal fed as input to said speaker unit 12. As illustrated by
(20) As illustrated by
(21) Indeed, the sensor audio signal and the speaker audio signal are in time domain and steps S210 and S211 aim at performing a spectral analysis of the sensor and speaker audio signals to obtain respective audio spectra in frequency domain. In some examples, steps S210 and S211 may for instance use any time to frequency conversion method, for instance a Fast Fourier Transform (FFT), a Discrete Fourier Transform (DFT), a Discrete Cosine Transform (DCT), a wavelet transform, etc. In other examples, steps S210 and S211 may for instance use a bank of bandpass filters which filter the audio signals in respective frequency sub-bands of a same frequency band, etc.
(22) For instance, the sensor and speaker audio signals may be sampled at e.g. 16 kilohertz (kHz) and buffered into time-domain audio frames of e.g. 4 milliseconds (ms). For instance, it is possible to apply on these audio frames a 128-point DCT or FFT to produce audio spectra up to the Nyquist frequency f.sub.Nyquist, i.e. half the sampling rate (i.e. 8 kHz if the sampling rate is 16 kHz).
(23) In the sequel, we assume in a non-limitative manner that the frequency band on which are determined the sensor audio spectrum and the speaker audio spectrum are composed of N discrete frequency values f.sub.n with 1?n?N, wherein f.sub.n-1<f.sub.n for any 2?n?N. For instance, f.sub.1=0 and f.sub.N=f.sub.Nyquist, but the spectral analysis may also be carried out on a frequency sub-band in [0, f.sub.Nyquist] For instance, f.sub.1=0 and f.sub.N is lower than or equal to 4000 Hz, or lower than or equal to 3000 Hz. It should be noted that the determination of the audio spectra may be performed with any suitable spectral resolution. Also, the frequencies f.sub.n may be regularly spaced in some embodiments or irregularly spaced in other embodiments.
(24) The sensor audio spectrum X.sub.f of the sensor audio signal x.sub.t corresponds to a set of values {X.sub.f(f.sub.n), 1?n?N}. The speaker audio spectrum S.sub.f of the speaker audio signal s.sub.t corresponds to a set of values {S.sub.f(f.sub.n), 1?n?N}. Typically, the sensor audio spectrum X.sub.f corresponds to a complex spectrum such that X.sub.f(f.sub.n) comprises both: a magnitude value representative of the power of the sensor audio signal x.sub.t at frequency f.sub.n, a phase value of the sensor audio signal x.sub.t at the frequency f.sub.n.
(25) The speaker audio spectrum S.sub.f corresponds for instance also to a complex spectrum comprising both a magnitude value and a phase value for each frequency f.sub.n. However, it is possible to use only the magnitudes values for the speaker audio signal such that the speaker audio spectrum S.sub.f may consist in a magnitude spectrum. In the following, we consider in a non-limitative manner that the determined speaker audio spectrum S.sub.f corresponds to a complex spectrum.
(26) For instance, if the sensor audio spectrum is computed by an FFT, then X.sub.f(f.sub.n) can correspond to FFT[x.sub.t](f.sub.n). The corresponding magnitude spectrum is designated by ?X.sub.f?, wherein ?X.sub.f(f.sub.n)? corresponds for instance to |X.sub.f(f.sub.n)| (i.e. modulus or absolute value of X.sub.f(f.sub.n)), or to |X.sub.f(f.sub.n)|.sup.2 (i.e. power of X.sub.f(f.sub.n)). Similarly, if the speaker audio spectrum is computed by an FFT, then S.sub.f(f.sub.n) can correspond to FFT[s.sub.t](f.sub.n). The corresponding magnitude spectrum is designated by ?S.sub.f?, wherein ?S.sub.f(f.sub.n)? corresponds for instance to |S.sub.f(f.sub.n)| (i.e. modulus or absolute value of S.sub.f(f.sub.n)), or to |S.sub.f(f.sub.n)|.sup.2 (i.e. power of S.sub.f(f.sub.n)).
(27) It should be noted that, in some embodiments, the sensor and speaker audio spectra can optionally be smoothed over time, for instance by using exponential averaging with a configurable time constant.
(28) As illustrated by
(29)
(30) Basically, the spectral transfer function is an estimate of the frequency domain response of the acoustic path which includes the speaker unit 12, a propagation channel between the speaker unit 12 and the audio sensor 11, and the audio sensor 11. As for the audio spectra, the determined spectral transfer function can be composed of magnitude values or complex values. However, it is possible to use only magnitude values for the spectral transfer function. In the sequel, we consider in a non-limitative manner that the spectral transfer function is composed of magnitude values and is designated by ?W.sub.f?. Similarly, the estimated echo audio spectrum can be a complex spectrum or a magnitude spectrum. However, it is possible to use only magnitude values for the echo audio spectrum. In the sequel, we consider in a non-limitative manner that the estimated echo audio spectrum is a magnitude spectrum, designated by ?E.sub.f?. The spectral transfer function ?W.sub.f? is applied to the speaker audio spectrum, on a frequency by frequency basis, for instance as follows (with f.sub.1?f.sub.n?f.sub.N):
?E.sub.f(f.sub.n)?=?W.sub.f(f.sub.n)??S.sub.f(f.sub.n)?
(31) The spectral transfer function ?W.sub.f? may be predefined and remain static over time. For instance, the spectral transfer function may be obtained beforehand, e.g. by calibration of the acoustic path (which includes the speaker unit 12, a propagation channel between the speaker unit 12 and the audio sensor 11, and the audio sensor 11), independently from the speaker audio spectrum and the sensor audio spectrum computed for the current audio frame.
(32) In preferred embodiments, and as illustrated by
(33)
(34) Hence, in preferred embodiments, and as illustrated by
(35) Basically, the spectral transfer function updating criterion aims at determining whether or not the current speaker audio spectrum and the current sensor audio spectrum are suitable for updating the spectral transfer function. As already discussed above, parameters that may be taken into account include e.g.: the amount of noise (a noisy environment, which affects only the sensor audio spectrum, degrades the accuracy of the estimation of the spectral transfer function), the amount of echo (a low echo might be difficult to distinguish from e.g. noise and/or the user's voice, and no echo prevents from being able to estimate the spectral transfer function), the presence of the user's voice (the user's voice, which affects only the sensor audio spectrum, degrades the accuracy of the estimation of the spectral transfer function).
(36) Accordingly, the spectral transfer function updating criterion may evaluate at least one among the presence of noise, the presence of echo and the presence of the user's voice.
(37) For instance, evaluating whether the spectral transfer function updating criterion is satisfied may comprise determining whether the sensor audio signal includes a voice audio signal corresponding to a voice acoustic signal emitted by the user, and the spectral transfer function updating criterion is satisfied when it is determined that the sensor audio signal does not include a voice audio signal. The present disclosure may use any voice activity detection method known to the skilled person, and the choice of a specific method corresponds to a specific and non-limitative embodiment of the present disclosure.
(38) Alternatively, or in combination thereof, evaluating whether the spectral transfer function updating criterion is satisfied may comprise determining whether the sensor audio signal includes a noise audio signal having a noise level (wherein the noise level is representative of the power of the noise) below a predetermined noise threshold, and the spectral transfer function updating criterion is satisfied when it is determined that the noise level is below said predetermined noise threshold. The present disclosure may use any noise level estimation method known to the skilled person, and the choice of a specific method corresponds to a specific and non-limitative embodiment.
(39) Alternatively, or in combination thereof, evaluating whether the spectral transfer function updating criterion is satisfied may comprise determining whether the sensor audio signal includes an echo audio signal having an echo level (estimated for instance by determining the level/power of the speaker audio signal) above a predetermined echo threshold, and the spectral transfer function updating criterion is satisfied when it is determined that the echo level is above said predetermined echo threshold. The present disclosure may use any echo level estimation method known to the skilled person, and the choice of a specific method corresponds to a specific and non-limitative embodiment.
(40) The spectral transfer function corresponds for instance to a set of weights {?W.sub.f(f.sub.n)?, 1?n?N}. It should be noted that the spectral transfer function updating criterion can be evaluated for the whole frequency band [f.sub.1,f.sub.N], or it can be evaluated independently for different sub-bands of the frequency band or even for each frequency f.sub.n. Hence, it is possible in some cases to update selectively only some of the weights ?W.sub.f(f.sub.n)?. For instance, it is possible to update only the weight ?W.sub.f (f.sub.n)? (and not the weights ?W.sub.f(f.sub.n)? with n?n) if the spectral transfer function updating criterion is satisfied only for the frequency f.sub.n. In such cases, the noise level, the voice presence or the echo level are evaluated separately for each considered sub-band or each considered frequency.
(41) In some embodiments, spectral transfer functions determined successively for successive audio frames can be smoothed over time. Such a smoothing may for instance be performed using exponential averaging with a predetermined time constant. Preferably, asymmetric exponential averaging may be used with predetermined attack and release time constants to enable e.g. faster decrease of the weights ?W.sub.f(f.sub.n)? compared to increase of the weights.
(42) As illustrated by
(43) The MMSE-LSA estimator is a well-known algorithm for denoising and for speech enhancement (i.e. for making speech sound more natural), see for instance [EPHRAIM85], the contents of which are hereby incorporated by reference in its entirety. The MMSE-LSA estimator is a recursive algorithm which minimizes the mean square error of logarithmically weighted amplitudes. Hence, the MMSE-LSA estimator requires only magnitude values (amplitudes) for the estimated echo audio spectrum and for the sensor audio spectrum (i.e. no phase values required). In the sequel, the MMSE-LSA is applied on the estimated echo audio (magnitude) spectrum ?E.sub.f? and on the sensor audio (magnitude) spectrum ?X.sub.f?.
(44)
(45) As illustrated by
(46) In some embodiments, the SER probabilities determined successively for successive audio frames may be smoothed over time to homogenize the behavior of the system, with a tradeoff in amount of echo removed.
(47) The MMSE-LSA estimator then comprises a step S232 of computing LSA gains which are the result of a spectral LSA gain function of both the a priori SER and the a posteriori SER, which aims at finding the ideal gains that maximize the output SER.
(48) The LSA gain function is usually based on the exponential integral function, which may be rather heavy in terms of computational complexity, and not straightforward to implement. In preferred embodiments, mathematical properties of the exponential integral function are leveraged to derive a linear approximation, which can be adjusted via a single slope factor. Such a linear function approximating the exponential integral function is lighter in terms of computational complexity and gives accurate enough results in most cases.
(49) As illustrated by
(50) The LSA gains correspond to the echo suppression gains output during step S230 of the audio signal processing method 20. The echo suppression gains (which are non-complex gains) are referred to by ?G.sub.f(f.sub.n)?. The echo suppression gains aim at attenuating the magnitude of frequency components of the sensor audio spectrum which are impacted by echo. Accordingly, the echo suppression gains are typically such that 0<?G.sub.f(f.sub.n)??1 (i.e. negative when expressed in decibels, dB, ?G.sub.f(f.sub.n)?.sub.dB?0).
(51) In some embodiments, the echo suppression gains may be constrained by one or more minimum possible values predetermined for said echo suppression gains. This enables, for instance, to control to some extent how much echo is suppressed and to prevent from strongly attenuating non-echo content in case the amount of echo was overestimated. If we denote by ?G.sub.min? the minimum possible value, then the echo suppression gains may be such that ?G.sub.min???G.sub.f(f.sub.n)?. For instance, ?G.sub.min?.sub.dB==?30 dB or ?G.sub.min?.sub.dB=?10 dB or ?G.sub.min?.sub.dB=?6 dB. It should be noted that it is possible to have different minimum possible values associated to different frequencies or frequency sub-bands, i.e. ?G.sub.min(f.sub.n)? may vary with the frequency f.sub.n. This enables e.g. to be more aggressive in frequency sub-bands where echo is likely to be more present, and more conservative in frequency sub-bands where the sensor audio spectrum is likely to include e.g. voice and needs to be preserved.
(52) As illustrated by
{tilde over (X)}.sub.f(f.sub.n)=?G.sub.f(f.sub.n)?X.sub.f(f.sub.n)
(53) Of course, the enhanced sensor audio spectrum {tilde over (X)}.sub.f may be converted back to time domain to produce an enhanced sensor audio signal {tilde over (x)}.sub.t (not represented in the figures). However, the conversion to time domain may also be performed at a later stage, after performing additional processing steps in frequency domain on the enhanced sensor audio spectrum {tilde over (X)}.sub.f (for instance, filtering {tilde over (X)}.sub.f, denoising {tilde over (X)}.sub.f, combining {tilde over (X)}.sub.f with other audio spectra, etc.).
(54)
(55) In addition to the steps discussed in relation to
(56) In
(57) It should be noted that the proposed echo suppression algorithm can be used alone, or in combination with other echo cancellation algorithms.
(58) In some embodiments, the audio system 10 may comprise more than one audio sensor. In such a case, the proposed solution may be applied to mitigate echo in all, or only part of the sensor audio signals produced by the audio sensors of the audio system 10. In other words, the proposed solution is applied to at least one of the audio sensors of the audio system 10.
(59)
(60) One of the audio sensors is referred to as internal sensor 110. The internal sensor 110 is referred to as internal because it is arranged to measure voice acoustic signals which propagate internally through the user's head. For instance, the internal sensor 110 may be an air conduction sensor (e.g. microphone) to be located in an ear canal of a user and arranged on the wearable device towards the interior of the user's head, or a bone conduction sensor (e.g. accelerometer, vibration sensor). The internal sensor 110 may be any type of bone conduction sensor or air conduction sensor known to the skilled person.
(61) The other audio sensor is referred to as external sensor 111. The external sensor 111 is referred to as external because it is arranged to measure voice acoustic signals which propagate externally to the user's head (via the air between the user's mouth and the external sensor 111). The external sensor 111 is an air conduction sensor (e.g. microphone) to be located outside the ear canals of the user, or to be located inside an ear canal of the user but arranged on the wearable device towards the exterior of the user's head. The external sensor 111 may be any type of air conduction sensor known to the skilled person.
(62) For instance, if the audio system 10 is included in a pair of earbuds (one earbud for each ear of the user), then the internal sensor 110 is for instance arranged with the speaker unit 12 in a portion of one of the earbuds that is to be inserted in the user's ear, while the external sensor 111 is for instance arranged in a portion of one of the earbuds that remains outside the user's ears. In some cases, the audio system 10 may comprise two or more internal sensors 110 (for instance one or two for each earbud) and/or two or more external sensors 111 (for instance one for each earbud) and/or two or more speaker units 12 (for instance one for each earbud).
(63) In some cases, the audio signal processing method 20 may be applied to both an internal audio signal produced by the internal sensor 110 and an external audio signal produced by the external sensor 111.
(64) In some cases, it is possible to consider the same frequency band [f.sub.1, f.sub.N] for processing the internal audio signal and the external audio signal.
(65) However, in other embodiments, it is also possible to consider different frequency bands for the internal audio sensor 110 and for the external audio sensor 111. For instance, it is possible to use a first frequency band [f.sub.1,l, f.sub.N,l] for the internal sensor 110, with f.sub.1?f.sub.1,l<f.sub.N,l?f.sub.N, and a second frequency band [f.sub.1,E, f.sub.N,E] for the external sensor 111, with f.sub.1?f.sub.1,E<f.sub.N,E?f.sub.N. Hence, in such cases, echo suppression gains may be computed and applied to the internal audio spectrum only on the first frequency band, and echo suppression gains may be computed and applied to the external audio spectrum only on the second frequency band.
(66) In some cases, audio signals from the internal sensor and the external sensor can be mixed together for e.g. mitigating noise, by using the internal audio signal mainly for low frequencies while using the external audio signal for higher frequencies. Hence, in such cases, it might not be necessary to perform echo suppression in higher frequencies of the internal audio signal and, in preferred embodiments, the first and second frequency bands may be such that f.sub.N,I<f.sub.N,E. Similarly, it might not be necessary to perform echo suppression in lower frequencies of the external audio signal and, in preferred embodiments, the first and second frequency bands may be such that f.sub.1,l<f.sub.1,E. For instance, the internal audio signal is used mainly below a crossover frequency f.sub.CROSS and the external audio signal is used mainly above said crossover frequency f.sub.CROSS. The crossover frequency f.sub.CROSS may be static over time or may be dynamically adjusted based on the operating conditions, and f.sub.1,E?f.sub.CROSS?f.sub.N,l. Such different frequency bands may also be used even when the audio signals from the internal sensor and the external sensor are not mixed together.
(67) It is emphasized that the present disclosure is not limited to the above exemplary embodiments. Variants of the above exemplary embodiments are also within the scope of the present invention.
REFERENCES
(68) [EPHRAIM85] Ephraim, Y.; Malah, D., Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 33, no. 2, pp. 443-445, April 1985.