Background noise estimation using gap confidence
11587576 · 2023-02-21
Assignee
Inventors
Cpc classification
H04R2227/001
ELECTRICITY
H04R3/02
ELECTRICITY
International classification
H04R3/02
ELECTRICITY
Abstract
A noise estimation method including steps of generating gap confidence values in response to microphone output and playback signals, and using the gap confidence values to generate an estimate of background noise in a playback environment. Each gap confidence value is indicative of confidence of presence of a gap at a corresponding time in the playback signal, and may be a combination of candidate noise estimates weighted by the gap confidence values. Generation of the candidate noise estimates may but need not include performance of echo cancellation. Optionally, noise compensation is performed on an audio input signal using the generated background noise estimate. Other aspects are systems configured to perform any embodiment of the noise estimation method.
Claims
1. An audio processing method, comprising: receiving microphone output signals from a microphone of a playback environment, the microphone output signals corresponding to playback content reproduced by one or more loudspeakers and detected by the microphone, the microphone output signals also corresponding to background noise in the playback environment detected by the microphone; receiving playback content values corresponding to the playback content; and generating gap confidence values in response to the microphone output signals and the playback content values, where each of the gap confidence values is for a different time, t, and is indicative of a confidence that there is a gap, at the time t, in the playback content, wherein a gap denotes a time or time interval at or in which playback content is missing or has a level less than a predetermined threshold, and wherein generating the gap confidence values includes generating a gap confidence value for each time, t, including by: determining a minimum in the playback content values for the time, t; processing the microphone output signals to determine a smoothed level of the microphone output signal for the time, t; and determining the gap confidence value for the time, t, to be indicative of how different the minimum in playback content values for the time, t, is from the smoothed level of the microphone output signals for the time, t; and generating an estimate of background noise in the playback environment using the gap confidence values.
2. The method of claim 1, wherein the microphone output signals and the playback content values are frequency banded and wherein the gap denotes a time or time interval at or in which playback content is missing or has a level less than a predetermined threshold in one or more frequency bands.
3. The method of claim 2, wherein the gap confidence value is determined for at least one of the one or more frequency bands.
4. The method of claim 3, further comprising: determining a plurality of gap confidence values for each frequency band of the one or more frequency bands; and determining a gap health value based on the plurality of gap confidence values, the gap health value indicating how up-to-date a noise estimate is for each frequency band of the one or more frequency bands.
5. The method of claim 4, wherein the plurality of gap confidence values comprise n most recent gap confidence values, wherein n is an integer.
6. The method of claim 5, wherein determining the gap health value comprises adding the n most recent gap confidence values and dividing a resulting sum by n.
7. The method of claim 4, further comprising identifying one or more frequency bands for which a noise estimate will be updated based, at least in part, on the gap health value.
8. The method of claim 7, wherein the identifying comprises identifying a first healthy frequency band having a gap health value at or above a gap health value threshold.
9. The method of claim 8, further comprising evaluating one or more frequency bands, including at least one frequency band adjacent the first healthy frequency band, to locate at least one unhealthy frequency band having a gap health value below the gap health value threshold.
10. The method of claim 9, further comprising identifying a second healthy frequency band having a gap health value at or above the gap health value threshold.
11. The method of claim 10, further comprising computing a noise estimate for at least one unhealthy frequency band between the first health frequency band and the second health frequency band.
12. The method of claim 11, wherein computing the noise estimate involves performing a linear interpolation between a noise estimate for the first health frequency band and a noise estimate for the second health frequency band.
13. The method of claim 12, wherein the linear interpolation is performed in a logarithmic domain.
14. The method of claim 1, wherein the estimate of the background noise in the playback environment is or includes a sequence of noise estimates, each of the noise estimates is an estimate of background noise in the playback environment at a different time, t, and each of the noise estimates is a combination of candidate noise estimates for a different time interval including the time t, wherein the candidate noise estimates have been weighted by the gap confidence values.
15. The method of claim 1, wherein the estimate of the background noise in the playback environment is or includes a sequence of noise estimates, each of the noise estimates is an estimate of background noise in the playback environment at a different time, t, and wherein generating the estimate of the background noise in the playback environment using the gap confidence values involves, for each noise estimate, weighting candidate noise estimates for a different time interval including the time t by the gap confidence values and combining the weighted candidate noise estimates to obtain a respective noise estimate.
16. The method of claim 1, wherein the estimate of the background noise in the playback environment is or includes a sequence of noise estimates, and also include performing noise compensation on an audio input signal using the sequence of noise estimates.
17. The method of claim 16, wherein performing noise compensation on the audio input signal involves generation of the playback signal and wherein the method further comprises driving at least one loudspeaker with the playback signal to generate sound.
18. One or more non-transitory media having software stored thereon, the software including instructions for controlling one or more devices to perform a method, the method comprising: receiving microphone output signals from a microphone of a playback environment, the microphone output signals corresponding to playback content reproduced by one or more loudspeakers and detected by the microphone, the microphone signals also corresponding to background noise in the playback environment detected by the microphone; receiving playback content values corresponding to the playback content; generating gap confidence values in response to the microphone output signals and the playback content values, where each of the gap confidence values is for a different time, t, and is indicative of a confidence that there is a gap, at the time t, in the playback content, wherein a gap denotes a time or time interval at or in which playback content is missing or has a level less than a predetermined threshold, and wherein generating the gap confidence values includes generating a gap confidence value for each time, t, including by: determining a minimum in the playback content values for the time, t; processing the microphone output signals to determine a smoothed level of the microphone output signals for the time, t; and determining the gap confidence value for the time, t, to be indicative of how different the minimum in playback content values for the time, t, is from the smoothed level of the microphone output signals for the time, t; and generating an estimate of background noise in the playback environment using the gap confidence values.
19. The one or more non-transitory media of claim 18, wherein the method also involves performing noise compensation on an audio input signal based, at least in part, on the estimate of background noise.
20. An apparatus, comprising: an input system configured for: receiving microphone output signals from a microphone of a playback environment, the microphone output signals corresponding to playback content reproduced by one or more loudspeakers and detected by the microphone, the microphone signals also corresponding to background noise in the playback environment detected by the microphone; and receiving playback content values corresponding to the playback content; and a noise estimation subsystem configured for generating gap confidence values in response to the microphone output signals and the playback content values, where each of the gap confidence values is for a different time, t, and is indicative of a confidence that there is a gap, at the time t, in the playback content, wherein a gap denotes a time or time interval at or in which playback content is missing or has a level less than a predetermined threshold, and wherein generating the gap confidence values includes generating a gap confidence value for each time, t, including by: determining a minimum in the playback content values for the time, t; processing the microphone output signal to determine a smoothed level of the microphone output signals for the time, t; and determining the gap confidence value for the time, t, to be indicative of how different the minimum in playback content values for the time, t, is from the smoothed level of the microphone output signals for the time, t; and generating an estimate of background noise in the playback environment using the gap confidence values.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
NOTATION AND NOMENCLATURE
(5) Throughout this disclosure, including in the claims, a “gap” in a playback signal denotes a time (or time interval) of the playback signal at (or in) which playback content is missing (or has a level less than a predetermined threshold).
(6) Throughout this disclosure, including in the claims, “speaker” and “loudspeaker” are used synonymously to denote any sound-emitting transducer (or set of transducers) driven by a single speaker feed. A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), all driven by a single, common speaker feed (the speaker feed may undergo different processing in different circuitry branches coupled to the different transducers).
(7) Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
(8) Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X−M inputs are received from an external source) may also be referred to as a decoder system.
(9) Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
(10) Throughout this disclosure including in the claims, the term “couples” or “coupled” is used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections.
DETAILED DESCRIPTION OF EMBODIMENTS
(11) Many embodiments of the present invention are technologically possible. It will be apparent to those of ordinary skill in the art from the present disclosure how to implement them. Some embodiments of the inventive system and method are described herein with reference to
(12) The system of
(13) Noise estimation subsystem 37 of
(14) The
(15) Speaker system 29 (including at least one speaker) is coupled and configured to emit sound (in playback environment 28) in response to playback signal 25. Signal 25 may consist of a single playback channel, or it may consist of two or more playback channels. In typical operation, each speaker of speaker system 29 receives a speaker feed indicative of the playback content of a different channel of signal 25. In response, speaker system 29 emits sound (in playback environment 28) in response to the speaker feed(s). The sound is perceived by listener 31 (in environment 28) as a noise-compensated version of the playback content of input signal 23.
(16) The other elements of the
(17) The present disclosure will refer to the following three types of background noise:
(18) distracting noise (e.g., impulsive and infrequent events (e.g., having duration less than 0.5 second), such as for example doors slamming, automobile sounding horn, driving over a road bump);
(19) disrupting (short events that interfere with playback content, e.g., overhead airplane passing, driving through a short tunnel, driving over a section of new road surface); and
(20) pervasive (persistent/constant noise that can start and stop, but generally remains steady, e.g., air conditioning, fans, ambient metropolitan noise, rain, kitchen appliances).
(21) In order of importance based on experimentation by the inventors, the characteristics of successful noise compensation include the following:
(22) stability (the noise estimate should not be corrupted by the playback content measured at the microphone. The noise estimate and therefore compensation gain should not fluctuate in a noticeable way due to changes in playback content. No noise estimate should track anything faster than the “disrupting” sources of noise. A noise estimate should ignore “distracting” impulsive events);
(23) fast reaction time (a good noise estimate will track only the “pervasive” sources of noise. A great noise estimate however will also be reliably able to track “disrupting” sources of noise. Reacting quickly to a change in noise conditions is highly important to the user experience); and
(24) comfortable compensation amount (noise compensation should ensure preserved intelligibility and timbre in the presence of noise. Compensating too low or too high makes the user experience unsatisfactory. Compensation is performed in a multi-band sense, with more fidelity than a bulk volume adjustment).
(25) Noise estimation using minimum following filters to track stationary noise is an established art. To perform such estimation, a minimum follower filter accumulates input samples into a sliding fixed size buffer called the analysis window, and outputs the smallest sample value in that buffer. Minimum following removes impulsive, distracting sources of noise, for both short and long analysis windows. A long analysis window (having duration on the order of 10 sec) is effective at locating a stationary noise floor (pervasive noise), as the minimum follower will hold onto minima that occur during gaps in the playback content, and in between any user's speech in the vicinity of the microphone. The longer the analysis window, it is more likely that a gap will be found. However, this approach will follow minima regardless of whether they are actually gaps in the playback content or not. Furthermore, a long analysis window causes the system to take a long time to track upwards to increases in background noise, which becomes a significant disadvantage for noise compensation. A long analysis window will typically track pervasive source of noise eventually, but miss out on tracking disruptive sources of noise.
(26) An important aspect of typical embodiments of the present invention is to use knowledge of the playback signal to decide when conditions are most favorable to measure the noise estimate from the microphone output (and optionally also from an echo cancelled noise estimate, generated by performing echo cancellation on the microphone output). Realistic playback signals viewed in the time-frequency domain will typically contain points where the signal energy is low, which implies that those points in time and frequency are good opportunities to measure the ambient noise conditions. An important aspect of typical embodiments of the present invention is a method of quantifying how good these opportunities are (e.g., by assigning to each of them a value to be referred to as a “gap confidence” value or “gap confidence”). Approaching the problem in this way makes noise compensation (or noise estimation) possible for many types of content without requiring an echo canceller (to generate an echo cancelled noise estimate) and lowers the requirements of an echo canceller's performance (when an echo canceller is used).
(27) Next, with reference to
(28) A microphone output signal (e.g., signal “Mic” of
(29) Each channel of the playback content (e.g., each channel of noise compensated signal 25 of
(30) The
(31) In the case that signal 25 is multi-channel signal (comprising Z playback channels), a typical implementation of echo canceller 34 receives (from element 26) multiple streams of frequency-domain playback content values (one stream for each channel), and adapts a filter W′.sub.i(corresponding to filter W′ of
(32) In general, an echo cancelled noise estimate is obtained by applying echo cancellation (wherein the echo results from or relates to the sound/audio content of the playback signal) to the microphone output signal. As such, an echo cancelled noise estimate (echo cancelled noise estimate value) may be said to be obtained by cancelling the echo resulting from or relating to the sound (or, put differently, resulting from or relating to the audio content of the playback signal) from the microphone output signal. This may be done in the frequency domain.
(33) The filter coefficients of each adaptive filter employed by echo canceller 34 to generate the echo cancelled noise estimate values (i.e., each adaptive filter implemented by echo canceller 34 which corresponds to filter W′ of
(34) Optionally, echo canceller 34 is omitted (or does not operate), and thus no adaptive filter values are provided to banding element 36, and no banded adaptive filter values are provided from 36 to subsystem 43. In this case, subsystem 43 generates the gain values G in one of the ways (described below) without use of banded adaptive filter values.
(35) If an echo canceller is used (i.e. if the
(36) If no echo canceller is used (i.e., if echo canceller 34 is omitted or does not operate), the values M′res (in the description herein of
(37) In typical implementations (including that shown in
(38) In the implementation shown in
(39) minima in the values Mres (the echo canceller residual) can confidently be considered to indicate estimates of noise in the playback environment; and
(40) minima in the M (microphone output signal) values can confidently be considered to indicate estimates of noise in the playback environment.
(41) The inventors have also recognized that, at times other than during a gap in playback content, minima in the values Mres (or the values M) may not be indicative of accurate estimates of noise in the playback environment.
(42) In response to microphone output signal (M) and the values of S.sub.min, subsystem 16 generates gap confidence values. Sample aggregator subsystem 20 is configured to use the values of M.sub.resmin (or the values of M, in the case that no echo cancellation is performed) as candidate noise estimates, and to use the gap confidence values (generated by subsystem 16) as indications of the reliability of the candidate noise estimates.
(43) More specifically, sample aggregator subsystem 20 of
(44) A simple example of subsystem 20 is a minimum follower (of gap confidence weighted samples), e.g., a minimum follower that includes candidate samples (values of M.sub.resmin) in the analysis window only if the associated gap confidence is higher than a predetermined threshold value (i.e., subsystem 20 assigns a weight of one to a sample M.sub.resmin if the gap confidence for the sample is equal to or greater than the threshold value, and subsystem 20 assigns a weight of zero to a sample M.sub.resminif the gap confidence for the sample is less than the threshold value). Other implementations of subsystem 20 otherwise aggregate (e.g., determine an average of, or otherwise aggregate) gap confidence weighted samples (values of M.sub.resmin, each weighted by a corresponding one of the gap confidence values, in an analysis window). An exemplary implementation of subsystem 20 which aggregates gap confidence weighted samples is (or includes) a linear interpolator/one pole smoother with an update rate controlled by the gap confidence values.
(45) Subsystem 20 may employ strategies that ignore gap confidence at times when incoming samples (values of M.sub.resmin) are lower than the current noise estimate (determined by subsystem 20), in order to track drops in noise conditions even if no gaps are available.
(46) Preferably, subsystem 20 is configured to effectively hold onto noise estimates during intervals of low gap confidence until new sampling opportunities arise as determined by the gap confidence. For example, in a preferred implementation of subsystem 20, when subsystem 20 determines a current noise estimate (in one analysis window) and then the gap confidence values (generated by subsystem 16) indicate low confidence that there is a gap in playback content (e.g., the gap confidence values indicate gap confidence below a predetermined threshold value), subsystem 20 continues to output that current noise estimate until (in a new analysis window) the gap confidence values indicate higher confidence that there is a gap in playback content (e.g., the gap confidence values indicate gap confidence above the threshold value), at which time subsystem 20 generates (and outputs) an updated noise estimate. By so using gap confidence values to generate noise estimates (including by holding onto noise estimates during intervals of low gap confidence until new sampling opportunities arise as determined by the gap confidence) in accordance with preferred embodiments of the invention, rather than relying only on candidate noise estimate values output from minimum follower 14 as a sequence of noise estimates (without determining and using gap confidence values) or otherwise generating noise estimates in a conventional manner, the length for all employed minimum follower analysis windows (i.e., τ1, the analysis window length of each of minimum followers 13 and 14, and τ2, the analysis window length of aggregator 20, if aggregator 20 is implemented as a minimum follower of gap confidence weighted samples) can be reduced by about an order of magnitude over traditional approaches, improving the speed at which the noise estimation system can track the noise conditions when gaps do arise. Typical default values for the analysis window sizes are given below.
(47) In a class of implementations, sample aggregator 20 is configured to report forward (i.e., to output) not only a current noise estimate but also an indication, referred to herein as “gap health,” of how up to date the noise estimate is in each frequency band. In typical implementations, gap health is a unitless measure, calculated (in one typical implementation) as:
(48)
where n is an integer, index i ranges from 1 to n, and the GapConfidence values are the most recent n gap confidence values provided by subsystem 16 to sample aggregator 20. Typically, a gap health value (e.g., a value GH) is determined for each frequency band, with subsystem 16 generating (and providing to aggregator 20) a set of gap confidence values (one for each frequency band) for each analysis window of minimum follower 13 (so that the n most recent gap confidence values in the above example of GH are the n most recent gap confidence values for the relevant band).
(49) In a class of implementations, gap confidence subsystem 16 is configured to process the S.sub.min values (output from minimum follower 13) and a smoothed version (i.e., smoothed values M.sub.smoothed, output from smoothing subsystem 17 of subsystem 16) of the M values (output from gain stage 11), e.g., by comparing the S.sub.min values to the M.sub.smoothed values, in order to generate a sequence of gap confidence values. Typically, subsystem 16 generates (and provides to aggregator 20) a set of gap confidence values (one for each frequency band) for each analysis window of minimum follower 13, and the description herein pertains to generation of a gap confidence value for a particular frequency band (from values of S.sub.min and M.sub.smoothed for the band).
(50) Each gap confidence value (for one band, at one time) indicates how indicative a corresponding one of the M.sub.resmin values (i.e., the M.sub.resmin value for the same band and time) is of the noise conditions in the playback environment. Each minimum (M.sub.resmin) recognized (during a gap in playback content) by minimum follower 14 (which operates on the Mres values) can confidently be considered to be indicative of noise conditions in the playback environment. When there is no gap in playback content, a minimum (M.sub.resmin) recognized by minimum follower 14 (which operates on the Mres values) cannot confidently be considered to be indicative of noise conditions in the playback environment since it may instead be indicative of a minimum (S.sub.min) in the playback signal (S).
(51) Subsystem 16 is typically implemented to generate each gap confidence value (a value GapConfidence, for a time t) to be indicative of how different S.sub.min is from the smoothed (average) level detected by the microphone (M.sub.smoothed) at the time t. The further S.sub.min is from the smoothed (average) level detected by the microphone (M.sub.smoothed), the greater is the confidence that there is a gap in playback content at the time t, and thus the greater is the confidence that a value M.sub.resmin is representative of the noise conditions (at the time t) in is the playback environment.
(52) The computation of each gap confidence value (i.e., the gap confidence value for each time, t, e.g., for each analysis window of minimum follower 13), for each band, is based on S.sub.min, the minimum followed playback content energy level at the time, t, and M.sub.smoothed, the smoothed microphone energy level at the same time, t. In a preferred embodiment, each gap confidence value output from subsystem 16 is a unitless value proportional to:
(53)
where * denotes multiplication, all the energy values (S.sub.min and M.sub.smoothed) are in the linear domain, and δ and C are tuning parameters. Typically, the value of C is associated with the amount of echo cancellation provided by an echo canceller (e.g., element 34 of
(54) The value of δ sets the required distance between the observed minimum of the playback content, and the smoothed microphone level. This parameter trades off error and stability with the update rate of the system, and will depend on how aggressive the noise compensation gains are.
(55) Using M.sub.smoothed as a point of comparison means that the current gap confidence value takes into account the severity of making an error in the estimate of the noise, given the current conditions. Generally if δ is chosen to be large enough, the operation of the noise estimator will take advantage of the following scenarios. For a fixed value of S.sub.min, an increased value of M.sub.smoothed implies that the gap confidence should increase. If M.sub.smoothed increases because the actual noise conditions increase significantly, allowing more error in the noise estimate due to residual echo is possible because the error will be small relative to the magnitude of the noise conditions. If M.sub.smoothed increases because the playback content increases in level, the impact of any error made in the noise estimate is also reduced because the noise compensator will not be performing much compensation. For a fixed value of S.sub.min, a decreased value of M.sub.smoothed implies that the gap confidence should decrease. Any errors introduced through residual echo in the microphone output signal in this situation would have a large impact on the compensation experience, as they would be large with respect to the playback content. Thus it is appropriate for the noise estimator to be more conservative in computing the gap confidence under these conditions.
(56) In applications with a strong employment of echo cancellation (“AEC”), where the cost of making errors is lower, δ can be relaxed (reduced), so that the noise estimate (output from subsystem of 20) is indicative of more frequent gaps. In AEC-free applications, δ can be increased in order for the noise estimate (output from subsystem of 20) to be indicative of only higher quality gaps.
(57) The following table is a summary of tuning parameters of the
(58) TABLE-US-00001 With AEC No AEC Parameter Purpose Default Default δ Required distance between 6 dB 30 dB playback minimum and microphone level for gap. C Amount of cancellation Depends 0 dB (i.e., expected due to echo on AEC. C = 1 in the cancellation. linear domain) τ1 Size of minimum follower 200 ms 200 ms analysis windows (of minimum followers 13 and 14) operating on microphone residual energy and playback energy. τ2 Size of the minimum 800 ms 800 ms follower-like filter (20) that processes microphone residual energy levels and corresponding confidences.
(59) All of the tuning parameters affect the update rate of the system, which is balanced against the accuracy of the system's noise estimate. Generally, as long as stability is maintained, it is better to have a faster responding system with some error present, then a conservative, slow responding system that relies on high quality gaps.
(60) The described approach to computing gap confidence (e.g., the output of subsystem 16 of
(61) With reference again to
(62) The noise compensated playback content 25 is transformed (in element 26), and downmixed and frequency banded (in element 27) to produce the values S. The microphone output signal is transformed (in element 32) and banded (in element 33) to produce the values M′. If an echo canceller (34) is employed, the residual signal (echo cancelled noise estimate values) from the echo canceller is banded (in element 35) to produce the values Mres'.
(63) Subsystem 43 determines the calibration gain G (for each frequency band) in accordance with a microphone to digital mapping, which captures the level difference per frequency band between the playback content in the digital domain at the point (e.g., the output of time-to-frequency domain transform element 26) it is tapped off and provided to the noise estimator, and the playback content as received by the microphone. Each set of current values of the gain G is provided from subsystem 43 to noise estimator 37 (for application by gain stages 11 and 12 of the
(64) Subsystem 43 has access to at least one of the following three sources of data: factory preset gains (stored in memory 40); the state of the gains G generated (by subsystem 43) during the previous session (and stored in memory 41); if an AEC (e.g., echo canceller 34) is present and in use, banded AEC filter coefficient energies (e.g., those which determine the adaptive filter, corresponding to filter W′ of
(65) If no AEC is employed (e.g., if a version of the
(66) Thus, in some embodiments, subsystem 43 is configured such that the
(67) With reference again to
(68) The microphone to digital mapping performed by subsystem 43 to determine the gain values G captures the level difference (per frequency band) between the playback content in the digital domain (e.g., the output of time-to-frequency domain transform element 26) at the point it is tapped off for provision to the noise estimator, and the playback content as received by the microphone. The mapping is primarily determined by the physical separation and characteristics of the speaker system and microphone, as well as the electrical amplification gains used in the reproduction of sound and microphone signal amplification.
(69) In the most basic instance, the microphone to digital mapping may be a pre-stored factory tuning, measured during production design over a sample of devices, and re-used for all such devices being produced.
(70) When an AEC (e.g., echo canceller 34 of
(71) While estimated gains G′ can substitute for factory determined gains, a robust approach to determining the gain G for each band, that combines both factory gains and the online estimated gains G′, is the following:
G=max(min(G′,F+L),F−L
where F is the factory gain for the band, G′ is the estimated gain for the band, and L is a maximum allowed deviation from the factory settings. All gains are in dB. If a value G′ exceeds the indicated range for a long period of time, this may indicate faulty hardware, and the noise compensation system may decide to fall back to safe behavior.
(72) A higher quality noise compensation experience can be maintained using a post-processing step performed (e.g., by element 39 of the
(73) An important aspect of some embodiments of the inventive noise estimation method and system is post-processing (e.g., performed by an implementation of element 39 of the
(74) In some such embodiments, the gap health as reported by the noise estimator (e.g., gap health values, for each frequency band, generated by subsystem 20 of the
(75) Stale value imputation may not be necessary in embodiments where a sufficient number of gaps are constantly available, and bands are rarely stale. Default threshold values for the simple imputation algorithm are given by the following table:
(76) TABLE-US-00002 Parameter: Default α.sub.Healthy 0.5 α.sub.Stale 0.3
(77) Other methods that operate on the gap health and noise estimate values are of course possible.
(78) In some embodiments, element 39 of the
(79) Gap confidence determination (and use of the determined gap confidence data to perform noise estimation) in accordance with typical embodiments of the invention as disclosed herein enables a viable noise compensation experience (using noise estimates determined using the gap confidence values) without the need for an echo canceller, across the range of audio types encountered in media playback scenarios. Including an echo canceller to perform gap confidence determination in accordance with some embodiments of the invention can improve the responsiveness of noise compensation (using noise estimates determined using the determined gap confidence data), removing dependency on playback content characteristics. Typical implementations of the gap confidence determination, and use of the determined gap confidence data to perform noise estimation, lower the requirements placed on an echo canceller (also used to perform the noise estimation), and the significant effort involved in optimisation and testing.
(80) Removing an echo canceller from a noise compensation system: saves a large amount of development time, as echo cancellers demand a large amount of time and research to tune to ensure cancellation performance and stability; saves computation time, as large adaptive filter banks (for implementing echo cancellation) typically consume large resources and often require high precision arithmetic to run; and removes the need for shared clock domain and time alignment between the microphone signal and the playback audio signal. Echo cancellation relies on both playback and recording signals to be synchronized on the same audio clock.
(81) A noise estimator (implemented in accordance with any of typical embodiments of the invention, e.g., without echo cancellation) can run at an increased block rate/smaller FFT size for further complexity savings. Echo cancellation performed in the frequency domain typically requires a narrow frequency resolution.
(82) When using echo cancellation (and gap confidence determination) to generate noise estimates in accordance with typical embodiments of the invention, echo canceller performance can be reduced without compromising user experience (when the user listens to noise compensated playback content, implemented using noise estimates generated in accordance with typical embodiments of the invention), since the echo canceller need only perform enough cancellation to reveal gaps in playback content, and need not maintain a high ERLE for the playback content peaks (“ERLE” here denotes echo return loss enhancement, a measure of how much echo, in dB, is removed by an echo canceller).
(83) Exemplary embodiments of the inventive method include the following:
(84) E1. A method, including steps of:
(85) during emission of sound in a playback environment, using a microphone to generate a microphone output signal, wherein the sound is indicative of audio content of a playback signal, and the microphone output signal is indicative of background noise in the playback environment and the audio content;
(86) generating (e.g., in element 16 of the
(87) generating (e.g., in element 20 of the
(88) E2. The method of claim E1, wherein the estimate of the background noise in the playback environment is or includes a sequence of noise estimates, each of the noise estimates is an estimate of background noise in the playback environment at a different time, t, and said each of the noise estimates (e.g., each noise estimate output from element 20 of the
(89) E3. The method of claim E2, wherein the sequence of noise estimates includes a noise estimate for each said time interval, and generation of the noise estimate for each said time interval includes steps of:
(90) (a) identifying (e.g., in element 20 of the
(91) (b) generating the noise estimate for the time interval to be a minimum one of the candidate noise estimates identified in step (a).
(92) E4. The method of claim E2, wherein each of the candidate noise estimates is a minimum echo cancelled noise estimate (e.g., one of the values, M.sub.resmin, output from element 14 of the
(93) E5. The method of claim E2, wherein each of the candidate noise estimates is a minimum microphone output signal value (e.g., a value, M.sub.min, output from element 14 of the
(94) E6. The method of claim E1, wherein the step of generating the gap confidence values includes generating a gap confidence value for each time, t, including by:
(95) processing the playback signal (e.g., in element 13 of the
(96) processing the microphone output signal (e.g., in elements 11 and 17 of the
(97) determining (e.g., in element 18 of the
(98) E7. The method of claim E1, wherein the estimate of the background noise in the playback environment is or includes a sequence of noise estimates, and also including a step of:
(99) performing noise compensation (e.g., in element 24 of the
(100) E8. The method of claim E7, wherein the step of performing noise compensation on the audio input signal includes generation of the playback signal, and wherein the method includes a step of:
(101) driving at least one speaker with the playback signal to generate said sound.
(102) E9. The method of claim E1, including steps of:
(103) performing a time-domain to frequency-domain transform on the microphone output signal, thereby generating frequency-domain microphone output data; and
(104) generating frequency-domain playback content data in response to the playback signal, and wherein the gap confidence values are generated in response to the frequency-domain microphone output data and the frequency-domain playback content data.
(105) Exemplary embodiments of the inventive system include the following:
(106) E10. A system, including:
(107) a microphone (e.g., microphone 30 of
(108) a noise estimation system (e.g., elements 26, 27, 32, 33, 34, 35, 36, 37, 39, and 43 of the
(109) to generate gap confidence values in response to the microphone output signal and the playback signal, where each of the gap confidence values is for a different time, t, and is indicative of confidence that there is a gap, at the time t, in the playback signal; and
(110) to generate an estimate of the background noise in the playback environment using the gap confidence values.
(111) E11. The system of claim E10, wherein the noise estimation system is configured to generate the estimate of the background noise in the playback environment such that said estimate of the background noise in the playback environment is or includes a sequence of noise estimates, each of the noise estimates is an estimate of background noise in the playback environment at a different time, t, and said each of the noise estimates (e.g., each noise estimate output from element 20 of the
(112) E12. The system of claim E11, wherein the sequence of noise estimates includes a noise estimate for each said time interval, and the noise estimation system is configured to generate the noise estimate for each said time interval including by:
(113) (a) identifying (e.g., in element 20 of
(114) (b) generating the noise estimate for the time interval to be a minimum one of the candidate noise estimates identified in step (a).
(115) E13. The system of claim E12, wherein each of the candidate noise estimates is a minimum echo cancelled noise estimate (e.g., one of the values, M.sub.resmin, output from element 14 of the
(116) E14. The system of claim E12, wherein each of the candidate noise estimates is a minimum microphone output signal value (e.g., a value, M.sub.resmin, output from element 14 of the
(117) E15. The system of claim E10, wherein the gap confidence values include a gap confidence value for each time, t, and the noise estimation system is configured to generate the gap confidence value for each time, t, including by:
(118) processing the playback signal (e.g., in element 13 of the
(119) processing (e.g., in elements 11 and 17 of the
(120) determining (e.g., in element 18 of the
(121) E16. The system of claim E10, wherein the estimate of the background noise in the playback environment is or includes a sequence of noise estimates, said system also including:
(122) a noise compensation subsystem (e.g., element 24 of the
(123) E17. The system of claim E10, wherein the noise estimation system is configured:
(124) to perform a time-domain to frequency-domain transform (e.g., in elements 32 and 33 of the
(125) to generate frequency-domain playback content data (e.g., in elements 26 and 27 of the
(126) Aspects of the invention include a system or device configured (e.g., programmed) to perform any embodiment of the inventive method, and a tangible computer readable medium (e.g., a disc) which stores code for implementing any embodiment of the inventive method or steps thereof. For example, the inventive system can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of the inventive method or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform an embodiment of the inventive method (or steps thereof) in response to data asserted thereto.
(127) Some embodiments of the inventive system (e.g., some implementations of the system of
(128) Alternatively, embodiments of the inventive system (e.g., some implementations of the system of
(129) Another aspect of the invention is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) any embodiment of the inventive method or steps thereof.
(130) While specific embodiments of the present invention and applications of the invention have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the invention described and claimed herein. It should be understood that while certain forms of the invention have been shown and described, the invention is not to be limited to the specific embodiments described and shown or the specific methods described.
(131) Various aspects of the present invention may be appreciated from the following enumerated example embodiments (EEEs):
(132) 1. A method, including steps of:
(133) during emission of sound in a playback environment, using a microphone to generate a microphone output signal, wherein the sound is indicative of audio content of a playback signal, and the microphone output signal is indicative of background noise in the playback environment and the audio content;
(134) generating gap confidence values in response to the microphone output signal and the playback signal, where each of the gap confidence values is for a different time, t, and is indicative of confidence that there is a gap, at the time t, in the playback signal; and
(135) generating an estimate of the background noise in the playback environment using the gap confidence values.
(136) 2. The method of EEE 1, wherein the estimate of the background noise in the playback environment is or includes a sequence of noise estimates, each of the noise estimates is an estimate of background noise in the playback environment at a different time, t, and said each of the noise estimates is a combination of candidate noise estimates which have been weighted by the gap confidence values for a different time interval including the time t.
(137) 3. The method of EEE 2, wherein the sequence of noise estimates includes a noise estimate for each said time interval, and generation of the noise estimate for each said time interval includes steps of:
(138) (a) identifying each of the candidate noise estimates for the time interval for which a corresponding one of the gap confidence values exceeds a predetermined threshold value; and
(139) (b) generating the noise estimate for the time interval to be a minimum one of the candidate noise estimates identified in step (a).
(140) 4. The method of EEE 2 or 3, wherein each of the candidate noise estimates is a minimum echo cancelled noise estimate, M.sub.resmin, of a sequence of echo cancelled noise estimates, the sequence of noise estimates includes a noise estimate for each said time interval, and the noise estimate for each said time interval is a combination of the minimum echo cancelled noise estimates for the time interval, weighted by corresponding ones of the gap confidence values for the time interval.
(141) 5. The method of EEE 2 or 3, wherein each of the candidate noise estimates is a minimum microphone output signal value, M.sub.min, of a sequence of microphone output signal values, the sequence of noise estimates includes a noise estimate for each said time interval, and the noise estimate for each said time interval is a combination of the minimum microphone output signal values for the time interval, weighted by corresponding ones of the gap confidence values for the time interval.
(142) 6. The method of EEE 1, 2, 3, 4, or 5, wherein the step of generating the gap confidence values includes generating a gap confidence value for each time, t, including by:
(143) processing the playback signal to determine a minimum in playback signal level for the time, t;
(144) processing the microphone output signal to determine a smoothed level of the microphone output signal for the time, t; and
(145) determining the gap confidence value for the time, t, to be indicative of how different the minimum in playback signal level for the time, t, is from the smoothed level of the microphone output signal for the time, t.
(146) 7. The method of EEE 1, 2, 3, 4, 5, or 6, wherein the estimate of the background noise in the playback environment is or includes a sequence of noise estimates, and also including a step of:
(147) performing noise compensation on an audio input signal using the sequence of noise estimates.
(148) 8. The method of EEE 7, wherein the step of performing noise compensation on the audio input signal includes generation of the playback signal, and wherein the method includes a step of:
(149) driving at least one speaker with the playback signal to generate said sound. 9. The method of EEE 1, 2, 3, 4, 5, 6, 7, or 8, including steps of:
(150) performing a time-domain to frequency-domain transform on the microphone output signal, thereby generating frequency-domain microphone output data; and
(151) generating frequency-domain playback content data in response to the playback signal, and wherein the gap confidence values are generated in response to the frequency-domain microphone output data and the frequency-domain playback content data.
(152) 10. A system, including:
(153) a microphone, configured to generate a microphone output signal during emission of sound in a playback environment, wherein the sound is indicative of audio content of a playback signal, and the microphone output signal is indicative of background noise in the playback environment and the audio content; and
(154) a noise estimation system, coupled to receive the microphone output signal and the playback signal, and configured:
(155) to generate gap confidence values in response to the microphone output signal and the playback signal, where each of the gap confidence values is for a different time, t, and is indicative of confidence that there is a gap, at the time t, in the playback signal; and
(156) to generate an estimate of the background noise in the playback environment using the gap confidence values.
(157) 11. The system of EEE 10, wherein the noise estimation system is configured to generate the estimate of the background noise in the playback environment such that said estimate of the background noise in the playback environment is or includes a sequence of noise estimates, each of the noise estimates is an estimate of background noise in the playback environment at a different time, t, and said each of the noise estimates is a combination of candidate noise estimates which have been weighted by the gap confidence values for a different time interval including the time t.
(158) 12. The system of EEE 11, wherein the sequence of noise estimates includes a noise estimate for each said time interval, and the noise estimation system is configured to generate the noise estimate for each said time interval including by:
(159) (a) identifying each of the candidate noise estimates for the time interval for which a corresponding one of the gap confidence values exceeds a predetermined threshold value; and
(160) (b) generating the noise estimate for the time interval to be a minimum one of the candidate noise estimates identified in step (a).
(161) 13. The system of EEE 11 or 12, wherein each of the candidate noise estimates is a minimum echo cancelled noise estimate, M.sub.resmin, of a sequence of echo cancelled noise estimates, the sequence of noise estimates includes a noise estimate for each said time interval, and the noise estimate for each said time interval is a combination of the minimum echo cancelled noise estimates for the time interval, weighted by corresponding ones of the gap confidence values for the time interval.
(162) 14. The system of EEE 11 or 12, wherein each of the candidate noise estimates is a minimum microphone output signal value, M.sub.min, of a sequence of microphone output signal values, the sequence of noise estimates includes a noise estimate for each said time interval, and the noise estimate for each said time interval is a combination of the minimum microphone output signal values for the time interval, weighted by corresponding ones of the gap confidence values for the time interval.
(163) 15. The system of EEE 10, 11, 12, 13, or 14, wherein the gap confidence values include a gap confidence value for each time, t, and the noise estimation system is configured to generate the gap confidence value for each time, t, including by:
(164) processing the playback signal to determine a minimum in playback signal level for the time, t;
(165) processing the microphone output signal to determine a smoothed level of the microphone output signal for the time, t; and
(166) determining the gap confidence value for the time, t, to be indicative of how different the minimum in playback signal level for the time, t, is from the smoothed level of the microphone output signal for the time, t.
(167) 16. The system of EEE 10, 11, 12, 13, 14, or 15, wherein the estimate of the background noise in the playback environment is or includes a sequence of noise estimates, said system also including:
(168) a noise compensation subsystem, coupled to receive the sequence of noise estimates, and configured to perform noise compensation on an audio input signal using the sequence of noise estimates to generate the playback signal.
(169) 17. The system of EEE 10, 11, 12, 13, 14, 15, or 16, wherein the noise estimation system is configured:
(170) to perform a time-domain to frequency-domain transform on the microphone output signal, thereby generating frequency-domain microphone output data;
(171) to generate frequency-domain playback content data in response to the playback signal; and
(172) to generate the gap confidence values in response to the frequency-domain microphone output data and the frequency-domain playback content data.