Forced gap insertion for pervasive listening
11195539 · 2021-12-07
Assignee
Inventors
Cpc classification
G10K2210/108
PHYSICS
H04R2227/001
ELECTRICITY
G10K11/17885
PHYSICS
H03G3/32
ELECTRICITY
International classification
Abstract
A pervasive listening method including steps of inserting at least one forced gap in a playback signal (thus generating a modified playback signal), and during playback of the modified playback signal, monitoring non-playback content (e.g., including by generating an estimate of background noise) in a playback environment using output of a microphone in the playback environment. Optionally, the method includes generation of the playback signal, including by processing of (e.g., performing noise compensation on) an input signal using a result (e.g., a background noise estimate) of the monitoring of non-playback content. Other aspects are systems configured to perform any embodiment of the pervasive listening method.
Claims
1. A pervasive listening method, comprising: inserting at least one gap into at least one selected frequency band, in a selected time interval, of an audio playback signal to generate a modified playback signal; during emission of sound in a playback environment in response to the modified playback signal, generating, using a microphone in the playback environment, a microphone output signal, wherein the sound is indicative of playback content of the modified playback signal, and the microphone output signal is indicative of non-playback sound in the playback environment and the playback content; and monitoring the non-playback sound in the playback environment in response to the modified playback signal and the microphone output signal.
2. The method of claim 1, wherein each said gap is inserted into a selected frequency band, in the selected time interval, of the audio playback signal, in an effort so that any artifact, in the sound emitted in the playback environment in response to the modified playback signal, resulting from insertion of the gap has low perceptibility to a user in the playback environment, and high identifiability during performance of the monitoring.
3. The method of claim 1, wherein each said gap is inserted into a selected frequency band, in a selected time interval, of the audio playback signal such that the sound emitted in the playback environment in response to the modified playback signal is perceivable by the user without any significant artifact resulting from insertion of the gap.
4. The method of claim 1, wherein each said gap is inserted into a selected frequency band of the audio playback signal, and each said selected frequency band is determined by selection, from a set of frequency bands of the audio playback signal, implemented using perceptual freedom values indicative of expected perceptual effect of insertion of a gap in each band of the set of frequency bands.
5. The method of claim 4, wherein the perceptual freedom values are determined in accordance with at least one frequency masking consideration, such that when one of the perceptual freedom values is a near peak value for a near peak band which is near to a peak energy band of the set of frequency bands, each of the perceptual freedom values, for a band farther from the peak energy band than is said near peak band, is indicative of greater expected perceptual effect than is said near peak value.
6. The method of claim 4, wherein the perceptual freedom values are determined in accordance with at least one temporal masking consideration, such that when the audio playback signal is indicative of at least one loud playback sound event, those of the perceptual freedom values for a first time interval of the audio playback signal occurring shortly after the loud playback sound event, are indicative of lower expected perceptual effect than are those of the perceptual freedom values for a second time interval of the audio playback signal, where the second time interval is later than the first time interval.
7. The method of claim 1, wherein the pervasive listening method is a noise estimation method, the microphone output signal is indicative of background noise in the playback environment, and the monitoring includes generating an estimate of background noise in the playback environment in response to the modified playback signal and the microphone output signal.
8. The method of claim 1, wherein the monitoring includes generation of an estimate of at least one aspect of the non-playback sound in the playback environment in response to the modified playback signal and the microphone output signal, and wherein the method further comprises: generating the audio playback signal in response to the estimate of at least one aspect of the non-playback sound in the playback environment.
9. The method of claim 1, wherein each said gap is inserted into the playback signal based on need for a gap in at least one frequency band of the playback signal.
10. The method of claim 9, wherein each said gap is inserted into the playback signal in response to urgency values indicative of urgency for gap insertion in each band of a set of frequency bands of the playback signal.
11. The method of claim 9, wherein each said gap is inserted into the playback signal in response to urgency values indicative of urgency for gap insertion in each band of a set of frequency bands of the playback signal, and based on expected perceptual effect of insertion of a gap in said each band of the set of frequency bands of the playback signal.
12. The method of claim 9, wherein each said gap is inserted into the playback signal in a manner including balancing of urgency and expected perceptual effect of gap insertion, using urgency values indicative of urgency for gap insertion in each band of a set of frequency bands of the playback signal, and based on expected perceptual effect of insertion, in at least one specific time interval of the playback signal, of a gap in said each band of the set of frequency bands of the playback signal.
13. The method of claim 1, comprising: determining a probability distribution indicative of a probability for each band of a set of frequency bands of the playback signal; and in accordance with the probability distribution, randomly selecting at least one of the frequency bands of the set, and inserting a gap in each of said at least one of the frequency bands.
14. The method of claim 13, wherein the probability distribution is based on need for a gap in each said band of the set of frequency bands of the playback signal.
15. The method of claim 13, wherein the probability distribution is based on need for a gap, and expected perceptual effect of insertion of the gap, in each said band of the set of frequency bands of the playback signal.
16. The method of claim 1, comprising: generating urgency values in response to the microphone output signal and the modified playback signal, wherein the urgency values are indicative of need for a gap, in each band of a set of frequency bands of the playback signal, based on elapsed time since occurrence of a previous gap in said each band, and wherein insertion of each gap into the playback signal is at least partially based on the urgency values.
17. The method of claim 1, wherein the monitoring of the non-playback sound includes generation of background noise estimates, wherein the method further comprises: generating the audio playback signal in response to the background estimates, including by performing noise compensation on an input audio signal in response to the background estimates.
18. A system, including: a microphone, positioned and configured to generate a microphone output signal during emission of sound in a playback environment, wherein the sound is indicative of playback content of a modified playback signal, and the microphone output signal is indicative of non-playback sound in the playback environment and the playback content; a forced gap application subsystem, coupled to receive an audio playback signal, and configured to insert at least one gap into at least one selected frequency band, in a selected time interval, of the audio playback signal, thereby generating the modified playback signal; and a pervasive listening subsystem, coupled to receive the microphone output signal and the modified playback signal, and configured to monitor the non-playback sound in the playback environment in response to the modified playback signal and the microphone output signal.
19. The system of claim 18, wherein the forced gap application subsystem is configured to insert each said gap into a selected frequency band, in the selected time interval, of the audio playback signal, in an effort so that any artifact, in the sound emitted in the playback environment in response to the modified playback signal, resulting from insertion of the gap has low perceptibility to a user in the playback environment, and high identifiability during performance of the monitoring.
20. The system of claim 18, wherein the forced gap application subsystem is configured to insert each said gap into a selected frequency band of the audio playback signal, including by selecting each said selected frequency band, from a set of frequency bands of the audio playback signal, using perceptual freedom values indicative of expected perceptual effect of insertion of a gap in each band of the set of frequency bands.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
NOTATION AND NOMENCLATURE
(16) Throughout this disclosure, including in the claims, a “gap” in an audio signal (and in playback content of the audio signal) denotes a time (or time interval) of the signal at (or in) which playback content (e.g., in at least one frequency band) is missing (or has a level less than a predetermined value). The audio signal may have a banded frequency-domain representation (in each of a sequence of times, or time intervals, of the signal) comprising frequency-domain playback content in each band of a set of different frequency bands (at each time or time interval), and may have a gap in at least one of the frequency bands (at a time or time interval of the audio signal).
(17) Throughout this disclosure, including in the claims, “speaker” and “loudspeaker” are used synonymously to denote any sound-emitting transducer (or set of transducers) driven by a single speaker feed. A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), all driven by a single, common speaker feed (the speaker feed may undergo different processing in different circuitry branches coupled to the different transducers).
(18) Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
(19) Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X−M inputs are received from an external source) may also be referred to as a decoder system.
(20) Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
(21) Throughout this disclosure including in the claims, the term “couples” or “coupled” is used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections.
DETAILED DESCRIPTION OF EMBODIMENTS
(22) Many embodiments of the present invention are technologically possible. It will be apparent to those of ordinary skill in the art from the present disclosure how to implement them. Some embodiments of the inventive system and method are described herein with reference to
(23) In accordance with typical embodiments of the present invention gaps (referred to as “forced” gaps) are inserted into an audio playback signal, to introduce an intentional distortion of the audio playback content in order to provide glimpses at background noise (or other non-playback sound in the playback environment) to be monitored. Typically, forced gaps are inserted artificially in particular frequency bands in which a corresponding estimate of noise (or other non-playback sound) has gone stale (e.g., so that the forced gaps can be automatically used in accordance with the gap confidence framework described in U.S. Provisional Patent Application No. 62/663,302). In some embodiments, the distortions are carefully masked perceptually, to provide a good quality listening experience despite introduction of the forced gaps, and to implement responsive noise estimation (or another pervasive listening method) in a content-independent way even without use of an echo canceller.
(24) In some embodiments, a sequence of forced gaps is inserted in a playback signal, each forced gap in a different frequency band (or set of bands) of the playback signal, to allow a pervasive listener to monitor non-playback sound which occurs “in” each forced gap in the sense that it occurs during the time interval in which the gap occurs and in the frequency band(s) in which the gap is inserted.
(25) Introduction of a forced gap into a playback signal in accordance with typical embodiments of the invention is distinct from simplex device operation in which a device pauses a playback stream of content (e.g., in order to better hear the user and the user's environment). Introduction of forced gaps into a playback signal in accordance with typical embodiments of the invention is optimized to significantly reduce (or eliminate) perceptibility of artifacts resulting from the introduced gaps during playback, preferably so that the forced gaps have no or minimal perceptible impact for the user, but so that the output signal of a microphone in the playback environment is indicative of the forced gaps (e.g., so the gaps can be exploited to implement a pervasive listening method). By using forced gaps which have been introduced in accordance with typical embodiments of the invention, a pervasive listening system may monitor non-playback sound (e.g., sound indicative of background activity and/or noise in the playback environment) even without the use of an acoustic echo canceller.
(26) With reference to
(27)
(28)
(29) Thus, when a gap forcing operation occurs for a particular frequency band (i.e., the band centered at center frequency, f.sub.0, shown in
(30) Typical embodiments of the invention insert forced gaps in accordance with a predetermined, fixed banding structure that covers the full frequency spectrum of the audio playback signal, and includes B.sub.count bands (where B.sub.count is a number, e.g., B.sub.count=49). To force a gap in any of the bands, a band attenuation is applied in the band. Specifically, for the jth band, an attenuation, Gj, is applied over the frequency region defined by the band. In determining the number of bands and the width of each band, a tradeoff exists between perceptual impact (narrower bands with gaps are better in that they typically have less perceptual impact) and usefulness of the gaps (wider bands with gaps are better for implementing noise estimation (and other pervasive listening methods) and reducing the time (“convergence” time) required to converge to a new noise estimate (or other value monitored by pervasive listening), in all frequency bands of a full frequency spectrum, e.g., in response to a change in background noise or playback environment status). If only a limited number of gaps can be forced at once, it will take a longer time to force gaps sequentially in a large number of small bands (than to force gaps sequentially in a smaller number of larger bands), resulting in longer convergence time. Larger bands (with gaps) provide a lot of information about the background noise (or other value monitored by pervasive listening) at once, but have a larger perceptual impact.
(31)
(32) When assessing perceptual impact of introducing forced gaps (of the type discussed with reference to
(33) TABLE-US-00001 TABLE 1 Examples of parameters for forcing gaps in bands of a playback signal. Parameter Default Minimum Maximum Units Purpose B.sub.count 49 20 128 — Number of discrete groupings of frequency bins, referred to as “bands” Maximum attenuation applied in Z −12 −12 −18 dB the forced gap in a band. Time to ramp gain down to t1 8 5 15 Milliseconds −Z dB at the center frequency of a band once a forced gap is triggered. t2 80 40 120 Milliseconds Time to apply attenuation −Z dB after t1 seconds. t3 8 5 15 Milliseconds Time to ramp gain up to 0 dB after t1 + t2 elapses.
(34) Preferably, each forced gap introduced (in a frequency band of playback content) is introduced in accordance with a discrete selection from a predetermined banding structure (e.g., that of
(35) To implement typical embodiments of the invention, choices are made regarding in which discrete frequency band(s), of a set of B.sub.count bands of a playback signal, gap(s) should be forced, and when each such gap should be forced. We next discuss factors which pertain to such choices, including methods of quantifying and balancing both: 1. the need to force a gap in a band (a factor sometimes referred to herein as “urgency”);
(36) and 2. the degree to which forcing a gap would have perceptual impact (a factor sometimes referred to herein as “perceptual freedom”).
(37) In some embodiments of the invention, urgency and perceptual freedom estimates are determined for each of B.sub.count frequency bands of a playback signal, in an effort to insert forced gaps in a manner which minimizes overall urgency and attains an acceptably low (e.g., minimizes) perceptual cost (e.g., in a non-optimal, statistical sense). For example, this may be implemented as follows. A discrete probability distribution P is defined over the B.sub.count possible outcomes (i.e., a probability is defined for selection, at a specific time, of each one of the B.sub.count bands). Once per each time interval, w.sub.f, this distribution P is sampled randomly to select the band in which to insert (in the corresponding time interval) a forced gap (e.g., having parameters as described with reference to
(38)
where
P′.sub.k=δ*U.sub.k+(1−δ)*F.sub.k,
and where U.sub.k and F.sub.k are values indicative of urgency and perceptual freedom, respectively, for the “k”th band, P′.sub.k is the (non-normalized) probability of selection of the “k” th band, δ is a parameter indicative of relative importance of urgency and perceptual freedom considerations, and the summation is over all the frequency bands (so that P.sub.k is the normalized version of P′.sub.k for the “k”th band).
(39)
(40) Each channel of the audio signal input to subsystem 70 is indicative of audio content (sometimes referred to herein as media content or playback content), and is intended to undergo playback to generate sound (in environment E) indicative of the audio content. Each channel of the audio signal output from subsystem 70 may be a speaker feed, or another element of the system may generate a speaker feed in response to each channel of the audio signal output from subsystem 70. The K speaker feed(s) are asserted to speaker system S (including at least one speaker) in playback environment E.
(41) Pervasive listening subsystem 71 (which, in some implementations, is a pervasive listening application running on an appropriately programmed processor) is coupled and configured to monitor sound (“non-playback” sound) in playback environment E, other than playback sound emitted from speaker system S (in response to the speaker feed(s) in environment E. Specifically, microphone M in environment E captures sound in the environment E, and asserts to subsystem 71 a microphone output signal Mic indicative of the captured sound. The captured sound includes playback sound emitted from speaker system S, and non-playback sound which may be or include background noise, and/or speech by (or other sound indicative of activity, or mere presence, of) at least one human user L in the environment E.
(42) By monitoring non-playback sound in environment E which is revealed by forced gaps (i.e., in frequency bands and time intervals corresponding to the forced gaps) which have been inserted in the playback content in accordance with the invention, the performance of subsystem 71 is improved relative to the performance which it could attain without insertion of the forced gaps.
(43) Optionally, pervasive listening subsystem 71 is coupled and configured also to generate the audio playback signal which is input to forced gap applicator 70 (e.g., for the purpose of improving in some respect audio signal playback by speaker system S) in response to at least one result of the monitoring performed by said subsystem 71. Subsystem 71 may generate the audio playback signal by modifying an input audio playback signal (e.g., as does pervasive listener subsystem 72 of the system of
(44) Speaker system S (including at least one speaker) is coupled and configured to emit sound (in playback environment E) in response to each speaker feed determined by the output of subsystem 70. The output of subsystem 70 may consist of a single playback channel, or two or more playback channels. In typical operation, each speaker of speaker system S receives a speaker feed indicative of the playback content of a different channel of the output of subsystem 70. In response, speaker system S emits sound in playback environment E. Typically, the sound is perceived by at least one user (L) present in environment E.
(45) Microphone output signal “Mic” of
(46) Pervasive listening subsystem 71 is provided with the microphone output signal Mic. In response to the microphone output signal Mic, subsystem 71 monitors (or attempts to monitor) non-playback sound in environment E. The non-playback sound is sound other than the sound emitted by speaker system S. For example, the non-playback sound may be background noise and/or sound uttered by (or resulting from activity of) a user L. Subsystem 71 is also provided the K channels (which may include forced gaps) which are output from forced gap application subsystem 70. The K channels provided to subsystem 71 are sometimes referred to herein as K channels of “echo reference.” Each of the echo reference channels may contain forced gaps which have been automatically forced therein by subsystem 70, to aid subsystem 71 in its monitoring task.
(47) In typical implementations, forced gap applicator 70 inserts gaps in the audio playback signal in response to urgency data values indicative of the urgency (in each of a number of frequency bands, in each of a sequence of time intervals) for insertion of the gaps. In some implementations, forced gap applicator 70 determines the urgency data values either using a predetermined, fixed estimate for urgency for each of the frequency bands (e.g., as indicated by a probability distribution of the type described above), or an estimate of urgency for each band (in each of the time intervals) generated by forced gap applicator 70 (e.g., based on the playback signal input to applicator 70 and optionally also on history of forced gap insertion by applicator 70).
(48)
(49) Pervasive listening subsystem 72 (which, in some implementations, is a pervasive listening application running on an appropriately programmed processor) is coupled and configured to monitor non-playback sound in playback environment E. The non-playback sound is sound, other than playback sound emitted from speaker system S (in response to the speaker feed(s) asserted thereto) in environment E. Specifically, microphone M in environment E captures sound in the environment E, and asserts to subsystem 72 a microphone output signal Mic indicative of the captured sound. The captured sound includes playback sound emitted from speaker system S, and non-playback sound. The non-playback sound may be or include background noise, and/or speech by (or other sound indicative of activity, or mere presence, of) at least one human user L in the environment E.
(50) By monitoring non-playback sound in environment E which is revealed by forced gaps (i.e., in frequency bands and time intervals corresponding to the forced gaps) inserted in the playback content by forced gap application subsystem 70 in accordance with the invention, the performance of subsystem 72 is improved relative to the performance which it could attain without insertion of the forced gaps.
(51) Pervasive listening subsystem 72 is also coupled and configured to perform audio signal processing (e.g., noise compensation) on an input audio signal (typically comprising K channels of playback content) to generate a processed audio playback signal (typically comprising K channels of processed playback content) which is input to forced gap applicator 70 (e.g., for the purpose of improving in some respect the audio signal playback by speaker system S) in response to at least one result of the monitoring performed by said subsystem 72. The processed audio playback signal is provided to forced gap applicator 70, and the output of the forced gap applicator is (or is used to generate) a set of K speaker feeds which is asserted to speaker subsystem S. One example of an implementation of subsystem 72 is noise compensation subsystem 62 together with noise estimation subsystem 64 of the
(52)
(53) In the
(54) In some implementations of subsystem 73, the urgency signal U is indicative of a fixed (time invariant) urgency value set [U.sub.0, U.sub.1, . . . U.sub.N] determined by a probability distribution defining a probability of gap insertion for each of the N frequency bands. Thus, in response to such a fixed urgency value set, subsystem 70 operates to insert fewer forced gaps (on the average) in those bands which have lower urgency values (i.e., lower probability values determined by the probability distribution), and to insert more forced gaps (on the average) in those bands which have higher urgency values (i.e., higher probability values). In some implementations of subsystem 73, the urgency signal U is indicative of a sequence of urgency value sets [U.sub.0, U.sub.1, . . . U.sub.N], e.g., a different urgency value set for each different time in the sequence. Each such different urgency value set may be determined by a different probability distribution for each of the different times. Various examples of urgency signal U and urgency values indicated thereby are described in more detail below.
(55) The
(56)
(57) Noise compensation systems (e.g., that of
(58) Typical embodiments of forced gap applicator 70 of the
(59) Without forced gap application subsystem 70, the
(60) Although some implementations of the
(61) In
(62) By use of forced gap applicator 70, the number of gaps in each channel of the compensated playback signal (output from noise compensation subsystem 62 of the
(63) In the system of any of
(64) We next describe
(65) Noise estimation subsystem 37 of
(66) Subsystem 70 of
(67) Subsystem 62 of
(68) The
(69) Speaker system S (including at least one speaker) is coupled and configured to emit sound (in playback environment E) in response to playback signal 25. Signal 25 may consist of a single playback channel, or it may consist of two or more playback channels. In typical operation, each speaker of speaker system S receives a speaker feed indicative of the playback content of a different channel of signal 25. In response, speaker system S emits sound (in playback environment E) in response to the speaker feed(s). The sound is perceived by user (a human listener) L (in environment E) as a noise-compensated version of the playback content of input signal 23.
(70) Next, with reference to
(71) A microphone output signal (e.g., signal “Mic” of
(72) Each channel of the playback content (e.g., each channel of noise compensated signal 25 of
(73) The
(74) In the case that signal 25 is multi-channel signal (comprising Z playback channels), a typical implementation of echo canceller 34 receives (from element 26) multiple streams of frequency-domain playback content values (one stream for each channel), and adapts a filter W′.sub.i (corresponding to filter W′ of
(75) The filter coefficients of each adaptive filter employed by echo canceller 34 to generate the echo cancelled noise estimate values (i.e., each adaptive filter implemented by echo canceller 34 which corresponds to filter W′ of
(76) Optionally, echo canceller 34 is omitted (or does not operate), and thus no adaptive filter values are provided to banding element 36, and no banded adaptive filter values are provided from 36 to subsystem 43. In this case, subsystem 43 generates the gain values G in one of the ways (described below) without use of banded adaptive filter values.
(77) If an echo canceller is used (i.e. if the
(78) If no echo canceller is used (i.e., if echo canceller 34 is omitted or does not operate), the values M′res (in the description herein of
(79) In typical implementations (including that shown in
(80) In the implementation shown in
(81) minima in the values Mres (the echo canceller residual) can confidently be considered to indicate estimates of noise in the playback environment; and
(82) minima in the M (microphone output signal) values can confidently be considered to indicate estimates of noise in the playback environment.
(83) At times other than during a gap in playback content, minima in the values Mres (or the values M) may not be indicative of accurate estimates of noise in the playback environment.
(84) In response to microphone output signal (M) and the values of S.sub.min, subsystem 16 generates gap confidence values. Sample aggregator subsystem 20 is configured to use the values of M.sub.resmin (or the values of M, in the case that no echo cancellation is performed) as candidate noise estimates, and to use the gap confidence values (generated by subsystem 16) as indications of the reliability of the candidate noise estimates.
(85) More specifically, sample aggregator subsystem 20 of
(86) A simple example of subsystem 20 is a minimum follower (of gap confidence weighted samples), e.g., a minimum follower that includes candidate samples (values of M.sub.resmin) in the analysis window only if the associated gap confidence is higher than a predetermined threshold value (i.e., subsystem 20 assigns a weight of one to a sample M.sub.resmin if the gap confidence for the sample is equal to or greater than the threshold value, and subsystem 20 assigns a weight of zero to a sample M.sub.resmin if the gap confidence for the sample is less than the threshold value). Other implementations of subsystem 20 otherwise aggregate (e.g., determine an average of, or otherwise aggregate) gap confidence weighted samples (values of M.sub.resmin, each weighted by a corresponding one of the gap confidence values, in an analysis window). An exemplary implementation of subsystem 20 which aggregates gap confidence weighted samples is (or includes) a linear interpolator/one pole smoother with an update rate controlled by the gap confidence values.
(87) Subsystem 20 may employ strategies that ignore gap confidence at times when incoming samples (values of M.sub.resmin) are lower than the current noise estimate (determined by subsystem 20), in order to track drops in noise conditions even if no gaps are available.
(88) Preferably, subsystem 20 is configured to effectively hold onto noise estimates during intervals of low gap confidence until new sampling opportunities arise as determined by the gap confidence. For example, in a preferred implementation of subsystem 20, when subsystem 20 determines a current noise estimate (in one analysis window) and then the gap confidence values (generated by subsystem 16) indicate low confidence that there is a gap in playback content (e.g., the gap confidence values indicate gap confidence below a predetermined threshold value), subsystem 20 continues to output that current noise estimate until (in a new analysis window) the gap confidence values indicate higher confidence that there is a gap in playback content (e.g., the gap confidence values indicate gap confidence above the threshold value), at which time subsystem 20 generates (and outputs) an updated noise estimate. By so using gap confidence values to generate noise estimates (including by holding onto noise estimates during intervals of low gap confidence until new sampling opportunities arise as determined by the gap confidence) in accordance with preferred embodiments of the invention, rather than relying only on candidate noise estimate values output from minimum follower 14 as a sequence of noise estimates (without determining and using gap confidence values) or otherwise generating noise estimates in a conventional manner, the length for all employed minimum follower analysis windows (i.e., τ1, the analysis window length of each of minimum followers 13 and 14, and τ2, the analysis window length of aggregator 20, if aggregator 20 is implemented as a minimum follower of gap confidence weighted samples) can be reduced by about an order of magnitude over traditional approaches, improving the speed at which the noise estimation system can track the noise conditions when gaps do arise.
(89) As noted herein, noise estimator 37 is preferably also configured to generate and report (to forced gap applicator 70) an urgency signal U indicative of urgency values. Examples of such an urgency signal (and the manner in which such examples may be generated) are described herein.
(90) With reference again to
(91) The noise compensated playback content 25 is transformed (in element 26), and downmixed and frequency banded (in element 27) to produce the values S′. The microphone output signal is transformed (in element 32) and banded (in element 33) to produce the values M′. If an echo canceller (34) is employed, the residual signal (echo cancelled noise estimate values) from the echo canceller is banded (in element 35) to produce the values Mres′.
(92) Subsystem 43 determines the calibration gain G (for each frequency band) in accordance with a microphone to digital mapping, which captures the level difference per frequency band between the playback content in the digital domain at the point (e.g., the output of time-to-frequency domain transform element 26) it is tapped off and provided to the noise estimator, and the playback content as received by the microphone. Each set of current values of the gain G is provided from subsystem 43 to noise estimator 37.
(93) Subsystem 43 has access to at least one of the following three sources of data:
(94) factory preset gains (stored in memory 40);
(95) the state of the gains G generated (by subsystem 43) during the previous session (and stored in memory 41);
(96) if an AEC (e.g., echo canceller 34) is present and in use, banded AEC filter coefficient energies (e.g., those which determine the adaptive filter, corresponding to filter W′ of
(97) If no AEC is employed (e.g., if a version of the
(98) Thus, in some embodiments, subsystem 43 is configured such that the
(99) With reference again to
(100) imputation of missing noise estimate values from a partially updated noise estimate;
(101) constraining of the shape of the current noise estimate to preserve timbre; and
(102) constraining of the absolute value of current noise estimate.
(103) The microphone to digital mapping performed by subsystem 43 to determine the gain values G captures the level difference (per frequency band) between the playback content in the digital domain (e.g., the output of time-to-frequency domain transform element 26) at the point it is tapped off for provision to the noise estimator, and the playback content as received by the microphone. The mapping is primarily determined by the physical separation and characteristics of the speaker system and microphone, as well as the electrical amplification gains used in the reproduction of sound and microphone signal amplification.
(104) In the most basic instance, the microphone to digital mapping may be a pre-stored factory tuning, measured during production design over a sample of devices, and re-used for all such devices being produced.
(105) When an AEC (e.g., echo canceller 34 of
(106) While estimated gains G′ can substitute for factory determined gains, a robust approach to determining the gain G for each band, that combines both factory gains and the online estimated gains G′, is the following:
G=max(min(G′,F+L),F−L)
where F is the factory gain for the band, G′ is the estimated gain for the band, and L is a maximum allowed deviation from the factory settings. All gains are in dB. If a value G′ exceeds the indicated range for a long period of time, this may indicate faulty hardware, and the noise compensation system may decide to fall back to safe behaviour.
(107) A higher quality noise compensation experience can be maintained using a post-processing step performed (e.g., by element 39 of the
(108) An aspect of some embodiments of the inventive noise estimation method and system is post-processing (e.g., performed by an implementation of element 39 of the
(109) Stale value imputation may not be necessary in embodiments where a sufficient number of gaps (including forced gaps inserted by operation of forced gap applicator 70) are constantly available, and bands are rarely stale.
(110) As noted, operation of forced gap applicator 70 may cause a sufficient number of gaps (including forced gaps) in content 25 to be present to allow implementation of a version of the
(111) saves a large amount of development time, as echo cancellers demand a large amount of time and research to tune to ensure cancellation performance and stability;
(112) saves computation time, as large adaptive filter banks (for implementing echo cancellation) typically consume large resources and often require high precision arithmetic to run; and
(113) removes the need for shared clock domain and time alignment between the microphone signal and the playback audio signal. Echo cancellation relies on both playback and recording signals to be synchronised on the same audio clock.
(114) A noise estimator (implemented in accordance with any of typical embodiments of the invention, e.g., without echo cancellation) can run at an increased block rate/smaller FFT size for further complexity savings. Echo cancellation performed in the frequency domain typically requires a narrow frequency resolution.
(115) When using echo cancellation to generate noise estimates in accordance with some embodiments of the invention (including by insertion of forced gaps into a playback signal), echo canceller performance can be reduced without compromising user experience (when the user listens to noise compensated playback content, implemented using noise estimates generated in accordance with such embodiments of the invention), since the echo canceller need only perform enough cancellation to reveal gaps (including forced gaps) in playback content, and need not maintain a high ERLE for the playback content peaks (“ERLE” here denotes echo return loss enhancement, a measure of how much echo, in dB, is removed by an echo canceller).
(116) We next describe methods (which may be implemented in any of many different embodiments of the inventive pervasive listening method) for determining urgency values or a signal (U) indicative of urgency values.
(117) An urgency value for a frequency band indicates the need for a gap to be forced in the band. We present three strategies for determining urgency values, U.sub.k, where U.sub.k denotes urgency for forced gap insertion in band k, and U denotes a vector containing the urgency values for all bands of a set of B.sub.count frequency bands:
U=[U.sub.0,U.sub.1,U.sub.2, . . . ].
(118) The first strategy (sometimes referred to herein as Method 1) determines fixed urgency values. This method is the simplest, simply allowing the urgency vector U to be a predetermined, fixed quantity. When used with a fixed perceptual freedom metric, this can be used to implement a system that randomly inserts forced gaps over time. The system of
U=[u.sub.0,u.sub.1,u.sub.2, . . . ,u.sub.x]
where X=B.sub.count, and each value u.sub.k (for k in the range from k=1 to k=B.sub.count) is a predetermined, fixed urgency value for the “k” band. Setting all u.sub.k to 1.0 would express an equal degree of urgency in all frequency bands.
(119) The second strategy (sometimes referred to herein as Method 2) determines urgency values which depend on elapsed time since occurrence of a previous gap. Typically, one can expect that urgency gradually increases over time, and returns low once either a forced or existing gap causes an update in a pervasive listening result (e.g., a background noise estimate update).
(120) Thus, the urgency value U.sub.k in each frequency band (band k) may be the number of seconds since a gap was seen (by a pervasive listener) in band k. Thus:
U.sub.k(t)=min(t−t.sub.g,U.sub.max)
where t.sub.g is the time at which the last gap was seen for band k, and U.sub.max is a tuning parameter which limits urgency to a maximum size. It should be noted that t.sub.g may update based on the presence of gaps originally present in the playback content. Urgency can be computed in this way either by a forced gap applicator (e.g., in the system of
(121) The third strategy (sometimes referred to herein as Method 3) determines urgency values which are event based. In this context, “event based” denotes dependent on some event or activity (or need for information) external to the playback environment, or detected or inferred to have occurred in the playback environment. Urgency determined by a pervasive listening subsystem may vary suddenly with the onset of new user behavior or changes in playback environment conditions. For example, such a change may cause the pervasive listener to have an urgent need to observe background activity in order to make a decision, or to rapidly tailor the playback experience to new conditions, or to implement a change in the general urgency or desired density and time between gaps in each band. Table 2 below provides a number of examples of contexts and scenarios and corresponding event-based changes in urgency:
(122) TABLE-US-00002 TABLE 2 Change CONTEXT Conditions in Urgency Examples User Interface Some played out Increase Incoming message audio or other tone waiting for modality has user to “answer” requested verbal or the question “Is auditory response this the song you from the user, wanted?” by without pausing or uttering a response ducking the played out audio Environment Occasional deeper Increase When the pervasive Scanning probe of listener has not background noise detected any user and what may be speech or button going on in the presses for a while, playback it may listen environment closely to see if the user is still present. Request or Something from Decrease “Alexa” signature Metadata the user, or data voice user says Indicating available to the “Play this bit loud Quality is pervasive listener, and clear” a Priority suggests that playback audio should not have forced gaps inserted therein Predictive Points of content Increase or 5s into playback of Behaviour that either Decrease a new track, expect heuristically or a “skip” or “turn it from population up” utterance, or in data line up with response to the times that users occurrence of want to talk or be offensive language heard. in content look for a parent uttering “stop”
(123) A fourth strategy (sometimes referred to herein as Method 4) determines urgency values using a combination of two or more of Methods 1, 2, and 3. For example, each of Methods 1, 2, and 3 may be combined into a joint strategy, represented by a generic formulation of the following type:
U.sub.k(t)=U.sub.k*min(t−t.sub.g,U.sub.max)*V.sub.k
where u.sub.k is a fixed unitless weighting factor that controls the relative importance of each frequency band, V.sub.k is a scalar value that is modulated in response to changes in context or user behaviour that require a rapid alteration of urgency, and t.sub.g and U.sub.max are defined above. Typically, the values V.sub.k are expected to remain at a value of 1.0 under normal operation.
(124) We next describe methods (which may be implemented in any of many different embodiments of the inventive pervasive listening method) for determining perceptual freedom values (or a signal indicative thereof) for use by a forced gap applicator to insert forced gaps in a playback signal.
(125) In this context, “F” is defined to be a “perceptual freedom” signal indicative of perceptual freedom values, f.sub.k, where each of such perceptual freedom values has a relatively large magnitude when perceptual impact of forcing a gap in a corresponding band k at a point in time is low, and a relatively smaller magnitude (smaller than the relatively large magnitude) when perceptual impact of forcing a gap in the band k at the point in time is high. For example, perceptual freedom value f.sub.k may be the inverse of the perceptual distortion introduced by a forced gap in the “k”th band.
(126) A first strategy determines fixed perceptual freedom values. For example, “F” may be a predetermined, fixed vector:
F=[f.sub.0,f.sub.1,f.sub.2, . . . ,f.sub.x]
where X=B.sub.count (the number of available bands in which a forced gap may be inserted) and value f.sub.k (for k in the range from k=1 to k=B.sub.count) is a predetermined, fixed perceptual freedom value for the “k” band. Although a flat structure of f.sub.k=1.0 for all f.sub.k will treat all bands equally (in the sense that forced gaps will not be inserted preferentially in specific ones of the bands in response to the identical perceptual freedom values), it is true that different frequency bands will have intrinsic differences in perceptibility. In particular, gaps inserted in bands below 1 kHz and above 6 kHz will be more perceptually impactful than those between these frequencies. A fixed perceptual freedom vector that takes into consideration this phenomenon can be effective, in some implementations of forced gap insertion.
(127) A second strategy determines perceptual freedom values using a perceptual masking curve. In this strategy, forced gaps inserted into a stream of playback content may be considered a kind of distortion. Choosing frequency bins (or bands) in which to place distortion from among a discrete set of frequency bins is a problem also encountered in the art of information hiding, and lossy audio codecs. Those skilled in the art of information hiding and lossy audio compression will be familiar with the concept of a perceptual masking curve. Such curves help indicate where distortions resulting from the addition of noise would be inaudible to a human listener.
(128) There are many known methods for determining a perceptual masking curve which takes advantage of any number of psychoacoustic effects. For example, two such methods are frequency masking and temporal masking. Examples of such methods are described in Swanson, M. D., Kobayashi, Mei, and Tewfik, Ahmed (1998), Multimedia Data-Embedding and Watermarking Technologies, Proceedings of the IEEE, Vol. 86, Issue 6, pp. 1064-1087.
(129) To compute f.sub.k values in accordance with the second strategy, we introduce a perceptual masking curve, M, which has discrete values across the B.sub.count bands.
f.sub.k=M.sub.k−E.sub.k.
(130) Next, we describe an example embodiment for determining perceptual freedom values in accordance with a perceptual mask computation. In the example, the banded playback content energies (E.sub.k) are:
E=[E.sub.0,E.sub.1,E.sub.2, . . . ], and
the aim is to produce masking threshold values (M.sub.k) for the bands:
M=[M.sub.0,M.sub.1,M.sub.2, . . . ]
such that the difference M.sub.k−E.sub.k (for the “k”th band), which is the perceptual freedom value freedom f.sub.k for the “k”th band, is a value inversely proportional to the perceptibility of a forced gap in the “k”th band. The definition of a masking threshold here does not promise the imperceptibility of inserting a forced gap. It is well known how to use masking curves in scenarios where imperceptibility has been proven and demonstrated with controlled signals and conditions, however the computation of perceptual freedom only requires that the curves are indicative of this, not normative.
(131) Loud signals have the ability to mask quieter signals nearby in frequency, a phenomena known as “frequency masking” (or “spectral masking” or “simultaneous masking”). In the example, the concept of frequency masking is applied to the banded energies E, to determine masking threshold values M.sub.k by spreading the energies in accordance with the following algorithm:
M.sub.0=E.sub.0,
M.sub.k=max(E.sub.k−1*s.sub.k−1,E.sub.k) for bands k=1,2, . . . ,M.sub.count−1, and
M.sub.k=max(M.sub.k+1*s.sub.k+1,M.sub.k) for bands k=0,1, . . . ,B.sub.count−2,
where the lines are performed sequentially (updating the value “k” in M.sub.k during each performance), and where s.sub.k are spreading factors derived from a psychoacoustic model. The spreading factors are typically proportional the bandwidths of the corresponding frequency bands. For logarithmically spaced bands with increasing bandwidth, the following simple linear approximation is typically sufficient:
(132)
where
s.sub.k=10.sup.−1.5*s′.sup.
(133) Playback of loud signals has the ability to mask playback of quieter signals which occurs soon thereafter, a phenomena known as “temporal masking.” In the example, we model temporal masking by a decaying exponential applied to the banded energies. In the example, forward temporal masking is applied to determine masking thresholds M.sub.k,t for masking curves (each curve for a different value of time t), where M.sub.k,t is the masking threshold for frequency band k for the curve for time t, in accordance with the following algorithm, which applies the model with an exponential truncated to T previous values of each of the above-determined masking thresholds M.sub.k:
M.sub.k,t=max(M.sub.k,t,M.sub.k,t-1*e.sup.−α,M.sub.k,t-2*e.sub.k.sup.−2α, . . . ) for each different band index k,
where the maximum (“max( )”) for each band, k, is taken over the T terms (the value M.sub.k for the time t, and the values M.sub.k for each of the T−1 previous times) for that band. The parameter α in the above expression is the decay rate of the exponential which will depend on the system's block rate/sampling rate. A value of α that achieves a decay rate of 0.1 dB/ms is a sensible default value of α.
(134) The example method of determining masking thresholds optionally includes a step of emphasizing the masking curve. In this step, the masking curve is emphasized to lift the curve upwards for low-energy bands, which typically achieves good results when the emphasized curves are used for inserting gaps. This final step is optional, and is useful if the (non-emphasized) masking curves are too conservative for the application of forced gaps. A typical implementation of the emphasizing step replaces each previously determined value M.sub.k with the following emphasized value:
M.sub.k=(∜
(135) We next describe typical aspects of probabilistic forced gap insertion implemented in accordance with some embodiments of the invention.
(136) Once urgency and perceptual freedom values have been calculated or otherwise determined, they are combined (in some embodiments of forced gap insertion) to form the (above-mentioned) discrete probability distribution P:
(137)
in which the parameter δ controls the relative importance of urgency (U.sub.k) over perceptual freedom (F.sub.k). Such a probability distribution is convenient to tune and control.
(138) An example of an algorithm for selecting bands of a playback signal in which to insert forced gaps (using the probability distribution of the previous paragraph), in each frame of analysis, is as follows: 1. Compute or otherwise determine the values U.sub.k, and F.sub.k for the current frame of analysis (optionally, limiting the values U.sub.k so that they do not exceed a value U.sub.max); 2. Compute the values P.sub.k (of the distribution P) from which to select (draw) bands for forced gap insertion; and 3. If at least T.sub.p seconds have passed since gaps were last forced, a. Draw N bands randomly from the distribution P and, b. Discard any bands for which U.sub.k is below a threshold U.sub.min, or for which F.sub.k is above a threshold F.sub.min, and c. Initiate gap forcing in the bands remaining after steps 3a and 3b.
(139) By randomly selecting from the distribution P, structured patterns of gaps are avoided, which would otherwise create perceptible artifacts of their own. Step 3b ultimately lowers the actual number of gaps created, but has the important advantage of being very easy to tune, and being highly connected to the perceptual cost of the system. Typical defaults for the values of the parameters in the example method, to optimize the general distribution shape for lower perceptible impact and timely response to urgency, are set forth in the following table
(140) TABLE-US-00003 Parameter Default Units Purpose Delta (δ) 0.5 Unitless [0, Scale and 1] combine the quantities of urgency and perceptual freedom in the probability distribution P. U.sub.max 10.0 Seconds Limit the urgency quantity in magnitude. U.sub.min 4.0 Seconds The minimum period of time that must pass after a gap is seen in a band, before a gap may be inserted in that band. F.sub.min — dB Minimum level of perceptual freedom required to force a gap, should be tuned based on the specific perceptual masking curves used, and the degree to which impact on the audio can be tolerated in the system. N 1 Unitless The number of bands that are attempted to be forced at one time. T.sub.p 0.0 Seconds The amount of time that must elapse since the last gap was forced before the algorithm attempts to force new gaps.
(141) Next, with reference to
(142) The output of subsystem 81 is provided to probability distribution subsystem 82, which is configured to determine a probability distribution, P (e.g., a fixed, time-invariant distribution, or a distribution which is updated at times corresponding to different time intervals of the mono feed). In accordance with the probability distribution, a set of N of the frequency bands (e.g., a set of N of the bands for each time interval of the mono feed) can be drawn randomly by subsystem 83, so that subsystem 84 can insert forced gaps in each set of drawn bands. Subsystem 82 is typically configured to generate (and optionally to update, for each of a number of different time intervals of the mono feed) the probability distribution P to be a distribution of the following form (which is described above in this disclosure):
(143)
where
P′.sub.k=δ*U.sub.k+(1−δ)*F.sub.k,
and where F.sub.k are perceptual freedom values determined by subsystem 81 (e.g., for the relevant time interval), U.sub.k are values indicative of urgency for each band (i.e., U.sub.k is the urgency value for the “k”th band), P′k is the (non-normalized) probability of selection of the “k”th band, δ is a parameter indicative of relative importance of urgency and perceptual freedom considerations, and the summation is over all the frequency bands (so that P.sub.k is the normalized version of P′k for the “k”th band).
(144) In some implementations, a banded urgency signal U, indicative of the urgency values, U.sub.k, (e.g., for a time interval of the playback signal) is provided to subsystem 82 from an external source (e.g., pervasive listener subsystem 73 of
(145) Subsystem 83 is coupled and configured to select (draw) a set of N bands randomly (once for each time interval of the mono feed) from the probability distribution P determined by subsystem 82 (for the corresponding time interval), and typically also to check that the bands of each set of drawn bands satisfy minimum requirements F.sub.min and U.sub.min (of the type described above). If the urgency value, U.sub.k, or the perceptual freedom value, F.sub.k, corresponding to a drawn band does not satisfy the relevant one of the minimum requirements, F.sub.min and U.sub.min, the band is typically discarded (no forced gap is inserted therein).
(146) Subsystem 83 is configured to notify gap application subsystem 84 of each set of bands (one set for each time interval of the mono feed determined by subsystem 80) into which forced gaps are to be inserted. In response to each such notification, subsystem 84 is configured to insert a forced gap (during the appropriate time interval) in each of the bands which have been notified thereto. The insertion of each forced gap includes computation of the forced gap gains G to be applied, and application of these gains to the K channels of playback content in the appropriate frequency band and time interval (of each channel), thereby inserting a forced gap in each such channel in which non-playback sound may be monitored (by a pervasive listener) during playback.
(147) We next describe typical forced gap application system behavior, assuming different choices of methods for determining urgency values (i.e., above-described Methods 1, 2, 3, and 4) and different choices of methods for determining perceptual freedom values (i.e., the above-described method for determining fixed perceptual freedom values, and the above-described method for determining perceptual freedom values using at least one masking curve). Table 3 (set forth below) compares the typical behavior of a forced gap application system, for the indicated choices of methods of determining urgency and perceptual freedom values.
(148) TABLE-US-00004 TABLE 3 Urgency Grows in Time Urgency Fixed (Method 1) (Methods 2-4) Fixed Random forced gaps. Forced gaps are inserted perceptual with purpose as freedom they are needed. values Perceptual Opportunistic (e.g., in bands Balanced. Freedom benefiting from frequency Forced gap insertion is Masking masking and/or at times performed so as to Curve benefiting from temporal implement a balance masking). It may of need against not be known if an inserted perceptual cost. forced gap is needed. Forced Potentially allows gap insertion may be best pervasive listening performed in accordance performance, but with balancing of a possibly at a random forced gap density cost of high tuning against opportunity for complexity. low perceptual cost insertions (e.g., to insert a gap that is not needed but which has low perceptual cost)
(149) The following table describes aspects of different embodiments of forced gap insertion, which may rely on different types of masking to insert forced gaps at low perceptual cost. These aspects include factors useful in some embodiments to shape and create perceptual masking curves for computation of perceptual freedom.
(150) TABLE-US-00005 Gap Masking Factor Characteristics Notes Spectral Gaps inserted in the spectral Masking ‘shadow’ of (i.e., in bands near (i.e., to) a peak energy band (a band Frequency having a playback content energy Masking) peak) at a particular (peak energy) frequency, are less likely to be audible than are gaps inserted in bands farther from the peak energy band. This masking can be at least somewhat symmetric in the sense that forced gaps are less likely to be audible both at frequencies just above and just below the frequency of a peak energy band. Temporal At a time when a particularly Forced gaps can Masking loud event has just happened, a desirably be inserted human listener is likely to be (i.e., at low immune to broadband changes perceptual cost) at (e.g., insertion of forced gaps in a times within a wide range of frequency bands) short time for a short period of time interval just after afterward. This masking is very a sudden loud asymmetric (in that it does not event indicated apply at times before occurrence by playback of the loud event. content. Rhythm If a forced gap is inserted, it may Masking be repeated (with low perceptual cost) in accordance with temporal cadence (beat) or time texture of the playback sound. Textural If playback sound has a grainy There would Masking texture over time and frequency, typically forced gaps may be inserted at need to be times and frequencies some textural corresponding to or in contrast gaps indicated by with the playback sound texture the playback sound, to reduce perceptibility of the so that forced gaps inserted gaps. could be inserted in a similar statistical pattern to that of the existing textural gaps.
(151) Aspects of some embodiments of the present invention include the following:
(152) methods and system for insertion of forced sampling gaps into playback content for the purpose of improved performance of pervasive listening methods (using local microphones to capture both the playback sound and non-playback sound), without a significant perceptual impact to the user;
(153) methods and system for insertion of forced sampling gaps into playback content based on urgency or need to do so;
(154) methods and system for insertion of forced sampling gaps into playback content based on relative perceptual impact through using a masking curve;
(155) methods and system for insertion of forced sampling gaps into playback content based on balancing of relative perceptual impact of gap insertion and urgency for gap insertion;
(156) methods and system for insertion of forced sampling gaps into playback content for the purpose of improved performance of pervasive listening methods (in contrast with barge-in ducking or pausing of the playback audio), whereby defining parameters of a forced sampling gap are determined proportionally to the duration of time that components of a noise estimate have not updated, and/or whereby defining parameters of a forced sampling gap are determined through an optimization process that minimizes perceptual impact of the forced gaps by considering their proximity in time and space to the playback audio signal;
(157) methods and systems that extend noise compensation functionality through the use of forced sampling gaps, whereby the trigger to force the presence of gaps in the playback content is automatically linked to the duration of time elapsed since components of a noise estimate have updated, and/or whereby the trigger to force the presence of gaps in the playback content is requested by a secondary device or by user demand;
(158) methods and systems that extend noise compensation functionality and/or background sound awareness including by forced gap insertion, using a perceptual model for the impact of forced gap insertion, e.g., balanced against accumulated need or desire for insertion of a forced gap.
(159) We next describe examples of operation of embodiments of the inventive system (e.g., the system of
(160) 1. Noise conditions increase, while the estimate is stuck;
(161) 2. Noise conditions decrease, while the estimate is stuck; or
(162) 3. Noise conditions persist, while the estimate is stuck.
(163) In Case 3 (when noise conditions persist), the system will continue to perform compensation in the previously determined manner, but because the system cannot distinguish this case from the other cases, we consider the impact of forcing gaps during Case 3.
(164) Table 4 set forth below sets forth assessments of the three scenarios where forced gaps are introduced to combat stale noise estimates brought on by a lack of sampling gaps available in the playback content.
(165) TABLE-US-00006 TABLE 4 Severity of Urgency failure to Duration Forced of update noise of Gap Gap forced compen- Forcing Percepti- gap sation Frequency Process bility insertion Case 1: Impaired Moderately Brief when Low, Medium Noise intelligibility, often, enough masked conditions timbre based forced by noise increase on gaps are conditions. environ- provided ment to update the noise estimate. Case 2: Playback Moderately Very brief, High, no Very Noise annoyingly often, system noise high conditions loud, based unwinds conditions decrease unnecessarily on rapidly for impacted environ- through masking. voice ment positive assistant feedback performance after gaps introduced Case 3: Suitable Highly Long, High, as High Noise compensation frequent, throughout the system conditions amount the the time SNR is persist steady the system maintained state has by noise for dense converged. compen- content sation as desired.
(166) Cases 1 and 2 are expected to be short-lived events, lasting only as long as it takes for the system to re-converge (using inserted forced gaps) to an accurate noise estimate. Case 1 should reconverge quickly, as even small gaps will help the system find the increased noise conditions. Case 2 should also reconverge quickly, due to the positive feedback in compensation systems that favour lower noise estimates for stability. Case 3 will be the steady state of the system, for as long as the content is dense and poor in gaps. Hence the impact of forced gaps on audio quality should be considered predominantly for Case 3.
(167) Table 4 shows a trend between urgency and the potential perceptibility of forced gaps. Higher urgency generally implies that the system is struggling to hear the background conditions, so the signal to noise ratio (SNR) of the playback content is high. A higher SNR of playback content to background noise will provide less masking, increasing the chances of forced gaps to be more perceptible.
(168) Exemplary embodiments of the inventive method include the following:
(169) E1. A pervasive listening method, including steps of:
(170) inserting at least one gap into at least one selected frequency band of an audio playback signal to generate a modified playback signal;
(171) during emission of sound in a playback environment in response to the modified playback signal, using a microphone in the playback environment to generate a microphone output signal, wherein the sound is indicative of playback content of the modified playback signal, and the microphone output signal is indicative of non-playback sound in the playback environment and the playback content; and
(172) monitoring the non-playback sound in the playback environment in response to the modified playback signal and the microphone output signal.
(173) E2. The method of E1, wherein each said gap is inserted into a selected frequency band, in a selected time interval, of the audio playback signal, in an effort so that any artifact, in the sound emitted in the playback environment in response to the modified playback signal, resulting from insertion of the gap has low perceptibility to a user in the playback environment, and high identifiability during performance of the monitoring.
(174) E3. The method of E1, wherein each said gap is inserted into a selected frequency band, in a selected time interval, of the audio playback signal such that the sound emitted in the playback environment in response to the modified playback signal is perceivable by the user without any significant artifact resulting from insertion of the gap.
(175) E4. The method of E1, wherein each said gap is inserted into a selected frequency band of the audio playback signal, and each said selected frequency band is determined by selection, from a set of frequency bands of the audio playback signal, implemented using perceptual freedom values indicative of expected perceptual effect of insertion of a gap in each band of the set of frequency bands.
(176) E5. The method of E4, wherein the perceptual freedom values are determined in accordance with at least one frequency masking consideration, such that when one of the perceptual freedom values is a near peak value for a near peak band which is near to a peak energy band of the set of frequency bands, each of the perceptual freedom values, for a band farther from the peak energy band than is said near peak band, is indicative of greater expected perceptual effect than is said near peak value.
(177) E6. The method of E4, wherein the perceptual freedom values are determined in accordance with at least one temporal masking consideration, such that when the audio playback signal is indicative of at least one loud playback sound event, those of the perceptual freedom values for a first time interval of the audio playback signal occurring shortly after the loud playback sound event, are indicative of lower expected perceptual effect than are those of the perceptual freedom values for a second time interval of the audio playback signal, where the second time interval is later than the first time interval.
(178) E7. The method of E1, wherein the pervasive listening method is a noise estimation method, the microphone output signal is indicative of background noise in the playback environment, and the monitoring includes generating an estimate of background noise in the playback environment in response to the modified playback signal and the microphone output signal.
(179) E8. The method of E1, wherein the monitoring includes generation of an estimate of at least one aspect of the non-playback sound in the playback environment in response to the modified playback signal and the microphone output signal, and wherein the method also includes a step of:
(180) generating the audio playback signal in response to the estimate of at least one aspect of the non-playback sound in the playback environment.
(181) E9. The method of E1, wherein each said gap is inserted into the playback signal based on need for a gap in at least one frequency band of the playback signal.
(182) E10. The method of E9, wherein each said gap is inserted into the playback signal in response to urgency values indicative of urgency for gap insertion in each band of a set of frequency bands of the playback signal.
(183) E11. The method of E9, wherein each said gap is inserted into the playback signal in response to urgency values indicative of urgency for gap insertion in each band of a set of frequency bands of the playback signal, and based on expected perceptual effect of insertion of a gap in said each band of the set of frequency bands of the playback signal.
(184) E12. The method of E9, wherein each said gap is inserted into the playback signal in a manner including balancing of urgency and expected perceptual effect of gap insertion, using urgency values indicative of urgency for gap insertion in each band of a set of frequency bands of the playback signal, and based on expected perceptual effect of insertion, in at least one specific time interval of the playback signal, of a gap in said each band of the set of frequency bands of the playback signal.
(185) E13. The method of E1, including steps of:
(186) determining a probability distribution indicative of a probability for each band of a set of frequency bands of the playback signal; and
(187) in accordance with the probability distribution, randomly selecting at least one of the frequency bands of the set, and inserting a gap in each of said at least one of the frequency bands.
(188) E14. The method of E13, wherein the probability distribution is based on need for a gap in each said band of the set of frequency bands of the playback signal.
(189) E15. The method of E13, wherein the probability distribution is based on need for a gap, and expected perceptual effect of insertion of the gap, in each said band of the set of frequency bands of the playback signal.
(190) E16. The method of E1, including a step of:
(191) generating urgency values in response to the microphone output signal and the modified playback signal, wherein the urgency values are indicative of need for a gap, in each band of a set of frequency bands of the playback signal, based on elapsed time since occurrence of a previous gap in said each band, and wherein insertion of each gap into the playback signal is at least partially based on the urgency values.
(192) E17. The method of E1, wherein the monitoring of the non-playback sound includes generation of background noise estimates, wherein the method also includes a step of:
(193) generating the audio playback signal in response to the background estimates, including by performing noise compensation on an input audio signal in response to the background estimates.
(194) E18. A system, including:
(195) a microphone, positioned and configured to generate a microphone output signal during emission of sound in a playback environment, wherein the sound is indicative of playback content of a modified playback signal, and the microphone output signal is indicative of non-playback sound in the playback environment and the playback content;
(196) a forced gap application subsystem, coupled to receive an audio playback signal, and configured to insert at least one gap into at least one selected frequency band of the audio playback signal, thereby generating the modified playback signal; and
(197) a pervasive listening subsystem, coupled to receive the microphone output signal and the modified playback signal, and configured to monitor the non-playback sound in the playback environment in response to the modified playback signal and the microphone output signal.
(198) E19. The system of E18, wherein the forced gap application subsystem is configured to insert each said gap into a selected frequency band, in a selected time interval, of the audio playback signal, in an effort so that any artifact, in the sound emitted in the playback environment in response to the modified playback signal, resulting from insertion of the gap has low perceptibility to a user in the playback environment, and high identifiability during performance of the monitoring.
(199) E20. The system of E18, wherein the forced gap application subsystem is configured to insert each said gap into a selected frequency band of the audio playback signal, including by selecting each said selected frequency band, from a set of frequency bands of the audio playback signal, using perceptual freedom values indicative of expected perceptual effect of insertion of a gap in each band of the set of frequency bands.
(200) E21. The system of E20, wherein the perceptual freedom values have been determined in accordance with at least one frequency masking consideration.
(201) E22. The system of E20, wherein the perceptual freedom values have been determined in accordance with at least one temporal masking consideration.
(202) E23. The system of E18, wherein the microphone output signal is indicative of background noise in the playback environment, and the pervasive listening subsystem is configured to generate an estimate of the background noise in the playback environment in response to the modified playback signal and the microphone output signal.
(203) E24. The system of E18, wherein the pervasive listening subsystem is coupled and configured:
(204) to generate an estimate of at least one aspect of the non-playback sound in the playback environment in response to the modified playback signal and the microphone output signal; and
(205) to generate the audio playback signal in response to the estimate of at least one aspect of the non-playback sound in the playback environment.
(206) E25. The system of E18, wherein the forced gap application subsystem is configured to insert each said gap into the playback signal based on need for a gap in at least one frequency band of the playback signal.
(207) E26. The system of E25, wherein the forced gap application subsystem is configured to insert each said gap into the playback signal in response to urgency values indicative of urgency for gap insertion in each band of a set of frequency bands of the playback signal.
(208) E27. The system of E25, wherein the forced gap application subsystem is configured to insert each said gap into the playback signal in response to urgency values indicative of urgency for gap insertion in each band of a set of frequency bands of the playback signal, and based on expected perceptual effect of insertion of a gap in said each band of the set of frequency bands of the playback signal.
(209) E28. The system of E25, wherein the forced gap application subsystem is configured to insert each said gap into the playback signal in a manner including balancing of urgency and expected perceptual effect of gap insertion, using urgency values indicative of urgency for gap insertion in each band of a set of frequency bands of the playback signal, and based on expected perceptual effect of insertion, in at least one specific time interval of the playback signal, of a gap in said each band of the set of frequency bands of the playback signal.
(210) E29. The system of E18, wherein the forced gap application subsystem is configured to:
(211) determine a probability distribution indicative of a probability for each band of a set of frequency bands of the playback signal; and
(212) in accordance with the probability distribution, to randomly select at least one of the frequency bands of the set, and to insert a gap in each of said at least one of the frequency bands.
(213) E30. The system of E29, wherein the probability distribution is based on need for a gap in each said band of the set of frequency bands of the playback signal.
(214) E31. The system of E29, wherein the probability distribution is based on need for a gap, and expected perceptual effect of insertion of the gap, in each said band of the set of frequency bands of the playback signal.
(215) E32. The system of E18, wherein the pervasive listening subsystem is configured to:
(216) generate urgency values in response to the microphone output signal and the modified playback signal, wherein the urgency values are indicative of need for a gap, in each band of a set of frequency bands of the playback signal, based on elapsed time since occurrence of a previous gap in said each band, and wherein the forced gap application subsystem is coupled to receive the urgency values and configured to insert each said gap into the playback signal in a manner at least partially based on the urgency values.
(217) E33. The system of E18, wherein the pervasive listening subsystem is coupled and configured to:
(218) monitor the non-playback sound including by generating background noise estimates, and
(219) generate the audio playback signal in response to the background estimates, including by performing noise compensation on an input audio signal in response to the background estimates.
(220) Aspects of the invention include a system or device configured (e.g., programmed) to perform any embodiment of the inventive method, and a tangible computer readable medium (e.g., a disc) which stores code for implementing any embodiment of the inventive method or steps thereof. For example, the inventive system can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of the inventive method or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform an embodiment of the inventive method (or steps thereof) in response to data asserted thereto.
(221) Some embodiments of the inventive system (e.g., some implementations of the
(222) Another aspect of the invention is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) any embodiment of the inventive method or steps thereof.
(223) While specific embodiments of the present invention and applications of the invention have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the invention described and claimed herein. It should be understood that while certain forms of the invention have been shown and described, the invention is not to be limited to the specific embodiments described and shown or the specific methods described.