Methods and apparatus for adjusting a level of an audio signal
11296668 · 2022-04-05
Assignee
Inventors
Cpc classification
H04S2420/03
ELECTRICITY
H03G3/3005
ELECTRICITY
H03G3/32
ELECTRICITY
H04S3/02
ELECTRICITY
G10L19/008
PHYSICS
H03G9/025
ELECTRICITY
International classification
H03G9/00
ELECTRICITY
H03G3/32
ELECTRICITY
H04S3/02
ELECTRICITY
G10L19/008
PHYSICS
Abstract
The invention relates to methods and apparatus for adjusting a level of an audio signal. An audio signal is divided into a plurality of frequency bands. Modification parameters are obtained for at least one of the plurality of frequency band. Gain factors are derived for at least one of the plurality of frequency bands, the gain factors determined based on the amplitude scale factors. The gain factors are smoothed. A level of noise from noise compensation factors is determined. The gain factors are applied to at least one of the frequency bands to generate gain adjusted frequency bands. The level of noise is adjusted based on the gain adjusted frequency bands. At least one of the frequency bands is filtered with a filter generated with the filter coefficients. The plurality of frequency bands is synthesized to generate an output audio signal.
Claims
1. A method for adjusting a level of an audio signal in an audio processing apparatus, the method comprising: dividing an audio signal into a plurality of frequency bands; delaying at least one of the plurality of the frequency bands; obtaining modification parameters for at least one of the plurality of frequency bands, the modification parameters comprising amplitude scale factors, each of the amplitude scale factors respectively operating in a frequency band of the plurality of frequency bands and each amplitude scale factor representing an average energy over the frequency band and a time segment; deriving gain factors for at least one of the plurality of frequency bands, the gain factors determined based on the amplitude scale factors; smoothing the gain factors, wherein the smoothing of the gain factors is optional; determining a level of noise from noise compensation factors; applying the gain factors to at least one of the frequency bands to generate gain adjusted frequency bands; adjusting the level of noise based on the gain adjusted frequency bands; and synthesizing the plurality of frequency bands to generate an output audio signal; wherein the gain factors are both time and frequency varying.
2. The method of claim 1, further comprising band smoothing of the amplitude scale factors.
3. An audio processing apparatus for adjusting a level of an audio signal, the audio processing apparatus comprising: an analysis filterbank for dividing an audio signal into a plurality of frequency bands; a delay unit for delaying at least one of the plurality of the frequency bands; a parameter generator for obtaining modification parameters for at least one of the plurality of frequency bands, the modification parameters comprising amplitude scale factors, each amplitude scale factor respectively operating in a frequency band of the plurality of frequency bands and each amplitude scale factor representing an average energy over the frequency band and a time segment; a first processor for deriving gain factors for at least one of the plurality of frequency bands, the gain factors determined based on the amplitude scale factors; a smoother for smoothing the gain factors, wherein the smoothing of the gain factors is optional; a second processor for determining a level of noise from noise compensation factors; a first adjuster for applying the gain factors to at least one of the frequency bands to generate gain adjusted frequency bands; a second adjuster for adjusting the level of noise based on the gain adjusted frequency bands; and a synthesis filterbank for synthesizing the plurality of frequency bands to generate an output audio signal; wherein the gain factors are both time and frequency varying.
4. A non-transitory computer readable medium, storing software instructions for controlling a perceptual loudness of a digital audio signal, which when executed by one or more processors cause performance of the steps of method claim 1.
5. The audio decoder of claim 4, further comprising a band smoother for smoothing the amplitude scale factors.
Description
DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
BEST MODE FOR CARRYING OUT THE INVENTION
(20)
(21) Referring to the example of a feed-forward topology in
(22) In the
(23) The calculations performed by processes or devices 6, 8 and 10 (and by processes or devices 12, 14, 10′ in the
(24) Although the calculation processes or devices 6, 8 and 10 of the
(25) As an aspect of the present invention, in the example of
{circumflex over (N)}[b,t]=Ξ[b,t]N[b,t],
a time-varying, frequency-invariant scale factor Φ[t] scaling of the specific loudness as in the relationship
{circumflex over (N)}[b,t]=Φ[t]N[b,t],
a time-invariant, frequency-varying scale factor Θ[b] scaling of the specific loudness as in the relationship
{circumflex over (N)}[b,t]=Θ[b]N[b,t], or
a scale factor α scaling of the specific loudness of the audio signal as in the relationship
{circumflex over (N)}[b,t]=αN[b,t],
where b is a measure of frequency (e.g., the band number) and t is a measure of time (e.g., the block number). Multiple scalings may also be employed, using multiple instances of a particular scaling and/or combinations of particular scalings. Examples of such multiple scalings are given below. In some cases, as explained further below, the scaling may be a function of the audio signal or measure of the audio signal. In other cases, also as explained further below, when the scaling is not a function of a measure of the audio signal, the scaling may be otherwise determined or supplied. For example, a user could select or apply a time- and frequency-invariant scale factor α or a time-invariant, frequency-varying scale factor Θ[b] scaling.
(26) Thus, the target specific loudness may be expressed as one or more functions F of the audio signal or measure of the audio signal (the specific loudness being one possible measure of the audio signal):
{circumflex over (N)}[b,t]=F(N[b,t]).
(27) Provided that the function or functions F is invertible, the specific loudness (N[b,t] of the unmodified audio signal may be calculated as the inverse function or functions F.sup.−1 of the target specific loudness ({circumflex over (N)}[b,t]):
N[b,t]=F.sup.−1({circumflex over (N)}[b,t])
(28) As will be seen below, the inverse function or functions F.sup.−1 is calculated in the feedback and hybrid feed-forward/feedback examples of
(29) A “Select Function(s) and Function Parameter(s)” input for Calculate Target Specific Loudness 6 is shown to indicate that the device or process 6 may calculate the target specific loudness by applying one or more functions in accordance with one or more function parameters. For example, the Calculate Target Specific Loudness 8 may calculate the function or functions “F” of the specific loudness of the audio signal in order to define the target specific loudness. For example, the “Select Function(s) and Function Parameter(s)” input may select one or more particular functions that fall into one or more of the above types of scaling, along with one or more function parameters, such as constants (e.g., scale factors) pertaining to the functions.
(30) The scaling factors associated with a scaling may serve as a representation of the target specific loudness inasmuch as the target specific loudness may be computed as a scaling of the specific loudness, as indicated above. Thus, in the
(31) Whether employing a lookup table, a closed-form mathematical expression, or some other technique, the operation of the Generate Modification Parameters 4 (and its counterpart processes or devices 4′, 4″ and 4′″ in each of the
(32) In a playback environment having a secondary interfering audio signal, such as noise, the Calculate Modification Parameters 10 (and its counterpart processes or devices 10′, 10″ and 10′″ in each of the
(33) As mentioned above, in each of the
(34) Here and elsewhere in this document, the use of the same reference numeral indicates that the device or process may be substantially identical to another or others bearing the same reference numeral. Reference numerals bearing prime numbers (e.g., “10′”) indicates that the device or process is similar to in structure or function but may be a modification of another or others bearing the same basic reference numeral or primed versions thereof.
(35) Under certain constraints, a nearly equivalent feedback arrangement of the feed-forward example of
(36) With the constraint that the function or functions F is invertible, the process or device 12 estimates the specific loudness of the unmodified audio signal by applying the inverse function F.sup.−1 to the specific loudness or partial specific loudness of the modified audio signal. The device or process 12 may calculate an inverse function F.sup.−1, as described above. This is indicated schematically in
(37) As mentioned above, in a playback environment having a secondary interfering audio signal, such as noise, the Calculate Modification Parameters 10,′ the Calculate Approximation of Specific Loudness of Unmodified Audio 12, and the Calculate Approximation of Target Specific Loudness 14 may each also receive as an optional input a measure of such a secondary interfering audio signal or the secondary interfering signal itself as one of its inputs and process or device 12 and process or device 14 may each calculate the partial specific loudness of the modified audio signal. Such optional inputs are shown in
(38) As mentioned above, hybrid feed-forward/feedback implementation examples of aspects of the invention are possible.
(39) In the
(40) The Calculate Modification Parameters 10″ may employ an error detecting device or function, such that differences between its target specific loudness and target specification loudness approximation inputs adjust the Modification Parameters so as to reduce the differences between the approximation of the target specific loudness and the “actual” target specific loudness. Such adjustments reduce the differences between the specific loudness of the unmodified audio signal, and the target specific loudness, which may be implicit. Thus, the modification parameters M may be updated based on an error between the target specific loudness, computed in the feed-forward path from the specific loudness of the original audio using the function F, and the target specific loudness approximation computed in the feedback path from specific loudness or partial specific loudness of the modified audio.
(41) In the
(42) As in the examples of
(43) The Calculate Modification Parameters 10′″ may employ an error detecting device or function, such that differences between its specific loudness and specific loudness approximation inputs produce outputs that adjust the Modification Parameters so as to reduce the differences between the approximation of the specific loudness and the “actual” specific loudness. Because the approximation of the specific loudness is derived from the specific loudness or partial specific loudness of the modified audio, which can be viewed as an approximation of the target specific loudness, such adjustments reduce the differences between the specific loudness of the modified audio signal and the target specific loudness, which is inherent in the function or functions F. Thus, the modification parameters M may be updated based on an error between the specific loudness, computed in the feed-forward path from the original audio, and the specific loudness approximation computed, using the inverse function or functions F, in the feedback path from specific loudness or partial specific loudness of the modified audio. Due to the feedback path, practical implementations may introduce a delay between the update and application of the modification parameters.
(44) Although the modification parameters M in the examples of
(45) Although not critical or essential to aspects of the present invention, calculation of the specific loudness of the audio signal or the modified audio signal may advantageously employ techniques set forth in said International Patent Application No. PCT/US2004/016964, published as WO 2004/111964 A2, wherein the calculating selects, from a group of two or more specific loudness model functions, one or a combination of two or more of the specific loudness model functions, the selection of which is controlled by the measure of characteristics of the input audio signal. The description of Specific Loudness 104 of
(46) In accordance with further aspects of the invention, the unmodified audio signal and either (1) the modification parameters or (2) the target specific loudness or a representation of the target specific loudness (e.g., scale factors usable in calculating, explicitly or implicitly, target specific loudness) may be stored or transmitted for use, for example, in a temporally and/or spatially separated device or process. The modification parameters, target specific loudness, or representation of the target specific loudness may be determined in any suitable way, as, for example, in one of the feed-forward, feedback, and hybrid feed-forward feedback arrangement examples of
(47) An arrangement such as in the example of
(48) Accordingly, further aspects of the present invention are the provision of a device or process (1) that receives or plays back, from a store or transmit device or process, modification parameters M and applies them to an audio signal that is also received or (2) that receives or plays back, from a store or transmit device or process, a target specific loudness or representation of a target specific loudness, generates modification parameters M by applying the target specific loudness or representation thereof to the audio signal that is also received (or to a measure of the audio signal such as its specific loudness, which may be derived from the audio signal), and applies the modification parameters M to the received audio signal. Such devices or processes may be characterized as decoding processes or decoders; while the devices or processes required to produce the stored or transmitted information may be characterized as encoding processes or encoders. Such encoding processes or encoders are those portions of the
(49) In one aspect of the invention, as in the example of
(50) In another aspect of the invention, as in the example of
(51) When implementing the disclosed invention as a digital system, a feed-forward configuration is the most practical, and examples of such configurations are therefore described below in detail, it being understood that the scope of the invention is not so limited.
(52) Throughout this document, terms such as “filter” or “filterbank” are used herein to include essentially any form of recursive and non-recursive filtering such as IIR filters or transforms, and “filtered” information is the result of applying such filters. Embodiments described below employ filterbanks implemented by transforms.
(53)
(54) In practical embodiments, processing of the audio may be performed in the digital domain. Accordingly, the audio input signal is denoted by the discrete time sequence x[n] which has been sampled from the audio source at some sampling frequency f.sub.s. It is assumed that the sequence x[n] has been appropriately scaled so that the rms power of x[n] in decibels given by
(55)
is equal to the sound pressure level in dB at which the audio is being auditioned by a human listener. In addition, the audio signal is assumed to be monophonic for simplicity of exposition.
(56) Analysis Filterbank 100, Transmission Filter 101, Excitation 102, Specific Loudness 104, Specific Loudness Modification 105, Gain Solver 106, and Synthesis Filterbank 110 may be described in greater detail as follows.
Analysis Filterbank 100
(57) The audio input signal is applied to an analysis filterbank or filterbank function (“Analysis Filterbank”) 100. Each filter in Analysis Filterbank 100 is designed to simulate the frequency response at a particular location along the basilar membrane in the inner ear. The Filterbank 100 may include a set of linear filters whose bandwidth and spacing are constant on the Equivalent Rectangular Bandwidth (ERB) frequency scale, as defined by Moore, Glasberg and Baer (B. C. J. Moore, B. Glasberg, T. Baer, “A Model for the Prediction of Thresholds, Loudness, and Partial Loudness,” supra).
(58) Although the ERB frequency scale more closely matches human perception and shows improved performance in producing objective loudness measurements that match subjective loudness results, the Bark frequency scale may be employed with reduced performance.
(59) For a center frequency f in hertz, the width of one ERB band in hertz may be approximated as:
ERB(f)=24.7(4.37f/1000+1) (1)
(60) From this relation a warped frequency scale is defined such that at any point along the warped scale, the corresponding ERB in units of the warped scale is equal to one. The function for converting from linear frequency in hertz to this ERB frequency scale is obtained by integrating the reciprocal of Eqn. 1:
(61)
(62) It is also useful to express the transformation from the ERB scale back to the linear frequency scale by solving Eqn. 2a for f:
(63)
(64) where e is in units of the ERB scale.
(65) The Analysis Filterbank 100 may include B auditory filters, referred to as bands, at center frequencies f.sub.c[1] . . . f.sub.c[B] spaced uniformly along the ERB scale. More specifically,
f.sub.c[1]=f.sub.min (3a)
f.sub.c[b]=f.sub.c[b−1]+ERBToHz(HzToERB(f.sub.c[b−1])+Δ) b=2 . . . B (3b)
f.sub.c[B]<f.sub.max, (3c)
where Δ is the desired ERB spacing of the Analysis Filterbank 100, and where f.sub.min and f.sub.max are the desired minimum and maximum center frequencies, respectively. One may choose Δ=1, and taking into account the frequency range over which the human ear is sensitive, one may set f.sub.min=50 Hz and f.sub.max=20,000 Hz. With such parameters, for example, application of Eqns. 3a-c yields B=40 auditory filters.
(66) The magnitude frequency response of each auditory filter may be characterized by a rounded exponential function, as suggested by Moore and Glasberg. Specifically, the magnitude response of a filter with center frequency f.sub.c[b] may be computed as:
(67)
The magnitude responses of such B auditory filters, which approximate critical banding on the ERB scale, are shown in
(68) The filtering operations of Analysis Filterbank 100 may be adequately approximated using a finite length Discrete Fourier Transform, commonly referred to as the Short-Time Discrete Fourier Transform (STDFT), because an implementation running the filters at the sampling rate of the audio signal, referred to as a full-rate implementation, is believed to provide more temporal resolution than is necessary for accurate loudness measurements. By using the STDFT instead of a full-rate implementation, an improvement in efficiency and reduction in computational complexity may be achieved.
(69) The STDFT of input audio signal x[n] is defined as:
(70)
where k is the frequency index, t is the time block index, N is the DFT size, T is the hop size, and w[n] is a length N window normalized so that
(71)
(72) Note that the variable t in Eqn. 5a is a discrete index representing the time block of the STDFT as opposed to a measure of time in seconds. Each increment in t represents a hop of T samples along the signal x[n]. Subsequent references to the index t assume this definition. While different parameter settings and window shapes may be used depending upon the details of implementation, for f.sub.s=44100 Hz, choosing N=2048, T=1024, and having w[n] be a Hanning window provides an adequate balance of time and frequency resolution. The STDFT described above may be more efficient using the Fast Fourier Transform (FFT).
(73) Instead of the STDFT, the Modified Discrete Cosine Transform (MDCT) may be utilized to implement the analysis filterbank. The MDCT is a transform commonly used in perceptual audio coders, such as Dolby AC-3. If the disclosed system is implemented with such perceptually coded audio, the disclosed loudness measurement and modification may be more efficiently implemented by processing the existing MDCT coefficients of the coded audio, thereby eliminating the need to perform the analysis filterbank transform. The MDCT of the input audio signal x[n] is given by:
(74)
(75) Generally, the hopsize T is chosen to be exactly one-half the transform length N so that perfect reconstruction of the signal x[n] is possible.
Transmission Filter 101
(76) The outputs of Analysis Filterbank 100 are applied to a transmission filter or transmission filter function (“Transmission Filter”) 101 which filters each band of the filterbank in accordance with the transmission of audio through the outer and middle ear.
Excitation 102
(77) In order to compute the loudness of the input audio signal, a measure of the audio signals' short-time energy in each filter of the Analysis Filterbank 100 after application of the Transmission Filter 101 is needed. This time and frequency varying measure is referred to as the excitation. The short-time energy output of each filter in Analysis Filterbank 100 may be approximated in Excitation Function 102 through multiplication of filter responses in the frequency domain with the power spectrum of the input signal:
(78)
where b is the band number, t is the block number, and H.sub.b[k] and P[k] are the frequency responses of the auditory filter and transmission filter, respectively, sampled at a frequency corresponding to STDFT or MDCT bin index k. It should be noted that forms for the magnitude response of the auditory filters other than that specified in Eqns. 4a-c may be used in Eqn. 7 to achieve similar results. For example, said International Application No. PCT/US2004/016964, published as WO 2004/111964 A2, describes two alternatives: an auditory filter characterized by a 12.sup.th order IIR transfer function, and a low-cost “brick-wall” band pass approximation.
(79) In summary, the output of Excitation Function 102 is a frequency domain representation of energy E in respective ERB bands b per time period t.
Time Averaging (“Smoothing”) 103
(80) For certain applications of the disclosed invention, as described below, it may be desirable to smooth the excitation E[b,t] prior to its transformation to specific loudness. For example, smoothing may be performed recursively in Smoothing function 103 according to the equation:
Ē[b,t]=λ.sub.bĒ[b,t]+(1−λ.sub.b)E[b,t], (8)
where the time constants λ.sub.b at each band b are selected in accordance with the desired application. In most cases the time constants may be advantageously chosen to be proportionate to the integration time of human loudness perception within band b. Watson and Gengel performed experiments demonstrating that this integration time is within the range of 150-175 ms at low frequencies (125-200 Hz) and 40-60 ms at high frequencies (Charles S. Watson and Roy W. Gengel, “Signal Duration and Signal Frequency in Relation to Auditory Sensitivity” Journal of the Acoustical Society of America, Vol. 46, No. 4 (Part 2), 1969, pp. 989-997).
Specific Loudness 104
(81) In the specific loudness converter or conversion function (“Specific Loudness”) 104, each frequency band of the excitation is converted into a component value of the specific loudness, which is measured in sone per ERB.
(82) Initially, in computing specific loudness, the excitation level in each band of Ē[b,t] may be transformed to an equivalent excitation level at 1 kHz as specified by the equal loudness contours of ISO 226 (
Ē.sub.1 kHz[b,t]=T.sub.1 kHz(Ē[b,t],f.sub.c[b]), (9)
where T.sub.1 kHz(E, f) is a function that generates the level at 1 kHz, which is equally loud to level E at frequency f. In practice, T.sub.1 kHz(E, f) is implemented as an interpolation of a look-up table of the equal loudness contours, normalized by the transmission filter. Transformation to equivalent levels at 1 kHz simplifies the following specific loudness calculation.
(83) Next, the specific loudness in each band may be computed as:
N[b,t]=α[b,t]N.sub.NB[b,t]+(1−α[b,t])N.sub.WB[b,t], (10)
where N.sub.NB[b,t] and N.sub.WB[b,t] are specific loudness values based on a narrowband and wideband signal model, respectively. The value α[b,t] is an interpolation factor lying between 0 and 1 that is computed from the audio signal. Said International Application No. PCT/US2004/016964, published as WO 2004/111964 A2, describes a technique for calculating α[b,t] from the spectral flatness of the excitation. It also describes “narrowband” and “wideband” signal models in greater detail.
(84) The narrowband and wideband specific loudness values N.sub.NB[b,t] and N.sub.WB[b, t] may be estimated from the transformed excitation using the exponential functions:
(85)
where TQ.sub.1 kHz is the excitation level at threshold in quiet for a 1 kHz tone. From the equal loudness contours (
(86) Moore and Glasberg suggest that the specific loudness should be equal to some small value instead of zero when the excitation is at the threshold of hearing. Specific loudness should then decrease monotonically to zero as the excitation decreases to zero. The justification is that the threshold of hearing is a probabilistic threshold (the point at which a tone is detected 50% of the time), and that a number of tones, each at threshold, presented together may sum to a sound that is more audible than any of the individual tones. In the disclosed application, augmenting the specific loudness functions with this property has the added benefit of making the gain solver, discussed below, behave more appropriately when the excitation is near threshold. If the specific loudness is defined to be zero when the excitation is at or below threshold, then a unique solution for the gain solver does not exist for excitations at or below threshold. If, on the other hand, specific loudness is defined to be monotonically increasing for all values of excitation greater than or equal to zero, as suggested by Moore and Glasberg, then a unique solution does exist. Loudness scaling greater than unity will always result in a gain greater than unity and vice versa. The specific loudness functions in Eqns. 11a and 11b may be altered to have the desired property according to:
(87)
where the constant λ is greater than one, the exponent q is less than one, and the constants K and C are chosen so that the specific loudness function and its first derivative are continuous at the point Ē.sub.1 kHz[b,t]=λTQ.sub.1 kHz.
(88) From the specific loudness, the overall or “total” loudness L[t] is given by the sum of the specific loudness across all bands b:
(89)
Specific Loudness Modification 105
(90) In the specific loudness modification function (“Specific Loudness Modification”) 105, the target specific loudness, referred to as {circumflex over (N)}[b,t], may be calculated from the specific loudness of SL 104 (
Gain Solver 106
(91) In this example, for each band b and every time interval t, the Gain Solver 106 takes as its inputs the smoothed excitation Ē[b,t] and the target specific loudness {circumflex over (N)}[b,t] and generates gains G[b, t] used subsequently for modifying the audio. Letting the function Ψ{⋅} represent the non-linear transformation from excitation to specific loudness such that
N[b,t]=Ψ{Ē[b,t]}, (13)
the Gain Solver finds G[b, t] such that
{circumflex over (N)}[b,t]=Ψ{G.sup.2[b,t]Ē[b,t]}. (14a)
(92) The Gain Solvers 106 determine frequency- and time-varying gains, which, when applied to the original excitation, result in a specific loudness that, ideally, is equal to the desired target specific loudness. In practice, the Gain Solvers determine frequency- and time-varying gains, which when applied to the frequency-domain version of the audio signal results in modifying the audio signal in order to reduce the difference between its specific loudness and the target specific loudness. Ideally, the modification is such that the modified audio signal has a specific loudness that is a close approximation of the target specific loudness. The solution to Eqn. 14a may be implemented in a variety of ways. For example, if a closed form mathematical expression for the inverse of the specific loudness, represented by Ψ.sup.−1{⋅}, exists, then the gains may be computed directly by re-arranging equation 14a:
(93)
(94) Alternatively, if a closed form solution for Ψ.sup.−1{⋅} does not exist, an iterative approach may be employed in which for each iteration equation 14a is evaluated using a current estimate of the gains. The resulting specific loudness is compared with the desired target and the gains are updated based on the error. If the gains are updated properly, they will converge to the desired solution. Another method involves pre-computing the function Ψ{⋅} for a range of excitation values in each band to create a look-up table. From this look-up table, one obtains an approximation of the inverse function Ψ.sup.−1{⋅} and the gains may then be computed from equation 14b. As mentioned earlier, the target specific loudness may be represented by a scaling of the specific loudness:
{circumflex over (N)}[b,t]=Ξ[b,t]N[b,t] (14c)
(95) Substituting equation 13 into 14c and then 14c into 14b yields an alternative expression for the gains:
(96)
(97) We see that the gains may be expressed purely as a function of the excitation Ē[b,t] and the specific loudness scaling Ξ[b,t]. Therefore, the gains may be computed through evaluation of 14d or an equivalent lookup table without ever explicitly computing the specific loudness or target specific loudness as intermediate values. However, these values are implicitly computed through use of equation 14d. Other equivalent methods for computing the modification parameters through either explicit or implicit computation of the specific loudness and target specific loudness may be devised, and this invention is intended to cover all such methods.
Synthesis Filterbank 110
(98) As described above, Analysis Filterbank 100 may be implemented efficiently through use of the Short-time Discrete Fourier Transform (STDFT) or the Modified Discrete Cosine Transform, and the STDFT or MDCT may be used similarly to implement Synthesis Filterbank 110. Specifically, letting X[k,t] represent the STDFT or MDCT of the input audio, as defined earlier, the STDFT or MDCT of the processed (modified) audio in Synthesis Filterbank 110 may be calculated as
(99)
where S.sub.b[k] is the response of the synthesis filter associated with band b, and d is the delay associated with delay block 109 in
Target Specific Loudness
(100) The behavior of arrangements embodying aspects of the invention such as the examples of
Time-Invariant and Frequency-Invariant Function Suitable for Volume Control
(101) A standard volume control adjusts the loudness of an audio signal by applying a wideband gain to the audio. Generally, the gain is coupled to a knob or slider that is adjusted by a user until the loudness of the audio is at the desired level. An aspect of the present invention allows for a more psychoacoustically consistent way of implementing such a control. According to this aspect of the invention, rather than having a wideband gain coupled to the volume control that results in a change of gain by the same amount across all frequency bands, which may cause a change in the perceived spectrum, a specific loudness scaling factor is associated with the volume control adjustment instead so that the gain in each of multiple frequency bands is changed by an amount that takes into account the human hearing model so that, ideally, there is no change in the perceived spectrum. In the context of this aspect of the invention and an exemplary application thereof, “constant” or “time-invariant” is intended to allow for changes in the setting of a volume control scale factor from time to time, for example, by a user. Such “time-invariance” is sometimes referred to as “quasi time-invariant,” “quasi-stationary,” “piecewise time-invariant,” “piecewise stationary,” “step-wise time-invariant,” and “step-wise stationary.” Given such a scale factor, α, the target specific loudness may be calculated as the measured specific loudness multiplied by α:
{circumflex over (N)}[b,t]=αN[b,t]. (16)
(102) Because total loudness L[t] is the sum of specific loudness N[b,t] across all bands b, the above modification also scales the total loudness by a factor of α, but it does so in a way that preserves the same perceived spectrum at a particular time for changes in the volume control adjustment. In other words, at any particular time, a change in the volume control adjustment results in a change in perceived loudness but no change in the perceived spectrum of the modified audio versus the perceived spectrum of the unmodified audio.
(103)
(104) Along with the distortion of the perceived spectral balance associated with a traditional volume control there exists a second problem. A property of loudness perception, which is reflected in the loudness model reflected in Equations 11a-11d, is that loudness of a signal at any frequency decreases more rapidly as signal level approaches the threshold of hearing. As a result, the electrical attenuation required to impart the same loudness attenuation to a softer signal is less than that required for a louder signal. A traditional volume control imparts a constant attenuation regardless of signal level, and therefore soft signals become “too soft” with respect to louder signals as the volume is turned down. In many cases this results in the loss of detail in the audio. Consider the recording of a castanet in a reverberant room. In such a recording the main “hit” of the castanet is quite loud in comparison to the reverberant echoes, but it is the reverberant echoes that convey the size of the room. As the volume is turned down with a traditional volume control, the reverberant echoes become softer with respect to the main hit and eventually disappear below the threshold of hearing, leaving a “dry” sounding castanet. The loudness based volume control prevents the disappearance of the softer portions of the recordings by boosting the softer reverberant portion of the recording relative to the louder main hit so that the relative loudness between these sections remains constant. In order to achieve this effect, the multiband gains G[b,t] must vary over time at a rate that is commensurate with the human temporal resolution of loudness perception. Because the multiband gains G[b,t] are computed as a function of the smoothed excitation Ē[b,t], selection of the time constants λ.sub.b in Eqn. 8 dictates how quickly the gains may vary across time in each band b. As mentioned earlier, these time constants may be selected to be proportionate the integration time of human loudness perception within band b and thus yield the appropriate variation of G[b,t] over time. It should be noted that if the time constants are chosen inappropriately (either too fast or too slow), then perceptually objectionable artifacts may be introduced in the processed audio.
Time-Invariant and Frequency-Variant Function Suitable for Fixed Equalization
(105) In some applications, one may wish to apply a fixed perceptual equalization to the audio, in which case the target specific loudness may be computed by applying a time-invariant but frequency-variant scale factor Θ[b] as in the relationship
wherein {circumflex over (N)}[b, t] is the target specific loudness, N[b, t] is the specific loudness of the audio signal, b is a measure of frequency, and t is a measure of time. In this case, the scaling may vary from band to band. Such an application may be useful for emphasizing, for example, the portion of the spectrum dominated by speech frequencies in order to boost intelligibility.
Frequency-Invariant and Time-Variant Function Suitable for Automatic Gain and Dynamic Range Control
(106) The techniques of automatic gain and dynamic range control (AGC and DRC) are well known in the audio processing field. In an abstract sense, both techniques measure the level of an audio signal in some manner and then gain-modify the signal by an amount that is a function of the measured level. For the case of AGC, the signal is gain-modified so that its measured level is closer to a user selected reference level. With DRC, the signal is gain-modified so that the range of the signal's measured level is transformed into some desired range. For example, one may wish to make the quiet portions of the audio louder and the loud portions quieter. Such a system is described by Robinson and Gundry (Charles Robinson and Kenneth Gundry, “Dynamic Range Control via Metadata,” 107.sup.th Convention of the AES, Preprint 5028, Sep. 24-27, 1999, New York). Traditional implementations of AGC and DRC generally utilize a simple measurement of audio signal level, such as smoothed peak or root mean square (rms) amplitude, to drive the gain modification. Such simple measurements correlate to some degree to the perceived loudness of the audio, but aspects of the present invention allow for more perceptually relevant AGC and DRC by driving the gain modifications with a measure of loudness based on a psychoacoustic model. Also, many traditional AGC and DRC systems apply the gain modification with a wideband gain, thereby incurring the aforementioned timbral (spectral) distortions in the processed audio. Aspects of the present invention, on the other hand, utilize a multiband gain to shape the specific loudness in a manner that reduces or minimizes such distortions.
(107) Both the AGC and DRC applications employing aspects of the present invention are characterized by a function that transforms or maps an input wideband loudness L.sub.i[t] into a desired output wideband loudness L.sub.o[t], where the loudness is measured in perceptual loudness units, such as sone. The input wideband loudness L.sub.i[t] is a function of the input audio signal's specific loudness N[b,t]. Although it may be the same as the input audio signal's total loudness, it may be a temporally-smoothed version of the audio signal's total loudness.
(108)
(109)
(110) The audio signal's original specific loudness N[b,t] is simply scaled by the ratio of the desired output wideband loudness to the input wideband loudness to yield an output specific loudness {circumflex over (N)}[b,t]. For an AGC system, the input wideband loudness L.sub.i[t] should generally be a measure of the long-term total loudness of the audio. This can be achieved by smoothing the total loudness L[t] across time to generate L.sub.i[t].
(111) In comparison to an AGC, a DRC system reacts to shorter term changes in a signal's loudness, and therefore L.sub.i[t] can simply be made equal to L[t]. As a result, the scaling of specific loudness, given by L.sub.o[t]/L.sub.i[t], may fluctuate rapidly leading to unwanted artifacts in the processed audio. One typical artifact is the audible modulation of a portion of the frequency spectrum by some other relatively unrelated portion of the spectrum. For example, a classical music selection might contain high frequencies dominated by a sustained string note, while the low frequencies contain a loud booming timpani. Whenever the timpani hits, the overall loudness L.sub.i[t] increases, and the DRC system applies attenuation to the entire specific loudness. The strings are then heard to “pump” down and up in loudness with the timpani. Such cross pumping in the spectrum is a problem with traditional wideband DRC systems as well, and a typical solution involves applying DRC independently to different frequency bands. The system disclosed here is inherently multiband due to the filterbank and the calculation of specific loudness that employs a perceptual loudness model, and therefore modifying a DRC system to operate in a multiband fashion in accordance with aspects of the present invention is relatively straightforward and is next described.
Frequency-Variant and Time-Variant Function Suitable for Dynamic Range Control
(112) The DRC system may be expanded to operate in a multiband or frequency-variant fashion by allowing the input and output loudness to vary independently with band b. These multiband loudness values are referenced as L.sub.i[b,t] and L.sub.o[b,t], and the target specific loudness may then be given by
(113)
where L.sub.o[b,t] has been calculated from or mapped from L.sub.i[b, t], as illustrated in
(114) The most straightforward way of calculating L.sub.i[b, t] is to set it equal to the specific loudness N[b,t]. In this case, DRC is performed independently on every band in the auditory filterbank of the perceptual loudness model rather than in accordance with the same input versus output loudness ratio for all bands as just described above under the heading “Frequency-Invariant and Time-Variant Function Suitable for Automatic Gain and Dynamic Range Control.” In a practical embodiment employing 40 bands, the spacing of these bands along the frequency axis is relatively fine in order to provide an accurate measure of loudness. However, applying a DRC scale factor independently to each band may cause the processed audio to sound “torn apart”. To avoid this problem, one may choose to calculate L.sub.i[b, t] by smoothing specific loudness N[b, t] across bands so that the amount of DRC applied from one band to the next does not vary as drastically. This may be achieved by defining a band-smoothing filter Q(b) and then smoothing the specific loudness across all bands c according to the standard convolution sum:
(115)
wherein N[c, t] is the specific loudness of the audio signal and Q(b−c) is the band-shifted response of the smoothing filter.
(116) If the DRC function that calculates L.sub.i[b,t] as a function of L.sub.o[b,t] is fixed for every band b, then the type of change incurred to each band of the specific loudness N[b, t] will vary depending on the spectrum of the audio being processed, even if the overall loudness of the signal remains the same. For example, an audio signal with loud bass and quiet treble may have the bass cut and the treble boosted. A signal with quiet bass and loud treble may have the opposite occur. The net effect is a change in the timbre or perceived spectrum of the audio, and this may be desirable in certain applications.
(117) However, one may wish to perform multiband DRC without modifying the average perceived spectrum of the audio. One might want the average modification in each band to be roughly the same while still allowing the short-term variations of the modifications to operate independently between and among bands. The desired effect may be achieved by forcing the average behavior of the DRC in each band to be the same as that of some reference behavior. One may choose this reference behavior as the desired DRC for the wideband input loudness L.sub.i[t]. Let the function L.sub.o[t]=DRC{L.sub.i[t]} represent the desired DRC mapping for the wideband loudness. Then let
(118)
(119) Note that the multiband input loudness is first scaled to be in the same average range as the wideband input loudness. The DRC function designed for the wideband loudness is then applied. Lastly, the result is scaled back down to the average range of the multiband loudness. With this formulation of multiband DRC, the benefits of reduced spectral pumping are retained, while at the same time preserving the average perceived spectrum of the audio.
Frequency-Variant and Time-Variant Function Suitable for Dynamic Equalization
(120) Another application of aspects of the present invention is the intentional transformation of the audio's time-varying perceived spectrum to a target time-invariant perceived spectrum while still preserving the original dynamic range of the audio. One may refer to this processing as Dynamic Equalization (DEQ). With traditional static equalization, a simple fixed filtering is applied to the audio in order to change its spectrum. For example, one might apply a fixed bass or treble boost. Such processing does not take into account the current spectrum of the audio and may therefore be inappropriate for some signals, i.e., signals that already contain a relatively large amount of bass or treble. With DEQ, the spectrum of the signal is measured and the signal is then dynamically modified in order to transform the measured spectrum into an essentially static desired shape. For aspects of the present invention, such a desired shape is specified across bands in the filterbank and referred to as EQ[b]. In a practical embodiment, the measured spectrum should represent the average spectral shape of the audio that may be generated by smoothing the specific loudness N[b,t] across time. One may refer to the smoothed specific loudness as
(121)
(122) In order to preserve the original dynamic range of the audio, the desired spectrum EQ[b] should be normalized to have the same overall loudness as the measured spectral shape given by
(123)
(124) Finally, the target specific loudness is calculated as
(125)
where β is a user-specified parameter ranging from zero to one, indicating the degree of DEQ that is to be applied. Looking at Eqn. 23, one notes that when β=0, the original specific loudness is unmodified, and when β=1, the specific loudness is scaled by the ratio of the desired spectral shape to the measured spectral shape.
(126) One convenient way of generating the desired spectral shape EQ[b] is for a user to set it equal to
Combined Processing
(127) One may wish to combine all the previously described processing, including Volume Control (VC), AGC, DRC, and DEQ, into a single system. Because each of these processes may be represented as a scaling of the specific loudness, all of them are easily combined as follows:
{circumflex over (N)}[b,t]=(Ξ.sub.VC[b,t]Ξ.sub.AGC[b,t]Ξ.sub.DRC[b,t]Ξ.sub.DEQ[b,t])N[.sub.b,t], (24)
where Ξ.sub.*[b,t] represents the scale factors associated with process “*”. A single set of gains G[b,t] may then be calculated for the target specific loudness that represents the combined processing.
(128) In some cases, the scale factors of one or a combination of the loudness modification processes may fluctuate too rapidly over time and produce artifacts in the resulting processed audio. It may therefore be desirable to smooth some subset of these scaling factors. In general, the scale factors from VC and DEQ varying smoothly over time, but smoothing the combination of the AGC and DRC scale factors may be required. Let the combination of these scale factors be represented by
Ξ.sub.C[b,t]=Ξ.sub.AGC[b,t]Ξ.sub.DRC[b,t] (25)
(129) The basic notion behind the smoothing is that the combined scale factors should react quickly when the specific loudness is increasing, and that the scale factors should be more heavily smoothed when the specific loudness is decreasing. This notion corresponds to the well-known practice of utilizing a fast attack and a slow release in the design of audio compressors. The appropriate time constants for smoothing the scale factors may be calculated by smoothing across time a band-smoothed version of the specific loudness. First a band-smoothed version of the specific loudness is computed:
(130)
wherein N[c,t] is the specific loudness of the audio signal and Q(b-c) is the band-shifted response of the smoothing filter as in Eqn. 19, above.
(131) The time-smoothed version of this band-smoothed specific loudness is then calculated as
where the band dependent smoothing coefficient λ[b,t] is given by
(132)
(133) The smoothed combined scale factors are then calculated as
where λ.sub.M[b, t] is a band-smoothed version of λ[b,t]:
(134)
(135) Band smoothing of the smoothing coefficient prevents the time-smoothed scale factors from changing drastically across bands. The described scale factor time- and band-smoothing results in processed audio containing fewer objectionable perceptual artifacts.
Noise Compensation
(136) In many audio playback environments there exists background noise that interferes with the audio that a listener wishes to hear. For example, a listener in a moving automobile may be playing music over the installed stereo system and noise from the engine and road may significantly alter the perception of the music. In particular, for parts of the spectrum in which the energy of the noise is significant relative to the energy of the music, the perceived loudness of the music is reduced. If the level of the noise is large enough, the music is completely masked. With respect to an aspect of the current invention, one would like to choose gains G[b, t] so that the specific loudness of the processed audio in the presence of the interfering noise is equal to the target specific loudness {circumflex over (N)}[b,t]. To achieve this effect, one may utilize the concept of partial loudness, as defined by Moore and Glasberg, supra. Assume that one is able to obtain a measurement of the noise by itself and a measurement of the audio by itself. Let E.sub.N[b, t] represent the excitation from the noise and let E.sub.A[b,t] represent the excitation from the audio. The combined specific loudness of the audio and the noise is then given by
N.sub.TOT[b,t]=Ψ{E.sub.A[b,t]+E.sub.N[b,t]}, (31)
where, again, Ψ{⋅} represents the non-linear transformation from excitation to specific loudness. One may assume that a listener's hearing partitions the combined specific loudness between the partial specific loudness of the audio and the partial specific loudness of the noise in a way that preserves the combined specific loudness:
N.sub.TOT[b,t]=N.sub.A[b,t]+N.sub.N[b,t]. (32)
(137) The partial specific loudness of the audio, N.sub.A[b,t], is the value one wishes to control, and therefore one must solve for this value. The partial specific loudness of the noise may be approximated as
(138)
(139) where E.sub.TN[b,t] is the masked threshold in the presence of the noise, E.sub.TQ[b] is the threshold of hearing in quiet at band b, and κ is an exponent between zero and one.
(140) Combining Eqns. 31-33 one arrives at an expression for the partial specific loudness of the audio:
(141)
(142) One notes that when the excitation of the audio is equal to the masked threshold of the noise (E.sub.A[b, t]=E.sub.TN[b,t]), the partial specific loudness of the audio is equal to the loudness of a signal at the threshold in quiet, which is the desired outcome. When the excitation of the audio is much greater than that of the noise, the second term in Eqn. 34 vanishes, and the specific loudness of the audio is approximately equal to what it would be if the noise were not present. In other words, as the audio becomes much louder than the noise, the noise is masked by the audio. The exponent κ is chosen empirically to give a good fit to data on the loudness of a tone in noise as a function of the signal-to-noise ratio. Moore and Glasberg have found that a value of κ=0.3 is appropriate. The masked threshold of the noise may be approximated as a function of the noise excitation itself:
E.sub.TN[b,t]=K[b]E.sub.N[b,t]+E.sub.TQ[b] (35)
where K[b] is a constant that increases at lower frequency bands. Thus, the partial specific loudness of the audio given by Eqn. 34 may be represented abstractly as a function of the excitation of the audio and the excitation of the noise:
N.sub.A[b,t]=Φ{E.sub.A[b,t],E.sub.N[b,t]}. (36)
(143) A modified gain solver may then be utilized to calculate the gains G[b,t] such that the partial specific loudness of the processed audio in the presence of the noise is equal to the target specific loudness:
{circumflex over (N)}[b,t]=Φ{G.sup.2[b,t]E.sub.A[b,t],E.sub.N[b,t]} (37)
(144)
(145) In its most basic mode of operation, the SL Modification 105 in
(146) In a practical embodiment, the measurement of the noise may be obtained from a microphone placed in or near the environment into which the audio will be played. Alternatively, a predetermined set of template noise excitations may be utilized that approximate the anticipated noise spectrum under various conditions. For example, the noise in an automobile cabin may be pre-analyzed at various driving speeds and then stored as a look-up table of noise excitation versus speed. The noise excitation fed into the Gain Solver 206 in
Implementation
(147) The invention may be implemented in hardware or software, or a combination of both (e.g., programmable logic arrays). Unless otherwise specified, the algorithms included as part of the invention are not inherently related to any particular computer or other apparatus. In particular, various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct more specialized apparatus (e.g., integrated circuits) to perform the required method steps. Thus, the invention may be implemented in one or more computer programs executing on one or more programmable computer systems each comprising at least one processor, at least one data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device or port, and at least one output device or port. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices, in known fashion.
(148) Each such program may be implemented in any desired computer language (including machine, assembly, or high level procedural, logical, or object oriented programming languages) to communicate with a computer system. In any case, the language may be a compiled or interpreted language.
(149) Each such computer program is preferably stored on or downloaded to a storage media or device (e.g., solid state memory or media, or magnetic or optical media) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer system to perform the procedures described herein. The inventive system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer system to operate in a specific and predefined manner to perform the functions described herein.
(150) A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, some of the steps described herein may be order independent, and thus can be performed in an order different from that described.