Mode selection for modal reverb
11043203 · 2021-06-22
Inventors
Cpc classification
G10L21/00
PHYSICS
G10L25/18
PHYSICS
G10H2250/115
PHYSICS
International classification
G06F17/00
PHYSICS
G10L25/18
PHYSICS
Abstract
Methods and systems for performing modal reverb techniques for audio signals are described. The method may involve simplifying a reverb effect to be applied to the audio signal by receiving an IR, dividing the IR into a plurality of sub-bands, using a parametric estimation algorithm to determine respective parameters of the modes included in each sub-band, aggregating the respective modes of the sub-bands into a set; and truncating the set of aggregated modes into a subset of modes. Reverberation of the audio signal may be manipulated based on an IR that itself is based on the truncated subset of modes.
Claims
1. A method for generating a modal reverb effect for manipulating an audio signal, comprising: receiving an impulse response of an acoustic space, the impulse response including a plurality of modes of vibration of the acoustic space; dividing the impulse response into a plurality of sub-bands, each sub-band of the impulse response including a portion of the plurality of modes; for each respective sub-band, using a parametric estimation algorithm, determining respective parameters of the portion of modes included in the sub-band; aggregating the respective modes of the plurality of sub-bands into a set; and truncating the set of aggregated modes into a subset of modes, wherein truncating the set of aggregated modes comprises: for each of the modes included in the set, determining a signal to mask ratio (SMR) of the mode based on a predetermined masking curve; and sorting the modes included in the set according to the SMR for each mode, wherein each mode included in the subset has an SMR greater than the SMR of each mode excluded from the subset.
2. The method of claim 1, wherein the impulse response is divided into a plurality of non-uniform sub-bands.
3. The method of claim 1, wherein dividing the impulse response into a plurality of sub-bands comprises passing the impulse response through a filter bank.
4. The method of claim 3, further comprising, for each respective sub-band signal, estimating a number of modes included in the portion of modes of the sub-band signal, wherein the filter bank includes one or more complex filters and for each sub-band has each of a passband width and a partition width narrower than the passband width, wherein the number of modes is estimated within the passband width, and wherein determining parameters of the respective modes included in the sub-band signal is performed for only the modes within the partition width.
5. The method of claim 1, further comprising, for each respective sub-band, estimating a number of modes included in the portion of modes of the sub-band.
6. The method of claim 5, wherein, for each respective sub-band, a model order of the parametric estimation algorithm applied to the sub-band is based on the estimated number of modes included in the portion of modes of the sub-band.
7. The method of claim 5, wherein estimating a number of modes included in the portion of modes of the sub-band comprises: determining a peak selection threshold for the sub-band; and determining a number of peaks detected within the sub-band that are greater than the peak selection threshold, wherein the estimated number of modes is based on the determined number of peaks.
8. The method of claim 7, wherein the sub-band is derived from a Discrete Fourier Transform (DFT) of the impulse response, and wherein determining a peak selection threshold for the sub-band comprises: detecting a maximum peak magnitude of the sub-band; and detecting a minimum peak magnitude of the sub-band, wherein the peak selection threshold is determined based at least in part on the maximum peak magnitude and the minimum peak magnitude.
9. The method of claim 8, wherein the peak selection threshold is determined based on: t=M.sub.max−a(M.sub.max−M.sub.min), wherein M.sub.max is the maximum peak magnitude, M.sub.min is the minimum peak magnitude, and a is predetermined value between 0 and 1.
10. The method of claim 1, wherein, for each respective sub-band, determining respective parameters of the portion of modes comprises, for each sub-band to which the parametric estimation algorithm is applied, determining one or more of a frequency, a decay time, an initial magnitude or an initial phase of the portion of modes included in the sub-band.
11. The method of claim 10, wherein, for each respective sub-band, determining respective parameters of the portion of modes further comprises estimating a complex amplitude for each respective mode included in the sub-band.
12. The method of claim 11, wherein the sub-band is derived from a Discrete Fourier Transform (DFT), and wherein for each mode included in the sub-band signal, estimating the complex amplitude comprises minimizing an approximation error for each of the estimated complex amplitudes of the sub-band signal.
13. The method of claim 12, wherein the approximation error is minimized for only modes of the sub-band signal that fall within a passband of a corresponding spectral filter, wherein a different spectral filter corresponds to each of the sub-band signals, and wherein the different spectral filters cover the audible spectrum and do not overlap.
14. The method of claim 1, wherein the parametric estimation algorithm is an ESPRIT algorithm.
15. The method of claim 1, wherein, for each respective sub-band, determining respective parameters of the portion of modes comprises determining a peak selection threshold for the sub-band, and wherein the parameters are determined for the modes included in the portion of modes and having an amplitude greater than the peak selection threshold.
16. The method of claim 1, wherein truncating the set into a subset of modes further comprises: receiving an input indicating a total number of modes, wherein the total number of modes is less than or equal to a number of modes included in the set; and truncating the set into a subset of modes having a number of modes equal to the total number of modes.
17. The method of claim 1, wherein the predetermined masking curve is based on a psychoacoustic model.
18. A system for generating a modal reverb effect for manipulating an audio signal, comprising: memory for storing an impulse response; and one or more processors configured to: receive an impulse response of an acoustic space, the impulse response including a plurality of modes of vibration of the acoustic space; divide the impulse response into a plurality of sub bands, each sub band of the impulse response including a portion of the plurality of modes; for each respective sub band: estimate a number of modes included in the portion of modes of the sub band; and using a parametric estimation algorithm, determine respective parameters of the portion of modes included in the sub¬band signal; aggregate the respective modes of the plurality of sub bands into a set; for each of the modes included in the set, determine a signal to mask ratio (SMR) of the mode based on a predetermined masking curve; sort the modes according to the SMR for each mode; and truncate the set of aggregated modes into a subset of modes, wherein each mode included in the subset has an SMR greater than the SMR of each mode excluded from the subset.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The foregoing aspects, features and advantages of the present invention will be further appreciated when considered with reference to the following description of exemplary embodiments and accompanying drawings, wherein like reference numerals represent like elements. In describing the embodiments of the invention illustrated in the drawings, specific terminology may be used for the sake of clarity. However, the aspects of the invention are not intended to be limited to the specific terms used.
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION
(7)
(8) Various instructions are described in greater detail in connection with the flow diagrams of
(9) The system 100 may further include an interface 150 for input and output of data. For example, the IR for a given acoustic space may be input to the system via the interface 150, and a select number of modes or corresponding exponentially damped sinusoids (EDSs) and their parameters may be output via the interface 150. Alternatively or additionally, the one or more processors may be capable of performing the reverb operations, in which case a user may input desired reverb parameters via the interface 150, and a modified audio signal based on the reverb parameters may be generated and output via the interface 150. Other parameters and instructions may be provided to and from the system via the interface 150. For example, the number of modes to be identified in the IR may be a variable entered by the user. This may be used to vary the processing speed of the reverb operations depending on a preference of the user. A desired number of modes may be preset and stored in the memory 140, entered by the user via the interface 150, or both.
(10) In some examples, the system 100 may include a personal computer, laptop, tablet, or other computing device of the user, housing therein both processors and memory. Operations performed by the system are described in greater detail in connection with the routines of
(11)
(12) At block 210, the system receives an IR of a given space. The space may be a real space (whereby the IR may be a recording in response to an impulse played in the real space), or a simulated or virtual space. The IR can be broken down into the respective modes of vibration of the space simulated by the IR and these modes can be isolated and individually modified. A typical IR may include upwards of approximately 10,000 modes.
(13) At block 220, the system may divide the IR into a plurality of sub-bands. For example, the modes of the IR may be centered at various frequencies across a wide band of frequencies, generally on the range of audible frequencies (commonly considered to be about 20 Hz-20 kHz). This band may be broken up into a plurality of sub-bands, each sub-band having a bandwidth smaller than the full band of the IR. In some examples, the sub-bands may be chosen so that they do not overlap, so that all of the frequencies within the full band of the IR are accounted for, or both. If both considerations are met, then the sum of the sub-band bandwidths may equal the bandwidth of the complete IR.
(14) In some examples, the sub-bands may be chosen to have uniform bandwidth, either on a logarithmic or non-logarithmic scale. For instance, if the IR is broken up into three sub-bands, each sub-band may have an equal bandwidth. In other examples, the IR may be divided into sub-bands based on a different factor, and this may result in non-uniformity of the sub-band bandwidths. For instance, the sub-band division may be arranged to divide the modes of the complete IR approximately evenly.
(15) In some examples, dividing the complete IR may first involve down-sampling the complete IR using one or more filterbanks. The filterbanks may be configured to pass certain portions of the IR, whereby the IR may be filtered into different sub-bands.
(16) Additionally, in some examples, the down-sampling may be performed using one or more complex filters. The complex filters may retain only a positive frequency spectrum of the IR, thereby omitting unwanted portions of the filtered IR from later processing operations.
(17) At block 230, a number of modes in each respective sub-band is estimated. The estimated number of modes may inform whether the sub-bands have been divided evenly. Additionally, or alternatively, the estimated number of modes may inform a desired resolution for later operations of the routine.
(18) An example subroutine 300 for estimating a number of modes in a given sub-band is shown in the flow diagram of
(19) At block 310, a peak selection threshold for the sub-band may be determined. In some examples, the peak selection threshold may be a fixed value, such as an amplitude value representing a lowest audible volume. Amplitude values of the sub-band at sampled frequencies (e.g., using a Fourier transform method) may be determined and then compared to the peak selection threshold, whereby only those values at or above the peak selection threshold are determined to be modes of the IR.
(20) In some examples, the peak selection threshold may be determined based on characteristics of the sub-band itself. For instance, at block 312, the sub-band may be derived in the frequency domain using a discrete Fourier transform (DFT). Then, at block 314, a maximum peak magnitude of the DFT of the sub-band may be determined, and at block 316, a minimum peak magnitude of the DFT of the sub-band may be determined. At block 318, the peak selection threshold is set based on the maximum peak and the minimum peak. For instance, the formula: t=M.sub.max−a(M.sub.max−M.sub.min), may be used to set a peak selection threshold t, whereby M.sub.max is the maximum peak magnitude, M.sub.min is the minimum peak magnitude, and a is predetermined value between 0 and 1. The predetermined value of a may be 0.25.
(21) At block 320, the number of peaks detected within the sub-band that have a magnitude greater than the peak selection threshold value are counted. The remaining peaks in the DFT are disregarded as insignificant or inaudible. The counted number of peaks corresponds to the estimated number of modes in the sub-band. Stated another way, each counted peak represents a center frequency of a mode that is identified and counted in the sub-band and used in further processing steps. The remaining modes are discounted and omitted from further processing steps.
(22) At block 330, the complete IR may be divided into sub-bands based on the number of detected peaks. This may result in non-uniform sub-bands. In order to achieve this result, an Audio FFT filter bank may be used. Each sub-band may be produced by filtering the IR with a causal N-tap finite impulse response (FIR) filter h.sub.r[n]:
(23)
(24) whereby
(25)
a.sub.m is the complex amplitude and z.sub.m is the complex mode of the m.sup.th of M modes, a.sub.mr is the complex amplitude with a scaling factor. The first N−1 samples of the signal represent a start-up transient that does not exhibit the behavior of an exponentially damping sinusoid, and then afterwards the samples begin to follow such behavior. The filter effectively cuts out modes with center frequencies in the stopband.
(26) Windowing methods, which are known in the art, allow an FIR filter to be designed by truncating an IIR filter. The act of truncation expands the bandwidth of the FIR (as compared to the IIR filter). This in turn causes the sub-band filters to overlap in frequency, as shown in
(27) In one example of the filter bank being designed using a windowing method, first a number R brickwall filters may be chosen such that the sum of all frequency responses H.sub.r of the R filters is unity. Taking the inverse DTFT of the R filters shows that
(28)
in which h.sub.r is an impulse response of the r.sup.th filter among the R filters. Since the filters are brickwall filters, the impulse response is an IIR filter. Next, each channel's impulse response may be truncated via multiplication with a short window, thus creating an FIR filter. For instance, an N-tap window w[n] may be used so that each sub-band IR channel becomes w[n]h.sub.r[n]. So long as w[0] is normalized to 1, this set of filters may still result in perfect reconstruction of the R filters (δ[n]), as can be seen from the following equations:
(29)
(30) Time-domain multiplication by w[n] results in convolution between the ideal channel filter and the window in the frequency domain. This results in frequency-domain spreading of the filters, which causes the filter responses to overlap with one another in frequency. This results in a filter bank like the one shown in
(31)
(32) In the example of
(33) Returning to
(34) Because the vector matrix is in an m-dimensional space (m being the number of complex modes), the processing necessary to solve for the complex modes increases exponentially as the number of modes increases. Stated another way, the model order of the ESPRIT algorithm corresponds to the number of modes that are estimated to be included in the sub-band. This makes processing the entire IR in a single matrix intractable. But by dividing the IR into sub-sands and then applying the ESPRIT algorithm to the sub-bands individually, instead of to all of the modes of the IR collectively, and by only solving for those modes that have a magnitude greater than the peak selection threshold, the amount of processing can be significantly reduced.
(35) For a given subset of modes (e.g., modes of a given sub-band), a complex amplitude of each mode may be estimated. The estimation may be performed using a least squares method, such as the following minimization function of a, the matrix of the complex amplitudes of the modes:
(36)
whereby x is a vector of sampled modes, and E are the complex sinusoids. This function may be solved in the frequency domain by taking the DFT of x and E, respectively labeled X and Y:
(37)
Each column of Y may then be computed analytically using the geometric series:
(38)
whereby z is the n.sup.th sample of the m.sup.th of N modes, and l is the l.sup.th of the sampled modes collected into the vector x.
(39) Alternatively, the process of magnitude and phase estimation by again resorting to a divide and conquer approach using spectral filters. In this approach, the magnitudes may be estimated using the minimization function:
(40)
whereby X and Y are DFTs of x and E, respectively, and H.sub.k is the k.sup.th spectral filter associated with the k.sup.th sub-band of the plurality of sub-bands. Modes that have minimal overlap with the filter H.sub.k may be effectively ignored by removing columns from Y, so that only those frequencies that fall within H.sub.k need to be minimized.
(41) The bandwidth b.sub.m of each mode m included in the subset of modes may also be estimated. This may be performed for each of the sub-bands, and this may be performed using the following equation: b.sub.m=arccos(2−0.5*(e.sup.d.sup.
(42) The above equations may be applied to only those modes that fall within the passband of the spectral filter of the sub-band. For example, for the k.sup.th spectral filter associated with the k.sup.th sub-band, magnitude and phase may be estimated for only those modes for which the range
(43)
intersects the passband of the filter. This may simplify the function.
(44) Additionally, since estimation of the magnitude and phase for each mode is performed independent for each sub-band, the processing for each sub-band can be performed in parallel. Therefore, for a computer architecture having multiple cores with parallel processing capabilities, the mode parameter estimation can be sped up even further.
(45) The estimated parameters may be stored in the memory of the system for further computation and subsequent applications.
(46) Continuing with
(47) For example, for each of the modes included in the set, determining a signal-to-mask ratio (SMR) of the mode based on a predetermined masking curve, and wherein one or more of the modes included in the set are truncated based on the determined SMR.
(48) An example subroutine 500 for truncating the unified set of modes is shown in the flow diagram of
(49) At block 510, a masking curve may be defined. In some examples, the masking curve may be predetermined. The masking curve may be used to compare a relative magnitude of the modes, but in relation to the curve instead of solely in relation to one another. The masking curve may be a psychoacoustic model, designed to account for psychoacoustics for someone who may listen to the audio signal. One example psychoacoustic model is Psychoacoustic Model 1 from the ISO/IEC MPEG-1 Standard.
(50) In some examples, the masking curve may involve tonal maskers and noise maskers. In some cases, including Psychoacoustic Model 1, a single noise masker may be created by summing the contribution of non-tonal maskers in each critical band of a signal. Alternatively, the sum may be replaced by an average, which has been found to model the masking curve more realistically.
(51) At block 520, for each mode in the unified set, a signal-to-mask ratio (SMR) may be determined based on the frequency for each given mode. The SMR values may be stored in the memory of the system.
(52) At block 530, the modes may be sorted according to the SMR for each mode. Then, at block 540, an input indicating a total number of modes may be received, and at block 550, the unified set of modes may be truncated down to a subset of modes having the modes with the highest SMR. The number of modes included in the subset may equal the total number input. The total number input may be a number that is less than or equal to the total number of modes of vibration included in the IR. The result is a subset of modes that excludes the modes having the least effect on the IR, and that includes the modes having the greatest effect on the IR, from a psychoacoustic perspective. This means that manipulation of the modal reverb parameters based on the subset of modes may be perceived by a listener as not different (or negligibly different) from manipulation of the parameters based on a complete set of identified modes of the complete IR.
(53) Other methods for truncating modes may be used in place of or in conjunction with the subroutine 500 of
(54) In some instances, the ESPRIT algorithm may estimate an IR of a given acoustic space to contain between 6,000-12,000 modes. The number of modes that a user may wish to truncate from the 6,000-12,000 may vary from computer to computer depending on processing power, or from user to user depending on allowable time constraints or target audio quality. The subroutine 500 of
(55) Returning to
(56) More generally, the present disclosure may enable a user to more effectively and efficiently manipulate reverberation effects of an audio recording or a portion of the audio recording. For instance, the user may wish to add an acoustic effect to a portion of the audio recording to make the recording sound as if it were played in a target acoustic space, such as a large hall or a small room. In operation, one or more processors would receive or otherwise derive an impulse response of the target acoustic space, convert the impulse response into the frequency domain, break the frequency plot into sub-bands, and then analyze each of the sub-bands—first separately and then as an aggregate—in order to select the most significant modes of the space (e.g., the subset of modes described above). The impulse response may then be simplified by discarding the remaining, less significant modes of the space. The one or more processors would then be capable of manipulating the audio signal using the simplified impulse response of the space. The result would be a modified audio recording.
(57) In this regard, reverberation is only one example of a property of the audio recording that may be modified using a simplified set of modes of vibration, although modal modification is particularly useful for manipulating reverberation. This is in part because the mapping of modes to perceptually important parameters (room size, decay time) is relatively straightforward, and because the parameters of a modal filter bank can be stably modulated at audio-rate. Other approaches for audio signal or recording manipulation may be more effective for modifying other properties of a given signal.
(58) The routines described above operate on the assumption that an IR can be represented using a sum of exponentially damped sinusoids (EDS). In this manner, the selected modes are effectively an estimation of EDS parameters of the IR, and controlling the selected modes individually approximates controlling the individual EDSs of the IR. This can achieve a wide variety of audio effects to the IR, including but not limited to morphing, spatialization, room size scaling, equalization, and so on.
(59) Additionally, the routines described above generally describe processing of an impulse response of a chosen acoustic space. However, those skilled in the art will appreciate that similar mode selection concepts and algorithms may be applied to other digital inputs, such as audio signals, even without the audio signals being an impulse response of a selected space. For example, an audio signal may itself have a included therein an impulse response of an acoustic space in which the audio signal is recorded, and that impulse response may include a number of modes of vibration of the recording space that may be identified and selected using the techniques herein. For further example, the audio recording may be a drum recording including a number of modes of vibration, such that application of the ESPRIT algorithm could enable the modes of vibration to be separately modified. In this manner, the present application can achieve an improved resolution for any modally modifiable audio recording.
(60) The above examples are described in the context of using the ESPRIT algorithm. However other algorithms may be used for the parameter approximation. More generally, parametric estimation algorithms other than ESPRIT may be used to deconstruct the signal into separate components (e.g., modes, damped sinusoids, etc.) and then estimate parameters of each separate component.
(61) Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims.