METHOD FOR PROCESSING AN AUDIO SIGNAL, SIGNAL PROCESSING UNIT, BINAURAL RENDERER, AUDIO ENCODER AND AUDIO DECODER
20230032120 · 2023-02-02
Inventors
Cpc classification
H04S7/305
ELECTRICITY
H04S2400/03
ELECTRICITY
H04S2420/01
ELECTRICITY
H04S7/30
ELECTRICITY
G10L19/008
PHYSICS
H04S2400/01
ELECTRICITY
International classification
H04S7/00
ELECTRICITY
G10L19/008
PHYSICS
Abstract
A method for processing an audio signal in accordance with a room impulse response is described. The audio signal is processed with an early part of the room impulse response separate from a late reverberation of the room impulse response, wherein the processing of the late reverberation has generating a scaled reverberated signal, the scaling being dependent on the audio signal. The processed early part of the audio signal and the scaled reverberated signal are combined.
Claims
1. A method for processing an audio signal in accordance with a room impulse response, the method comprising: applying the audio signal as an input signal to an early part processor and to a late reverberation processor; receiving, by the early part processor, a direct sound and early reflections of the room impulse response and processing the audio signal using the direct sound and early reflections to obtain a processed audio signal; receiving, by the late reverberation processor, predefined reverberator parameters and processing the audio signal using the predefined reverberator parameters to obtain a reverberated signal; scaling the reverberated signal to obtain a scaled reverberated signal; and combining the processed audio signal and the scaled reverberated signal, wherein scaling the reverberated signal by the late reverberation processor comprises setting a gain factor according to a predefined correlation measure of the audio signal, the predefined correlation measure having a fixed value determined empirically on the basis of an analysis of a plurality of audio signals, and applying the gain factor to the reverberated signal, or obtaining a gain factor using a correlation analysis of the audio signal, and applying the gain factor to the reverberated signal.
2. The method of claim 1, wherein the scaling is dependent on the condition of the one or more input channels of the audio signal, wherein the condition of the one or more input channels of the audio signal comprises one or more of the number of input channels, the number of active input channels and the activity in the input channel.
3. (canceled)
4. (canceled)
5. The method of claim 1, wherein the gain factor is determined based on the condition of one or more input channels of the audio signal.
6. The method of claim 1, wherein scaling the reverberated signal comprises applying the gain factor before, during or after processing the audio signal using the predefined reverberator parameters.
7. The method of claim 5, wherein the gain factor is determined as follows:
g=c.sub.u+ρ.Math.(c.sub.c−c.sub.u) where ρ=the predefined correlation measure for the audio signal, c.sub.u, c.sub.c=factors indicative of the condition of the one or more input channels of the audio signal, with C.sub.u referring to totally uncorrelated channels, and c.sub.c relating to totally correlated channels.
8. The method of claim 7, wherein C.sub.u and c.sub.c are determined as follows:
9. (canceled)
10. (canceled)
11. (canceled)
12. The method of claim 1, wherein the correlation analysis of the audio signal comprises determining for an audio frame of the audio signal the predefined correlation measure, and wherein the predefined correlation measure is calculated by combining correlation coefficients for a plurality of channel combinations of one audio frame, each audio frame comprising one or more time slots.
13. The method of claim 12, wherein combining the correlation coefficients comprises averaging a plurality of the correlation coefficients of the audio frame.
14. The method of claim 11, wherein determining the combined correlation measure comprises: (i) calculating an overall mean value for every channel of the audio frame, (ii) calculating a zero-mean audio frame by subtracting the mean values from every channel, (iii) calculating for a plurality of channel combinations the correlation coefficient, and (iv) calculating the predefined correlation measure as a mean of the plurality of correlation coefficients.
15. The method of claim 11, wherein the correlation coefficient for a channel combination is calculated as follows:
16. The method of claim 1, comprising delaying the scaled reverberated signal to match its a start scaled reverberated signal to the transition point from early reflections to a late reverberation in the room impulse response.
17. The method of claim 1, wherein processing the audio signal by the late reverberation processor comprises downmixing the audio signal.
18. A non-transitory digital storage medium having stored thereon a computer program with program code for carrying out the method of claim 1 when being executed by a computer.
19. A signal processing unit, comprising: an input for receiving an audio signal, an early part processor receiving a direct sound and early reflections of the room impulse response and processing the audio signal using the direct sound and early reflections to obtain a processed audio signal, a late reverberation processor receiving as input signal the audio signal, wherein the late reverberation processor is to receive predefined reverberator parameters and to process the audio signal using the predefined reverberator parameters to obtain a reverberated signal and to scale the reverberated signal to obtain a scaled reverberated signal; and an output for combining the processed audio signal and the scaled reverberated signal into an output audio signal, wherein the late reverberation processor is to scale the reverberated signal by setting a gain factor according to a predefined correlation measure of the audio signal, the predefined correlation measure having a fixed value determined empirically on the basis of an analysis of a plurality of audio signals, and applying the gain factor to the reverberated signal, or obtaining a gain factor using a correlation analysis of the audio signal, and applying the gain factor to the reverberated signal.
20. The signal processing unit of claim 19, wherein the late reverberation processor comprises: a reverberator receiving the audio signal and generating a reverberated signal; and a gain stage coupled to an input or to an output of the reverberator and controlled by the gain factor.
21. The signal processing unit of claim 20, comprising a correlation analyzer generating the gain factor dependent on the audio signal.
22. The signal processing unit of claim 20, further comprising at least one of: a low pass filter coupled to the gain stage, and a delay element coupled between the gain stage and an adder, the adder further coupled to the early part processor and the output.
23. A binaural renderer, comprising the signal processing unit of claim 19.
24. An audio encoder for coding audio signals, comprising: the signal processing unit of claim 19 or the binaural renderer of claim 23.
25. An audio decoder for decoding encoded audio signals, comprising: the signal processing unit of claim 19 or the binaural renderer of claim 23.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0042] Embodiments of the present invention will be described with regard to the accompanying drawings, in which:
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
DETAILED DESCRIPTION OF THE INVENTION
[0052] Embodiments of the inventive approach will now be described. The following description will start with a system overview of a 3D audio codec system in which the inventive approach may be implemented.
[0053]
[0054]
[0055] In an embodiment of the present invention, the encoding/decoding system depicted in
[0056] The algorithm blocks for the overall 3D audio system shown in
[0057] The pre-renderer/mixer 102 may be optionally provided to convert a channel plus object input scene into a channel scene before encoding. Functionally, it is identical to the object renderer/mixer that will be described below. Pre-rendering of objects may be desired to ensure a deterministic signal entropy at the encoder input that is basically independent of the number of simultaneously active object signals. With pre-rendering of objects, no object metadata transmission is required. Discrete object signals are rendered to the channel layout that the encoder is configured to use. The weights of the objects for each channel are obtained from the associated object metadata (OAM).
[0058] The USAC encoder 116 is the core codec for loudspeaker-channel signals, discrete object signals, object downmix signals and pre-rendered signals. It is based on the MPEG-D USAC technology. It handles the coding of the above signals by creating channel—and object mapping information based on the geometric and semantic information of the input channel and object assignment. This mapping information describes how input channels and objects are mapped to USAC-channel elements, like channel pair elements (CPEs), single channel elements (SCEs), low frequency effects (LFEs) and quad channel elements (QCEs) and CPEs, SCEs and LFEs, and the corresponding information is transmitted to the decoder. All additional payloads like SAOC data 114, 118 or object metadata 126 are considered in the encoder's rate control. The coding of objects is possible in different ways, depending on the rate/distortion requirements and the interactivity requirements for the renderer. In accordance with embodiments, the following object coding variants are possible: [0059] Pre-rendered objects: Object signals are pre-rendered and mixed to the 22.2 channel signals before encoding. The subsequent coding chain sees 22.2 channel signals. [0060] Discrete object waveforms: Objects are supplied as monophonic waveforms to the encoder. The encoder uses single channel elements (SCEs) to transmit the objects in addition to the channel signals. The decoded objects are rendered and mixed at the receiver side. Compressed object metadata information is transmitted to the receiver/renderer. [0061] Parametric object waveforms: Object properties and their relation to each other are described by means of SAOC parameters. The downmix of the object signals is coded with the USAC. The parametric information is transmitted alongside. The number of downmix channels is chosen depending on the number of objects and the overall data rate. Compressed object metadata information is transmitted to the SAOC renderer.
[0062] The SAOC encoder 112 and the SAOC decoder 220 for object signals may be based on the MPEG SAOC technology. The system is capable of recreating, modifying and rendering a number of audio objects based on a smaller number of transmitted channels and additional parametric data, such as OLDs, IOCs (Inter Object Coherence), DMGs (DownMix Gains). The additional parametric data exhibits a significantly lower data rate than necessitated for transmitting all objects individually, making the coding very efficient. The SAOC encoder 112 takes as input the object/channel signals as monophonic waveforms and outputs the parametric information (which is packed into the 3D-Audio bitstream 128) and the SAOC transport channels (which are encoded using single channel elements and are transmitted). The SAOC decoder 220 reconstructs the object/channel signals from the decoded SAOC transport channels 210 and the parametric information 214, and generates the output audio scene based on the reproduction layout, the decompressed object metadata information and optionally on the basis of the user interaction information.
[0063] The object metadata codec (see OAM encoder 124 and OAM decoder 224) is provided so that, for each object, the associated metadata that specifies the geometrical position and volume of the objects in the 3D space is efficiently coded by quantization of the object properties in time and space. The compressed object metadata cOAM 126 is transmitted to the receiver 200 as side information.
[0064] The object renderer 216 utilizes the compressed object metadata to generate object waveforms according to the given reproduction format. Each object is rendered to a certain output channel according to its metadata. The output of this block results from the sum of the partial results. If both channel based content as well as discrete/parametric objects are decoded, the channel based waveforms and the rendered object waveforms are mixed by the mixer 226 before outputting the resulting waveforms 228 or before feeding them to a postprocessor module like the binaural renderer 236 or the loudspeaker renderer module 232.
[0065] The binaural renderer module 236 produces a binaural downmix of the multichannel audio material such that each input channel is represented by a virtual sound source. The processing is conducted frame-wise in the QMF (Quadrature Mirror Filterbank) domain, and the binauralization is based on measured binaural room impulse responses.
[0066] The loudspeaker renderer 232 converts between the transmitted channel configuration 228 and the desired reproduction format. It may also be called “format converter”. The format converter performs conversions to lower numbers of output channels, i.e., it creates downmixes.
[0067]
[0068]
[0069]
[0070] As has been described above, in a binaural renderer, for example a binaural renderer as it is depicted in
[0071]
[0072] In a binaural renderer, as mentioned above, it may be desired to process the direct sound and early reflections separate from the late reverberation, mainly because of the reduced computational complexity. The processing of the direct sound and early reflections may, for example, be imprinted to the audio signal by a convolutional approach carried out by the processor 406 (see
[0073] This processing is also described in known technology reference [1]. The result of the above described approach should be perceptually as far as possible identical to the result of a convolution of the complete impulse response, the full-conversion approach described with regard to
[0074] However, it has been found out that despite these input parameters provided to the reverberator, the influence of the input audio signal on the reverberation is not fully preserved when using a synthetic reverberation approach as is described with regard to
[0075] So far, there are no known approaches that compare the amount of late reverberation with the results of the full-convolutional approach or match it to the convolutional result. There are some techniques that try to rate the quality of late reverberation or how natural it sounds. For example, in one method a loudness measure for natural sounding reverberation is defined, which predicts the perceived loudness of reverberation using a loudness model.
[0076] This approach is described in known technology reference [2], and the level can be fitted to a target value. The disadvantage of this approach is that it relies on a model of human hearing which is complicated and not exact. It also needs a target loudness to provide a scaling factor for the late reverberation that could be found using the full-convolution result.
[0077] In another method described in known technology reference [3] a cross-correlation criterion for artificial reverberation quality testing is used. However, this is only applicable for testing different reverberation algorithms, but not for multichannel audio, not for binaural audio and not for qualifying the scaling of late reverberation.
[0078] Another possible approach is to use of the number of input channels at the considered ear as a scaling factor, however this does not give a perceptually correct scaling, because the perceived amplitude of the overall sound signal depends on the correlation of the different audio channels and not just on the number of channels.
[0079] Therefore, in accordance with the inventive approach a signal-dependent scaling method is provided which adapts the level of reverberation according to the input audio signal. As mentioned above, the perceived level of the reverberation is desired to match with the level of reverberation when using the full-convolution approach for the binaural rendering, and the determination of a measure for an adequate level of reverberation is therefore important for achieving a good sound quality. In accordance with embodiments, an audio signal is separately processed with an early part and a late reverberation of the room impulse response, wherein processing the late reverberation comprises generating a scaled reverberated signal, the scaling being dependent on the audio signal. The processed early part of the audio signal and the scaled reverberated signal are combined into the output signal. In accordance with one embodiment the scaling is dependent on the condition of the one or more input channels of the audio signal (e.g. the number of input channels, the number of active input channels and/or the activity in the input channel). In accordance another embodiment the scaling is dependent on a predefined or calculated correlation measure for the audio signal. Alternative embodiments may perform the scaling based on a combination of the condition of the one or more input channels and the predefined or calculated correlation measure.
[0080] In accordance with embodiments the scaled reverberated signal may be generated by applying a gain factor that is determined based on the condition of the one or more input channels of the audio signal, or based on the predefined or calculated correlation measure for the audio signal, or based on a combination thereof.
[0081] In accordance with embodiments, separate processing the audio signal comprises processing the audio signal with the early reflection part 301, 302 of the room impulse response 300 during a first process, and processing the audio signal with the diffuse reverberation 304 of the room impulse response 300 during a second process that is different and separate from the first process. Changing from the first process to the second process occurs at the transition time. In accordance with further embodiments, in the second process the diffuse (late) reverberation 304 may be replaced by a synthetic reverberation. In this case the room impulse response applied to the first process contains only the early reflection part 300, 302 (see
[0082] In the following an embodiment of the inventive approach will be described in further detail in accordance with which the gain factor is calculated on the basis of a correlation analysis of the input audio signal.
[0083] The reverberation branch 512 further includes a correlation analysis processor 524 that receives the input signal 504 and generates a gain factor g at its output. Further, a gain stage 526 is provided that is coupled between the reverberator 514 and the adder 510. The gain stage 526 is controlled by the gain factor g, thereby generating at the output of the gain stage 526 the scaled reverberated signal r.sub.g[k] that is applied to the adder 510. The adder 510 combines the early processed part and the reverberated signal to provide the output signal y[k] which also includes two channels. Optionally, the reverberation branch 512 may comprise a low pass filter 528 coupled between the processor 524 and the gain stage for smoothing the gain factor over a number of audio frames. Optionally, a delay element 530 may also be provided between the output of the gain stage 526 and the adder 510 for delaying the scaled reverberated signal such that it matches a transition between the early reflection and the reverberation in the room impulse response.
[0084] As described above,
[0085] The multichannel binaural renderer depicted in
[0086] For calculating the scaling factors, a correlation measure is introduced that is based on the correlation coefficient and in accordance with embodiments, is defined in a two-dimensional time-frequency domain, for example the QMF domain. A correlation value between −1 and 1 is calculated for each multi-dimensional audio frame, each audio frame being defined by a number of frequency bands N, a number of time slots M per frame, and a number of audio channels A. One scaling factor per frame per ear is obtained.
[0087] In the following, an embodiment of the invention approach will be described in further detail. First of all, reference is made to the correlation measure used in the correlation analysis processor 524 of
[0088] This processing in accordance with the described embodiment is transferred to two dimensions in a time-frequency domain, for example the QMF-domain. The two dimensions are the time slots and the QMF bands. This approach is reasonable, because the data is often encoded and transmitted also in the time-frequency domain. The expectation operator is replaced with a mean operation over several time and/or frequency samples so that the time-frequency correlation measure between two zero-mean variables x.sub.m, x.sub.n in the range of (0, 1) is defined as follows:
[0089] After the calculation of this coefficient for a plurality of channel combinations (m,n) of one audio frame, the values of p[m,n,t.sub.i] are combined to a single correlation measure ρ.sub.m(t.sub.i) by taking the mean of (or averaging) a plurality of correlation values pρ[m,n,t.sub.i]. It is noted that the audio frame may comprise 32 QMF time slots, and t, indicates the respective audio frame. The above processing may be summarized for one audio frame as follows: [0090] (i) First, the overall mean value
[0094] In accordance with the above described embodiment the scaling was determined based on the calculated correlation measure for the audio signal. This is advantageous, despite the additional computational resources needed, e.g., when it is desired to obtain the correlation measure for the currently processed audio signal individually.
[0095] However, the present invention is not limited to such an approach. In accordance with other embodiments, rather that calculating the correlation measure also a predefined correlation measure may be used. Using a predefined correlation measure is advantageous as it reduces the computational complexity in the process. The predefined correlation measure may have a fixed value, e.g. 0.1 to 0.9, that may be determined empirically on the basis of an analysis of a plurality of audio signals. In such a case the correlation analysis 524 may be omitted and the gain of the gain stage may be set by an appropriate control signal.
[0096] In accordance with other embodiments the scaling may be dependent on the condition of the one or more input channels of the audio signal (e.g. the number of input channels, the number of active input channels and/or the activity in the input channel). This is advantageous because the scaling can be easily determined from the input audio signal with a reduced computational overhead. For example, the scaling can be determined by simply determining the number of channels in the original audio signal that are downmixed to a currently considered downmix channel including a reduced number of channels when compared to the original audio signal. Alternatively, the number of active channels (channels showing some activity in a current audio frame) downmixed to the currently considered downmix channel may form the basis for scaling the reverberated signal. this may be done in the block 524.
[0097] In the following, an embodiment will be described in detail determining the scaling of the reverberated signal on the basis of the condition of the one or more input channels of the audio signal and on the basis of a correlation measure (either fixed or calculated as above described). In accordance with such an embodiment, the gain factor or gain or scaling factor g is defined as follows:
[0098] c.sub.u is the factor that is applied if the downmixed channels are totally uncorrelated (no inter-channel dependencies). In case of using only the condition of the one or more input channels g=c.sub.u and the predefined fixed correlation coefficient is set to zero. c.sub.c is the factor that is applied if the downmixed channels are totally correlated (signals are weighted versions (plus phase-shift and offset) of each other's). In case of using only the condition of the one or more input channels g=c.sub.c and the predefined fixed correlation coefficient is set to one. These factors describe the minimum and maximum scaling of the late reverberation in the audio frame (depending on the number of (active) channels).
[0099] The “channel number” K.sub.in is defined, in accordance with embodiments, as follows: A multichannel audio signal is downmixed to a stereo downmix using a downmix matrix Q that defines which input channels are included in which downmix channel (size M×2, with M being the number of input channels of the audio input material, e.g. 6 channels for a 5.1 setup).
[0100] An example for the downmix matrix Q may be as follows:
[0101] For each of the two downmix channels the scaling coefficient is calculated as follows:
g=ƒ(c.sub.c,c.sub.u,ρ.sub.avg)=C.sub.uρ.sub.avg(c.sub.c−C.sub.u)
with ρ.sub.avg being the average/mean value of all correlation coefficients ρ[m, n] for a number of K.sub.in.Math.K.sub.in, channel combinations [m, n] and c.sub.c, c.sub.u being dependent on the channel number K.sub.in, which may be as follows: [0102] K.sub.in may be the number of channels that are downmixed to the currently considered downmix channel k∈[1,2] (the number of rows in the downmix matrix Q in the column k that contain values unequal to zero). This number is time-invariant because the downmix matrix Q is predefined for one input channel configuration and does not change over the length of one audio input signal. [0103] E.g. when considering a 5.1 input signal the following applies: [0104] channels 1, 3, 4 are downmixed to downmix channel 1 (see matrix Q above), [0105] K.sub.in =3 in every frame (3 channels) [0106] K.sub.in may be the number of active channels that are downmixed to the currently considered downmix channel k∈[1,2] (number of input channels where there is activity in the current audio frame and where the corresponding row of the downmix matrix Q in the column k contains a value unequal to zero.fwdarw.number of channels in the intersection of active channels and non-equal elements in column k of Q). This number may be time-variant over the length of one audio input signal, because even if Q stays the same, the signal activity may vary over time. [0107] E.g. when considering a 5.1 input signal the following applies: [0108] channels 1, 3, 4 are downmixed to downmix channel 1 (see matrix Q above), [0109] In frame n: [0110] the active channels are channels 1, 2, 4, [0111] K.sub.in is the number of channels in the intersection {1, 4}, [0112] K.sub.in (n)=2 [0113] In frame n+1: [0114] the active channels are channels 1, 2, 3, 4 [0115] K.sub.in is the number of channels in the intersection {1, 3, 4}, [0116] K.sub.in (n+1)=3.
[0117] An audio channel (in a predefined frame) may be considered active in case it has an amplitude or an energy within the predefined frame that exceeds a preset threshold value, e.g., in accordance with embodiments, an activity in an audio channel (in a predefined frame) may be defined as follows: [0118] the sum or maximum value of the absolute amplitudes of the signal (in the time domain, QMF domain, etc.) in the frame is bigger than zero, or [0119] the sum or maximum value of the signal energy (squared absolute value of amplitudes in time domain or QMF domain) in the frame is bigger than zero.
[0120] Instead of zero also another threshold (relative to the maximum energy or amplitude) bigger than zero may be used, e.g. a threshold of 0.01.
[0121] In accordance with embodiments, a gain factor for each ear is provided which depends on the number of active (time-varying) or the fixed number of included channels (downmix matrix unequal to zero) Km, in the downmix channel. It is assumed that the factor linearly increases between the totally uncorrelated and the totally correlated case. Totally uncorrelated means no inter-channel dependencies (correlation value is zero) and totally correlated means the signals are weighted versions of each other's (with phase difference of offset, correlation value is one).
[0122] As mentioned above, the gain or scaling factor g may be smoothed over the audio frames by the low pass filter 528. The low pass filter 528 may have a time constant of t.sub.s is which results in a smoothed gain factor of g.sub.s(t) for a frame size k as follows:
[0123] The frame size k may be the size of an audio frame in time domain samples, e.g. 2048 samples.
[0124] The left channel reverbed signal of the audio frame x(t.sub.i) is then scaled by the factor g.sub.s,left(t.sub.i) and the right channel reverbed signal is scaled by the factor g.sub.s,right(t.sub.i) The scaling factor is once calculated with K.sub.in as the number of (active non-zero or total number of) channels that are present in the left channel of the stereo downmix that is fed to the reverberator resulting in the scaling factor n (t.sub.i) Then the scaling factor is calculated once more with K.sub.in as the number of (active non-zero or total number of) channels that are present in the right channel of the stereo downmix that is fed to the reverberator resulting in the scaling factor g.sub.s,right The reverberator gives back a stereo reverberated version of the audio frame. The left channel of the reverberated version (or the left channel of the input of the reverberator) is scaled with g.sub.s,left(t.sub.i) and the right channel of the reverberated version (or the right channel of the input of the reverberator) is scaled with g.sub.s,right
[0125] The scaled artificial (synthetic) late reverberation is applied to the adder 510 to be added to the signal 506 which has been processed with the direct sound and the early reflections.
[0126] As mentioned above, the inventive approach, in accordance with embodiments may be used in a binaural processor for binaural processing of audio signals. In the following an embodiment of binaural processing of audio signals will be described. The binaural processing may be carried out as a decoder process converting the decoded signal into a binaural downmix signal that provides a surround sound experience when listened to over headphones.
[0127]
[0128] The binaural renderer module 800 (e.g., the binaural renderer 236 of
Definitions
[0129] Audio signals 802 that are fed into the binaural renderer module 800 are referred to as input signals in the following. Audio signals 830 that are the result of the binaural processing are referred to as output signals. The input signals 802 of the binaural renderer module 800 are audio output signals of the core decoder (see for example signals 228 in
TABLE-US-00001 N.sub.in Number of input channels N.sub.out Number of output channels, N.sub.out = 2 M.sub.DMX Downmix matrix containing real-valued non-negative downmix coefficients (downmix gains). M.sub.DMX is of dimension N.sub.out × N.sub.in L Frame length measured in time domain audio samples. v Time domain sample index n QMF time slot index (subband sample index) L.sub.n Frame length measured in QMF time slots F Frame index (frame number) K Number of QMF frequency bands, K = 64 k QMF band index (1 . . . 64) A, B, ch Channel indices (channel numbers of channel configurations) L.sub.trans Length of the BRIR's early reflection part in time domain samples L.sub.trans, n Length of the BRIR's early reflection part in QMF time slots N.sub.BRIR Number of BRIR pairs in a BRIR data set L.sub.FFT Length of FFT transform (.Math.) Real part of a complex-valued signal
(.Math.) Imaginary part of a complex-valued signal m.sub.conv Vector that signals which input signal channel belongs to which BRIR pair in the BRIR data set f.sub.max Maximum frequency used for the binaural processing f.sub.max, decoder Maximum signal frequency that is present in the audio output signal of the decoder K.sub.max Maximum band that is used for the convolution of the audio input signal with the early reflection part of the BRIRs a Downmix matrix coefficient c.sub.eq, k Bandwise energy equalization factor ε Numerical constant, ε = 10.sup.−20 d Delay in QMF domain time slots y̆.sub.ch.sup.n′, k Pseudo-FFT domain signal representation in frequency band k n′ Pseudo-FFT frequency index h̆.sup.n′, k Pseudo-FFT domain representation of BRIR in frequency band k z̆.sub.ch, conv.sup.n′, k Pseudo-FFT domain convolution result in frequency band k {circumflex over (z)}.sub.ch, conv.sup.n, k Intermediate signal: 2-channel convolutional result in QMF domain {circumflex over (z)}.sub.ch, rev.sup.n, k Intermediate signal: 2-channel reverberation in QMF domain K.sub.ana Number of analysis frequency bands (used for the reverberator) f.sub.c, ana Center frequencies of analysis frequency bands N.sub.DMX, act Number of channels that are downmixed to one channel of the stereo downmix and are active in the actual signal frame c.sub.corr Overall correlation coefficient for one signal frame c.sub.corr.sup.A, B Correlation coefficient for the combination of channels A, B σ.sub.ŷ.sub.
[0130] The processing of the input signal is now described. The binaural renderer module operates on contiguous, non-overlapping frames of length L=2048 time domain samples of the input audio signals and outputs one frame of L samples per processed input frame of length L.
[0131] (1) Initialization and Preprocessing
[0132] The initialization of the binaural processing block is carried out before the processing of the audio samples delivered by the core decoder (see for example the decoder of 200 in
[0133] (a) Reading of Analysis Values
[0134] The reverberator module 816a, 816b takes a frequency-dependent set of reverberation times 808 and energy values 810 as input parameters. These values are read from an interface at the initialization of the binaural processing module 800. In addition the transition time 832 from early reflections to late reverberation in time domain samples is read. The values may be stored in a binary file written with 32 bit per sample, float values, little-endian ordering. The read values that are needed for the processing are stated in the table below:
TABLE-US-00002 Value description Number Datatype transition length L.sub.trans 1 Integer Number of frequency bands K.sub.ana 1 Integer Center frequencies f.sub.c, ana of frequency K.sub.ana Float bands Reverberation times RT60 in seconds K.sub.ana Float Energy values that represent the K.sub.ana Float energy (amplitude to the power of two) of the late reverberation part of one BRIR
[0135] (b) Reading and Preprocessing of BRIRs
[0136] The binaural room impulse responses 804 are read from two dedicated files that store individually the left and right ear BRIRs. The time domain samples of the BRIRs are stored in integer wave-files with a resolution of 24 bit per sample and 32 channels. The ordering of BRIRs in the file is as stated in the following table:
TABLE-US-00003 Channel Speaker number label 1 CH_M_L045 2 CH_M_R045 3 CH_M_000 4 CH_LFE1 5 CH_M_L135 6 CH_M_R135 7 CH_M_L030 8 CH_M_R030 9 CH_M_180 10 CH_LFE2 11 CH_M_L090 12 CH_M_R090 13 CH_U_L045 14 CH_U_R045 15 CH_U_000 16 CH_T_000 17 CH_U_L135 18 CH_U_R135 19 CH_U_L090 20 CH_U_R090 21 CH_U_180 22 CH_L_000 23 CH_L_L045 24 CH_L_R045 25 CH_M_L060 26 CH_M_R060 27 CH_M_L110 28 CH_M_R110 29 CH_U_L030 30 CH_U_R030 31 CH_U_L110 32 CH_U_R110
[0137] If there is no BRIR measured at one of the loudspeaker positions, the corresponding channel in the wave file contains zero-values. The LFE channels are not used for the binaural processing.
[0138] As a preprocessing step, the given set of binaural room impulse responses (BRIRs) is transformed from time domain filters to complex-valued QMF domain filters. The implementation of the given time domain filters in the complex-valued QMF domain is carried out according to ISO/IEC FDIS 23003-1:2006, Annex B. The prototype filter coefficients for the filter conversion are used according to ISO/IEC FDIS 23003-1:2006, Annex B, Table B.1. The time domain representation {tilde over (h)}.sub.eh.sup.v=[{tilde over (h)}.sub.1.sup.v . . . {tilde over (h)}.sub.NDRIR.sup.v] with 1≤v≤L.sub.trans is processed to gain a complex valued QMF domain filter ĥ.sub.eh.sup.n,k=[ĥ.sub.i.sup.n,k . . . ĥ.sub.nBRIR.sup.n,k] with 1≤n≤L.sub.trans,n.
[0139] (2) Audio Signal Processing
[0140] The audio processing block of the binaural renderer module 800 obtains time domain audio samples 802 for N.sub.in input channels from the core decoder and generates a binaural output signal 830 consisting of N.sub.out=2 channels.
[0141] The processing takes as input [0142] the decoded audio data 802 from the core decoder, [0143] the complex QMF domain representation of the early reflection part of the BRIR set 804, and [0144] the frequency-dependent parameter set 808, 810, 832 that is used by the QMF domain reverberator 816a, 816b to generate the late reverberation 826a, 826b.
[0145] (a) QMF Analysis of the Audio Signal
[0146] As the first processing step, the binaural renderer module transforms L=2048 time domain samples of the N.sub.in-channel time domain input signal (coming from the core decoder) [{tilde over (y)}.sub.eh,1.sup.v . . . {tilde over (y)}.sub.ehNin.sup.v]={tilde over (y)}.sub.ch.sup.v to an Aim-channel QMF domain signal representation 802 of dimension 4=32 QMF time slots (slot index n) and K=64 frequency bands (band index k).
[0147] A QMF analysis as outlined in ISO/IEC 14496-3:2009, subclause 4.6.18.2 with the modifications stated in ISO/IEC 14496-3:2009, subclause 8.6.4.2. is performed on a frame of the time domain signal ST veil to gain a frame of the QMF domain signal [ŷ.sub.eh1.sup.n,k . . . ŷ.sub.ehNin.sup.n,k]=ŷ.sub.eh.sup.n,k with 1≤v≤L and 1≤n≤L.sub.n.
[0148] (b) Fast convolution of the QMF domain audio signal and the QMF domain BRIRs Next, a bandwise fast convolution 812 is carried out to process the QMF domain audio signal 802 and the QMF domain BRIRs 804. A FFT analysis may be carried out for each QMF frequency band k for each channel of the input signal 802 and each BRIR 804.
[0149] Due to the complex values in the QMF domain one FFT analysis is carried out on the real part of the QMF domain signal representation and one FFT analysis on the imaginary parts of the QMF domain signal representation. The results are then combined to form the final bandwise complex-valued pseudo-FFT domain signal
{tilde over (y)}.sub.eh.sup.n,k=FFT(ŷ.sub.eh.sup.n,k)=FFT(ŷ.sub.eh.sup.n,k))+j.Math.FFT(
(ŷ.sub.eh.sup.n,k))
and the bandwise complex-valued BRIRs
{tilde over (h)}.sub.1.sup.n,k=FFT(ĥ.sub.1.sup.n,k)=FFT((ĥ.sub.1.sup.n,k))+j.Math.FFT(
(ĥ.sub.1.sup.n,k)) for the left ear
{tilde over (h)}.sub.2.sup.n,k=FFT(ĥ.sub.2.sup.n,k)=FFT((ĥ.sub.2.sup.n,k))+j.Math.FFT(
ĥ.sub.2.sup.n,k)) for the right ear
[0150] The length of the FFT transform is determined according to the length of the complex valued QMF domain BRIR filters L.sub.trans,n and the frame length in QMF domain time slots L.sub.n such that
L.sub.FFT=L.sub.trans,n+L.sub.n−1
[0151] The complex-valued pseudo-FFT domain signals are then multiplied with the complex-valued pseudo-FFT domain BRIR filters to form the fast convolution results. A vector m.sub.conv is used to signal which channel of the input signal corresponds to which BRIR pair in the BRIR data set.
[0152] This multiplication is done bandwise for all QMF frequency bands k with 1≤k≤K.sub.max. The maximum band K.sub.max is determined by the QMF band representing a frequency of either 18 kHz or the maximal signal frequency that is present in the audio signal from the core decoder
ƒ.sub.max=(ƒ.sub.max,decoder,18 kHz).
[0153] The multiplication results from each audio input channel with each BRIR pair are summed up in each QMF frequency band k with 1≤k≤K. resulting in an intermediate 2-channel K.sub.max-band pseudo-FFT domain signal.
are the pseudo-FFT convolution result {tilde over (z)}.sub.ch,conv.sup.n,k=[{tilde over (z)}.sub.ch,1,conv.sup.n,k,{tilde over (z)}.sub.ch,2,conv.sup.n,k] in the QMF domain frequency band k.
[0154] Next, a bandwise FFT synthesis is carried out to transform the convolution result back to the QMF domain resulting in an intermediate 2-channel K.sub.max-band QMF domain signal with L.sub.FFT time slots {circumflex over (z)}.sub.ch,conv.sup.n,k=[{circumflex over (z)}.sub.ch,1,conv.sup.n,k,{circumflex over (z)}.sub.ch,2,conv.sup.n,k] with 1≤n≤L.sub.FFT and 1≤k≤K.sub.max.
[0155] For each QMF domain input signal frame with L=32 timeslots a convolution result signal frame with L=32 timeslots is returned. The remaining L.sub.FFT−32 timeslots are stored and an overlap-add processing is carried out in the following frame(s).
[0156] (c) Generation of Late Reverberation
[0157] As a second intermediate signal 826a, 826b a reverberation signal called {circumflex over (z)}.sub.ch,rev.sup.n,k=[{circumflex over (z)}.sub.ch,1,rev.sup.n,k,{circumflex over (z)}.sub.ch,2,rev.sup.n,k] is generated by a frequency domain reverberator module 816a, 816b. The frequency domain reverberator 816a, 816b takes as input [0158] a QMF domain stereo downmix 822 of one frame of the input signal, [0159] a parameter set that contains frequency-dependent reverberation times 808 and energy values 810.
[0160] The frequency domain reverberator 816a, 816b returns a 2-channel QMF domain late reverberation tail.
[0161] The maximum used band number of the frequency-dependent parameter set is calculated depending on the maximum frequency.
[0162] First, a QMF domain stereo downmix 818 of one frame of the input signal ŷ.sub.ch.sup.n,k is carried out to form the input of the reverberator by a weighted summation of the input signal channels.
[0163] The weighting gains are contained in the downmix matrix M.sub.DMX. They are real-valued and non-negative and the downmix matrix is of dimension N.sub.out×N.sub.in. It contains a non-zero value where a channel of the input signal is mapped to one of the two output channels.
[0164] The channels that represent loudspeaker positions on the left hemisphere are mapped to the left output channel and the channels that represent loudspeakers located on the right hemisphere are mapped to the right output channel. The signals of these channels are weighted by a coefficient of 1. The channels that represent loudspeakers in the median plane are mapped to both output channels of the binaural signal. The input signals of these channels are weighted by a coefficient
[0165] In addition, an energy equalization step is performed in the downmix. It adapts the bandwise energy of one downmix channel to be equal to the sum of the bandwise energy of the input signal channels that are contained in this downmix channel. This energy equalization is conducted by a bandwise multiplication with a real-valued coefficient c.sub.eq,k=√{square root over (p.sub.in.sup.k/p.sub.out.sup.k+ε)}.
[0166] The factor c.sub.eq,k is limited to an interval of [0.5, 2]. The numerical constant E is introduced to avoid a division by zero. The downmix is also bandlimited to the frequency ƒ.sub.max; the values in all higher frequency bands are set to zero.
[0167]
[0168] In the frequency domain reverberator a mono downmix of the stereo input is calculated using an input mixer 900. This is done incoherently applying a 90° phase shift on the second input channel.
[0169] This mono signal is then fed to a feedback delay loop 902 in each frequency band k, which creates a decaying sequence of impulses. It is followed by parallel FIR decorrelators that distribute the signal energy in a decaying manner into the intervals between the impulses and create incoherence between the output channels. A decaying filter tap density is applied to create the energy decay. The filter tap phase operations are restricted to four options to implement a sparse and multiplier-free decorrelator.
[0170] After the calculation of the reverberation an inter-channel coherence (ICC) correction 904 is included in the reverberator module for every QMF frequency band. In the ICC correction step frequency-dependent direct gains g.sub.direct and crossmix gains g.sub.cross are used to adapt the ICC.
[0171] The amount of energy and the reverberation times for the different frequency bands are contained in the input parameter set. The values are given at a number of frequency points which are internally mapped to the K=64 QMF frequency bands.
[0172] Two instances of the frequency domain reverberator are used to calculate the final intermediate signal {circumflex over (z)}.sub.ch,rev.sup.n,k=[{circumflex over (z)}.sub.ch,1,rev.sup.n,k,{circumflex over (z)}.sub.ch,2,rev.sup.n,k]. The signal {circumflex over (z)}.sub.ch,1,rev.sup.n,k is the first output channel of the first instance of the reverberator, and {circumflex over (z)}.sub.ch,2,rev.sup.n,k is the second output channel of the second instance of the reverberator. They are combined to the final reverberation signal frame that has the dimension of 2 channels, 64 bands and 32 time slots.
[0173] The stereo downmix 822 is both times scaled 821a,b according to a correlation measure 820 of the input signal frame to ensure the right scaling of the reverberator output. The scaling factor is defined as a value in the interval of [√{square root over (N.sub.DMX,act)},N.sub.DMX,act] linearly depending on a correlation coefficient c.sub.corr, between 0 and 1 with
where σ.sub.ycha means the standard deviation across one time slot n of channel A, the operator {*} denotes the complex conjugate and ŷ is the zero-mean version of the QMF domain signal yin the actual signal frame.
[0174] c.sub.corr is calculated twice: once for the plurality of channels A,B that are active at the actual signal frame F and are included in the left channel of the stereo downmix and once for the plurality of channels A, B that are active at the actual signal frame F and that are included in the right channel of the stereo downmix. N.sub.DMX,act is the number of input channels that are downmixed to one downmix channel A (number of matrix element in the Ath row of the downmix matrix M.sub.DMX that are unequal to zero) and that are active in the current frame.
[0175] The scaling factors then are
[0176] The scaling factors are smoothed over audio signal frames by a 1.sup.st order low pass filter resulting in smoothed scaling factors {tilde over (c)}.sub.scale=[{tilde over (c)}.sub.scale,1 {tilde over (c)}.sub.scale,2].
[0177] The scaling factors are initialized in the first audio input data frame by a time-domain correlation analysis with the same means.
[0178] The input of the first reverberator instance is scaled with the scaling factor {tilde over (c)}.sub.scale,1 and the input of the second reverberator instance is scaled with the scaling factor ĉ.sub.scale,2.
[0179] (d) Combination of Convolutional Results and Late Reverberation
[0180] Next, the convolutional result 814, {circumflex over (z)}.sub.ch,conv.sup.n,k=[{circumflex over (z)}.sub.ch,1,conv.sup.n,k{circumflex over (z)}.sub.ch,2,conv.sup.n,k], and the reverberator output 826a, 826b,{circumflex over (z)}.sub.ch,rev.sup.n,k=[{circumflex over (z)}.sub.ch,1,rev.sup.n,k{circumflex over (z)}.sub.ch,2,rev.sup.n,k], for one QMF domain audio input frame are combined by a mixing process 828 that bandwise adds up the two signals. Note that the upper bands higher than K.sub.max are zero in z.sub.ch,conv.sup.n,k because the convolution is only conducted in the bands up to K.sub.max.
[0181] The late reverberation output is delayed by an amount of d=((L.sub.trans−20.Math.64+1)/64+0.5)+1 time slots in the mixing process.
[0182] The delay d takes into account the transition time from early reflections to late reflections in the BRIRs and an initial delay of the reverberator of 20 QMF time slots, as well as an analysis delay of 0.5 QMF time slots for the QMF analysis of the BRIRs to ensure the insertion of the late reverberation at a reasonable time slot. The combined signal {circumflex over (z)}.sub.ch.sup.n,k at one time slot n calculated by {circumflex over (z)}.sub.ch,conv.sup.n,k+{circumflex over (z)}.sub.ch,rev.sup.n=d,k.
[0183] (e) QMF Synthesis of Binaural QMF Domain Signal
[0184] One 2-channel frame of 32 time slots of the QMF domain output signal {circumflex over (z)}.sub.ch.sup.n,k is transformed to a 2-channel time domain signal frame with length L by the QMF synthesis according to ISO/IEC 14496-3:2009, subclause 4.6.18.4.2. yielding the final time domain output signal 830, {tilde over (z)}.sub.eh.sup.v=[{tilde over (z)}.sub.ch,1.sup.v . . . {tilde over (z)}.sub.eh,2.sup.v].
[0185] In accordance with the inventive approach the synthetic or artificial late reverberation is scaled taking into consideration the characteristics of the input signal, thereby improving the quality of the output signal while taking advantage of the reduced computational complexity obtained by the separate processing. Also, as can be seen from the above description, no additional hearing models or target reverberation loudness are necessitated.
[0186] It is noted that the invention is not limited to the above described embodiment. For example, while the above embodiment has been described in combination with the QMF domain, it is noted that also other time-frequency domains may be used, for example the STFT domain. Also, the scaling factor may be calculated in a frequency-dependent manner so that the correlation is not calculated over the entire number of frequency bands, namely i ∀[1,N], but is calculated in a number of S subsets defined as follows:
i.sub.1∀[1,N.sub.1],i.sub.2∀[N.sub.1+N.sub.2], . . . ,i.sub.S∀[N.sub.S−1+N]
[0187] Also, smoothing may be applied across the frequency bands or bands may be combined according to a specific rule, for example according to the frequency resolution of the hearing. Smoothing may be adapted to different time constants, for example dependent on the frame size or the preference of the listener.
[0188] The inventive approach may also be applied for different frame sizes, even a frame size of just one time slot in the time-frequency domain is possible.
[0189] In accordance with embodiments, different downmix matrices may be used for the downmix, for example symmetric downmix matrices or asymmetric matrices.
[0190] The correlation measure may be derived from parameters that are transmitted in the audio bitstream, for example from the inter-channel coherence in the MPEG surround or SAOC. Also, in accordance with embodiments it is possible to exclude some values of the matrix from the mean-value calculation, for example erroneously calculated values or values on the main diagonal, the autocorrelation values, if necessitated.
[0191] The process may be carried out at the encoder instead of using it in the binaural renderer at the decoder side, for example when applying a low complexity binaural profile. This results in that some representation of the scaling factors, for example the scaling factors themselves, the correlation measure between 0 and 1 and the like, and these parameters are transmitted in the bitstream from the encoder to the decoder for a fixed downstream matrix.
[0192] Also, while the above described embodiment is described applying the gain following the reverberator 514, it is noted that in accordance with other embodiments the gain can also be applied before the reverberator 514 or inside the reverberator, for example by modifying the gains inside the reverberator 514. This is advantageous as fewer computations may be necessitated.
[0193] Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
[0194] Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a non-transitory storage medium such as a digital storage medium, for example a floppy disc, a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
[0195] Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
[0196] Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may, for example, be stored on a machine readable carrier.
[0197] Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
[0198] In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
[0199] A further embodiment of the inventive method is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
[0200] A further embodiment of the invention method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may, for example, be configured to be transferred via a data communication connection, for example, via the internet.
[0201] A further embodiment comprises a processing means, for example, a computer or a programmable logic device, configured to, or programmed to, perform one of the methods described herein.
[0202] A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
[0203] A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
[0204] In some embodiments, a programmable logic device (for example, a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
[0205] While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which will be apparent to others skilled in the art and which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
LITERATURE
[0206] [1] M. R. Schroeder, “Digital Simulation of Sound Transmission in Reverberant Spaces”, The Journal of the Acoustical Society of America, VoS. 47, pp. 424-431 (1970) and enhanced in JA. Moorer, “About This Reverberation Business”, Computer Music Journal, Vol. 3, no. 2, pp. 13-28, MIT Press (1979). [0207] [2] Uhle, Christian; Paulus, Jouni; Herre, Jurgen: “Predicting the Perceived Level of Late Reverberation Using Computational Models of Loudness” Proceedings, 17th International Conference on Digital Signal Processing (DSP), Jul. 6-8, 2011, Corfu, Greece. [0208] [3] Czyzewski, Andrzej: “A Method of Artificial Reverberation Quality Testing” J. Audio Eng. Soc., Vol. 38, No 3, 1990.