Apparatus and Method for Estimating an Inter-Channel Time Difference
20220310103 · 2022-09-29
Inventors
- Stefan Bayer (Nuernberg, DE)
- Eleni FOTOPOULOU (Nuernberg, DE)
- Markus MULTRUS (Nuernberg, DE)
- Guillaume Fuchs (Bubenreuth, DE)
- Emmanuel Ravelli (Erlangen, DE)
- Markus SCHNELL (Nuernberg, DE)
- Stefan Doehla (Erlangen, DE)
- Wolfgang Jaegers (Erlangen, DE)
- Martin DIETZ (Nuernberg, DE)
- Goran MARKOVIC (Nuernberg, DE)
Cpc classification
H04S2400/03
ELECTRICITY
H04S2420/03
ELECTRICITY
G10L19/02
PHYSICS
G10L25/18
PHYSICS
G10L19/008
PHYSICS
H04S3/008
ELECTRICITY
H04S2400/01
ELECTRICITY
International classification
G10L19/008
PHYSICS
G10L19/02
PHYSICS
G10L19/022
PHYSICS
G10L25/18
PHYSICS
Abstract
An apparatus for estimating an inter-channel time difference between a first channel signal and a second channel signal, includes: a calculator for calculating a cross-correlation spectrum for a time block from the first channel signal in the time block and the second channel signal in the time block; a spectral characteristic estimator for estimating a characteristic of a spectrum of the first channel signal or the second channel signal for the time block; a smoothing filter for smoothing the cross-correlation spectrum over time using the spectral characteristic to obtain a smoothed cross-correlation spectrum; and a processor for processing the smoothed cross-correlation spectrum to obtain the inter-channel time difference.
Claims
1. An apparatus for estimating an inter-channel time difference between a first channel signal and a second channel signal, comprising: a calculator for calculating a cross-correlation spectrum for a time block from the first channel signal in the time block and the second channel signal in the time block; a spectral characteristic estimator for estimating a characteristic of a spectrum of the first channel signal or the second channel signal for the time block; a smoothing filter for smoothing the cross-correlation spectrum over time using the spectral characteristic to acquire a smoothed cross-correlation spectrum; and a processor for processing the smoothed cross-correlation spectrum to acquire the inter-channel time difference.
2. The apparatus of claim 1, wherein the processor is configured to normalize the smoothed cross-correlation spectrum using a magnitude of the smoothed cross-correlation spectrum.
3. The apparatus of claim 1, wherein the processor is configured to calculate a time-domain representation of the smoothed cross-correlation spectrum or a normalized smoothed cross-correlation spectrum; and to analyze the time-domain representation to determine the inter-channel time difference.
4. The apparatus of claim 1, wherein the processor is configured to low-pass filter the time-domain representation and to further process a result of the low-pass filtering.
5. The apparatus of claim 1, wherein the processor is configured to perform the inter-channel time difference determination by performing a peak searching or peak picking operation within a time-domain representation determined from the smoothed cross-correlation spectrum.
6. The apparatus of claim 1, wherein the spectral characteristic estimator is configured to determine, as the spectral characteristic, a noisiness or a tonality of the spectrum; and wherein the smoothing filter is configured to apply a stronger smoothing over time with a first smoothing degree in case of a first less noisy characteristic or a first more tonal characteristic, or to apply a weaker smoothing over time with a second smoothing degree in case of a second more noisy characteristic or a second less tonal characteristic, wherein the first smoothing degree is greater than the second smoothing degree, and wherein the first noisy characteristic is less noisy than the second noisy characteristic, or the first tonal characteristic is more tonal than the second tonal characteristic.
7. The apparatus of claim 1, wherein the spectral characteristics estimator is configured to calculate, as the characteristic, a first spectral flatness measure of a spectrum of the first channel signal and a second spectral flatness measure of a second spectrum of the second channel signal, and to determine the characteristic of the spectrum from the first and the second spectral flatness measure by selecting a maximum value, by determining a weighted average or an unweighted average between the spectral flatness measures, or by selecting a minimum value.
8. The apparatus of claim 1, wherein the smoothing filter is configured to calculate a smoothed cross-correlation spectrum value for a frequency by a weighted combination of the cross-correlation spectrum value for the frequency from the time block and a cross-correlation spectral value for the frequency from at least one past time block, wherein weighting factors for the weighted combination are determined by the characteristic of the spectrum.
9. The apparatus of claim 1, wherein the processor is configured to determine a valid range and an invalid range within a time-domain representation derived from the smoothed cross-correlation spectrum, wherein at least one maximum peak within the invalid range is detected and compared to a maximum peak within the valid range, wherein the inter-channel time difference is only determined, when the maximum peak within the valid range is greater than at least one maximum peak within the invalid range.
10. The apparatus of claim 1, wherein the processor is configured to perform a peak search operation within a time-domain representation derived from the smoothed cross-correlation spectrum, to determine a variable threshold from the time-domain representation; and to compare a peak to the variable threshold, wherein the inter-channel time difference is determined as a time lag associated with a peak being in a predetermined relation to the variable threshold.
11. The apparatus of claim 10, wherein the processor is configured to determine the variable threshold as a value being equal to an integer multiple of a value among the largest 10% of values of the time-domain representation.
12. A method for estimating an inter-channel time difference between a first channel signal and a second channel signal, comprising: calculating a cross-correlation spectrum for a time block from the first channel signal in the time block and the second channel signal in the time block; estimating a characteristic of a spectrum of the first channel signal or the second channel signal for the time block; smoothing the cross-correlation spectrum over time using the spectral characteristic to acquire a smoothed cross-correlation spectrum; and processing the smoothed cross-correlation spectrum to acquire the inter-channel time difference.
13. A non-transitory digital storage medium having a computer program stored thereon to perform the method for estimating an inter-channel time difference between a first channel signal and a second channel signal, comprising: calculating a cross-correlation spectrum for a time block from the first channel signal in the time block and the second channel signal in the time block; estimating a characteristic of a spectrum of the first channel signal or the second channel signal for the time block; smoothing the cross-correlation spectrum over time using the spectral characteristic to acquire a smoothed cross-correlation spectrum; and processing the smoothed cross-correlation spectrum to acquire the inter-channel time difference, when said computer program is run by a computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0049] Embodiments of the present invention will be detailed subsequently referring to the ap-pended drawings, in which:
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064]
[0065]
[0066]
[0067]
[0068]
[0069]
[0070]
[0071]
[0072]
[0073]
[0074]
DETAILED DESCRIPTION OF THE INVENTION
[0075]
[0076] Furthermore, the time-domain representations of the left and the right channel signals are input into a calculator 1020 for calculating a cross-correlation spectrum for a time block from the first channel signal in the time block and the second channel signal in the time block. Furthermore, the apparatus comprises a spectral characteristic estimator 1010 for estimating a characteristic of a spectrum of the first channel signal or the second channel signal for the time block. The apparatus further comprises a smoothing filter 1030 for smoothing the cross-correlation spectrum over time using the spectral characteristic to obtain a smoothed cross-correlation spectrum. The apparatus further comprises a processor 1040 for processing the smoothed correlation spectrum to obtain the inter-channel time difference.
[0077] Particularly, the functionalities of the spectral characteristic estimator are also reflected by
[0078] Furthermore, the functionalities of the cross-correlation spectrum calculator 1020 are also reflected by item 452 in
[0079] Correspondingly, the functionalities of the smoothing filter 1030 are also reflected by item 453 in the context of
[0080] Advantageously, the spectral characteristic estimation calculates a noisiness or a tonality of the spectrum where an advantageous implementation is the calculation of a spectral flatness measure being close to 0 in the case of tonal or non-noisy signals and being close to 1 in the case of noisy or noise-like signals.
[0081] Particularly, the smoothing filter is then configured to apply a stronger smoothing with a first smoothing degree over time in case of a first less noisy characteristic or a first more tonal characteristic, or to apply a weaker smoothing with a second smoothing degree over time in case of a second more noisy or second less tonal characteristic.
[0082] Particularly, the first smoothing is greater than the second smoothing degree, where the first noisy characteristic is less noisy than the second noisy characteristic or the first tonal characteristic is more tonal than the second tonal characteristic. The advantageous implementation is the spectral flatness measure.
[0083] Furthermore, as illustrated in
[0084] As illustrated in
[0085] As illustrated in
[0086] As illustrated in
[0087] Then, as illustrated in step 1034c, a number such as the number 3 is multiplied to the lowest value of the highest 10 or 5% in order to obtain the variable threshold.
[0088] As stated, advantageously, the highest 10 or 5% are determined, but it can also be useful to determine the lowest number of the highest 50% of the values and to use a higher multiplication number such as 10. Naturally, even a smaller amount such as the highest 3% of the values are determined and the lowest value among these highest 3% of the values is then multiplied by a number which is, for example, equal to 2.5 or 2, i.e., lower than 3. Thus, different combinations of numbers and percentages can be used in the embodiment illustrated in
[0089] Apart from the percentages, the numbers can also vary, and numbers greater than 1.5 are advantageous.
[0090] In a further embodiment illustrated in
[0091] Here, about 16 subblocks are used for the valid range so that each subblock has a time lag span of 20. However, the number of subblocks can be greater than this value or lower and advantageously greater than 3 and lower than 50.
[0092] In step 1102 of
[0093] Then, in step 1105, the multiplication value a determined in block 1104 is multiplied by the average threshold in order to obtain the variable threshold that is then used in the comparison operation in block 1106. For the comparison operation, once again the time-domain representation input into block 1101 can be used or the already determined peaks in each subblock as outlined in block 1102 can be used.
[0094] Subsequently, further embodiments regarding the evaluation and detection of a peak within the time-domain cross-correlation function is outlined.
[0095] The evaluation and detection of a peak within the time-domain cross correlation function resulting from the generalized cross-correlation (GCC-PHAT) method in order to estimate the Interchannel Time Difference (ITD) is not always straightforward due to different input scenarios. Clean speech input can result to a low deviation cross-correlation function with a strong peak, while speech in a noisy reverberant environment can produce a vector with high deviation and peaks with lower but still outstanding magnitude indicating the existence of ITD. A peak detection algorithm that is adaptive and flexible to accommodate different input scenarios is described.
[0096] Due to delay constraints, the overall system can handle channel time alignment up to a certain limit, namely ITD_MAX. The proposed algorithm is designed to detect whether a valid ITD exists in the following cases: [0097] Valid ITD due to outstanding peak. An outstanding peak within the [-ITD_MAX, ITD_MAX] bounds of the cross-correlation function is present. [0098] No correlation. When there is no correlation between the two channels, there is no outstanding peak. A threshold should be defined, above which the peak is strong enough to be considered as a valid ITD value. Otherwise , no ITD handling should be signaled, meaning ITD is set to zero and no time alignment is performed. [0099] Out of bounds ITD. Strong peaks of the cross-correlation function outside the region [-ITD_MAX, ITD MAX] should be evaluated in order to determine whether ITDs that lie outside the handling capacity of the system exist. In this case no ITD handling should be signaled and thus no time alignment is performed.
[0100] To determine whether the magnitude of a peak is high enough to be considered as a time difference value, a suitable threshold needs to be defined. For different input scenarios, the cross-correlation function output varies depending on different parameters, e.g. the environment (noise, reverberation etc.), the microphone setup (AB, M/S, etc.). Therefore, to adaptively define the threshold is essential.
[0101] In the proposed algorithm, the threshold is defined by first calculating the mean of a rough computation of the envelope of the magnitude of the cross-correlation function within the [-ITD_MAX, ITD_MAX] region (
[0102] The step-by-step description of the algorithm is described below.
[0103] The output of the inverse DFT of the GCC-PHAT, which represents the time-domain cross-correlation, is rearranged from negative to positive time lags (
[0104] The cross-correlation vector is divided in three main areas: the area of interest namely [-ITD_MAX, ITD_MAX] and the area outside the ITD_MAX bounds, namely time lags smaller than —ITD_MAX (max_low) and higher than ITD_MAX (max_high). The maximum peaks of the “out of bound” areas are detected and saved to be compared to the maximum peak detected in the area of interest.
[0105] In order to determine whether a valid ITD is present, the sub-vector area [-ITD_MAX, ITD_MAX] of the cross-correlation function is considered. The sub-vector is divided into N sub-blocks (
[0106] For each sub-block the maximum peak magnitude peak_sub and the equivalent time lag position index sub is found and saved.
[0107] The maximum of the local maxima peak_max is determined and will be compared to the threshold to determine the existence of a valid ITD value.
[0108] The maximum value peak_max is compared to max_low and max_high. If peak_max is lower than either of the two than no itd handling is signaled and no time alignment is performed. Because of the ITD handling limit of the system, the magnitudes of the out of bound peaks do not need to be evaluated.
[0109] The mean of the magnitudes of the peaks is calculated:
[0110] The threshold thres is then computed by weighting peak_mean with an SNR depended weighting factor a.sub.w:
[0111] In cases where SNR<<SNR.sub.threshold and |thres−peak_max|<ε, the peak magnitude is also compared to a slightly more relaxed threshold (a.sub.w=a.sub.lowest), in order to avoid rejecting an outstanding peak with high neighboring peaks. The weighting factors could be for example a.sub.high=3, a.sub.low=2.5 and a.sub.lowest=2, while the SNR.sub.threshold could be for example 20 dB and the bound ε=0.05.
[0112] Advantageous ranges are 2.5 to 5 for a.sub.high; 1.5 to 4 for a.sub.low; 1.0 to 3 for a.sub.lowest; 10 to 30 dB for SNR.sub.threshold; and 0.01 to 0.5 for ε, where a.sub.high is greater than a.sub.low that is greater than a.sub.lowest
[0113] If peak_max>thres the equivalent time lag is returned as the estimated ITD, elsewise no itd handling is signaled (ITD=0).
[0114] Further embodiments are described later on with respect to
[0115] Subsequently, an advantageous implementation of the present invention within block 1050 of
[0116] However, as stated and as illustrated in
[0117]
[0118] Advantageously, the signal aligner is configured to align the channels from the multi-channel signal using the broadband alignment parameter, before the parameter determiner 100 actually calculates the narrowband parameters. Therefore, in this embodiment, the signal aligner 200 sends the broadband aligned channels back to the parameter determiner 100 via a connection line 15. Then, the parameter determiner 100 determines the plurality of narrowband alignment parameters from an already with respect to the broadband characteristic aligned multi-channel signal. In other embodiments, however, the parameters are determined without this specific sequence of procedures.
[0119]
[0120]
[0121] Specifically, the multi-channel encoder further comprises a time-spectrum converter 150 for converting a time domain multi-channel signal into a spectral representation of the at least two channels within the frequency domain.
[0122] Furthermore, as illustrated at 152, the parameter determiner, the signal aligner and the signal processor illustrated at 100, 200 and 300 in
[0123] Furthermore, the multi-channel encoder and, specifically, the signal processor further comprises a spectrum-time converter 154 for generating a time domain representation of the midsignal at least.
[0124] Advantageously, the spectrum time converter additionally converts a spectral representation of the side signal also determined by the procedures represented by block 152 into a time domain representation, and the signal encoder 400 of
[0125] Advantageously, the time-spectrum converter 150 of
[0126] In step 156, each channel is windowed using the analysis window with overlap ranges. Specifically, each channel is widowed using the analysis window in such a way that a first block of the channel is obtained. Subsequently, a second block of the same channel is obtained that has a certain overlap range with the first block and so on, such that subsequent to, for example, five windowing operations, five blocks of windowed samples of each channel are available that are then individually transformed into a spectral representation as illustrated at 157 in
[0127] In step 158, which is performed by the parameter determiner 100 of
[0128]
[0129] Specifically, the operations of the steps 304 and 305 result in a kind of cross fading from one block of the mid-signal or the side signal in the next block of the mid signal and the side signal is performed so that, even when any parameter changes occur such as the inter-channel time difference parameter or the inter-channel phase difference parameter occur, this will nevertheless be not audible in the time domain mid/side signals obtained by step 305 in
[0130] The new low-delay stereo coding is a joint Mid/Side (M/S) stereo coding exploiting some spatial cues, where the Mid-channel is coded by a primary mono core coder, and the Side-channel is coded in a secondary core coder. The encoder and decoder principles are depicted in
[0131] The stereo processing is performed mainly in Frequency Domain (FD). Optionally some stereo processing can be performed in Time Domain (TD) before the frequency analysis. It is the case for the ITD computation, which can be computed and applied before the frequency analysis for aligning the channels in time before pursuing the stereo analysis and processing. Alternatively, ITD processing can be done directly in frequency domain. Since usual speech coders like ACELP do not contain any internal time-frequency decomposition, the stereo coding adds an extra complex modulated filter-bank by means of an analysis and synthesis filter-bank before the core encoder and another stage of analysis-synthesis filter-bank after the core decoder. In the advantageous embodiment, an oversampled DFT with a low overlapping region is employed. However, in other embodiments, any complex valued time-frequency decomposition with similar temporal resolution can be used.
[0132] The stereo processing consists of computing the spatial cues: inter-channel Time Difference (ITD), the inter-channel Phase Differences (IPDs) and inter-channel Level Differences (ILDs). ITD and IPDs are used on the input stereo signal for aligning the two channels L and R in time and in phase. ITD is computed in broadband or in time domain while IPDs and ILDs are computed for each or a part of the parameter bands, corresponding to a non-uniform decomposition of the frequency space. Once the two channels are aligned a joint M/S stereo is applied, where the Side signal is then further predicted from the Mid signal. The prediction gain is derived from the ILDs.
[0133] The Mid signal is further coded by a primary core coder. In the advantageous embodiment, the primary core coder is the 3GPP EVS standard, or a coding derived from it which can switch between a speech coding mode, ACELP, and a music mode based on a MDCT transformation. Advantageously, ACELP and the MDCT-based coder are supported by a Time Domain BandWidth Extension (TD-BWE) and or Intelligent Gap Filling (IGF) modules respectively. The Side signal is first predicted by the Mid channel using prediction gains derived from ILDs. The residual can be further predicted by a delayed version of the Mid signal or directly coded by a secondary core coder, performed in the advantageous embodiment in MDCT domain. The stereo processing at encoder can be summarized by
[0134]
[0135] In particular, the signal is received by an input interface 600. Connected to the input interface 600 are a signal decoder 700, and a signal de-aligner 900. Furthermore, a signal processor 800 is connected to a signal decoder 700 on the one hand and is connected to the signal dealigner on the other hand.
[0136] In particular, the encoded multi-channel signal comprises an encoded mid-signal, an encoded side signal, information on the broadband alignment parameter and information on the plurality of narrowband parameters. Thus, the encoded multi-channel signal on line 50 can be exactly the same signal as output by the output interface of 500 of
[0137] However, importantly, it is to be noted here that, in contrast to what is illustrated in
[0138] Thus, the information on the alignment parameters can be the alignment parameters as used by the signal aligner 200 in
[0139] The input interface 600 of
[0140] The signal decoder is configured for decoding the encoded mid-signal and for decoding the encoded side signal to obtain a decoded mid-signal on line 701 and a decoded side signal on line 702. These signals are used by the signal processor 800 for calculating a decoded first channel signal or decoded left signal and for calculating a decoded second channel or a decoded right channel signal from the decoded mid signal and the decoded side signal, and the decoded first channel and the decoded second channel are output on lines 801, 802, respectively. The signal de-aligner 900 is configured for de-aligning the decoded first channel on line 801 and the decoded right channel 802 using the information on the broadband alignment parameter and additionally using the information on the plurality of narrowband alignment parameters to obtain a decoded multi-channel signal, i.e., a decoded signal having at least two decoded and de-aligned channels on lines 901 and 902.
[0141]
[0142] In step 914, any further processing is performed that comprises using a windowing or any over-lap-add operation or, generally, any cross-fade operation in order to obtain, at 915a or 915b, an artifact-reduced or artifact-free decoded signal, i.e., to decoded channels that do not have any artifacts although there have been, typically, time-varying de-alignment parameters for the broadband on the one hand and for the plurality of narrowbands on the other hand.
[0143]
[0144] In particular, the signal processor 800 from
[0145] The signal processor furthermore comprises a mid/side to left/right converter 820 in order to calculate from a mid-signal M and a side signal S a left signal L and a right signal R.
[0146] However, importantly, in order to calculate L and R by the mid/side-left/right conversion in block 820, the side signal S is not necessarily to be used. Instead, as discussed later on, the left/right signals are initially calculated only using a gain parameter derived from an interchannel level difference parameter ILD. Generally, the prediction gain can also be considered to be a form of an ILD. The gain can be derived from ILD but can also be directly computed. It is advantageous to not compute ILD anymore, but to compute the prediction gain directly and to transmit and use the prediction gain in the decoder rather than the ILD parameter.
[0147] Therefore, in this implementation, the side signal S is only used in the channel updater 830 that operates in order to provide a better left/right signal using the transmitted side signal S as illustrated by bypass line 821.
[0148] Therefore, the converter 820 operates using a level parameter obtained via a level parameter input 822 and without actually using the side signal S but the channel updater 830 then operates using the side 821 and, depending on the specific implementation, using a stereo filling parameter received via line 831. The signal aligner 900 then comprises a phased-de-aligner and energy scaler 910. The energy scaling is controlled by a scaling factor derived by a scaling factor calculator 940. The scaling factor calculator 940 is fed by the output of the channel updater 830. Based on the narrowband alignment parameters received via input 911, the phase de-alignment is performed and, in block 920, based on the broadband alignment parameter received via line 921, the time-de-alignment is performed. Finally, a spectrum-time conversion 930 is performed in order to finally obtain the decoded signal.
[0149]
[0150] Specifically, the narrowband de-aligned channels are input into the broadband de-alignment functionality corresponding to block 920 of
[0151] When
[0152] Furthermore, the DFT operations in blocks 810 correspond to element 810 in
[0153] Subsequently,
[0154] Additionally, the spectrum is also divided into different parameter bands. Each parameter band has at least one and advantageously more than one spectral lines. Additionally, the parameter bands increase from lower to higher frequencies. Typically, the broadband alignment parameter is a single broadband alignment parameter for the whole spectrum, i.e., for a spectrum comprising all the bands 1 to 6 in the exemplary embodiment in
[0155] Furthermore, the plurality of narrowband alignment parameters are provided so that there is a single alignment parameter for each parameter band. This means that the alignment parameter for a band applies to all the spectral values within the corresponding band.
[0156] Furthermore, in addition to the narrowband alignment parameters, level parameters are also provided for each parameter band.
[0157] In contrast to the level parameters that are provided for each and every parameter band from band 1 to band 6, it is advantageous to provide the plurality of narrowband alignment parameters only for a limited number of lower bands such as bands 1, 2, 3 and 4.
[0158] Additionally, stereo filling parameters are provided for a certain number of bands excluding the lower bands such as, in the exemplary embodiment, for bands 4, 5 and 6, while there are side signal spectral values for the lower parameter bands 1, 2 and 3 and, consequently, no stereo filling parameters exist for these lower bands where wave form matching is obtained using either the side signal itself or a prediction residual signal representing the side signal.
[0159] As already stated, there exist more spectral lines in higher bands such as, in the embodiment in
[0160] Nevertheless,
[0161] As illustrated, the level parameter ILD is provided for each of 12 bands and is quantized to a quantization accuracy represented by five bits per band.
[0162] Furthermore, the narrowband alignment parameters IPD are only provided for the lower bands up to a boarder frequency of 2.5 kHz. Additionally, the inter-channel time difference or broadband alignment parameter is only provided as a single parameter for the whole spectrum but with a very high quantization accuracy represented by eight bits for the whole band.
[0163] Furthermore, quite roughly quantized stereo filling parameters are provided represented by three bits per band and not for the lower bands below 1 kHz since, for the lower bands, actually encoded side signal or side signal residual spectral values are included.
[0164] Subsequently, an advantageous processing on the encoder side is summarized with respect to
[0165] ILD parameters, i.e., level parameters and phase parameters (IPD parameters), are calculated for each parameter band on the shifted L and R representations as illustrated at step 171. This step corresponds to step 160 of
[0166] In the final step 175, the time domain mid-signal m and, optionally, the residual signal are coded as illustrated in step 175. This procedure corresponds to what is performed by the signal encoder 400 in
[0167] At the decoder in the inverse stereo processing, the Side signal is generated in the DFT domain and is first predicted from the Mid signal as:
=g.Math.Mid
[0168] where g is a gain computed for each parameter band and is function of the transmitted Inter-channel Level Difference (ILDs).
[0169] The residual of the prediction Side−g.Math.Mid can be then refined in two different ways: [0170] By a secondary coding of the residual signal:
=g.Math.Mid+g.sub.cod.Math.(Side
Mid) [0171] where g.sub.pred is a global gain transmitted for the whole spectrum [0172] By a residual prediction, known as stereo filling, predicting the residual side spectrum with the previous decoded Mid signal spectrum from the previous DFT frame:
=g.Math.Mid+g.sub.pred.Math.Mid.Math.z.sup.−1 [0173] where g.sub.pred is a predictive gain transmitted per parameter band.
[0174] The two types of coding refinement can be mixed within the same DFT spectrum. In the advantageous embodiment, the residual coding is applied on the lower parameter bands, while residual prediction is applied on the remaining bands. The residual coding is in the advantageous embodiment as depict in
[0175] 1. Time-Frequency Analysis: DFT
[0176] It is important that the extra time-frequency decomposition from the stereo processing done by DFTs allows a good auditory scene analysis while not increasing significantly the overall delay of the coding system. By default, a time resolution of 10 ms (twice the 20 ms framing of the core coder) is used. The analysis and synthesis windows are the same and are symmetric. The window is represented at 16 kHz of sampling rate in
[0177] 2. Stereo Parameters
[0178] Stereo parameters can be transmitted at maximum at the time resolution of the stereo DFT. At minimum it can be reduced to the framing resolution of the core coder, i.e. 20 ms. By default, when no transients is detected, parameters are computed every 20 ms over 2 DFT windows. The parameter bands constitute a non-uniform and non-overlapping decomposition of the spectrum following roughly 2 times or 4 times the Equivalent Rectangular Bandwidths (ERB). By default, a 4 times ERB scale is used for a total of 12 bands for a frequency bandwidth of 16 kHz (32 kbps sampling-rate, Super Wideband stereo).
[0179] 3. Computation of ITD and Channel Time Alignment
[0180] The ITD are computed by estimating the Time Delay of Arrival (TDOA) using the Generalized Cross Correlation with Phase Transform (GCC-PHAT):
[0181] where L and R are the frequency spectra of the of the left and right channels respectively. The frequency analysis can be performed independently of the DFT used for the subsequent stereo processing or can be shared. The pseudo-code for computing the ITD is the following:
TABLE-US-00001 L =fft(window(l)); R =fft(window(r)); tmp = L .* conj( R ); sfm_L = prod(abs(L).{circumflex over ( )}(1/length(L)))/(mean(abs(L))+eps); sfm_R = prod(abs(R).{circumflex over ( )}(1/length(R)))/(mean(abs(R))+eps); sfm = max(sfm_L, sfm_R); h.cross_corr_smooth = (1-sfm)*h.cross_corr_smooth+sfm*tmp; tmp = h.cross_corr_smooth ./ abs( h.cross_corr_smooth+eps ); tmp = ifft( tmp ); tmp = tmp([length(tmp)/2+1:length(tmp) 1:length(tmp)/2+1]); tmp_sort = sort( abs(tmp) ); thresh = 3 * tmp_sort( round(0.95*length(tmp_sort)) ); xcorr_time=abs(tmp(− ( h.stereo_itd_q_max − (length(tmp)−1)/2 − 1):− ( h.stereo_itd_q_min − (length(tmp)−1)/2 − 1 ))); %smooth output for better detection xcorr_time=[xcorr_time 0]; xcorr_time2=filter([0.25 0.5 0.25], 1, xcorr_time); [m, i] = max(xcorr_time2(2:end)); if m > thresh itd = h.stereo_itd_q_max − i +1; else itd = 0; end
[0182]
[0183] In block 451, a DFT analysis of the time domain signals for a first channel (I) and a second channel (r) is performed. This DFT analysis will typically be the same DFT analysis as has been discussed in the context of steps 155 to 157 in
[0184] A cross-correlation is then performed for each frequency bin as illustrated in block 452.
[0185] Thus, a cross-correlation spectrum is obtained for the whole spectral range of the left and the right channels.
[0186] In step 453, a spectral flatness measure is then calculated from the magnitude spectra of L and R and, in step 454, the larger spectral flatness measure is selected. However, the selection in step 454 does not necessarily have to be the selection of the larger one but this determination of a single SFM from both channels can also be the selection and calculation of only the left channel or only the right channel or can be the calculation of weighted average of both SFM values.
[0187] In step 455, the cross-correlation spectrum is then smoothed over time depending on the spectral flatness measure.
[0188] Advantageously, the spectral flatness measure is calculated by dividing the geometric mean of the magnitude spectrum by the arithmetic mean of the magnitude spectrum. Thus, the values for SFM are bounded between zero and one.
[0189] In step 456, the smoothed cross-correlation spectrum is then normalized by its magnitude and in step 457 an inverse DFT of the normalized and smoothed cross-correlation spectrum is calculated. In step 458, a certain time domain filter is advantageously performed but this time domain filtering can also be left aside depending on the implementation but is advantageous as will be outlined later on.
[0190] In step 459, an ITD estimation is performed by peak-picking of the filter generalized cross-correlation function and by performing a certain thresholding operation.
[0191] If no peak above the threshold is obtained, then ITD is set to zero and no time alignment is performed for this corresponding block.
[0192] The ITD computationcan also be summarized as follows. The cross-correlation is computed in frequency domain before being smoothed depending of the Spectral Flatness Measurement. SFM is bounded between 0 and 1. In case of noise-like signals, the SFM will be high (i.e. around 1) and the smoothing will be weak. In case of tone-like signal, SFM will be low and the smoothing will become stronger. The smoothed cross-correlation is then normalized by its amplitude before being transformed back to time domain. The normalization corresponds to the Phase —transform of the cross-correlation, and is known to show better performance than the normal cross-correlation in low noise and relatively high reverberation environments. The so-obtained time domain function is first filtered for achieving a more robust peak peaking. The index corresponding to the maximum amplitude corresponds to an estimate of the time difference between the Left and Right Channel (ITD). If the amplitude of the maximum is lower than a given threshold, then the estimated of ITD is not considered as reliable and is set to zero.
[0193] If the time alignment is applied in Time Domain, the ITD is computed in a separate DFT analysis. The shift is done as follows:
[0194] An extra delay is involved at encoder, which is equal at maximum to the maximum absolute
[0195] ITD which can be handled. The variation of ITD over time is smoothed by the analysis windowing of DFT.
[0196] Alternatively the time alignment can be performed in frequency domain. In this case, the ITD computation and the circular shift are in the same DFT domain, domain shared with this other stereo processing. The circular shift is given by:
[0197] Zero padding of the DFT windows is needed for simulating a time shift with a circular shift. The size of the zero padding corresponds to the maximum absolute ITD which can be handled. In the advantageous embodiment, the zero padding is split uniformly on the both sides of the analysis windows, by adding 3.125 ms of zeros on both ends. The maximum absolute possible ITD is then 6.25 ms. In A-B microphones setup, it corresponds for the worst case to a maximum distance of about 2.15 meters between the two microphones. The variation in ITD over time is smoothed by synthesis windowing and overlap-add of the DFT.
[0198] It is important that the time shift is followed by a windowing of the shifted signal. It is a main distinction with the conventional Binaural Cue Coding (BCC), where the time shift is applied on a windowed signal but is not windowed further at the synthesis stage. As a consequence, any change in ITD over time produces an artificial transient/click in the decoded signal.
[0199] 4. Computation of IPDs and Channel Rotation
[0200] The IPDs are computed after time aligning the two channels and this for each parameter band or at least up to a given ipd_max _band, dependent of the stereo configuration.
[0201] IPDs is then applied to the two channels for aligning their phases:
[0202] Where β=atan2 (sin(IPD.sub.i[b]) , cos(IPD.sub.i[b])+c), c=10.sup.ILD.sup.
[0203] 5. Sum-Difference and Side Signal Coding
[0204] The sum difference transformation is performed on the time and phase aligned spectra of the two channels in a way that the energy is conserved in the Mid signal.
[0205] where
[0206] is bounded between 1/1.2 and 1.2, i.e. −1.58 and +1.58 dB. The limitation avoids artefact when adjusting the energy of M and S. It is worth noting that this energy conservation is less important when time and phase were beforehand aligned. Alternatively the bounds can be increased or decreased.
[0207] The side signal S is further predicted with M:
S′(f)=S(f)−g(ILD)M(f)
[0208] where
[0209] where c=10.sup.ILD.sup.
[0210] The residual signal S′(f) can be modeled by two means: either by predicting it with the delayed spectrum of M or by coding it directly in the MDCT domain in the MDCT domain.
[0211] 6. Stereo Decoding
[0212] The Mid signal X and Side signal S are first converted to the left and right channels L and R as follows:
L.sub.i[k]=M.sub.i[k]+gM.sub.i[k], for band_limits[b]≤k<band_limits[b+1],
R.sub.i[k]=M.sub.i[k]+gM.sub.i[k], for band_limits[b]≤k<band_limits[b+1],
[0213] where the gain g per parameter band is derived from the ILD parameter:
[0214] where c=10.sup.ILD.sup.
[0215] For parameter bands below cod_max_band, the two channels are updated with the decoded Side signal:
L.sub.i[k]=L.sub.i[k]+cod_gain.sub.i.Math.S.sub.i[k], for 0≤k<band_limits[cod_max_band],
R.sub.i[k]=R.sub.i[k]+cod_gain.sub.i.Math.S.sub.i[k], for 0≤k<band_limits[cod_max_band],
[0216] For higher parameter bands, the side signal is predicted and the channels updated as:
L.sub.i[k]=L.sub.i[k]+cod_pred.sub.i[b].Math.M.sub.i−1[k], for band_limits[b]≤k<band_limits[b +1],
R.sub.i[k]=R.sub.i[k]−cod_pred.sub.i[b].Math.M.sub.i−1[k], for band_limits[b]≤k<band_limits[b +1],
[0217] Finally, the channels are multiplied by a complex value aiming to restore the original energy and the inter-channel phase of the stereo signal:
L.sub.i[k]=a.Math.e.sup.j2πβ.Math.L.sub.i[k]
R.sub.i[k]=a.Math.e.sup.j2πβ−IPD.sup.
[0218] where
[0219] where a is defined and bounded as defined previously, and where β=atan2(sin(IPD.sub.i[b]), cos(IPD.sub.i_[b])+c), and where atan2(x,y) is the four-quadrant inverse tangent of x over y.
[0220] Finally, the channels are time shifted either in time or in frequency domain depending of the transmitted ITDs. The time domain channels are synthesized by inverse DFTs and overlap-adding.
[0221] Specific features of the invention relate to the combination of spatial cues and sum-difference joint stereo coding. Specifically, the spatial cues IDT and IPD are computed and applied on the stereo channels (left and right). Furthermore, sum-difference (M/S signals) are calculated and advantageously a prediction is applied of S with M.
[0222] On the decoder-side, the broadband and narrowband spatial cues are combined together with sum-different joint stereo coding. In particular, the side signal is predicted with the mid-signal using at least one spatial cue such as ILD and an inverse sum-difference is calculated for getting the left and right channels and, additionally, the broadband and the narrowband spatial cues are applied on the left and right channels.
[0223] Advantageously, the encoder has a window and overlap-add with respect to the time aligned channels after processing using the ITD. Furthermore, the decoder additionally has a windowing and overlap-add operation of the shifted or de-aligned versions of the channels after applying the inter-channel time difference.
[0224] The computation of the inter-channel time difference with the GCC-Phat method is a specifically robust method.
[0225] The new procedure is advantageous conventional technology since is achieves bit-rate coding of stereo audio or multi-channel audio at low delay. It is specifically designed for being robust to different natures of input signals and different setups of the multichannel or stereo recording. In particular, the present invention provides a good quality for bit rate stereos speech coding.
[0226] The advantageous procedures find use in the distribution of broadcasting of all types of stereo or multichannel audio content such as speech and music alike with constant perceptual quality at a given low bit rate. Such application areas are a digital radio, internet streaming or audio communication applications.
[0227] An inventively encoded audio signal can be stored on a digital storage medium or a non-transitory storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
[0228] Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
[0229] Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
[0230] Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
[0231] Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
[0232] Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.
[0233] In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
[0234] A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
[0235] A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
[0236] A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
[0237] A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
[0238] In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
[0239] While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.