METHOD AND APPARATUS FOR CONTROLLING AUDIO FRAME LOSS CONCEALMENT
20220375480 · 2022-11-24
Inventors
Cpc classification
G10L19/0017
PHYSICS
G10L19/025
PHYSICS
International classification
G10L19/005
PHYSICS
G10L19/00
PHYSICS
G10L19/02
PHYSICS
G10L19/025
PHYSICS
Abstract
In accordance with an example embodiment of the present invention, disclosed is a method and an apparatus thereof for controlling a concealment method for a lost audio frame of a received audio signal. A method for a decoder of concealing a lost audio frame comprises detecting in a property of the previously received and reconstructed audio signal, or in a statistical property of observed frame losses, a condition for which the substitution of a lost frame provides relatively reduced quality. In case such a condition is detected, the concealment method is modified by selectively adjusting a phase or a spectrum magnitude of a substitution frame spectrum.
Claims
1. A frame loss concealment method, wherein a segment of a previously synthesized audio signal is used as a prototype frame in order to create a substitution frame for a lost audio frame, the method comprising: transforming the prototype frame into a frequency domain; applying a sinusoidal model to the prototype frame to identify the frequency of a sinusoidal component of the audio signal; calculating a phase shift θ.sub.k for the sinusoidal component; shifting a phase of all spectral coefficients in the prototype frame included in an interval M.sub.k around a sinusoid k by the phase shift θ.sub.k while retaining the magnitude of those spectral coefficients; randomizing phases of spectral coefficients that are not phase shifted; and creating the substitution frame by performing an inverse frequency transform of a frequency spectrum of the prototype frame.
2. The frame loss concealment method according to claim 1, wherein the phase shift θ.sub.k depends on the frequency of the sinusoidal component of the audio signal and a time shift between the prototype frame and the lost audio frame.
3. The frame loss concealment method according to claim 2, wherein the applying the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal comprises identifying frequencies in the vicinity of peaks of spectrum related to the used frequency domain transform.
4. The frame loss concealment method according to claim 3, wherein the applying the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal is performed with higher resolution than a frequency resolution of the used frequency domain transform.
5. The frame loss concealment method according to claim 2, wherein the applying the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal is performed with higher resolution than a frequency resolution of the used frequency domain transform.
6. The frame loss concealment method according to claim 1, wherein the applying the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal comprises identifying frequencies in the vicinity of peaks of spectrum related to the used frequency domain transform.
7. The frame loss concealment method according to claim 1, wherein the applying the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal is performed with higher resolution than a frequency resolution of the used frequency domain transform.
8. An apparatus for creating a substitution frame for a lost audio frame, the apparatus being configured to: generate a prototype frame from a segment of a previously synthesized audio signal; transform the prototype frame into a frequency domain; apply a sinusoidal model to the prototype frame to identify the frequency of a sinusoidal component of the audio signal; calculate a phase shift θ.sub.k for the sinusoidal component; shift a phase of all spectral coefficients in the prototype frame included in an interval M.sub.k around a sinusoid k by the phase shift θ.sub.k while retaining the magnitude of those spectral coefficients; randomize phases of spectral coefficients that are not phase shifted; and create the substitution frame by performing an inverse frequency transform of a frequency spectrum of the prototype frame.
9. The apparatus according to claim 8, wherein the phase shift θ.sub.k depends on the frequency of the sinusoidal component of the audio signal and a time shift between the prototype frame and the lost audio frame.
10. The apparatus according to claim 9, wherein the applying the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal comprises identifying frequencies in the vicinity of peaks of the spectrum related to the used frequency domain transform.
11. The apparatus according to claim 10, wherein the applying the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal is performed with higher resolution than a frequency resolution of the used frequency domain transform.
12. The apparatus according to claim 9, wherein the applying the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal is performed with higher resolution than a frequency resolution of the used frequency domain transform.
13. The apparatus according to claim 8, wherein the applying the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal comprises identifying frequencies in the vicinity of peaks of the spectrum related to the used frequency domain transform.
14. The apparatus according to claim 8, wherein the applying the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal is performed with higher resolution than a frequency resolution of the used frequency domain transform.
15. An audio decoder comprising the apparatus according to claim 8.
16. A device comprising the audio decoder according to claim 15.
17. A computer program product comprising a non-transitory computer-readable medium storing instructions which, when executed on at least one processor of an apparatus for creating a substitution frame for a lost audio frame, cause the at least one processor to perform operations to: generate a prototype frame from a segment of a previously synthesized audio signal; transform the prototype frame into a frequency domain; apply a sinusoidal model to the prototype frame to identify the frequency of a sinusoidal component of the audio signal; calculate a phase shift θ.sub.k for the sinusoidal component; shift a phase of all spectral coefficients in the prototype frame included in an interval M.sub.k around a sinusoid k by the phase shift θ.sub.k while retaining the magnitude of those spectral coefficients; randomize phases of spectral coefficients that are not phase shifted; and create the substitution frame by performing an inverse frequency transform of a frequency spectrum of the prototype frame.
18. The computer program product according to claim 17, wherein the phase shift θ.sub.k depends on the frequency of the sinusoidal component of the audio signal and a time shift between the prototype frame and the lost audio frame.
19. The computer program product according to claim 18, wherein the operation to apply the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal comprises to identify frequencies in the vicinity of peaks of the spectrum related to the used frequency domain transform.
20. The computer program product according to claim 19, wherein the operation to apply the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal is performed with higher resolution than a frequency resolution of the used frequency domain transform.
21. The computer program product according to claim 18, wherein the operation to apply the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal is performed with higher resolution than a frequency resolution of the used frequency domain transform.
22. The computer program product according to claim 17, wherein the operation to apply the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal comprises to identify frequencies in the vicinity of peaks of the spectrum related to the used frequency domain transform.
23. The computer program product according to claim 17, wherein the operation to apply the sinusoidal model to the prototype frame to identify the frequency of the sinusoidal component of the audio signal is performed with higher resolution than a frequency resolution of the used frequency domain transform.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] For a more complete understanding of example embodiments of the present invention, reference is now made to the following description taken in connection with the accompanying drawings in which:
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
DETAILED DESCRIPTION
[0036] The new controlling scheme for the new frame loss concealment techniques described involve the following steps as shown in
[0037] 1. Detect conditions in the properties of the previously received and reconstructed audio signal or in the statistical properties of the observed frame losses for which the substitution of a lost frame according to the described methods provides relatively reduced quality, 101.
[0038] 2. In case such a condition is detected in step 1, modify the element of the methods according to which the substitution frame spectrum is calculated by Z(m)=Y(m).Math.e.sup.jθ.sub.k by selectively adjusting the phases or the spectrum magnitudes, 102.
[0039] Sinusoidal Analysis
[0040] A first step of the frame loss concealment technique to which the new controlling technique may be applied involves a sinusoidal analysis of a part of the previously received signal. The purpose of this sinusoidal analysis is to find the frequencies of the main sinusoids of that signal, and the underlying assumption is that the signal is composed of a limited number of individual sinusoids, i.e. that it is a multi-sine signal of the following type:
[0041] In this equation K is the number of sinusoids that the signal is assumed to consist of. For each of the sinusoids with index k=1 . . . K, a.sub.k is the amplitude, f.sub.k is the frequency, and φ.sub.k is the phase. The sampling frequency is denominated by f.sub.s and the time index of the time discrete signal samples s(n) by n.
[0042] It is of main importance to find as exact frequencies of the sinusoids as possible. While an ideal sinusoidal signal would have a line spectrum with line frequencies f.sub.k, finding their true values would in principle require infinite measurement time. Hence, it is in practice difficult to find these frequencies since they can only be estimated based on a short measurement period, which corresponds to the signal segment used for the sinusoidal analysis described herein; this signal segment is hereinafter referred to as an analysis frame. Another difficulty is that the signal may in practice be time-variant, meaning that the parameters of the above equation vary over time. Hence, on the one hand it is desirable to use a long analysis frame making the measurement more accurate; on the other hand a short measurement period would be needed in order to better cope with possible signal variations. A good trade-off is to use an analysis frame length in the order of e.g. 20-40 ms.
[0043] A preferred possibility for identifying the frequencies of the sinusoids f.sub.k is to make a frequency domain analysis of the analysis frame. To this end the analysis frame is transformed into the frequency domain, e.g. by means of DFT or DCT or similar frequency domain transforms. In case a DFT of the analysis frame is used, the spectrum is given by:
[0044] In this equation w(n) denotes the window function with which the analysis frame of length L is extracted and weighted. Typical window functions are e.g. rectangular windows that are equal to 1 for n∈[0 . . . L−1] and otherwise 0 as shown in
[0045] The peaks of the magnitude spectrum of the windowed analysis frame |X(m)| constitute an approximation of the required sinusoidal frequencies f.sub.k. The accuracy of this approximation is however limited by the frequency spacing of the DFT. With the DFT with block length L the accuracy is limited to
[0046] Experiments show that this level of accuracy may be too low in the scope of the methods described herein. Improved accuracy can be obtained based on the results of the following consideration:
[0047] The spectrum of the windowed analysis frame is given by the convolution of the spectrum of the window function with the line spectrum of the sinusoidal model signal S(Ω), subsequently sampled at the grid points of the DFT:
[0048] By using the spectrum expression of the sinusoidal model signal, this can be written as
[0049] Hence, the sampled spectrum is given by
[0050] Based on this consideration it is assumed that the observed peaks in the magnitude spectrum of the analysis frame stem from a windowed sinusoidal signal with K sinusoids where the true sinusoid frequencies are found in the vicinity of the peaks. Let m.sub.k be the DFT index (grid point) of the observed k.sup.th peak, then the corresponding frequency is
which can be regarded an approximation of the true sinusoidal frequency f.sub.k. The true sinusoid frequency f.sub.k can be assumed to lie within the interval
[0051] For clarity it is noted that the convolution of the spectrum of the window function with the spectrum of the line spectrum of the sinusoidal model signal can be understood as a superposition of frequency-shifted versions of the window function spectrum, whereby the shift frequencies are the frequencies of the sinusoids. This superposition is then sampled at the DFT grid points. These steps are illustrated by the following figures.
[0052] The previous discussion and the illustration of
[0053] One preferred way to find better approximations of the frequencies f.sub.k of the sinusoids is to apply parabolic interpolation. One such approach is to fit parabolas through the grid points of the DFT magnitude spectrum that surround the peaks and to calculate the respective frequencies belonging to the parabola maxima. A suitable choice for the order of the parabolas is 2. In detail the following procedure can be applied:
[0054] 1. Identify the peaks of the DFT of the windowed analysis frame. The peak search will deliver the number of peaks K and the corresponding DFT indexes of the peaks. The peak search can typically be made on the DFT magnitude spectrum or the logarithmic DFT magnitude spectrum.
[0055] 2. For each peak k (with k=1 . . . K) with corresponding DFT index m.sub.k fit a parabola through the three points {P1; P2; P3}={(m.sub.k−1, log(|X(m.sub.k−1)|); (m.sub.k, log(|X(m.sub.k)|); (m.sub.k+1, log(|X(m.sub.k+1)|)}. This results in parabola coefficients b.sub.k(0), b.sub.k(1), b.sub.k(2) of the parabola defined by
[0056] This parabola fitting is illustrated in
[0057] 3. For each of the K parabolas calculate the interpolated frequency index {circumflex over (m)}.sub.k corresponding to the value of q for which the parabola has its maximum. Use {circumflex over (f)}.sub.k={circumflex over (m)}.sub.k f.sub.s/L as approximation for the sinusoid frequency f.sub.k
[0058] The described approach provides good results but may have some limitations since the parabolas do not approximate the shape of the main lobe of the magnitude spectrum |W(Ω)| of the window function. An alternative scheme doing this is an enhanced frequency estimation using a main lobe approximation, described as follows.
[0059] The main idea of this alternative is to fit a function P(q), which approximates the main lobe of
through the grid points of the DFT magnitude spectrum that surround the peaks and to calculate the respective frequencies belonging to the function maxima. The function P(q) could be identical to the frequency-shifted magnitude spectrum
of the window function. For numerical simplicity it should however rather for instance be a polynomial which allows for straightforward calculation of the function maximum. The following detailed procedure can be applied:
[0060] 1. Identify the peaks of the DFT of the windowed analysis frame. The peak search will deliver the number of peaks K and the corresponding DFT indexes of the peaks. The peak search can typically be made on the DFT magnitude spectrum or the logarithmic DFT magnitude spectrum.
[0061] 2. Derive the function P(q) that approximates the magnitude spectrum
of the window function or of the logarithmic magnitude spectrum log
for a given interval (q.sub.1,q.sub.2) The choice of the approximation function approximating the window spectrum main lobe is illustrated by
[0062] 3. For each peak k (with k=1 . . . K) with corresponding DFT index m.sub.k fit the frequency-shifted function P(q−{circumflex over (q)}.sub.k) through the two DFT grid points that surround the expected true peak of the continuous spectrum of the windowed sinusoidal signal. Hence, if |X(m.sub.k−1)| is larger than |X(m.sub.k+1)| fit P(q−{circumflex over (q)}.sub.k) through the points {P.sub.1; P.sub.2}={(m.sub.k−1, log(|X(m.sub.k−1)|); (m.sub.k, log(|X(m.sub.k)|)} and otherwise through the points {P.sub.1; P.sub.2}={(m.sub.k, log(|X(m.sub.k)|); (m.sub.k+1, log(|X(m.sub.k+1)|)}. P(q) can for simplicity be chosen to be a polynomial either of order 2 or 4. This renders the approximation in step 2 a simple linear regression calculation and the calculation of {circumflex over (q)}.sub.k straightforward. The interval (q.sub.1,q.sub.2) can be chosen to be fixed and identical for all peaks, e.g. (q.sub.1,q.sub.2)=(−1,1), or adaptive.
[0063] In the adaptive approach the interval can be chosen such that the function P(q−{circumflex over (q)}.sub.k) fits the main lobe of the window function spectrum in the range of the relevant DFT grid points {P.sub.1; P.sub.2}. The fitting process is visualized in
[0064] 4. For each of the K frequency shift parameters {circumflex over (q)}.sub.k for which the continuous spectrum of the windowed sinusoidal signal is expected to have its peak calculate {circumflex over (f)}.sub.k={circumflex over (q)}.sub.k.Math.f.sub.s/L as approximation for the sinusoid frequency f.sub.k.
[0065] There are many cases where the transmitted signal is harmonic meaning that the signal consists of sine waves which frequencies are integer multiples of some fundamental frequency f.sub.0. This is the case when the signal is very periodic like for instance for voiced speech or the sustained tones of some musical instrument. This means that the frequencies of the sinusoidal model of the embodiments are not independent but rather have a harmonic relationship and stem from the same fundamental frequency. Taking this harmonic property into account can consequently improve the analysis of the sinusoidal component frequencies substantially.
[0066] One enhancement possibility is outlined as follows:
[0067] 1. Check whether the signal is harmonic. This can for instance be done by evaluating the periodicity of signal prior to the frame loss. One straightforward method is to perform an autocorrelation analysis of the signal. The maximum of such autocorrelation function for some time lag τ>0 can be used as an indicator. If the value of this maximum exceeds a given threshold, the signal can be regarded harmonic. The corresponding time lag τ then corresponds to the period of the signal which is related to the fundamental frequency through
[0068] Many linear predictive speech coding methods apply so-called open or closed-loop pitch prediction or CELP coding using adaptive codebooks. The pitch gain and the associated pitch lag parameters derived by such coding methods are also useful indicators if the signal is harmonic and, respectively, for the time lag.
[0069] A further method for obtaining f.sub.0 is described below.
[0070] 2. For each harmonic index j within the integer range 1 . . . J.sub.max check whether there is a peak in the (logarithmic) DFT magnitude spectrum of the analysis frame within the vicinity of the harmonic frequency f.sub.j=j.Math.f.sub.0. The vicinity of f.sub.j may be defined as the delta range around f.sub.j where delta corresponds to the frequency resolution of the DFT
i.e. the interval
[0071] In case such a peak with corresponding estimated sinusoidal frequency {circumflex over (f)}.sub.k is present, supersede {circumflex over (f)}.sub.k by {circumflex over ({circumflex over (f)})}.sub.k=j.Math.f.sub.0.
[0072] For the two-step procedure given above there is also the possibility to make the check whether the signal is harmonic and the derivation of the fundamental frequency implicitly and possibly in an iterative fashion without necessarily using indicators from some separate method. An example for such a technique is given as follows:
[0073] For each f.sub.0,p out of a set of candidate values {f.sub.0,1 . . . f.sub.0,p} apply the procedure step 2, though without superseding {circumflex over (f)}.sub.k but with counting how many DFT peaks are present within the vicinity around the harmonic frequencies, i.e. the integer multiples of f.sub.0,p. Identify the fundamental frequency f.sub.0,pmax for which the largest number of peaks at or around the harmonic frequencies is obtained. If this largest number of peaks exceeds a given threshold, then the signal is assumed to be harmonic. In that case f.sub.0,pmax can be assumed to be the fundamental frequency with which step 2 is then executed leading to enhanced sinusoidal frequencies {circumflex over ({circumflex over (f)})}.sub.k. A more preferable alternative is however first to optimize the fundamental frequency f.sub.0 based on the peak frequencies {circumflex over (f)}.sub.k that have been found to coincide with harmonic frequencies. Assume a set of M harmonics, i.e. integer multiples {n.sub.1 . . . n.sub.M} of some fundamental frequency that have been found to coincide with some set of M spectral peaks at frequencies {circumflex over (f)}.sub.k(m), m=1 . . . M, then the underlying (optimized) fundamental frequency f.sub.0,opt can be calculated to minimize the error between the harmonic frequencies and the spectral peak frequencies. If the error to be minimized is the mean square error
then the optimal fundamental frequency is calculated as
[0074] The initial set of candidate values {f.sub.0,1 . . . f.sub.0,P} can be obtained from the frequencies of the DFT peaks or the estimated sinusoidal frequencies {circumflex over (f)}.sub.k.
[0075] A further possibility to improve the accuracy of the estimated sinusoidal frequencies {circumflex over (f)}.sub.k is to consider their temporal evolution. To that end, the estimates of the sinusoidal frequencies from a multiple of analysis frames can be combined for instance by means of averaging or prediction. Prior to averaging or prediction a peak tracking can be applied that connects the estimated spectral peaks to the respective same underlying sinusoids.
[0076] Applying the Sinusoidal Model
[0077] The application of a sinusoidal model in order to perform a frame loss concealment operation described herein may be described as follows.
[0078] It is assumed that a given segment of the coded signal cannot be reconstructed by the decoder since the corresponding encoded information is not available. It is further assumed that a part of the signal prior to this segment is available. Let y(n) with n=0 . . . N−1 be the unavailable segment for which a substitution frame z(n) has to be generated and y(n) with n<0 be the available previously decoded signal. Then, in a first step a prototype frame of the available signal of length L and start index n.sub.−1 is extracted with a window function w(n) and transformed into frequency domain, e.g. by means of DFT:
[0079] The window function can be one of the window functions described above in the sinusoidal analysis. Preferably, in order to save numerical complexity, the frequency domain transformed frame should be identical with the one used during sinusoidal analysis.
[0080] In a next step the sinusoidal model assumption is applied. According to that the DFT of the prototype frame can be written as follows:
[0081] The next step is to realize that the spectrum of the used window function has only a significant contribution in a frequency range close to zero. As illustrated in
for non-negative m∈M.sub.k and for each k. Herein, M.sub.k denotes the integer interval
where m.sub.min,k and m.sub.max,k fulfill the above explained constraint such that the intervals are not overlapping. A suitable choice for m.sub.min,k and m.sub.max,k is to set them to a small integer value δ, e.g. δ=3. If however the DFT indices related to two neighboring sinusoidal frequencies f.sub.k and f.sub.k+1 are less than 2δ, then δ is set to floor
such that it is ensured that the intervals are not overlapping. The function floor (⋅) is the closest integer to the function argument that is smaller or equal to it.
[0082] The next step according to the embodiment is to apply the sinusoidal model according to the above expression and to evolve its K sinusoids in time. The assumption that the time indices of the erased segment compared to the time indices of the prototype frame differs by n.sub.−1 samples means that the phases of the sinusoids advance by
[0083] Hence, the DFT spectrum of the evolved sinusoidal model is given by:
[0084] Applying again the approximation according to which the shifted window function spectra do no overlap gives:
for non-negative m∈M.sub.k and for each k.
[0085] Comparing the DFT of the prototype frame Y.sub.−1(m) with the DFT of evolved sinusoidal model Y.sub.0(m) by using the approximation, it is found that the magnitude spectrum remains unchanged while the phase is shifted by
for each m∈M.sub.k. Hence, the frequency spectrum coefficients of the prototype frame in the vicinity of each sinusoid are shifted proportional to the sinusoidal frequency f.sub.k and the time difference between the lost audio frame and the prototype frame n.sub.−1.
[0086] Hence, according to the embodiment the substitution frame can be calculated by the following expression:
z(n)=IDTF{Z(m)} with Z(m)=Y(m).Math.e.sup.jθ.sub.k for non-negative m∈M.sub.k and for each k.
[0087] A specific embodiment addresses phase randomization for DFT indices not belonging to any interval M.sub.k. As described above, the intervals M.sub.k, k=1 . . . K have to be set such that they are strictly non-overlapping which is done using some parameter δ which controls the size of the intervals. It may happen that δ is small in relation to the frequency distance of two neighboring sinusoids. Hence, in that case it happens that there is a gap between two intervals. Consequently, for the corresponding DFT indices m no phase shift according to the above expression Z(m)=Y(m).Math.e.sup.jθ.sub.k is defined. A suitable choice according to this embodiment is to randomize the phase for these indices, yielding Z(m)=Y(m).Math.e.sup.j2πrand(⋅), where the function rand(⋅) returns some random number.
[0088] It has been found beneficial for the quality of the reconstructed signals to optimize the size of the intervals M.sub.k. In particular, the intervals should be larger if the signal is very tonal, i.e. when it has clear and distinct spectral peaks. This is the case for instance when the signal is harmonic with a clear periodicity. In other cases where the signal has less pronounced spectral structure with broader spectral maxima, it has been found that using small intervals leads to better quality. This finding leads to a further improvement according to which the interval size is adapted according to the properties of the signal. One realization is to use a tonality or a periodicity detector. If this detector identifies the signal as tonal, the δ-parameter controlling the interval size is set to a relatively large value. Otherwise, the δ-parameter is set to relatively smaller values.
[0089] Based on the above, the audio frame loss concealment methods involve the following steps:
[0090] 1. Analyzing a segment of the available, previously synthesized signal to obtain the constituent sinusoidal frequencies f.sub.k of a sinusoidal model, optionally using an enhanced frequency estimation.
[0091] 2. Extracting a prototype frame y.sub.−1 from the available previously synthesized signal and calculate the DFT of that frame.
[0092] 3. Calculating the phase shift θ.sub.k for each sinusoid k in response to the sinusoidal frequency f.sub.k and the time advance n.sub.−1 between the prototype frame and the substitution frame. Optionally in this step the size of the interval M may have been adapted in response to the tonality of the audio signal.
[0093] 4. For each sinusoid k advancing the phase of the prototype frame DFT with θ.sub.k selectively for the DFT indices related to a vicinity around the sinusoid frequency f.sub.k.
[0094] 5. Calculating the inverse DFT of the spectrum obtained in step 4.
[0095] Signal and Frame Loss Property Analysis and Detection
[0096] The methods described above are based on the assumption that the properties of the audio signal do not change significantly during the short time duration from the previously received and reconstructed signal frame and a lost frame. In that case it is a very good choice to retain the magnitude spectrum of the previously reconstructed frame and to evolve the phases of the sinusoidal main components detected in the previously reconstructed signal. There are however cases where this assumption is wrong which are for instance transients with sudden energy changes or sudden spectral changes.
[0097] A first embodiment of a transient detector according to the invention can consequently be based on energy variations within the previously reconstructed signal. This method, illustrated in
E.sub.left=Σ.sub.n=0.sup.N.sup.
[0098] Herein y(n) denotes the analysis frame, n.sub.left and n.sub.right denote the respective start indices of the partial frames that are both of size N.sub.part.
[0099] Now the left and right partial frame energies are used for the detection of a signal discontinuity. This is done by calculating the ratio
[0100] A discontinuity with sudden energy decrease (offset) can be detected if the ratio R.sub.l/r exceeds some threshold (e.g. 10), 115. Similarly a discontinuity with sudden energy increase (onset) can be detected if the ratio R.sub.l/r is below some other threshold (e.g. 0.1), 117.
[0101] In the context of the above described concealment methods it has been found that the above defined energy ratio may in many cases be a too insensitive indicator. In particular in real signals and especially music there are cases where a tone at some frequency suddenly emerges while some other tone at some other frequency suddenly stops. Analyzing such a signal frame with the above-defined energy ratio would in any case lead to a wrong detection result for at least one of the tones since this indicator is insensitive to different frequencies.
[0102] A solution to this problem is described in the following embodiment. The transient detection is now done in the time frequency plane. The analysis frame is again partitioned into a left and a right partial frame, 110. Though now, these two partial frames are (after suitable windowing with e.g. a Hamming window, 111) transformed into the frequency domain, e.g. by means of a N.sub.part-point DFT, 112.
Y.sub.left(m)=DFT{y(n−n.sub.left)}.sub.N.sub.
Y.sub.right(m)=DFT{y(n−n.sub.right)}.sub.N.sub.
[0103] Now the transient detection can be done frequency selectively for each DFT bin with index m. Using the powers of the left and right partial frame magnitude spectra, for each DFT index m a respective energy ratio can be calculated 113 as
[0104] Experiments show that frequency selective transient detection with DFT bin resolution is relatively imprecise due to statistical fluctuations (estimation errors). It was found that the quality of the operation is rather enhanced when making the frequency selective transient detection on the basis of frequency bands. Let l.sub.k=[m.sub.k−1+1, . . . , m.sub.k] specify the k.sup.h interval, k=1 . . . K, covering the DFT bins from m.sub.k−1+1 to m.sub.k, then these intervals define K frequency bands. The frequency group selective transient detection can now be based on the band-wise ratio between the respective band energies of the left and right partial frames:
[0105] It is to be noted that the interval I.sub.k=[m.sub.k−1+1, . . . , m.sub.k] corresponds to the frequency band
where f.sub.s denotes the audio sampling frequency.
[0106] The lowest lower frequency band boundary m.sub.0 can be set to 0 but may also be set to a DFT index corresponding to a larger frequency in order to mitigate estimation errors that grow with lower frequencies. The highest upper frequency band boundary m.sub.K can be set to
but is preferably chosen to correspond to some lower frequency in which a transient still has a significant audible effect.
[0107] A suitable choice for these frequency band sizes or widths is either to make them equal size with e.g. a width of several 100 Hz. Another preferred way is to make the frequency band widths following the size of the human auditory critical bands, i.e. to relate them to the frequency resolution of the auditory system. This means approximately to make the frequency band widths equal for frequencies up to 1 kHz and to increase them exponentially above 1 kHz. Exponential increase means for instance to double the frequency bandwidth when incrementing the band index k.
[0108] As described in the first embodiment of the transient detector that was based on an energy ratio of two partial frames, any of the ratios related to band energies or DFT bin energies of two partial frames are compared to certain thresholds. A respective upper threshold for (frequency selective) offset detection 115 and a respective lower threshold for (frequency selective) onset detection 117 is used.
[0109] A further audio signal dependent indicator that is suitable for an adaptation of the frame loss concealment method can be based on the codec parameters transmitted to the decoder. For instance, the codec may be a multi-mode codec like ITU-T G.718. Such codec may use particular codec modes for different signal types and a change of the codec mode in a frame shortly before the frame loss may be regarded as an indicator for a transient.
[0110] Another useful indicator for adaptation of the frame loss concealment is a codec parameter related to a voicing property and the transmitted signal. Voicing relates to highly periodic speech that is generated by a periodic glottal excitation of the human vocal tract.
[0111] A further preferred indicator is whether the signal content is estimated to be music or speech. Such an indicator can be obtained from a signal classifier that may typically be part of the codec. In case the codec performs such a classification and makes a corresponding classification decision available as a coding parameter to the decoder, this parameter is preferably used as signal content indicator to be used for adapting the frame loss concealment method.
[0112] Another indicator that is preferably used for adaptation of the frame loss concealment methods is the burstiness of the frame losses. Burstiness of frame losses means that there occur several frame losses in a row, making it hard for the frame loss concealment method to use valid recently decoded signal portions for its operation. A state-of-the-art indicator is the number n.sub.burst of observed frame losses in a row. This counter is incremented with one upon each frame loss and reset to zero upon the reception of a valid frame. This indicator is also used in the context of the present example embodiments of the invention.
[0113] Adaptation of the Frame Loss Concealment Method
[0114] In case the steps carried out above indicate a condition suggesting an adaptation of the frame loss concealment operation the calculation of the spectrum of the substitution frame is modified.
[0115] While the original calculation of the substitution frame spectrum is done according to the expression Z(m)=Y(m).Math.e.sup.jθ.sub.k, now an adaptation is introduced modifying both magnitude and phase. The magnitude is modified by means of scaling with two factors α(m) and β(m) and the phase is modified with an additive phase component ϑ(m). This leads to the following modified calculation of the substitution frame:
Z(m)=α(m).Math.β(m).Math.Y(m).Math.e.sup.j(θ+ϑ(m)).sub.k.
[0116] It is to be noted that the original (non-adapted) frame-loss concealment methods is used if α(m)=1, β(m)=1, and ϑ(m)=0. These respective values are hence the default.
[0117] The general objective with introducing magnitude adaptations is to avoid audible artifacts of the frame loss concealment method. Such artifacts may be musical or tonal sounds or strange sounds arising from repetitions of transient sounds. Such artifacts would in turn lead to quality degradations, which avoidance is the objective of the described adaptations. A suitable way to such adaptations is to modify the magnitude spectrum of the substitution frame to a suitable degree.
[0118]
[0119] It has however been found that it is beneficial to perform the attenuation with gradually increasing degree. One preferred embodiment which accomplishes this is to define a logarithmic parameter specifying a logarithmic increase in attenuation per frame, att_per_frame. Then, in case the burst counter exceeds the threshold the gradually increasing attenuation factor is calculated by
α(m)=10.sup.c.Math.att_per_frame.Math.(n.sup.
[0120] Here the constant c is mere a scaling constant allowing to specify the parameter att_per_frame for instance in decibels (dB).
[0121] An additional preferred adaptation is done in response to the indicator whether the signal is estimated to be music or speech. For music content in comparison with speech content it is preferable to increase the threshold thr.sub.burst and to decrease the attenuation per frame. This is equivalent with performing the adaptation of the frame loss concealment method with a lower degree. The background of this kind of adaptation is that music is generally less sensitive to longer loss bursts than speech. Hence, the original, i.e. the unmodified frame loss concealment method is still preferable for this case, at least for a larger number of frame losses in a row.
[0122] A further adaptation of the concealment method with regards to the magnitude attenuation factor is preferably done in case a transient has been detected based on that the indicator R.sub.l/r, band(k) or alternatively R.sub.l/r(m) or R.sub.l/r have passed a threshold, 122. In that case a suitable adaptation action, 125, is to modify the second magnitude attenuation factor β(m) such that the total attenuation is controlled by the product of the two factors α(m).Math.β(m).
[0123] β(m) is set in response to an indicated transient. In case an offset is detected the factor β(m) is preferably be chosen to reflect the energy decrease of the offset. A suitable choice is to set β(m) to the detected gain change:
β(m)=√{square root over (R.sub.l/r,band(k))}, for m∈I.sub.k, k=1 . . . K.
[0124] In case an onset is detected it is rather found advantageous to limit the energy increase in the substitution frame. In that case the factor can be set to some fixed value of e.g. 1, meaning that there is no attenuation but not any amplification either.
[0125] In the above it is to be noted that the magnitude attenuation factor is preferably applied frequency selectively, i.e. with individually calculated factors for each frequency band. In case the band approach is not used, the corresponding magnitude attenuation factors can still be obtained in an analogue way. β(m) can then be set individually for each DFT bin in case frequency selective transient detection is used on DFT bin level. Or, in case no frequency selective transient indication is used at all β(m) can be globally identical for all m.
[0126] A further preferred adaptation of the magnitude attenuation factor is done in conjunction with a modification of the phase by means of the additional phase component ϑ(m) 127. In case for a given m such a phase modification is used, the attenuation factor β(m) is reduced even further. Preferably, even the degree of phase modification is taken into account. If the phase modification is only moderate, β(m) is only scaled down slightly, while if the phase modification is strong, β(m) is scaled down to a larger degree.
[0127] The general objective with introducing phase adaptations is to avoid too strong tonality or signal periodicity in the generated substitution frames, which in turn would lead to quality degradations. A suitable way to such adaptations is to randomize or dither the phase to a suitable degree.
[0128] Such phase dithering is accomplished if the additional phase component ϑ(m) is set to a random value scaled with some control factor: ϑ(m)=α(m).Math.rand(⋅).
[0129] The random value obtained by the function rand(.Math.) is for instance generated by some pseudo-random number generator. It is here assumed that it provides a random number within the interval [0, 2π].
[0130] The scaling factor a(m) in the above equation control the degree by which the original phase θ.sub.k is dithered. The following embodiments address the phase adaptation by means of controlling this scaling factor. The control of the scaling factor is done in an analogue way as the control of the magnitude modification factors described above.
[0131] According to a first embodiment scaling factor a(m) is adapted in response to the burst loss counter. If the burst loss counter n.sub.burst exceeds some threshold thr.sub.burst, e.g. thr.sub.burst=3, a value larger than 0 is used, e.g. α(m)=0.2.
[0132] It has however been found that it is beneficial to perform the dithering with gradually increasing degree. One preferred embodiment which accomplishes this is to define a parameter specifying an increase in dithering per frame, dith_increase_per_frame.
[0133] Then in case the burst counter exceeds the threshold the gradually increasing dithering control factor is calculated by
α(m)=dith_increase_per_frame.Math.(n.sub.burst−thr.sub.burst).
[0134] It is to be noted in the above formula that a(m) has to be limited to a maximum value of 1 for which full phase dithering is achieved.
[0135] It is to be noted that the burst loss threshold value thr.sub.burst used for initiating phase dithering may be the same threshold as the one used for magnitude attenuation. However, better quality can be obtained by setting these thresholds to individually optimal values, which generally means that these thresholds may be different.
[0136] An additional preferred adaptation is done in response to the indicator whether the signal is estimated to be music or speech. For music content in comparison with speech content it is preferable to increase the threshold thr.sub.burst meaning that phase dithering for music as compared to speech is done only in case of more lost frames in a row. This is equivalent with performing the adaptation of the frame loss concealment method for music with a lower degree. The background of this kind of adaptation is that music is generally less sensitive to longer loss bursts than speech. Hence, the original, i.e. unmodified frame loss concealment method is still preferable for this case, at least for a larger number of frame losses in a row.
[0137] A further preferred embodiment is to adapt the phase dithering in response to a detected transient. In that case a stronger degree of phase dithering can be used for the DFT bins m for which a transient is indicated either for that bin, the DFT bins of the corresponding frequency band or of the whole frame.
[0138] Part of the schemes described address optimization of the frame loss concealment method for harmonic signals and particularly for voiced speech.
[0139] In case the methods using an enhanced frequency estimation as described above are not realized another adaptation possibility for the frame loss concealment method optimizing the quality for voiced speech signals is to switch to some other frame loss concealment method that specifically is designed and optimized for speech rather than for general audio signals containing music and speech. In that case, the indicator that the signal comprises a voiced speech signal is used to select another speech-optimized frame loss concealment scheme rather than the schemes described above.
[0140] The embodiments apply to a controller in a decoder, as illustrated in
[0141] The decoder with its including units could be implemented in hardware. There are numerous variants of circuitry elements that can be used and combined to achieve the functions of the units of the decoder. Such variants are encompassed by the embodiments. Particular examples of hardware implementation of the decoder is implementation in digital signal processor (DSP) hardware and integrated circuit technology, including both general-purpose electronic circuitry and application-specific circuitry.
[0142] The decoder 150 described herein could alternatively be implemented e.g. as illustrated in
[0143] The technology described above may be used e.g. in a receiver, which can be used in a mobile device (e.g. mobile phone, laptop) or a stationary device, such as a personal computer.
[0144] It is to be understood that the choice of interacting units or modules, as well as the naming of the units are only for exemplary purpose, and may be configured in a plurality of alternative ways in order to be able to execute the disclosed process actions.
[0145] It should also be noted that the units or modules described in this disclosure are to be regarded as logical entities and not with necessity as separate physical entities. It will be appreciated that the scope of the technology disclosed herein fully encompasses other embodiments which may become obvious to those skilled in the art, and that the scope of this disclosure is accordingly not to be limited.
[0146] Reference to an element in the singular is not intended to mean “one and only one” unless explicitly so stated, but rather “one or more.” All structural and functional equivalents to the elements of the above-described embodiments that are known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed hereby. Moreover, it is not necessary for a device or method to address each and every problem sought to be solved by the technology disclosed herein, for it to be encompassed hereby.
[0147] In the preceding description, for purposes of explanation and not limitation, specific details are set forth such as particular architectures, interfaces, techniques, etc. in order to provide a thorough understanding of the disclosed technology. However, it will be apparent to those skilled in the art that the disclosed technology may be practiced in other embodiments and/or combinations of embodiments that depart from these specific details. That is, those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the disclosed technology. In some instances, detailed descriptions of well-known devices, circuits, and methods are omitted so as not to obscure the description of the disclosed technology with unnecessary detail. All statements herein reciting principles, aspects, and embodiments of the disclosed technology, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, e.g. any elements developed that perform the same function, regardless of structure.
[0148] Thus, for example, it will be appreciated by those skilled in the art that the figures herein can represent conceptual views of illustrative circuitry or other functional units embodying the principles of the technology, and/or various processes which may be substantially represented in computer readable medium and executed by a computer or processor, even though such computer or processor may not be explicitly shown in the figures.
[0149] The functions of the various elements including functional blocks may be provided through the use of hardware such as circuit hardware and/or hardware capable of executing software in the form of coded instructions stored on computer readable medium. Thus, such functions and illustrated functional blocks are to be understood as being either hardware-implemented and/or computer-implemented, and thus machine-implemented.
[0150] The embodiments described above are to be understood as a few illustrative examples of the present invention. It will be understood by those skilled in the art that various modifications, combinations and changes may be made to the embodiments without departing from the scope of the present invention. In particular, different part solutions in the different embodiments can be combined in other configurations, where technically possible.