Audio decoder and method for providing a decoded audio information using an error concealment based on a time domain excitation signal
10269358 ยท 2019-04-23
Assignee
Inventors
- J?r?mie Lecomte (Fuerth, DE)
- Goran Markovic (Nuremberg, DE)
- Michael Schnabel (Geroldsgruen, DE)
- Grzegorz Pietrzyk (Nuremberg, DE)
Cpc classification
G10L19/09
PHYSICS
G10L19/08
PHYSICS
G10L19/02
PHYSICS
G10L19/005
PHYSICS
International classification
G10L19/005
PHYSICS
G10L19/02
PHYSICS
G10L19/09
PHYSICS
Abstract
An audio decoder for providing a decoded audio information on the basis of an encoded audio information includes an error concealment configured to provide an error concealment audio information for concealing a loss of an audio frame following an audio frame encoded in a frequency domain representation using a time domain excitation signal.
Claims
1. An audio decoder for providing a decoded audio information on the basis of an encoded audio information, the audio decoder comprising: an error concealment unit configured to provide an error concealment audio information for concealing a loss of an audio frame following an audio frame encoded in a frequency domain representation using a time domain excitation signal; wherein the audio decoder comprises a frequency-domain decoder core configured to apply a scale-factor-based scaling to a plurality of spectral values derived from the frequency-domain representation, and wherein the error concealment unit is configured to provide the error concealment audio information for concealing a loss of an audio frame following an audio frame encoded in a frequency domain representation comprising a plurality of encoded scale factors using a time domain excitation signal derived from the frequency domain representation; wherein the error concealment unit is configured to acquire the time domain excitation signal on the basis of the audio frame encoded in the frequency domain representation preceding a lost audio frame.
2. The audio decoder according to claim 1, wherein the error concealment unit is configured to derive the error concealment audio information on the basis of at least three partially overlapping frames or windows preceding a lost audio frame or a lost window.
3. The audio decoder according to claim 1, wherein the audio decoder comprises a frequency-domain decoder core configured to derive a time domain audio signal representation from the frequency-domain representation without using a time domain excitation signal as an intermediate quantity for the audio frame encoded in the frequency domain representation.
4. The audio decoder according to claim 1, wherein the error concealment unit is configured to acquire the time domain excitation signal on the basis of the audio frame encoded in the frequency domain representation preceding a lost audio frame, and wherein the error concealment unit is configured to provide the error concealment audio information for concealing the lost audio frame using said time domain excitation signal.
5. The audio decoder according to claim 1, wherein the error concealment unit is configured to perform an LPC analysis on the basis of the audio frame encoded in the frequency domain representation preceding the lost audio frame, to acquire a set of linear-prediction-coding parameters and the time-domain excitation signal representing an audio content of the audio frame encoded in the frequency domain representation preceding the lost audio frame; or wherein the error concealment unit is configured to perform an LPC analysis on the basis of the audio frame encoded in the frequency domain representation preceding the lost audio frame, to acquire the time-domain excitation signal representing an audio content of the audio frame encoded in the frequency domain representation preceding the lost audio frame; or wherein the audio decoder is configured to acquire a set of linear-prediction-coding parameters using a linear-prediction-coding parameter estimation; or wherein the audio decoder is configured to acquire a set of linear-prediction-coding parameters on the basis of a set of scale factors using a transform.
6. The audio decoder according to claim 1, wherein the error concealment unit is configured to acquire a pitch information describing a pitch of the audio frame encoded in the frequency domain representation preceding the lost audio frame, and to provide the error concealment audio information in dependence on the pitch information.
7. The audio decoder according to claim 6, wherein the error concealment unit is configured to acquire the pitch information on the basis of the time domain excitation signal derived from the audio frame encoded in the frequency domain representation preceding the lost audio frame.
8. The audio decoder according to claim 7, wherein the error concealment unit is configured to evaluate a cross correlation of the time domain excitation signal or the time domain signal, to determine a coarse pitch information, and wherein the error concealment unit is configured to refine the coarse pitch information using a closed loop search around a pitch determined by the coarse pitch information.
9. The audio decoder according to claim 1, wherein the error concealment unit is configured to acquire a pitch information on the basis of a side information of the encoded audio information.
10. The audio decoder according to claim 1, wherein the error concealment unit is configured to acquire a pitch information on the basis of a pitch information available for a previously decoded audio frame.
11. The audio decoder according to claim 1, wherein the error concealment unit is configured to acquire a pitch information on the basis of a pitch search performed on a time domain signal or on a residual signal.
12. The audio decoder according to claim 1, wherein the error concealment unit is configured to copy a pitch cycle of the time domain excitation signal derived from the audio frame encoded in the frequency domain representation preceding the lost audio frame one time or multiple times, in order to acquire a excitation signal for a synthesis of the error concealment audio information.
13. The audio decoder according to claim 12, wherein the error concealment unit is configured to low-pass filter the pitch cycle of the time domain excitation signal derived from the time domain representation of the audio frame encoded in the frequency domain representation preceding the lost audio frame using a sampling-rate dependent filter, a bandwidth of which is dependent on a sampling rate of the audio frame encoded in a frequency domain representation.
14. The audio decoder according to claim 12, wherein the error concealment unit is configured to change the spectral shape of a noise signal using a pre-emphasis filter wherein the noise signal is combined with the extrapolated time domain excitation signal if the audio frame encoded in a frequency domain representation preceding the lost audio frame is a voiced audio frame or comprises an onset.
15. The audio decoder according to claim 1, wherein the error concealment unit is configured to combine an extrapolated time domain excitation signal and a noise signal, in order to acquire an input signal for an LPC synthesis, and wherein the error concealment unit is configured to perform the LPC synthesis, wherein the LPC synthesis is configured to filter the input signal of the LPC synthesis in dependence on linear-prediction-coding parameters, in order to acquire the error concealment audio information.
16. The audio decoder according to claim 15, wherein the error concealment unit is configured to compute a gain of the extrapolated time domain excitation signal, which is used to acquire the input signal for the LPC synthesis, using a correlation in the time domain which is performed on the basis of a time domain representation of the audio frame encoded in the frequency domain representation preceding the lost audio frame, wherein a correlation lag is set in dependence on a pitch information acquired on the basis of the time-domain excitation signal, or using a correlation in the excitation domain.
17. The audio decoder according to claim 15, wherein the error concealment unit is configured to high-pass filter the noise signal which is combined with the extrapolated time domain excitation signal.
18. The audio decoder according to claim 1, wherein the error concealment unit is configured to predict a pitch at the end of a lost frame, and wherein the error concealment unit is configured to adapt the time domain excitation signal, or one or more copies thereof, to the predicted pitch, in order to acquire an input signal for an LPC synthesis.
19. The audio decoder according to claim 1, wherein the error concealment unit is configured to compute a gain of the noise signal in dependence on a correlation in the time domain which is performed on the basis of a time domain representation of the audio frame encoded in the frequency domain representation preceding the lost audio frame.
20. The audio decoder according to claim 1, wherein the error concealment unit is configured to modify a time domain excitation signal acquired on the basis of one or more audio frames preceding a lost audio frame, in order to acquire the error concealment audio information.
21. The audio decoder according to claim 20, wherein the error concealment unit is configured to use one or more modified copies of the time domain excitation signal acquired on the basis of one or more audio frames preceding a lost audio frame, in order to acquire the error concealment information.
22. The audio decoder according to claim 20, wherein the error concealment unit is configured to modify the time domain excitation signal acquired on the basis of one or more audio frames preceding a lost audio frame, or one or more copies thereof, to thereby reduce a periodic component of the error concealment audio information over time.
23. The audio decoder according to claim 22, wherein the error concealment unit is configured to gradually reduce a gain applied to scale the time domain excitation signal acquired on the basis of one or more audio frames preceding a lost audio frame, or the one or more copies thereof.
24. The audio decoder according to claim 23, wherein the error concealment unit is configured to adjust the speed used to gradually reduce a gain applied to scale the time domain excitation signal acquired on the basis of one or more audio frames preceding a lost audio frame, or the one or more copies thereof, in dependence on a length of a pitch period of the time domain excitation signal, such that a time domain excitation signal input into an LPC synthesis is faded out faster for signals having a shorter length of the pitch period when compared to signals having a larger length of the pitch period.
25. The audio decoder according to claim 23 wherein the error concealment unit is configured to adjust the speed used to gradually reduce a gain applied to scale the time domain excitation signal acquired on the basis of one or more audio frames preceding a lost audio frame, or the one or more copies thereof, in dependence on a result of a pitch analysis or a pitch prediction, such that a deterministic component of a time domain excitation signal input into an LPC synthesis is faded out faster for signals having a larger pitch change per time unit when compared to signals having a smaller pitch change per time unit, and/or such that a deterministic component of a time domain excitation signal input into an LPC synthesis is faded out faster for signals for which a pitch prediction fails when compared to signals for which the pitch prediction succeeds.
26. The audio decoder according to claim 22, wherein the error concealment unit is configured to adjust a speed used to gradually reduce a gain applied to scale the time domain excitation signal acquired on the basis of one or more audio frames preceding a lost audio frame, or the one or more copies thereof, in dependence on one or more parameters of one or more audio frames preceding the lost audio frame, and/or in dependence on a number of consecutive lost audio frames.
27. The audio decoder according to claim 20, wherein the error concealment unit is configured to scale the time domain excitation signal acquired on the basis of one or more audio frames preceding the lost audio frame, or one or more copies thereof, to thereby modify the time domain excitation signal.
28. The audio decoder according to claim 20, wherein the error concealment unit is configured to time-scale the time domain excitation signal acquired on the basis of one or more audio frames preceding a lost audio frame, or the one or more copies thereof, in dependence on a prediction of a pitch for the time of the one or more lost audio frames.
29. The audio decoder according to claim 1, wherein the error concealment unit is configured to provide the error concealment audio information for a time which is longer than a temporal duration of the one or more lost audio frames.
30. The audio decoder according to claim 29, wherein the error concealment unit is configured to perform an overlap-and-add of the error concealment audio information and a time domain representation of one or more properly received audio frames following the one or more lost audio frames.
31. A method for providing a decoded audio information on the basis of an encoded audio information, the method comprising: providing an error concealment audio information for concealing a loss of an audio frame following an audio frame encoded in a frequency domain representation using a time domain excitation signal; and applying a scale-factor-based scaling to a plurality of spectral values derived from the frequency-domain representation; wherein the error concealment audio information for concealing a loss of an audio frame following an audio frame encoded in a frequency domain representation comprising a plurality of encoded scale factors is provided using a time domain excitation signal derived from the frequency domain representation; wherein the time domain excitation signal is acquired on the basis of the audio frame encoded in the frequency domain representation preceding a lost audio frame.
32. A non-transitory digital storage medium having a computer program stored thereon to perform the method according to claim 31 when said computer program is run by a computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
DETAILED DESCRIPTION OF THE INVENTION
(13) 1. Audio Decoder According to
(14)
(15) The audio decoder 100 may comprise a decoding/processing 120, which provides the decoded audio information on the basis of the encoded audio information in the absence of a frame loss.
(16) The audio decoder 100 further comprises an error concealment 130, which provides an error concealment audio information. The error concealment 130 is configured to provide the error concealment audio information 132 for concealing a loss of an audio frame following an audio frame encoded in the frequency domain representation, using a time domain excitation signal.
(17) In other words, the decoding/processing 120 may provide a decoded audio information 122 for audio frames which are encoded in the form of a frequency domain representation, i.e. in the form of an encoded representation, encoded values of which describe intensities in different frequency bins. Worded differently, the decoding/processing 120 may, for example, comprise a frequency domain audio decoder, which derives a set of spectral values from the encoded audio information 110 and performs a frequency-domain-to-time-domain transform to thereby derive a time domain representation which constitutes the decoded audio information 122 or which forms the basis for the provision of the decoded audio information 122 in case there is additional post processing.
(18) However, the error concealment 130 does not perform the error concealment in the frequency domain but rather uses a time domain excitation signal, which may, for example, serve to excite a synthesis filter, like for example a LPC synthesis filter, which provides a time domain representation of an audio signal (for example, the error concealment audio information) on the basis of the time domain excitation signal and also on the basis of LPC filter coefficients (linear-prediction-coding filter coefficients).
(19) Accordingly, the error concealment 130 provides the error concealment audio information 132, which may, for example, be a time domain audio signal, for lost audio frames, wherein the time domain excitation signal used by the error concealment 130 may be based on, or derived from, one or more previous, properly received audio frames (preceding the lost audio frame), which are encoded in the form of a frequency domain representation. To conclude, the audio decoder 100 may perform an error concealment (i.e. provide an error concealment audio information 132), which reduces a degradation of an audio quality due to the loss of an audio frame on the basis of an encoded audio information, in which at least some audio frames are encoded in a frequency domain representation. It has been found that performing the error concealment using a time domain excitation signal even if a frame following a properly received audio frame encoded in the frequency domain representation is lost, brings along an improved audio quality when compared to an error concealment which is performed in the frequency domain (for example, using a frequency domain representation of the audio frame encoded in the frequency domain representation preceding the lost audio frame). This is due to the fact that a smooth transition between the decoded audio information associated with the properly received audio frame preceding the lost audio frame and the error concealment audio information associated with the lost audio frame can be achieved using a time domain excitation signal, since the signal synthesis, which is typically performed on the basis of the time domain excitation signal, helps to avoid discontinuities. Thus, a good (or at least acceptable) hearing impression can be achieved using the audio decoder 100, even if an audio frame is lost which follows a properly received audio frame encoded in the frequency domain representation. For example, the time domain approach brings improvement on monophonic signal, like speech, because it is closer to what is done in case of speech codec concealment. The usage of LPC helps to avoid discontinuities and give a better shaping of the frames.
(20) Moreover, it should be noted that the audio decoder 100 can be supplemented by any of the features and functionalities described in the following, either individually or taken in combination.
(21) 2. Audio Decoder According to
(22)
(23) The audio decoder 200 may typically comprise a decoding/processing 220, which may, for example, provide a decoded audio information 232 for audio frames which are properly received. In other words, the decoding/processing 230 may perform a frequency domain decoding (for example, an AAC-type decoding, or the like) on the basis of one or more encoded audio frames encoded in a frequency domain representation. Alternatively, or in addition, the decoding/processing 230 may be configured to perform a time domain decoding (or linear-prediction-domain decoding) on the basis of one or more encoded audio frames encoded in a time domain representation (or, in other words, in a linear-prediction-domain representation), like, for example, a TCX-excited linear-prediction decoding (TCX=transform-coded excitation) or an ACELP decoding (algebraic-codebook-excited-linear-prediction-decoding). Optionally, the decoding/processing 230 may be configured to switch between different decoding modes.
(24) The audio decoder 200 further comprises an error concealment 240, which is configured to provide an error concealment audio information 242 for one or more lost audio frames. The error concealment 240 is configured to provide the error concealment audio information 242 for concealing a loss of an audio frame (or even a loss of multiple audio frames). The error concealment 240 is configured to modify a time domain excitation signal obtained on the basis of one or more audio frames preceding a lost audio frame, in order to obtain the error concealment audio information 242. Worded differently, the error concealment 240 may obtain (or derive) a time domain excitation signal for (or on the basis of) one or more encoded audio frames preceding a lost audio frame, and may modify said time domain excitation signal, which is obtained for (or on the basis of) one or more properly received audio frames preceding a lost audio frame, to thereby obtain (by the modification) a time domain excitation signal which is used for providing the error concealment audio information 242. In other words, the modified time domain excitation signal may be used as an input (or as a component of an input) for a synthesis (for example, LPC synthesis) of the error concealment audio information associated with the lost audio frame (or even with multiple lost audio frames). By providing the error concealment audio information 242 on the basis of the time domain excitation signal obtained on the basis of one or more properly received audio frames preceding the lost audio frame, audible discontinuities can be avoided. On the other hand, by modifying the time domain excitation signal derived for (or from) one or more audio frames preceding the lost audio frame, and by providing the error concealment audio information on the basis of the modified time domain excitation signal, it is possible to consider varying characteristics of the audio content (for example, a pitch change), and it is also possible to avoid an unnatural hearing impression (for example, by fading out a deterministic (for example, at least approximately periodic) signal component). Thus, it can be achieved that the error concealment audio information 242 comprises some similarity with the decoded audio information 232 obtained on the basis of properly decoded audio frames preceding the lost audio frame, and it can still be achieved that the error concealment audio information 242 comprises a somewhat different audio content when compared to the decoded audio information 232 associated with the audio frame preceding the lost audio frame by somewhat modifying the time domain excitation signal. The modification of the time domain excitation signal used for the provision of the error concealment audio information (associated with the lost audio frame) may, for example, comprise an amplitude scaling or a time scaling. However, other types of modification (or even a combination of an amplitude scaling and a time scaling) are possible, wherein a certain degree of relationship between the time domain excitation signal obtained (as an input information) by the error concealment and the modified time domain excitation signal should remain.
(25) To conclude, the audio decoder 200 allows to provide the error concealment audio information 242, such that the error concealment audio information provides for a good hearing impression even in the case that one or more audio frames are lost. The error concealment is performed on the basis of a time domain excitation signal, wherein a variation of the signal characteristics of the audio content during the lost audio frame is considered by modifying the time domain excitation signal obtained on the basis of the one more audio frames preceding a lost audio frame.
(26) Moreover, it should be noted that the audio decoder 200 can be supplemented by any of the features and functionalities described herein, either individually or in combination.
(27) 3. Audio Decoder According to
(28)
(29) The audio decoder 300 is configured to receive an encoded audio information 310 and to provide, on the basis thereof, a decoded audio information 312. The audio decoder 300 comprises a bitstream analyzer 320, which may also be designated as a bitstream deformatter or bitstream parser. The bitstream analyzer 320 receives the encoded audio information 310 and provides, on the basis thereof, a frequency domain representation 322 and possibly additional control information 324. The frequency domain representation 322 may, for example, comprise encoded spectral values 326, encoded scale factors 328 and, optionally, an additional side information 330 which may, for example, control specific processing steps, like, for example, a noise filling, an intermediate processing or a post-processing. The audio decoder 300 also comprises a spectral value decoding 340 which is configured to receive the encoded spectral values 326, and to provide, on the basis thereof, a set of decoded spectral values 342. The audio decoder 300 may also comprise a scale factor decoding 350, which may be configured to receive the encoded scale factors 328 and to provide, on the basis thereof, a set of decoded scale factors 352.
(30) Alternatively to the scale factor decoding, an LPC-to-scale factor conversion 354 may be used, for example, in the case that the encoded audio information comprises an encoded LPC information, rather than an scale factor information. However, in some coding modes (for example, in the TCX decoding mode of the USAC audio decoder or in the EVS audio decoder) a set of LPC coefficients may be used to derive a set of scale factors at the side of the audio decoder. This functionality may be reached by the LPC-to-scale factor conversion 354.
(31) The audio decoder 300 may also comprise a scaler 360, which may be configured to apply the set of scaled factors 352 to the set of spectral values 342, to thereby obtain a set of scaled decoded spectral values 362. For example, a first frequency band comprising multiple decoded spectral values 342 may be scaled using a first scale factor, and a second frequency band comprising multiple decoded spectral values 342 may be scaled using a second scale factor. Accordingly, the set of scaled decoded spectral values 362 is obtained. The audio decoder 300 may further comprise an optional processing 366, which may apply some processing to the scaled decoded spectral values 362. For example, the optional processing 366 may comprise a noise filling or some other operations.
(32) The audio decoder 300 also comprises a frequency-domain-to-time-domain transform 370, which is configured to receive the scaled decoded spectral values 362, or a processed version 368 thereof, and to provide a time domain representation 372 associated with a set of scaled decoded spectral values 362. For example, the frequency-domain-to-time domain transform 370 may provide a time domain representation 372, which is associated with a frame or sub-frame of the audio content. For example, the frequency-domain-to-time-domain transform may receive a set of MDCT coefficients (which can be considered as scaled decoded spectral values) and provide, on the basis thereof, a block of time domain samples, which may form the time domain representation 372.
(33) The audio decoder 300 may optionally comprise a post-processing 376, which may receive the time domain representation 372 and somewhat modify the time domain representation 372, to thereby obtain a post-processed version 378 of the time domain representation 372.
(34) The audio decoder 300 also comprises an error concealment 380 which may, for example, receive the time domain representation 372 from the frequency-domain-to-time-domain transform 370 and which may, for example, provide an error concealment audio information 382 for one or more lost audio frames. In other words, if an audio frame is lost, such that, for example, no encoded spectral values 326 are available for said audio frame (or audio sub-frame), the error concealment 380 may provide the error concealment audio information on the basis of the time domain representation 372 associated with one or more audio frames preceding the lost audio frame. The error concealment audio information may typically be a time domain representation of an audio content.
(35) It should be noted that the error concealment 380 may, for example, perform the functionality of the error concealment 130 described above. Also, the error concealment 380 may, for example, comprise the functionality of the error concealment 500 described taking reference to
(36) Regarding the error concealment, it should be noted that the error concealment does not happen at the same time of the frame decoding. For example if the frame n is good then we do a normal decoding, and at the end we save some variable that will help if we have to conceal the next frame, then if n+1 is lost we call the concealment function giving the variable coming from the previous good frame. We will also update some variables to help for the next frame loss or on the recovery to the next good frame.
(37) The audio decoder 300 also comprises a signal combination 390, which is configured to receive the time domain representation 372 (or the post-processed time domain representation 378 in case that there is a post-processing 376). Moreover, the signal combination 390 may receive the error concealment audio information 382, which is typically also a time domain representation of an error concealment audio signal provided for a lost audio frame. The signal combination 390 may, for example, combine time domain representations associated with subsequent audio frames. In the case that there are subsequent properly decoded audio frames, the signal combination 390 may combine (for example, overlap-and-add) time domain representations associated with these subsequent properly decoded audio frames. However, if an audio frame is lost, the signal combination 390 may combine (for example, overlap-and-add) the time domain representation associated with the properly decoded audio frame preceding the lost audio frame and the error concealment audio information associated with the lost audio frame, to thereby have a smooth transition between the properly received audio frame and the lost audio frame. Similarly, the signal combination 390 may be configured to combine (for example, overlap-and-add) the error concealment audio information associated with the lost audio frame and the time domain representation associated with another properly decoded audio frame following the lost audio frame (or another error concealment audio information associated with another lost audio frame in case that multiple consecutive audio frames are lost).
(38) Accordingly, the signal combination 390 may provide a decoded audio information 312, such that the time domain representation 372, or a post processed version 378 thereof, is provided for properly decoded audio frames, and such that the error concealment audio information 382 is provided for lost audio frames, wherein an overlap-and-add operation is typically performed between the audio information (irrespective of whether it is provided by the frequency-domain-to-time-domain transform 370 or by the error concealment 380) of subsequent audio frames. Since some codecs have some aliasing on the overlap and add part that need to be canceled, optionally we can create some artificial aliasing on the half a frame that we have created to perform the overlap add.
(39) It should be noted that the functionality of the audio decoder 300 is similar to the functionality of the audio decoder 100 according to
(40) 4. Audio Decoder 400 According to
(41)
(42) In the following, some details of the audio decoder 400 will be described.
(43) The audio decoder 400 comprises a bitstream analyzer 420 which may, for example, analyze the encoded audio information 410 and extract, from the encoded audio information 410, a frequency domain representation 422, comprising, for example, encoded spectral values, encoded scale factors and, optionally, an additional side information. The bitstream analyzer 420 may also be configured to extract a linear-prediction coding domain representation 424, which may, for example, comprise an encoded excitation 426 and encoded linear-prediction-coefficients 428 (which may also be considered as encoded linear-prediction parameters). Moreover, the bitstream analyzer may optionally extract additional side information, which may be used for controlling additional processing steps, from the encoded audio information.
(44) The audio decoder 400 comprises a frequency domain decoding path 430, which may, for example, be substantially identical to the decoding path of the audio decoder 300 according to
(45) The audio decoder 400 may also comprise a linear-prediction-domain decoding path 440 (which may also be considered as a time domain decoding path, since the LPC synthesis is performed in the time domain). The linear-prediction-domain decoding path comprises an excitation decoding 450, which receives the encoded excitation 426 provided by the bitstream analyzer 420 and provides, on the basis thereof, a decoded excitation 452 (which may take the form of a decoded time domain excitation signal). For example, the excitation decoding 450 may receive an encoded transform-coded-excitation information, and may provide, on the basis thereof, a decoded time domain excitation signal. Thus, the excitation decoding 450 may, for example, perform a functionality which is performed by the excitation decoder 730 described taking reference to
(46) It should be noted that there different options for the excitation decoding. Reference is made, for example, to the relevant Standards and publications defining the CELP coding concepts, the ACELP coding concepts, modifications of the CELP coding concepts and of the ACELP coding concepts and the TCX coding concept.
(47) The linear-prediction-domain decoding path 440 optionally comprises a processing 454 in which a processed time domain excitation signal 456 is derived from the time domain excitation signal 452.
(48) The linear-prediction-domain decoding path 440 also comprises a linear-prediction coefficient decoding 460, which is configured to receive encoded linear prediction coefficients and to provide, on the basis thereof, decoded linear prediction coefficients 462.
(49) The linear-prediction coefficient decoding 460 may use different representations of a linear prediction coefficient as an input information 428 and may provide different representations of the decoded linear prediction coefficients as the output information 462. For details, reference to made to different Standard documents in which an encoding and/or decoding of linear prediction coefficients is described.
(50) The linear-prediction-domain decoding path 440 optionally comprises a processing 464, which may process the decoded linear prediction coefficients and provide a processed version 466 thereof.
(51) The linear-prediction-domain decoding path 440 also comprises a LPC synthesis (linear-prediction coding synthesis) 470, which is configured to receive the decoded excitation 452, or the processed version 456 thereof, and the decoded linear prediction coefficients 462, or the processed version 466 thereof, and to provide a decoded time domain audio signal 472. For example, the LPC synthesis 470 may be configured to apply a filtering, which is defined by the decoded linear-prediction coefficients 462 (or the processed version 466 thereof) to the decoded time domain excitation signal 452, or the processed version thereof, such that the decoded time domain audio signal 472 is obtained by filtering (synthesis-filtering) the time domain excitation signal 452 (or 456). The linear prediction domain decoding path 440 may optionally comprise a post-processing 474, which may be used to refine or adjust characteristics of the decoded time domain audio signal 472.
(52) The linear-prediction-domain decoding path 440 also comprises an error concealment 480, which is configured to receive the decoded linear prediction coefficients 462 (or the processed version 466 thereof) and the decoded time domain excitation signal 452 (or the processed version 456 thereof). The error concealment 480 may optionally receive additional information, like for example a pitch information. The error concealment 480 may consequently provide an error concealment audio information, which may be in the form of a time domain audio signal, in case that a frame (or sub-frame) of the encoded audio information 410 is lost. Thus, the error concealment 480 may provide the error concealment audio information 482 such that the characteristics of the error concealment audio information 482 are substantially adapted to the characteristics of a last properly decoded audio frame preceding the lost audio frame. It should be noted that the error concealment 480 may comprise any of the features and functionalities described with respect to the error concealment 240. In addition, it should be noted that the error concealment 480 may also comprise any of the features and functionalities described with respect to the time domain concealment of
(53) The audio decoder 400 also comprises a signal combiner (or signal combination 490), which is configured to receive the decoded time domain audio signal 372 (or the post-processed version 378 thereof), the error concealment audio information 382 provided by the error concealment 380, the decoded time domain audio signal 472 (or the post-processed version 476 thereof) and the error concealment audio information 482 provided by the error concealment 480. The signal combiner 490 may be configured to combine said signals 372 (or 378), 382, 472 (or 476) and 482 to thereby obtain the decoded audio information 412. In particular, an overlap-and-add operation may be applied by the signal combiner 490. Accordingly, the signal combiner 490 may provide smooth transitions between subsequent audio frames for which the time domain audio signal is provided by different entities (for example, by different decoding paths 430, 440). However, the signal combiner 490 may also provide for smooth transitions if the time domain audio signal is provided by the same entity (for example, frequency domain-to-time-domain transform 370 or LPC synthesis 470) for subsequent frames. Since some codecs have some aliasing on the overlap and add part that need to be canceled, optionally we can create some artificial aliasing on the half a frame that we have created to perform the overlap add. In other words, an artificial time domain aliasing compensation (TDAC) may optionally be used.
(54) Also, the signal combiner 490 may provide smooth transitions to and from frames for which an error concealment audio information (which is typically also a time domain audio signal) is provided.
(55) To summarize, the audio decoder 400 allows to decode audio frames which are encoded in the frequency domain and audio frames which are encoded in the linear prediction domain. In particular, it is possible to switch between a usage of the frequency domain decoding path and a usage of the linear prediction domain decoding path in dependence on the signal characteristics (for example, using a signaling information provided by an audio encoder).
(56) Different types of error concealment may be used for providing an error concealment audio information in the case of a frame loss, depending on whether a last properly decoded audio frame was encoded in the frequency domain (or, equivalently, in a frequency-domain representation), or in the time domain (or equivalently, in a time domain representation, or, equivalently, in a linear-prediction domain, or, equivalently, in a linear-prediction domain representation).
(57) 5. Time Domain Concealment According to
(58)
(59) The error concealment 500 is configured to receive a time domain audio signal 510 and to provide, on the basis thereof, an error concealment audio information 512, which may, for example, take the form of a time domain audio signal.
(60) It should be noted that the error concealment 500 may, for example, take the place of the error concealment 130, such that the error concealment audio information 512 may correspond to the error concealment audio information 132. Moreover, it should be noted that the error concealment 500 may take the place of the error concealment 380, such that the time domain audio signal 510 may correspond to the time domain audio signal 372 (or to the time domain audio signal 378), and such that the error concealment audio information 512 may correspond to the error concealment audio information 382.
(61) The error concealment 500 comprises a pre-emphasis 520, which may be considered as optional. The pre-emphasis receives the time domain audio signal and provides, on the basis thereof, a pre-emphasized time domain audio signal 522.
(62) The error concealment 500 also comprises a LPC analysis 530, which is configured to receive the time domain audio signal 510, or the pre-emphasized version 522 thereof, and to obtain an LPC information 532, which may comprise a set of LPC parameters 532. For example, the LPC information may comprise a set of LPC filter coefficients (or a representation thereof) and a time domain excitation signal (which is adapted for an excitation of an LPC synthesis filter configured in accordance with the LPC filter coefficients, to reconstruct, at least approximately, the input signal of the LPC analysis).
(63) The error concealment 500 also comprises a pitch search 540, which is configured to obtain a pitch information 542, for example, on the basis of a previously decoded audio frame.
(64) The error concealment 500 also comprises an extrapolation 550, which may be configured to obtain an extrapolated time domain excitation signal on the basis of the result of the LPC analysis (for example, on the basis of the time-domain excitation signal determined by the LPC analysis), and possibly on the basis of the result of the pitch search.
(65) The error concealment 500 also comprises a noise generation 560, which provides a noise signal 562. The error concealment 500 also comprises a combiner/fader 570, which is configured to receive the extrapolated time-domain excitation signal 552 and the noise signal 562, and to provide, on the basis thereof, a combined time domain excitation signal 572. The combiner/fader 570 may be configured to combine the extrapolated time domain excitation signal 552 and the noise signal 562, wherein a fading may be performed, such that a relative contribution of the extrapolated time domain excitation signal 552 (which determines a deterministic component of the input signal of the LPC synthesis) decreases over time while a relative contribution of the noise signal 562 increases over time. However, a different functionality of the combiner/fader is also possible. Also, reference is made to the description below.
(66) The error concealment 500 also comprises a LPC synthesis 580, which receives the combined time domain excitation signal 572 and which provides a time domain audio signal 582 on the basis thereof. For example, the LPC synthesis may also receive LPC filter coefficients describing a LPC shaping filter, which is applied to the combined time domain excitation signal 572, to derive the time domain audio signal 582. The LPC synthesis 580 may, for example, use LPC coefficients obtained on the basis of one or more previously decoded audio frames (for example, provided by the LPC analysis 530).
(67) The error concealment 500 also comprises a de-emphasis 584, which may be considered as being optional. The de-emphasis 584 may provide a de-emphasized error concealment time domain audio signal 586.
(68) The error concealment 500 also comprises, optionally, an overlap-and-add 590, which performs an overlap-and-add operation of time domain audio signals associated with subsequent frames (or sub-frames). However, it should be noted that the overlap-and-add 590 should be considered as optional, since the error concealment may also use a signal combination which is already provided in the audio decoder environment. For example, the overlap-and-add 590 may be replaced by the signal combination 390 in the audio decoder 300 in some embodiments.
(69) In the following, some further details regarding the error concealment 500 will be described.
(70) The error concealment 500 according to
(71) In the following, the sub-units and functionalities of the error concealment 500 will be described in more detail.
(72) 5.1. LPC Analysis
(73) In the embodiment according to
(74) 5.2. Pitch Search
(75) There are different approaches to get the pitch to be used for building the new signal (for example, the error concealment audio information).
(76) In the context of the codec using an LTP filter (long-term-prediction filter), like AAC-LTP, if the last frame was AAC with LTP, we use this last received LTP pitch lag and the corresponding gain for generating the harmonic part. In this case, the gain is used to decide whether to build harmonic part in the signal or not. For example, if the LTP gain is higher than 0.6 (or any other predetermined value), then the LTP information is used to build the harmonic part.
(77) If there is not any pitch information available from the previous frame, then there are, for example, two solutions, which will be described in the following.
(78) For example, it is possible to do a pitch search at the encoder and transmit in the bitstream the pitch lag and the gain. This is similar to the LTP, but there is not applied any filtering (also no LTP filtering in the clean channel).
(79) Alternatively, it is possible to perform a pitch search in the decoder. The AMR-WB pitch search in case of TCX is done in the FFT domain. In ELD, for example, if the MDCT domain was used then the phases would be missed. Therefore, the pitch search is done directly in the excitation domain. This gives better results than doing the pitch search in the synthesis domain. The pitch search in the excitation domain is done first with an open loop by a normalized cross correlation. Then, optionally, we refine the pitch search by doing a closed loop search around the open loop pitch with a certain delta. Due to the ELD windowing limitations, a wrong pitch could be found, thus we also verify that the found pitch is correct or discard it otherwise.
(80) To conclude, the pitch of the last properly decoded audio frame preceding the lost audio frame may be considered when providing the error concealment audio information. In some cases, there is a pitch information available from the decoding of the previous frame (i.e. the last frame preceding the lost audio frame). In this case, this pitch can be reused (possibly with some extrapolation and a consideration of a pitch change over time). We can also optionally reuse the pitch of more than one frame of the past to try to extrapolate the pitch that we need at the end of our concealed frame.
(81) Also, if there is an information (for example, designated as long-term-prediction gain) available, which describes an intensity (or relative intensity) of a deterministic (for example, at least approximately periodic) signal component, this value can be used to decide whether a deterministic (or harmonic) component should be included into the error concealment audio information. In other words, by comparing said value (for example, LTP gain) with a predetermined threshold value, it can be decided whether a time domain excitation signal derived from a previously decoded audio frame should be considered for the provision of the error concealment audio information or not.
(82) If there is no pitch information available from the previous frame (or, more precisely, from the decoding of the previous frame), there are different options. The pitch information could be transmitted from an audio encoder to an audio decoder, which would simplify the audio decoder but create a bitrate overhead. Alternatively, the pitch information can be determined in the audio decoder, for example, in the excitation domain, i.e. on the basis of a time domain excitation signal. For example, the time domain excitation signal derived from a previous, properly decoded audio frame can be evaluated to identify the pitch information to be used for the provision of the error concealment audio information.
(83) 5.3. Extrapolation of the Excitation or Creation of the Harmonic Part
(84) The excitation (for example, the time domain excitation signal) obtained from the previous frame (either just computed for lost frame or saved already in the previous lost frame for multiple frame loss) is used to build the harmonic part (also designated as deterministic component or approximately periodic component) in the excitation (for example, in the input signal of the LPC synthesis) by copying the last pitch cycle as many times as needed to get one and a half of the frame. To save complexity we can also create one and an half frame only for the first loss frame and then shift the processing for subsequent frame loss by half a frame and create only one frame each. Then we have access to half a frame of overlap.
(85) In case of the first lost frame after a good frame (i.e. a properly decoded frame), the first pitch cycle (for example, of the time domain excitation signal obtained on the basis of the last properly decoded audio frame preceding the lost audio frame) is low-pass filtered with a sampling rate dependent filter (since ELD covers a really broad sampling rate combinationgoing from AAC-ELD core to AAC-ELD with SBR or AAC-ELD dual rate SBR).
(86) The pitch in a voice signal is almost changing at all times. Therefore, the concealment presented above tends to create some problems (or at least distortions) at the recovery because the pitch at end of the concealed signal (i.e. at the end of the error concealment audio information) often does not match the pitch of the first good frame. Therefore, optionally, in some embodiments it is tried to predict the pitch at the end of the concealed frame to match the pitch at the beginning of the recovery frame. For example, the pitch at the end of a lost frame (which is considered as a concealed frame) is predicted, wherein the target of the prediction is to set the pitch at the end of the lost frame (concealed frame) to approximate the pitch at the beginning of the first properly decoded frame following one or more lost frames (which first properly decoded frame is also called recovery frame). This could be done during the frame loss or during the first good frame (i.e. during the first properly received frame). To get even better results, it is possible to optionally reuse some conventional tools and adapt them, such as the Pitch Prediction and Pulse resynchronization. For details, reference is made, for example, to reference [6] and [7].
(87) If a long-term-prediction (LTP) is used in a frequency domain codec, it is possible to use the lag as the starting information about the pitch. However, in some embodiments, it is also desired to have a better granularity to be able to better track the pitch contour. Therefore, it is advantageous to do a pitch search at the beginning and at the end of the last good (properly decoded) frame. To adapt the signal to the moving pitch, it is desirable to use a pulse resynchronization, which is present in the state of the art.
(88) 5.4. Gain of Pitch
(89) In some embodiments, it is advantageous to apply a gain on the previously obtained excitation in order to reach the desired level. The gain of the pitch (for example, the gain of the deterministic component of the time domain excitation signal, i.e. the gain applied to a time domain excitation signal derived from a previously decoded audio frame, in order to obtain the input signal of the LPC synthesis), may, for example, be obtained by doing a normalized correlation in the time domain at the end of the last good (for example, properly decoded) frame. The length of the correlation may be equivalent to two sub-frames' length, or can be adaptively changed. The delay is equivalent to the pitch lag used for the creation of the harmonic part. We can also optionally perform the gain calculation only on the first lost frame and then only apply a fadeout (reduced gain) for the following consecutive frame loss.
(90) The gain of pitch will determine the amount of tonality (or the amount of deterministic, at least approximately periodic signal components) that will be created. However, it is desirable to add some shaped noise to not have only an artificial tone. If we get very low gain of the pitch then we construct a signal that consists only of a shaped noise.
(91) To conclude, in some cases the time domain excitation signal obtained, for example, on the basis of a previously decoded audio frame, is scaled in dependence on the gain (for example, to obtain the input signal for the LPC analysis). Accordingly, since the time domain excitation signal determines a deterministic (at least approximately periodic) signal component, the gain may determine a relative intensity of said deterministic (at least approximately periodic) signal components in the error concealment audio information. In addition, the error concealment audio information may be based on a noise, which is also shaped by the LPC synthesis, such that a total energy of the error concealment audio information is adapted, at least to some degree, to a properly decoded audio frame preceding the lost audio frame and, ideally, also to a properly decoded audio frame following the one or more lost audio frames.
(92) 5.5. Creation of the Noise Part
(93) An innovation is created by a random noise generator. This noise is optionally further high pass filtered and optionally pre-emphasized for voiced and onset frames. As for the low pass of the harmonic part, this filter (for example, the high-pass filter) is sampling rate dependent. This noise (which is provided, for example, by a noise generation 560) will be shaped by the LPC (for example, by the LPC synthesis 580) to get as close to the background noise as possible. The high pass characteristic is also optionally changed over consecutive frame loss such that aver a certain amount a frame loss the is no filtering anymore to only get the full band shaped noise to get a comfort noise closed to the background noise.
(94) An innovation gain (which may, for example, determine a gain of the noise 562 in the combination/fading 570, i.e. a gain using which the noise signal 562 is included into the input signal 572 of the LPC synthesis) is, for example, calculated by removing the previously computed contribution of the pitch (if it exists) (for example, a scaled version, scaled using the gain of pitch, of the time domain excitation signal obtained on the basis of the last properly decoded audio frame preceding the lost audio frame) and doing a correlation at the end of the last good frame. As for the pitch gain, this could be done optionally only on the first lost frame and then fade out, but in this case the fade out could be either going to 0 that results to a completed muting or to an estimate noise level present in the background. The length of the correlation is, for example, equivalent to two sub-frames' length and the delay is equivalent to the pitch lag used for the creation of the harmonic part.
(95) Optionally, this gain is also multiplied by (1-gain of pitch) to apply as much gain on the noise to reach the energy missing if the gain of pitch is not one. Optionally, this gain is also multiplied by a factor of noise. This factor of noise is coming, for example, from the previous valid frame (for example, from the last properly decoded audio frame preceding the lost audio frame).
(96) 5.6. Fade Out
(97) Fade out is mostly used for multiple frames loss. However, fade out may also be used in the case that only a single audio frame is lost.
(98) In case of a multiple frame loss, the LPC parameters are not recalculated. Either, the last computed one is kept, or LPC concealment is done by converging to a background shape. In this case, the periodicity of the signal is converged to zero. For example, the time domain excitation signal 502 obtained on the basis of one or more audio frames preceding a lost audio frame is still using a gain which is gradually reduced over time while the noise signal 562 is kept constant or scaled with a gain which is gradually increasing over time, such that the relative weight of the time domain excitation signal 552 is reduced over time when compared to the relative weight of the noise signal 562. Consequently, the input signal 572 of the LPC synthesis 580 is getting more and more noise-like. Consequently, the periodicity (or, more precisely, the deterministic, or at least approximately periodic component of the output signal 582 of the LPC synthesis 580) is reduced over time.
(99) The speed of the convergence according to which the periodicity of the signal 572, and/or the periodicity of the signal 582, is converged to 0 is dependent on the parameters of the last correctly received (or properly decoded) frame and/or the number of consecutive erased frames, and is controlled by an attenuation factor, ?. The factor, ?, is further dependent on the stability of the LP filter. Optionally, it is possible to alter the factor ? in ratio with the pitch length. If the pitch (for example, a period length associated with the pitch) is really long, then we keep ? normal, but if the pitch is really short, it is typically necessitated to copy a lot of times the same part of past excitation. This will quickly sound too artificial, and therefore it is advantageous to fade out faster this signal.
(100) Further optionally, if available, we can take into account the pitch prediction output. If a pitch is predicted, it means that the pitch was already changing in the previous frame and then the more frames we loose the more far we are from the truth. Therefore, it is advantageous to speed up a bit the fade out of the tonal part in this case.
(101) If the pitch prediction failed because the pitch is changing too much, it means that either the pitch values are not really reliable or that the signal is really unpredictable. Therefore, again, it is advantageous to fade out faster (for example, to fade out faster the time domain excitation signal 552 obtained on the basis of one or more properly decoded audio frames preceding the one or more lost audio frames).
(102) 5.7. LPC Synthesis
(103) To come back to time domain, it is advantageous to perform a LPC synthesis 580 on the summation of the two excitations (tonal part and noisy part) followed by a de-emphasis.
(104) Worded differently, it is advantageous to perform the LPC synthesis 580 on the basis of a weighted combination of a time domain excitation signal 552 obtained on the basis of one or more properly decoded audio frames preceding the lost audio frame (tonal part) and the noise signal 562 (noisy part). As mentioned above, the time domain excitation signal 552 may be modified when compared to the time domain excitation signal 532 obtained by the LPC analysis 530 (in addition to LPC coefficients describing a characteristic of the LPC synthesis filter used for the LPC synthesis 580). For example, the time domain excitation signal 552 may be a time scaled copy of the time domain excitation signal 532 obtained by the LPC analysis 530, wherein the time scaling may be used to adapt the pitch of the time domain excitation signal 552 to a desired pitch.
(105) 5.8. Overlap-and-Add
(106) In the case of a transform codec only, to get the best overlap-add we create an artificial signal for half a frame more than the concealed frame and we create artificial aliasing on it. However, different overlap-add concepts may be applied.
(107) In the context of regular AAC or TCX, an overlap-and-add is applied between the extra half frame coming from concealment and the first part of the first good frame (could be half or less for lower delay windows as AAC-LD).
(108) In the special case of ELD (extra low delay), for the first lost frame, it is advantageous to run the analysis three times to get the proper contribution from the last three windows and then for the first concealment frame and all the following ones the analysis is run one more time. Then one ELD synthesis is done to be back in time domain with all the proper memory for the following frame in the MDCT domain.
(109) To conclude, the input signal 572 of the LPC synthesis 580 (and/or the time domain excitation signal 552) may be provided for a temporal duration which is longer than a duration of a lost audio frame. Accordingly, the output signal 582 of the LPC synthesis 580 may also be provided for a time period which is longer than a lost audio frame. Accordingly, an overlap-and-add can be performed between the error concealment audio information (which is consequently obtained for a longer time period than a temporal extension of the lost audio frame) and a decoded audio information provided for a properly decoded audio frame following one or more lost audio frames.
(110) To summarize, the error concealment 500 is well-adapted to the case in which the audio frames are encoded in the frequency domain. Even though the audio frames are encoded in the frequency domain, the provision of the error concealment audio information is performed on the basis of a time domain excitation signal. Different modifications are applied to the time domain excitation signal obtained on the basis of one or more properly decoded audio frames preceding a lost audio frame. For example, the time domain excitation signal provided by the LPC analysis 530 is adapted to pitch changes, for example, using a time scaling. Moreover, the time domain excitation signal provided by the LPC analysis 530 is also modified by a scaling (application of a gain), wherein a fade out of the deterministic (or tonal, or at least approximately periodic) component may be performed by the scaler/fader 570, such that the input signal 572 of the LPC synthesis 580 comprises both a component which is derived from the time domain excitation signal obtained by the LPC analysis and a noise component which is based on the noise signal 562. The deterministic component of the input signal 572 of the LPC synthesis 580 is, however, typically modified (for example, time scaled and/or amplitude scaled) with respect to the time domain excitation signal provided by the LPC analysis 530.
(111) Thus, the time domain excitation signal can be adapted to the needs, and an unnatural hearing impression is avoided.
(112) 6. Time Domain Concealment According to
(113)
(114) Moreover, it should be noted that the embodiment according to
(115) However, it should be noted that the error concealment 600 according to
(116) In the case of a switched codec (and even in the case of a codec merely performing the decoding in the linear-prediction-coefficient domain) we usually already have the excitation signal (for example, the time domain excitation signal) coming from a previous frame (for example, a properly decoded audio frame preceding a lost audio frame). Otherwise (for example, if the time domain excitation signal is not available), it is possible to do as explained in the embodiment according to
(117) If the decoder is using already some LPC parameters in the time domain, we are reusing them and extrapolate a new set of LPC parameters. The extrapolation of the LPC parameters is based on the past LPC, for example the mean of the last three frames and (optionally) the LPC shape derived during the DTX noise estimation if DTX (discontinuous transmission) exists in the codec.
(118) All of the concealment is done in the excitation domain to get smoother transition between consecutive frames.
(119) In the following, the error concealment 600 according to
(120) The error concealment 600 receives a past excitation 610 and a past pitch information 640. Moreover, the error concealment 600 provides an error concealment audio information 612.
(121) It should be noted that the past excitation 610 received by the error concealment 600 may, for example, correspond to the output 532 of the LPC analysis 530. Moreover, the past pitch information 640 may, for example, correspond to the output information 542 of the pitch search 540.
(122) The error concealment 600 further comprises an extrapolation 650, which may correspond to the extrapolation 550, such that reference is made to the above discussion.
(123) Moreover, the error concealment comprises a noise generator 660, which may correspond to the noise generator 560, such that reference is made to the above discussion.
(124) The extrapolation 650 provides an extrapolated time domain excitation signal 652, which may correspond to the extrapolated time domain excitation signal 552. The noise generator 660 provides a noise signal 662, which corresponds to the noise signal 562.
(125) The error concealment 600 also comprises a combiner/fader 670, which receives the extrapolated time domain excitation signal 652 and the noise signal 662 and provides, on the basis thereof, an input signal 672 for a LPC synthesis 680, wherein the LPC synthesis 680 may correspond to the LPC synthesis 580, such that the above explanations also apply. The LPC synthesis 680 provides a time domain audio signal 682, which may correspond to the time domain audio signal 582. The error concealment also comprises (optionally) a de-emphasis 684, which may correspond to the de-emphasis 584 and which provides a de-emphasized error concealment time domain audio signal 686. The error concealment 600 optionally comprises an overlap-and-add 690, which may correspond to the overlap-and-add 590. However, the above explanations with respect to the overlap-and-add 590 also apply to the overlap-and-add 690. In other words the overlap-and-add 690 may also be replaced by the audio decoder's overall overlap-and-add, such that the output signal 682 of the LPC synthesis or the output signal 686 of the de-emphasis may be considered as the error concealment audio information.
(126) To conclude, the error concealment 600 substantially differs from the error concealment 500 in that the error concealment 600 directly obtains the past excitation information 610 and the past pitch information 640 directly from one or more previously decoded audio frames without the need to perform a LPC analysis and/or a pitch analysis. However, it should be noted that the error concealment 600 may, optionally, comprise a LPC analysis and/or a pitch analysis (pitch search).
(127) In the following, some details of the error concealment 600 will be described in more detail. However, it should be noted that the specific details should be considered as examples, rather than as essential features.
(128) 6.1. Past Pitch of Pitch Search
(129) There are different approaches to get the pitch to be used for building the new signal.
(130) In the context of the codec using LTP filter, like AAC-LTP, if the last frame (preceding the lost frame) was AAC with LTP, we have the pitch information coming from the last LTP pitch lag and the corresponding gain. In this case we use the gain to decide if we want to build harmonic part in the signal or not. For example, if the LTP gain is higher than 0.6 then we use the LTP information to build harmonic part.
(131) If we do not have any pitch information available from the previous frame, then there are, for example, two other solutions.
(132) One solution is to do a pitch search at the encoder and transmit in the bitstream the pitch lag and the gain. This is similar to the long term prediction (LTP), but we are not applying any filtering (also no LTP filtering in the clean channel).
(133) Another solution is to perform a pitch search in the decoder. The AMR-WB pitch search in case of TCX is done in the FFT domain. In TCX for example, we are using the MDCT domain, then we are missing the phases. Therefore, the pitch search is done directly in the excitation domain (for example, on the basis of the time domain excitation signal used as the input of the LPC synthesis, or used to derive the input for the LPC synthesis) in an embodiment. This typically gives better results than doing the pitch search in the synthesis domain (for example, on the basis of a fully decoded time domain audio signal).
(134) The pitch search in the excitation domain (for example, on the basis of the time domain excitation signal) is done first with an open loop by a normalized cross correlation. Then, optionally, the pitch search can be refined by doing a closed loop search around the open loop pitch with a certain delta.
(135) In implementations, we do not simply consider one maximum value of the correlation. If we have a pitch information from a non-error prone previous frame, then we select the pitch that correspond to one of the five highest values in the normalized cross correlation domain but the closest to the previous frame pitch. Then, it is also verified that the maximum found is not a wrong maximum due to the window limitation.
(136) To conclude, there are different concepts to determine the pitch, wherein it is computationally efficient to consider a past pitch (i.e. pitch associated with a previously decoded audio frame). Alternatively, the pitch information may be transmitted from an audio encoder to an audio decoder. As another alternative, a pitch search can be performed at the side of the audio decoder, wherein the pitch determination is performed on the basis of the time domain excitation signal (i.e. in the excitation domain). A two stage pitch search comprising an open loop search and a closed loop search can be performed in order to obtain a particularly reliable and precise pitch information. Alternatively, or in addition, a pitch information from a previously decoded audio frame may be used in order to ensure that the pitch search provides a reliable result.
(137) 6.2. Extrapolation of the Excitation or Creation of the Harmonic Part
(138) The excitation (for example, in the form of a time domain excitation signal) obtained from the previous frame (either just computed for lost frame or saved already in the previous lost frame for multiple frame loss) is used to build the harmonic part in the excitation (for example, the extrapolated time domain excitation signal 662) by copying the last pitch cycle (for example, a portion of the time domain excitation signal 610, a temporal duration of which is equal to a period duration of the pitch) as many times as needed to get, for example, one and a half of the (lost) frame.
(139) To get even better results, it is optionally possible to reuse some tools known from state of the art and adapt them. For details, reference is made, for example, to reference [6] and [7].
(140) It has been found that the pitch in a voice signal is almost changing at all times. It has been found that, therefore, the concealment presented above tends to create some problems at the recovery because the pitch at end of the concealed signal often doesn't match the pitch of the first good frame. Therefore, optionally, it is tried to predict the pitch at the end of the concealed frame to match the pitch at the beginning of the recovery frame. This functionality will be performed, for example, by the extrapolation 650.
(141) If LTP in TCX is used, the lag can be used as the starting information about the pitch. However, it is desirable to have a better granularity to be able to track better the pitch contour. Therefore, a pitch search is optionally done at the beginning and at the end of the last good frame. To adapt the signal to the moving pitch, a pulse resynchronization, which is present in the state of the art, may be used.
(142) To conclude, the extrapolation (for example, of the time domain excitation signal associated with, or obtained on the basis of, a last properly decoded audio frame preceding the lost frame) may comprise a copying of a time portion of said time domain excitation signal associated with a previous audio frame, wherein the copied time portion may be modified in dependence on a computation, or estimation, of an (expected) pitch change during the lost audio frame. Different concepts are available for determining the pitch change.
(143) 6.3. Gain of Pitch
(144) In the embodiment according to
(145) The gain of the pitch determines the amount of tonality that will be created, but some shaped noise will also be added to not have only an artificial tone. If a very low gain of pitch is obtained, then a signal may be constructed that consists only of a shaped noise.
(146) To conclude, a gain which is applied to scale the time domain excitation signal obtained on the basis of the previous frame (or a time domain excitation signal which is obtained for a previously decoded frame, or which is associated to the previously decoded frame) is adjusted to thereby determine a weighting of a tonal (or deterministic, or at least approximately periodic) component within the input signal of the LPC synthesis 680, and, consequently, within the error concealment audio information. Said gain can be determined on the basis of a correlation, which is applied to the time domain audio signal obtained by a decoding of the previously decoded frame (wherein said time domain audio signal may be obtained using a LPC synthesis which is performed in the course of the decoding).
(147) 6.4. Creation of the Noise Part
(148) An innovation is created by a random noise generator 660. This noise is further high pass filtered and optionally pre-emphasized for voiced and onset frames. The high pass filtering and the pre-emphasis, which may be performed selectively for voiced and onset frames, are not shown explicitly in the
(149) The noise will be shaped (for example, after combination with the time domain excitation signal 652 obtained by the extrapolation 650) by the LPC to get as close as the background noise as possible.
(150) For example, the innovation gain may be calculated by removing the previously computed contribution of the pitch (if it exists) and doing a correlation at the end of the last good frame. The length of the correlation may be equivalent to two sub-frames length and the delay may be equivalent to the pitch lag used for the creation of the harmonic part.
(151) Optionally, this gain may also be multiplied by (1-gain of pitch) to apply as much gain on the noise to reach the energy missing if the gain of the pitch is not one. Optionally, this gain is also multiplied by a factor of noise. This factor of noise may be coming from a previous valid frame.
(152) To conclude, a noise component of the error concealment audio information is obtained by shaping noise provided by the noise generator 660 using the LPC synthesis 680 (and, possibly, the de-emphasis 684). In addition, an additional high pass filtering and/or pre-emphasis may be applied. The gain of the noise contribution to the input signal 672 of the LPC synthesis 680 (also designated as innovation gain) may be computed on the basis of the last properly decoded audio frame preceding the lost audio frame, wherein a deterministic (or at least approximately periodic) component may be removed from the audio frame preceding the lost audio frame, and wherein a correlation may then be performed to determine the intensity (or gain) of the noise component within the decoded time domain signal of the audio frame preceding the lost audio frame.
(153) Optionally, some additional modifications may be applied to the gain of the noise component.
(154) 6.5. Fade Out
(155) The fade out is mostly used for multiple frames loss. However, the fade out may also be used in the case that only a single audio frame is lost.
(156) In case of multiple frame loss, the LPC parameters are not recalculated. Either the last computed one is kept or an LPC concealment is performed as explained above.
(157) A periodicity of the signal is converged to zero. The speed of the convergence is dependent on the parameters of the last correctly received (or correctly decoded) frame and the number of consecutive erased (or lost) frames, and is controlled by an attenuation factor, ?. The factor, ?, is further dependent on the stability of the LP filter. Optionally, the factor ? can be altered in ratio with the pitch length. For example, if the pitch is really long then ? can be kept normal, but if the pitch is really short, it may be desirable (or necessitated) to copy a lot of times the same part of past excitation. Since it has been found that this will quickly sound too artificial, the signal is therefore faded out faster.
(158) Furthermore optionally, it is possible to take into account the pitch prediction output. If a pitch is predicted, it means that the pitch was already changing in the previous frame and then the more frames are lost the more far we are from the truth. Therefore, it is desirable to speed up a bit the fade out of the tonal part in this case.
(159) If the pitch prediction failed because the pitch is changing too much, this means either the pitch values are not really reliable or that the signal is really unpredictable. Therefore, again we should fade out faster.
(160) To conclude, the contribution of the extrapolated time domain excitation signal 652 to the input signal 672 of the LPC synthesis 680 is typically reduced over time. This can be achieved, for example, by reducing a gain value, which is applied to the extrapolated time domain excitation signal 652, over time. The speed used to gradually reduce the gain applied to scale the time domain excitation signal 552 obtained on the basis of one or more audio frames preceding a lost audio frame (or one or more copies thereof) is adjusted in dependence on one or more parameters of the one or more audio frames (and/or in dependence on a number of consecutive lost audio frames). In particular, the pitch length and/or the rate at which the pitch changes over time, and/or the question whether a pitch prediction fails or succeeds, can be used to adjust said speed.
(161) 6.6. LPC Synthesis
(162) To come back to time domain, an LPC synthesis 680 is performed on the summation (or generally, weighted combination) of the two excitations (tonal part 652 and noisy part 662) followed by the de-emphasis 684.
(163) In other words, the result of the weighted (fading) combination of the extrapolated time domain excitation signal 652 and the noise signal 662 forms a combined time domain excitation signal and is input into the LPC synthesis 680, which may, for example, perform a synthesis filtering on the basis of said combined time domain excitation signal 672 in dependence on LPC coefficients describing the synthesis filter.
(164) 6.7. Overlap-and-Add
(165) Since it is not known during concealment what will be the mode of the next frame coming (for example, ACELP, TCX or FD), it is advantageous to prepare different overlaps in advance. To get the best overlap-and-add if the next frame is in a transform domain (TCX or FD) an artificial signal (for example, an error concealment audio information) may, for example, be created for half a frame more than the concealed (lost) frame. Moreover, artificial aliasing may be created on it (wherein the artificial aliasing may, for example, be adapted to the MDCT overlap-and-add).
(166) To get a good overlap-and-add and no discontinuity with the future frame in time domain (ACELP), we do as above but without aliasing, to be able to apply long overlap add windows or if we want to use a square window, the zero input response (ZIR) is computed at the end of the synthesis buffer.
(167) To conclude, in a switching audio decoder (which may, for example, switch between an ACELP decoding, a TCX decoding and a frequency domain decoding (FD decoding)), an overlap-and-add may be performed between the error concealment audio information which is provided primarily for a lost audio frame, but also for a certain time portion following the lost audio frame, and the decoded audio information provided for the first properly decoded audio frame following a sequence of one or more lost audio frames. In order to obtain a proper overlap-and-add even for decoding modes which bring along a time domain aliasing at a transition between subsequent audio frames, an aliasing cancellation information (for example, designated as artificial aliasing) may be provided. Accordingly, an overlap-and-add between the error concealment audio information and the time domain audio information obtained on the basis of the first properly decoded audio frame following a lost audio frame, results in a cancellation of aliasing.
(168) If the first properly decoded audio frame following the sequence of one or more lost audio frames is encoded in the ACELP mode, a specific overlap information may be computed, which may be based on a zero input response (ZIR) of a LPC filter.
(169) To conclude, the error concealment 600 is well suited to usage in a switching audio codec. However, the error concealment 600 can also be used in an audio codec which merely decodes an audio content encoded in a TCX mode or in an ACELP mode.
(170) 6.8. Conclusion
(171) It should be noted that a particularly good error concealment is achieved by the above mentioned concept to extrapolate a time domain excitation signal, to combine the result of the extrapolation with a noise signal using a fading (for example, a cross-fading) and to perform an LPC synthesis on the basis of a result of a cross-fading.
(172) 7. Audio Decoder According to
(173)
(174) It should be noted that the audio decoder 1100 can be a part of a switching audio decoder. For example, the audio decoder 1100 may replace the linear-prediction-domain decoding path 440 in the audio decoder 400.
(175) The audio decoder 1100 is configured to receive an encoded audio information 1110 and to provide, on the basis thereof, a decoded audio information 1112. The encoded audio information 1110 may, for example, correspond to the encoded audio information 410 and the decoded audio information 1112 may, for example, correspond to the decoded audio information 412.
(176) The audio decoder 1100 comprises a bitstream analyzer 1120, which is configured to extract an encoded representation 1122 of a set of spectral coefficients and an encoded representation of linear-prediction coding coefficients 1124 from the encoded audio information 1110. However, the bitstream analyzer 1120 may optionally extract additional information from the encoded audio information 1110.
(177) The audio decoder 1100 also comprises a spectral value decoding 1130, which is configured to provide a set of decoded spectral values 1132 on the basis of the encoded spectral coefficients 1122. Any decoding concept known for decoding spectral coefficients may be used.
(178) The audio decoder 1100 also comprises a linear-prediction-coding coefficient to scale-factor conversion 1140 which is configured to provide a set of scale factors 1142 on the basis of the encoded representation 1124 of linear-prediction-coding coefficients. For example, the linear-prediction-coding-coefficient to scale-factor conversion 1142 may perform a functionality which is described in the USAC standard. For example, the encoded representation 1124 of the linear-prediction-coding coefficients may comprise a polynomial representation, which is decoded and converted into a set of scale factors by the linear-prediction-coding coefficient to scale-factor-conversion 1142.
(179) The audio decoder 1100 also comprises a scalar 1150, which is configured to apply the scale factors 1142 to the decoded spectral values 1132, to thereby obtain scaled decoded spectral values 1152. Moreover, the audio decoder 1100 comprises, optionally, a processing 1160, which may, for example, correspond to the processing 366 described above, wherein processed scaled decoded spectral values 1162 are obtained by the optional processing 1160. The audio decoder 1100 also comprises a frequency-domain-to-time-domain transform 1170, which is configured to receive the scaled decoded spectral values 1152 (which may correspond to the scaled decoded spectral values 362), or the processed scaled decoded spectral values 1162 (which may correspond to the processed scaled decoded spectral values 368) and provide, on the basis thereof, a time domain representation 1172, which may correspond to the time domain representation 372 described above. The audio decoder 1100 also comprises an optional first post-processing 1174, and an optional second post-processing 1178, which may, for example, correspond, at least partly, to the optional post-processing 376 mentioned above. Accordingly, the audio decoder 1110 obtains (optionally) a post-processed version 1179 of the time domain audio representation 1172.
(180) The audio decoder 1100 also comprises an error concealment block 1180 which is configured to receive the time domain audio representation 1172, or a post-processed version thereof, and the linear-prediction-coding coefficients (either in encoded form, or in a decoded form) and provides, on the basis thereof, an error concealment audio information 1182.
(181) The error concealment block 1180 is configured to provide the error concealment audio information 1182 for concealing a loss of an audio frame following an audio frame encoded in a frequency domain representation using a time domain excitation signal, and therefore is similar to the error concealment 380 and to the error concealment 480, and also to the error concealment 500 and to the error concealment 600.
(182) However, the error concealment block 1180 comprises an LPC analysis 1184, which is substantially identical to the LPC analysis 530. However, the LPC analysis 1184 may, optionally, use the LPC coefficients 1124 to facilitate the analysis (when compared to the LPC analysis 530). The LPC analysis 1134 provides a time domain excitation signal 1186, which is substantially identical to the time domain excitation signal 532 (and also to the time domain excitation signal 610). Moreover, the error concealment block 1180 comprises an error concealment 1188, which may, for example, perform the functionality of blocks 540, 550, 560, 570, 580, 584 of the error concealment 500, or which may, for example, perform the functionality of blocks 640, 650, 660, 670, 680, 684 of the error concealment 600. However, the error concealment block 1180 slightly differs from the error concealment 500 and also from the error concealment 600. For example, the error concealment block 1180 (comprising the LPC analysis 1184) differs from the error concealment 500 in that the LPC coefficients (used for the LPC synthesis 580) are not determined by the LPC analysis 530, but are (optionally) received from the bitstream. Moreover, the error concealment block 1188, comprising the LPC analysis 1184, differs from the error concealment 600 in that the past excitation 610 is obtained by the LPC analysis 1184, rather than being available directly.
(183) The audio decoder 1100 also comprises a signal combination 1190, which is configured to receive the time domain audio representation 1172, or a post-processed version thereof, and also the error concealment audio information 1182 (naturally, for subsequent audio frames) and combines said signals, using an overlap-and-add operation, to thereby obtain the decoded audio information 1112.
(184) For further details, reference is made to the above explanations.
(185) 8. Method According to
(186)
(187) 9. Method According to
(188)
(189) The method 1000 according to
(190) Moreover, it should be noted that the method according to
(191) 10. Additional Remarks
(192) In the above described embodiments, multiple frame loss can be handled in different ways. For example, if two or more frames are lost, the periodic part of the time domain excitation signal for the second lost frame can be derived from (or be equal to) a copy of the tonal part of the time domain excitation signal associated with the first lost frame. Alternatively, the time domain excitation signal for the second lost frame can be based on an LPC analysis of the synthesis signal of the previous lost frame. For example in a codec the LPC may be changing every lost frame, then it makes sense to redo the analysis for every lost frame.
(193) 11. Implementation Alternatives
(194) Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
(195) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
(196) Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(197) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
(198) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(199) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(200) A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
(201) A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
(202) A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
(203) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(204) A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
(205) In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
(206) The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
(207) The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
(208) The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
(209) 12. Conclusions
(210) To conclude, while some concealment for transform domain codecs has been described in the field, embodiments according to the invention outperform conventional codecs (or decoders). Embodiments according to the invention use a change of domain for concealment (frequency domain to time or excitation domain). Accordingly, embodiments according to the invention create a high quality speech concealment for transform domain decoders.
(211) The transform coding mode is similar to the one in USAC (confer, for example, reference [3]). It uses the modified discrete cosine transform (MDCT) as a transform and the spectral noise shaping is achieved by applying the weighted LPC spectral envelope in the frequency domain (also known as FDNS frequency domain noise shaping). Worded differently, embodiments according to the invention can be used in an audio decoder, which uses the decoding concepts described in the USAC standard. However, the error concealment concept disclosed herein can also be used in an audio decoder which his AAC like or in any AAC family codec (or decoder).
(212) The concept according to the present invention applies to a switched codec such as USAC as well as to a pure frequency domain codec. In both cases, the concealment is performed in the time domain or in the excitation domain.
(213) In the following, some advantages and features of the time domain concealment (or of the excitation domain concealment) will be described.
(214) Conventional TCX concealment, as described, for example, taking reference to
(215) Different parts and details have been explained above, for example based on the embodiments according to
(216) To conclude, embodiments according to the invention create an error concealment which outperforms the conventional solutions.
(217) While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
BIBLIOGRAPHY
(218) [1] 3GPP, Audio codec processing functions; Extended Adaptive Multi-RateWideband (AMR-WB+) codec; Transcoding functions, 2009, 3GPP TS 26.290. [2] MDCT-BASED CODER FOR HIGHLY ADAPTIVE SPEECH AND AUDIO CODING; Guillaume Fuchs & al.; EUSIPCO 2009. [3] ISO_IEC_DIS_23003-3 (E); Information technologyMPEG audio technologiesPart 3: Unified speech and audio coding. [4] 3GPP, General Audio Codec audio processing functions; Enhanced aacPlus general audio codec; Additional decoder tools, 2009, 3GPP TS 26.402. [5] Audio decoder and coding error compensating method, 2000, EP 1207519 B1 [6] Apparatus and method for improved concealment of the adaptive codebook in ACELP-like concealment employing improved pitch lag estimation, 2014, PCT/EP2014/062589 [7] Apparatus and method for improved concealment of the adaptive codebook in ACELP-like concealment employing improved pulse resynchronization, 2014, PCT/EP2014/062578