Encoder, decoder and method for encoding and decoding audio content using parameters for enhancing a concealment
11735196 · 2023-08-22
Assignee
Inventors
- Jérémie Lecomte (Santa Clara, CA, US)
- Benjamin Schubert (Nuremberg, DE)
- Michael Schnabel (Geroldsgruen, DE)
- Martin Dietz (Nuremberg, DE)
Cpc classification
G10L19/125
PHYSICS
G10L19/018
PHYSICS
G10L19/20
PHYSICS
G10L19/005
PHYSICS
G10L19/24
PHYSICS
International classification
G10L19/20
PHYSICS
G10L19/005
PHYSICS
G10L19/24
PHYSICS
Abstract
Described are an encoder for coding speech-like content and/or general audio content, wherein the encoder is configured to embed, at least in some frames, parameters in a bitstream, which parameters enhance a concealment in case an original frame is lost, corrupted or delayed, and a decoder for decoding speech-like content and/or general audio content, wherein the decoder is configured to use parameters which are sent later in time to enhance a concealment in case an original frame is lost, corrupted or delayed, as well as a method for encoding and a method for decoding.
Claims
1. An apparatus for encoding audio content using a TCX (Transform Coded Excitation) coding scheme, wherein the apparatus is configured to provide a primary encoded representation of a current frame and an encoded representation of at least one error concealment helper parameter for providing a decoder-sided guided error concealment of the current frame, wherein the encoded representation of the at least one error concealment helper parameter is transmitted in-band as part of the codec payload, wherein the apparatus is configured to combine the encoded representation of the at least one error concealment helper parameter of the current frame with a primary encoded representation of a future frame into a transport packet such that the encoded representation of the at least one error concealment helper parameter of the current frame is sent with a time delay relative to the primary encoded representation of the current frame, wherein the apparatus is configured to select the at least one error concealment helper parameter based on one or more parameters representing a signal characteristic of the audio content comprised in the current frame, wherein the apparatus is configured to selectively choose between at least two modes for providing an encoded representation of the at least one error concealment helper parameter, wherein one of the at least two modes for providing an encoded representation of the at least one error concealment helper parameter is a time domain concealment mode such that the encoded representation of the at least one error concealment helper parameter comprises one or more of a TCX LTP (Long-Term Prediction) lag and a classifier information, and wherein a further one of the at least two modes for providing an encoded representation of the at least one error concealment helper parameter is a frequency domain concealment mode such that the encoded representation of the at least one error concealment helper parameter comprises one or more of an LSF (Line Spectral Frequency) parameter, a TCX global gain and a classifier information, wherein the apparatus is implemented, at least in part, by one or more hardware elements.
2. The apparatus according to claim 1, wherein the decoder-sided guided error concealment is an extrapolation-based error concealment.
3. The apparatus according to claim 1, wherein the selection of a mode for providing an encoded representation of the at least one error concealment helper parameter is based on parameters which comprise at least one of a frame class, a LTP (Long-Term Prediction) pitch, a LTP gain and a mode for providing an encoded representation of the at least one error concealment helper parameter of one or more preceding frames.
4. An apparatus for decoding audio content using a TCX (Transform Coded Excitation) coding scheme, wherein the apparatus is configured to receive a primary encoded representation of a current frame and/or an encoded representation of at least one error concealment helper parameter for providing a decoder-sided guided error concealment of the current frame, wherein the encoded representation of the at least one error concealment helper parameter is transmitted in-band as part of the codec payload, wherein the apparatus is configured to extract the error concealment helper parameter of a current frame from a packet that is separated from a packet in which the primary encoded representation of the current frame is comprised, wherein the apparatus is configured to use the guided error concealment for at least partly reconstructing the audio content of the current frame by using the at least one error concealment helper parameter in case the primary encoded representation of the current frame is lost, corrupted or delayed, wherein the apparatus is configured to selectively choose between at least two error concealment modes which use different encoded representations of one or more error concealment helper parameters for at least partially reconstructing the audio content using the guided error concealment, wherein one of the at least two error concealment modes, which uses different encoded representations of the at least one error concealment helper parameter, is a time domain concealment mode wherein the encoded representation of the at least one error concealment helper parameter comprises at least one of a TCX LTP (Long-Term Prediction) lag and a classifier information, and wherein a further one of the at least two error concealment modes, which use different encoded representations of one or more error concealment helper parameter, is a frequency domain concealment mode wherein the encoded representation of the at least one error concealment helper parameter comprises one or more of an LSF (Line Spectral Frequency) parameter, a TCX global gain and a classifier information, wherein the apparatus is implemented, at least in part, by one or more hardware elements.
5. The apparatus according to claim 4, wherein the decoder-sided guided error concealment is an extrapolation-based error concealment.
6. A system comprising the apparatus for encoding audio content of claim 1 and an apparatus for decoding audio content of claim 4.
7. A method for encoding audio content using a TCX coding scheme, the method comprising: providing a primary encoded representation of a current frame and an encoded representation of at least one error concealment helper parameter, said error concealment helper parameter for providing a decoder-sided guided error concealment of the current frame, and transmitting the encoded representation of the at least one error concealment helper parameter in-band as part of the codec payload, selecting the at least one error concealment helper parameter based on one or more parameters representing a signal characteristic of the audio content comprised in the current frame, and selectively choosing between at least two modes for providing an encoded representation of the at least one error concealment helper parameter, wherein one of the at least two modes for providing an encoded representation of the at least one error concealment helper parameter is a time domain concealment mode such that the encoded representation of the at least one error concealment helper parameter comprises one or more of a TCX LTP lag and a classifier information, and wherein a further one of the at least two modes for providing an encoded representation of the at least one error concealment helper parameter is a frequency domain concealment mode such that the encoded representation of the at least one error concealment helper parameter comprises one or more of an LSF parameter, a TCX global gain and a classifier information.
8. A method for decoding audio content using a TCX coding scheme, the method comprising: receiving a primary encoded representation of a current frame and an encoded representation of at least one error concealment helper parameter, said error concealment helper parameter for providing a decoder-sided guided error concealment of the current frame, wherein the encoded representation of the at least one error concealment helper parameter is transmitted in-band as part of the codec payload, wherein using, at the decoder-side, the guided error concealment for at least partly reconstructing the audio content of the current frame by using the at least one error concealment helper parameter in case the primary encoded representation of the current frame is lost, corrupted or delayed, selectively choosing between at least two error concealment modes which use different encoded representations of the at least one error concealment helper parameters for at least partially reconstructing the audio content using the guided error concealment, wherein one of the at least two error concealment modes, which uses different encoded representations of the at least one error concealment helper parameter, a time domain concealment mode wherein the encoded representation of the at least one error concealment helper parameter comprises at least one of a TCX LTP lag and a classifier information, and wherein a further one of the at least two error concealment modes, which use different encoded representations of the at least one error concealment helper parameter, is a frequency domain concealment mode wherein the encoded representation of the at least one error concealment helper parameter comprises one or more of an LSF parameter, a TCX global gain and a classifier information.
9. A non-transitory digital storage medium having stored thereon a computer program for performing a method of encoding audio content using a TCX coding scheme, the method comprising: providing a primary encoded representation of a current frame and an encoded representation of at least one error concealment helper parameter, said error concealment helper parameter for providing a decoder-sided guided error concealment of the current frame, and transmitting the encoded representation of the at least one error concealment helper parameter in-band as part of the codec payload, selecting the at least one error concealment helper parameter based on one or more parameters representing a signal characteristic of the audio content comprised in the current frame, and selectively choosing between at least two modes for providing an encoded representation of the at least one error concealment helper parameter, wherein one of the two modes for providing an encoded representation of the at least one error concealment helper parameter is a time domain concealment mode such that the encoded representation of the at least one error concealment helper parameter comprises one or more of a TCX LTP lag and a classifier information, and wherein a further one of the two modes for providing an encoded representation of the at least one error concealment helper parameter is a frequency domain concealment mode such that the encoded representation of the at least one error concealment helper parameter comprises one or more of an LSF parameter, a TCX global gain and a classifier information, when said computer program is run by a computer.
10. A non-transitory digital storage medium having stored thereon a computer program for performing a method of decoding audio content using a TCX coding scheme, the method comprising: receiving a primary encoded representation of a current frame and an encoded representation of at least one error concealment helper parameter for providing a decoder-sided guided error concealment of the current frame, wherein the encoded representation of the at least one error concealment helper parameter is transmitted in-band as part of the codec payload, using, at the decoder-side, the guided error concealment for at least partly reconstructing the audio content of the current frame by using the at least one error concealment helper parameter in case the primary encoded representation of the current frame is lost, corrupted or delayed, selectively choosing between at least two error concealment modes which use different encoded representations of the at least one error concealment helper parameters for at least partially reconstructing the audio content using the guided error concealment, wherein one of the at least two error concealment modes, which use different encoded representations of the at least one error concealment helper parameter, is a time domain concealment mode wherein the encoded representation of the at least one error concealment helper parameter comprises at least one of a TCX LTP lag and a classifier information, and wherein a further one of the at least two error concealment modes, which use different encoded representations of the at least one error concealment helper parameter, is a frequency domain concealment mode wherein the encoded representation of the at least one error concealment helper parameter comprises one or more of an LSF parameter, a TCX global gain and a classifier information, when said computer program is run by a computer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
DETAILED DESCRIPTION OF THE INVENTION
(13)
(14) The encoder 1 is further configured to embed, at least in some frames 7, parameters 6 in the bitstream 5. These parameters 6 are used to enhance a concealment in case an original frame 4 is lost, corrupted or delayed.
(15) The bitstream 5 is sent to a receiver comprising a decoder.
(16) As shown in
(17) The encoder 1 is configured to delay the parameters 6 by some time and to embed the parameters 6 in a packet 9 which is encoded and sent later in time than a packet which comprises the primary frame 4b.
(18) The encoder 1 may create one or more primary frames 4b, 4c and one or more partial copies 8a, 8b. For example, at least a certain part of the audio content 2 is encoded and embedded into a primary frame 4b. The same part of the audio content 2 is analyzed by the encoder 1 as to certain signal characteristics. Based thereupon, the encoder 1 determines a selection of the one or more parameters 6 which enhance a concealment on the decoder side. These parameters 6 are embedded in a corresponding “partial copy” 8b.
(19) In other words, the primary frame 4b contains an encoded representation of at least a part of the audio content 2. The corresponding partial copy 8b contains one or more parameters 6 which are used by an error concealment at the decoder side in order to reconstruct the encoded representation of the audio content 2 in case the primary frame 4b is lost, corrupted or delayed.
(20) The primary copy 4b is packed into the transport packet 9 together with a partial copy 8a, wherein the partial copy 8a is the partial copy of an audio content that has been encoded in a primary frame 4a which has already been sent earlier in time. Accordingly, the encoder 1 delayed the parameters 6 by some time. As can be further seen in
(21) It is an important feature that the concept described herein uses an en-/decoding scheme where at least in some frames 8a, 8b redundant coding parameters 6 are embedded in the bitstream 5 and transmitted to the decoder side. The redundant info (parameters 6) is delayed by some time and embedded in a packet 9 which is encoded and sent later in time such that the info can be used in the case of the decoder already has the future frame 4b, 8a available, but the original frame 4a is lost, corrupted or delayed even more.
(22) The bitstream 5 may, for example, comprise a constant total bitrate. The encoder 1 may be configured to reduce a primary frame bitrate, i.e. a bitrate that is needed to encode a primary frame 4b, 4c when compared to the constant total bitrate. The bitrate reduction for the primary frames 4b, 4c and a partial redundant frame coding mechanism together determine a bitrate allocation between the primary and redundant frames (partial copies) 4b, 4c, 8a, 8b to be included within the constant total bitrate of the bitstream 5. Thus, the encoder 1 is configured to provide a packet 9 containing a primary frame 4b and a partial copy 8a, wherein the size, i.e. the bitrate of the packet 9 is at or below the constant total bitrate.
(23) In other words, the primary Frame bit-rate reduction and partial redundant frame coding mechanisms together determine the bit-rate allocation between the primary and redundant frames 4b, 4c, 8a, 8b to be included within the constant total bitrate. The overall bit rate of a frame 4b holding partial copy parameters 8a (in addition to primary frames) is not increased.
TCX-Coding Scheme
(24) According to an embodiment, the encoder 1 is part of a codec using a TCX coding scheme. The inventive encoder 1 may use TCX for coding general audio content. In case of TCX, the partial copy 8a, 8b is used to enhance a frame loss algorithm of an error concealment at the decoder side by transmitting some helper parameters 6.
(25) When using a transform domain codec, embedding redundant info 8a, 8b to TCX frames 4b, 4c may be chosen if: The Frame contains a really noisy audio signal. This may be indicated by a low auto correlation measure or by the Frame classificator output being UNVOICED or UNVOICED TRANSITION. UNVOICED or UNVOICED TRANSITION classification indicates a low prediction gain. The frame contains a noise floor with sharp spectral lines which are stationary over a longer period of time. This may be detected by a peak detection algorithm which is searching for local maxima in the TCX spectrum (power spectrum or real spectrum) and comparing the result with the result of the peak detection of the previous Frame. In case the peaks did not move it is likely that there are stationary tones which can easily be concealed after having concealed the noise spectrum by post processing the spectrum with a phase extrapolator called tonal concealment. In case LTP info is present and the lag is stable over the actual and the past Frame Tonal concealment [6] should be applied at the decoder.
(26) Redundant Information (Parameters 6) may be: ISF/LSF parameters: ISF/LSF parameter representation is used for quantization and coding of LPC parameters. In TCX the LPC is used to represent the masking threshold. This is an important parameter and very helpful to have available correctly on decoder side in case of a frame loss. Especially if the ISF/LSFs are coded predictively the concealment quality will improve significantly by having this info available during concealment, because the predictor states on decoder side will stay correct (in sync to encoder) and this will lead to a very quick recovery after the loss. Signal classification: Signal classification is used for signaling the content types: UNVOICED, UNVOICED TRANSITION, VOICED TRANSITION, VOICED and ONSET. Typically this type of classification is used in speech coding and indicating if tonal/predictive components are present in the signal or if the tonal/predictive components are changing. Having this information available on the decoder side during concealment may help to determine the predictability of the signal and thus it can help adjusting the amplitude fade-out speed, the interpolation speed of the LPC parameters. TCX global gain/level: The global gain may be transmitted to easily set the energy of the concealed frame to the correct (encoder determined) level in case it is available. Window information like overlap length. Spectral peak positions to help tonal concealment
(27) There is a special case where, at the encoder 1 for frequency domain partial copy, it is checked if the signal 2 contains an onset. If the gain (could be quantized) of the actual frame 4c is more than a certain factor (e.g. 1.6) time the gain of the previous frame 4b and the correlation between the actual frame 4c and the previous frame 4b is low, only a limited (clipped) gain is transmitted. This avoids getting pre echo artefacts in case of concealment. In case of Onset the previous frame 4b is really uncorrelated to the actual frame 4c. Thus, it cannot be relied on the gain computed on the actual frame 4c if concealment is done based on the previous frame 4b spectral bins.
Switched Codec Scheme (TCX-ACELP)
(28) In a further embodiment, the encoder 1 is part of a switched codec, wherein the switched codec consists of at least two core coding schemes. A first core coding scheme uses ACELP and a second core coding scheme uses TCX. With reference to
(29) The encoder further comprises an ACELP processor 11 for processing ACELP-coded content 13, and a TCX processor 12 for processing TCX-coded content 14. The ACELP processor 11 is a commonly known processor using a conventional partial copy approach, wherein primary frames 15 are primary coded and redundant frames 16 are redundant-coded. The redundant frames 16 are a low-bitrate version of their corresponding primary frames 15.
(30) The TCX processor 12 processes frames that have been encoded according to the inventive concept. In a first branch 17, the encoded content 3 is provided in the form of primary frames 4b, 4c. In a second branch 18, the parameters 6 which enhance the concealment are provided in the form of “partial copies” 8a, 8b, such as shown in
(31) Still with reference to
(32) Assuming ACELP frames 15, 16 are processed using traditional partial redundant copy coding and TCX frames 4b, 4c, 8a, 8b are processed using the inventive approach, two main cases will occur, where no special action is needed and the frames 4b, 4c, 8a, 8b, 15, 16 can be processed using the underlying core coder's 10 partial copy approach: ACELP primary frame 15 with partial copy 16 generated from future ACELP frame on top TCX primary frame 4c with partial copy 8b generated from future TCX frame 4b on top
(33) However, in frames that are close to a core coder switch, two special cases can occur, namely ACELP primary frame 15 with partial copy 8 generated from future TCX frame on top TCX primary frame 4 with partial copy 16 generated from future ACELP frame on top
(34) For these cases, both core coders need to be configurable to create primary frames 4, 15 in combination with partial copies 8, 16 from the other coder type, without infringing the necessitated total size of a frame, to assure a constant bitrate.
(35) Accordingly, the encoder 1 is configured to create a primary frame 4, 15 of one of the speech-like content type (ACELP) and the general audio content type (TCX) in combination with a partial copy 8, 16 of the other one of the speech-like content type and the general audio content type.
(36) However, there are more specific cases, where a more sophisticated selection of partial copies 8, 16 is appropriate, e.g.:
(37) First TCX Frame 4 After an ACELP Frame 15:
(38) If this frame 4 gets lost and thus is not available to the decoder, the inventive technique will TCX-conceal the frame 4 using partial copy information (parameters 6) that has been transported in top of another (hopefully not lost) frame. But as concealment needs a preceding frame for extrapolating the signal content, it is of advantage in this case to use ACELP concealment (as the previous frame was ACELP) which would make a TCX partial copy unnecessary. Thus it is decided already in the encoder 1, to not put a partial copy 8 on top of a TCX frame 4 after a switch.
(39) Accordingly, the encoder 1 is configured to not put a partial copy 8 on top of a TCX frame 4 after a switch when there is a first TCX frame 4 after an ACELP frame 15.
Signal-Adaptive Partial Copy Selection
(40) The signal (audio content) 2 can be analyzed before encoding to determine if the usage of the inventive partial copy (using parameters 6) is favorable. For example, if the signal 2 could be concealed satisfyingly well without the help of additional partial copy info, i.e. parameters 6, within the decoder, but the clean channel performance suffers because of reduced primary frame 4, the inventive partial copy usage (i.e. embedding parameters 6 in the bitstream 5) can be e.g. turned off or a specifically reduced partial copy 8 can be used within the encoder 1.
(41) Accordingly, the encoder 1 is configured to analyze the signal 2 before encoding and to turn off the partial copy usage or to provide a reduced partial copy based on the analyzed signal 2.
(42) Generally, the encoder 1 is configured to provide partial redundant copies 8 which are constructed in a partial copy mode. In an embodiment, the encoder 1 is configured to choose between multiple partial copy modes which use different amounts of information and/or different parameter sets, wherein the selection of the partial copy mode is based on various parameters.
Construction of Partial Redundant Frame for TCX Frame
(43) In case of TCX partial redundant frame type, a partial copy 8 consisting of some helper parameters 6 is used to enhance the frame loss concealment algorithm. In an embodiment, there are three different partial copy modes available, which are RF_TCXFD, RF_TCXTD1 and RF_TCX_TD2. Similar to the PLC mode decision on the decoder side, the selection of the partial copy mode for TCX is based on various parameters such as the mode of the last two frames, the frame class, LTP pitch and gain. The parameters used for the selection of the mode may be equal to or different from the parameters for enhancing the concealment which are included in the “partial copy”.
(44) a) Frequency Domain Concealment (RF_TCXFD) Partial Redundant Frame Type
(45) According to an embodiment, at least one of the multiple partial copy modes is a frequency domain (“FD”) concealment mode, an example of which is described in the following. 29 bits are used for the RF_TCXFD partial copy mode. 13 bits are used for the LSF quantizer (e.g. for coding LPC parameters) which is the same as used for regular low rate TCX coding. The global TCX gain is quantized using 7 bits. The classifier info (e.g. VOICED, UNVOICED, etc.) is coded on 2 bits.
(46) b) Time Domain Concealment (RF_TCXTD1 and RF_TCXTD2) Partial Redundant Frame Type
(47) According to an embodiment, at least two of the multiple partial copy modes are different time domain (“TD”) concealment modes, an example of which is described in the following. A first time domain concealment mode, namely the partial copy mode RF_TCXTD1 is selected if a frame 4c contains a transient or if the global gain of the frame 4c is (much) lower than the global gain of the previous frame 4b. Otherwise, the second time domain concealment mode, namely RF_TCXTD2 is chosen.
(48) Overall 18 bits of side data are used for both modes. 9 bits are used to signal the TCX LTP (Long Term Prediction) lag 2 bits for signaling the classifier info (e.g. VOICED, UNVOICED, etc.)
Time Domain Concealment
(49) Depending on the implementation, the codec could be a transform domain codec only or a switch codec (transform/time domain) using the time domain concealment described in [4] or [5]. Similar to the therein described packet loss concealment mode decision on the decoder side, the selection of the partial copy mode according to the present invention is based on various parameters, as mentioned above, e.g. the mode of the last two frames, the frame class, LTP pitch and gain.
(50) In the case time domain mode is chosen, the following parameters 6 can be transmitted: In the case LTP data is present, the LTP lag is transmitted, a classifier info is signaled (UNVOICED, UNVOICED TRANSITION, VOICED, VOICED TRANSITION, ONSET . . . ): Signal classification is used for signaling the content types: UNVOICED, UNVOICED TRANSITION, VOICED TRANSITION, VOICED and ONSET. Typically this type of classification is used in speech coding and indicating if tonal/predictive components are present in the signal or if the tonal/predictive components are changing. Having this information available on the decoder side during concealment may help to determine the predictability of the signal and thus it can help adjusting the amplitude fade-out speed, the interpolation speed of the LPC parameters and it can control possible usage of high- or low pass filtering of voiced or unvoiced excitation signals (e.g. for de-noising).
(51) Optionally, also at least one of the following parameters 6 can be transmitted: LPC parameters describing the full spectral range in case of bandwidth extension is used for regular coding, LTP Gain, Noise level, and Pulse position
(52) Most of the parameters 6 sent, are directly derived from the actual frame 4 coded in the transform domain, so there is no additional complexity caused. But if the complexity is not an issue, then a concealment simulation at the encoder 1 can be added to refine the variable 6 that can be sent.
(53) As mentioned above, also multiple modes for the provision of the partial copy 8 can be used. This permits to send different amounts of information or different parameter sets. For example, there are two modes for the time domain (TD). The partial copy mode TD1 could be selected if the frame 4c contains a transient or if the global gain of the frame 4c is much lower than the global gain of the previous frame 4b. Otherwise TD2 is chosen. Then at the decoder, the pitch gain and the code gain will be decreased with two different factors (0.4 and 0.7 accordingly) to avoid having a long stationary signal whenever the original signal 2 was more transient like.
Multiple Frame Loss
(54) There is a further special case, namely the case of multiple frame loss. The pitch decoded from the partial copy 8b shall not be taken into account if the previous frame 4a is lost, because the pitch sent in the bitstream 5 was computed on the encoder side based on the ground truth, but if the previous frame 4a is lost, the synthesis of the previously lost and concealed synthesis might be really different to the encoder ground truth. So it is better in general to not risk relying on the synchronicity of en-/decoder in case of multiple frame loss and fix the pitch to the predicted pitch for the following lost frame instead of using the pitch transmitted.
(55) The inventive concept of the encoder 1 shall be summarized in the following with reference to an embodiment as shown
(56) The encoder 1 receives an input signal which contains audio content 2. The audio content 2 may be speech-like content and/or general audio content such as music, background noise or the like.
(57) The encoder 1 comprises a core coder 10. The core coder 10 can use a core coding scheme for encoding speech-like content, such as ACELP, or a core coding scheme for encoding general audio content, such as TCX. The core coder 10 may also form part of a switched codec, i.e. the core coder 10 can switch between the speech-like content core coding scheme and the general audio content core coding scheme. In particular, the core coder 10 can switch between ACELP and TCX.
(58) As indicated in branch 20, the core coder 10 creates primary frames 4 which comprise an encoded representation of the audio content 2.
(59) The encoder 1 may further comprise a partial redundant frame provider 21. As indicated in branch 30, the core coder 10 may provide one or more parameters 6 to the partial redundant frame provider 21. These parameters 6 are parameters which enhance a concealment at the decoder side.
(60) Additionally or alternatively, the encoder 1 may comprise a concealment parameter extraction unit 22. The concealment parameter extraction unit 22 extracts the concealment parameters 6 directly from the audio signal, i.e. from the content 2, as indicated in branch 40. The concealment parameter extraction unit 22 provides the extracted parameters 6 to the partial redundant frame provider 21.
(61) The encoder 1 further comprises a mode selector 23. The mode selector 23 selectively chooses a concealment mode, which is also called partial redundant copy mode. Depending on the partial redundant copy mode, the mode selector 23 determines which parameters 6 are suitable for an error concealment at the decoder side.
(62) Therefore, the core coder 10 analyzes the signal, i.e. the audio content 2 and determines, based on the analyzed signal characteristics, certain parameters 24 which are provided to the mode selector 23. These parameters 24 are also referred to as mode selection parameters 24. For example, mode selection parameters can be at least one of a frame class, the mode of the last two frames, LTP pitch and LTP gain. The core coder 10 provides these mode selection parameters 24 to the mode selector 23.
(63) Based on the mode selection parameters 24, the mode selector 23 selects a partial redundant copy mode. The mode selector 23 may selectively choose between three different partial redundant copy modes. In particular, the mode selector 23 may selectively choose between a frequency domain partial redundant copy mode and two different time domain partial redundant copy modes, e.g. TD1 and TD2, for example as described above.
(64) As indicated in branch 50, the mode selection information 25, i.e. the information regarding the selected partial redundant copy mode, is provided to the partial redundant frame provider 21. Based on the mode selection information 25, the partial redundant frame provider 21 selectively chooses parameters 6 that will be used, at the decoder side, for error concealment. Therefore, the partial redundant frame provider 21 creates and provides partial redundant frames 8 which contain an encoded representation of said error concealment parameters 6.
(65) Stated differently, the partial redundant frame provider 21 provides signal specific partial redundant copies. These partial redundant copies are provided in partial redundant frames 8, wherein each partial redundant frame 8 contains at least one error concealment parameter 6.
(66) As indicated at the branches 20 and 60, the encoder 1 combines the primary frames 4 and the partial redundant frames 8 into the outgoing bitstream 5. In the case of a packet-based network, primary frames 4 and partial redundant frames 8 are packed together into a transport packet, which is sent in the bitstream to the decoder side. However, it is to be noted that the primary frame 4c of a current audio frame is packed into a packet 9 together with a partial redundant frame 8b (containing only the parameters 6 for enhancing a concealment) of a previous frame (i.e. a frame that has already been sent earlier in time).
(67) The bitstream 5 comprises a constant total bitrate. In order to ensure that the bitstream 5 is at or below the constant total bitrate, the encoder 1 controls the bitrate of the transport packet containing the combination of the primary frame and the partial redundant frame 8. Additionally or alternatively, the encoder 1 may comprise a bitrate controller 26 that takes over this functionality.
(68) In other words, the encoder 1 is configured to combine an encoded representation 8 of the at least one concealment parameter 6 of a current frame with a primary encoded representation 4 of a future frame (i.e. a frame that will be sent later in time than the current frame). Thus, the encoded representation 8 of the at least one error concealment parameter 6 of a current frame is sent with a time delay relative to the primary encoded representation 4 of this current frame.
(69) Stated differently, and still with reference to
DESCRIPTION OF THE DECODER
(70) According to an embodiment, the invention uses packet-switched, or packet-based networks. In this case, frames are sent in transport packets 9a, 9b, as shown in
(71) Stated differently, a partial copy 8a is an encoded representation of at least one error concealment parameter 6 of a current frame. The at least one error concealment parameter 6 has been selectively chosen by the encoder 1, as described before with reference to
(72) At the decoder 31, there may be two different cases regarding the transmitted frames 4, 8 or transport packets 9a, 9b, respectively.
Standard Decoding of Primary Encoded Representations
(73) In a first case, indicated by branch 70, the transmitted transport packets 9a, 9b are received in the correct order, i.e. in the same order as they have been sent at the encoder side.
(74) The decoder 31 comprises a decoding unit 34 for decoding the transmitted encoded audio content 2 contained in the frames. In particular, the decoding unit 34 is configured to decode the transmitted primary encoded representations 4b, 4c of certain frames. Depending on the encoding scheme of the respective frame, the decoder 31 may use the same scheme for decoding, i.e. a TCX decoding scheme for general audio content or an ACELP decoding scheme for speech-like content. Thus, the decoder 31 outputs a respectively decoded audio content 35.
Enhanced Error Concealment Using Encoded Representations of at Least One Error Concealment Parameter
(75) A second case may occur if a primary encoded representation 4 of a frame is defective, i.e. if a primary encoded representation 4 is lost, corrupted or delayed (for example because the transport packet 9a is lost, corrupted or delayed longer than a buffer length of the decoder), such as indicated by branch 80. The audio content will then have to be at least partly reconstructed by error concealment.
(76) Therefore, the decoder 31 comprises a concealment unit 36. The concealment unit 36 may use a concealment mechanism which is based on a conventional concealment mechanism, wherein, however, the concealment is enhanced (or supported) by one or more error concealment parameters 6 received from the encoder 1. According to an embodiment of the invention, the concealment unit 36 uses an extrapolation-based concealment mechanism, such as described in patent applications [4] and [5], which are incorporated herein by reference.
(77) Said extrapolation-based error concealment mechanism is used in order to reconstruct audio content that was available in a primary encoded representation 4 of a frame, in the case that this primary encoded representation 4 is defective, i.e. lost, corrupted or delayed. The inventive concept uses the at least one error concealment parameter 6 to enhance these conventional error concealment mechanisms.
(78) This shall be explained in more detail with reference to the embodiment shown in
(79) Stated differently, the encoded representation 8b of the at least one error concealment parameter 6 for reconstructing the defective audio content of the current frame is contained in transport packet 9b, while the primary encoded representation 4b of this current frame is contained in transport packet 9a.
(80) If it is detected by the decoder 31 that, for example, the primary encoded representation 4b of the current frame is defective, i.e. lost, corrupted or delayed, the defective audio content is reconstructed by using the afore-mentioned available error concealment mechanism. According to the present invention, the available error concealment mechanism is enhanced by using the at least one error concealment parameter 6 during error concealment.
(81) For this reason, the decoder 31 extracts the at least one error concealment parameter 6 from the encoded representation 8b contained in transport packet 9b. Based on the at least one parameter 6 that has been extracted, the decoder 31 selectively chooses between at least two concealment modes for at least partially reconstructing the defective audio content (in the sense that a concealed audio content is provided which is expected to be somewhat similar to the audio content of the lost primary encoded representation). In particular, the decoder 31 can choose between a frequency domain concealment mode and at least one time domain concealment mode.
(82) Frequency Domain Concealment (RF_TCXFD) Partial Redundant Frame Type
(83) In case of a frequency domain concealment mode, the encoded representation 8b of the at least one error concealment parameter 6 comprises one or more of an ISF/LSF parameter, a TCX global gain, a TCX global level, a signal classifier information, a window information like overlap length and spectral peak positions to help tonal concealment.
(84) The respective extracted one or more parameters 6 are fed to the error concealment unit 36 which uses the at least one parameter 6 for enhancing the extrapolation-based error concealment in order to at least partially reconstruct the defective audio content. As a result, the decoder 31 outputs the concealed audio content 35.
(85) An embodiment of the present invention, which uses an example of a frequency domain concealment, is described below, wherein
(86) 29 bits are used for the RF_TCXFD partial copy mode (i.e. 29 bits are included in the encoded representation of error concealment parameters 6 and are used by the concealment unit 36). 13 bits are used for the LSF quantizer which is the same as used for regular low rate TCX coding. The global TCX gain is quantized using 7 bits. The classifier info is coded on 2 bits.
(87) Time Domain Concealment (RF_TCXTD1 and RF_TCXTD2) Partial Redundant Frame Type
(88) In case of a time domain concealment mode, the decoder 31 may selectively choose between at least two different time domain concealment modes in order to at least partially reconstruct the defective audio content.
(89) For example, a first mode RF_TCXTD1 is selected if the frame contains a transient or if the global gain of the frame is much lower than the global gain of the previous frame. Otherwise, a second mode RF_TCXTD2 is chosen.
(90) In case of a time domain concealment mode, the encoded representation 8b of the at least one error concealment parameter 6 comprises one or more of an LSF parameter, a TCX LTP lag, a classifier information, LPC parameters, LTP gain, Noise Level and Pulse Position. The respective extracted one or more parameters 6 are fed to the error concealment unit 36 which uses the at least one parameter 6 for enhancing the extrapolation-based error concealment in order to at least partially reconstruct (or approximate) the defective audio content. As a result, the decoder 31 outputs the concealed audio content 35.
(91) An embodiment of the present invention, which uses an example of a time domain concealment, is described below, wherein
(92) Overall 18 bits of side data (i.e. of parameters 6) are used for both modes. 9 bits are used to signal the TCX LTP lag 2 bits for signaling the classifier info
(93) The decoder 31 may be part of a codec using a TCX decoding scheme for decoding and/or concealing TCX frames, as described above. The decoder 31 may also be part of a codec using an ACELP coding scheme for decoding and/or concealing ACELP frames. In case of ACELP coding scheme, the encoded representation 8b of the at least one error concealment parameter 6 may comprise one or more of adaptive codebook parameters and a fixed codebook parameter.
(94) According to the invention, in the decoder 31 the type of the encoded representation of the at least one error concealment parameter 6 of a current frame 4b is identified and decoding and error concealment is performed based on whether only one or more adaptive codebook parameters (e.g. ACELP), only one or more fixed codebook parameters (e.g. ACELP), or one or more adaptive codebook parameters and one or more fixed codebook parameters, TCX error concealment parameters 6, or Noise Excited Linear Prediction parameters are coded. If the current frame 4b or a previous frame 4a is concealed by using an encoded representation of at least one error concealment parameter 6 of the respective frame, the at least one error concealment parameter 6 of the current frame 4b, such as LSP parameters, the gain of adaptive codebook, fix codebook or the BWE gain, is firstly obtained and then processed in combination with decoding parameters, classification information or spectral tilt from previous frames of the current frame 4b, or from future frames of the current frame 4b, in order to reconstruct the output signal 35, as described above. Finally, the frame is reconstructed based on the concealment scheme (e.g. time-domain concealment or frequency-domain concealment). The TCX partial info is decoded, but in contrast to an ACELP partial copy mode, the decoder 31 is run in concealment mode. The difference to the above described conventional extrapolation-based concealment is that the at least one error concealment parameter 6 which is available from the bitstream 5 is directly used and not derived by said conventional concealment.
First EVS-Embodiment
(95) The following description passages provide a summary of the inventive concept with respect to the synergistic interaction between encoder 1 and decoder 31 using a so-called EVS (Enhanced Voice Services) Codec.
Introduction to EVS-Embodiment
(96) EVS (Enhanced Voice Services) offers partial redundancy based error robust channel aware mode at 13.2 kbps for both wideband and super-wideband audio bandwidths. Depending on the criticality of the frame, the partial redundancy is dynamically enabled or disabled for a particular frame, while keeping a fixed bit budget of 13.2 kbps.
Principles of Channel Aware Coding
(97) In a VoIP system, packets arrive at the decoder with random jitters in their arrival time. Packets may also arrive out of order at the decoder. Since the decoder expects to be fed a speech packet every 20 msec to output speech samples in periodic blocks, a de-jitter buffer [6] is necessitated to absorb the jitter in the packet arrival time. Larger the size of the de-jitter buffer, the better is its ability to absorb the jitter in the arrival time and consequently, fewer late arriving packets are discarded. Voice communications is also a delay critical system and therefore it becomes essential to keep the end to end delay as low as possible so that a two way conversation can be sustained.
(98) The design of an adaptive de-jitter buffer reflects the above mentioned trade-offs. While attempting to minimize packet losses, the jitter buffer management algorithm in the decoder also keeps track of the delay in packet delivery as a result of the buffering. The jitter buffer management algorithm suitably adjusts the depth of the de-jitter buffer in order to achieve the trade-off between delay and late losses.
(99) With reference to
(100) The difference in time units between the transmit time of the primary copy 4a of a frame and the transmit time of the redundant copy 8a of the frame (piggy backed onto a future frame 4b) is called the FEC offset. If the depth of the jitter buffer at any given time is at least equal to the FEC offset, then it is quite likely that the future frame is available in the de-jitter buffer at the current time instance. The FEC offset is a configurable parameter at the encoder which can be dynamically adjusted depending on the network conditions.
(101) The concept of partial redundancy in EVS with FEC offset equal to [7] is shown in
(102) The EVS channel aware mode transmits redundancy in-band as part of the codec payload as opposed to transmitting redundancy at the transport layer (e.g., by including multiple packets in a single RTP payload). Including the redundancy in-band allows the transmission of redundancy to be either channel controlled (e.g., to combat network congestion) or source controlled. In the latter case, the encoder can use properties of the input source signal to determine which frames are most critical for high quality reconstruction at the decoder and selectively transmit redundancy for those frames only. Another advantage of in-band redundancy is that source control can be used to determine which frames of input can best be coded at a reduced frame rate in order to accommodate the attachment of redundancy without altering the total packet size. In this way, the channel aware mode includes redundancy in a constant-bit-rate channel (13.2 kbps).
Bit-Rate Allocation for Primary and Partial Redundant Frame Coding
(103) Primary Frame Bit-Rate Reduction
(104) A measure of compressibility of the primary frame is used to determine which frames can best be coded at a reduced frame rate. For TCX frame the 9.6 kpbs setup is applied for WB as well as for SWB. For ACELP the following apply. The coding mode decision coming from the signal classification algorithm is first checked. Speech frames classified for Unvoiced Coding (UC) or Voiced Coding (VC) are suitable for compression. For Generic Coding (GC) mode, the correlation (at pitch lag) between adjacent sub-frames within the frame is used to determine compressibility. Primary frame coding of upper band signal (i.e., from 6.4 to 14.4 kHz in SWB and 6.4 to 8 kHz in WB) in channel aware mode uses time-domain bandwidth extension (TBE). For SWB TBE in channel aware mode, a scaled down version of the non-channel aware mode framework is used to obtain a reduction of bits used for the primary frame. The LSF quantization is performed using an 8-bit vector quantization in channel aware mode while a 21-bit scalar quantization based approach is used in non-channel aware mode. The SWB TBE primary frame gain parameters in channel aware mode are encoded similar to that of non-channel aware mode at 13.2 kbps, i.e., 8 bits for gain parameters. The WB TBE in channel aware mode uses similar encoding as used in 9.6 kbps WB TBE of non-channel aware mode, i.e., 2 bits for LSF and 4 bits for gain parameters.
(105) Partial Redundant Frame Coding
(106) The size of the partial redundant frame is variable and depends on the characteristics of the input signal. Also criticality measure is an important metric. A frame is considered as critical to protect when loss of the frame would cause significant impact to the speech quality at the receiver. The criticality also depends on if the previous frames were lost or not. For example, a frame may go from being non-critical to critical if the previous frames were also lost. Parameters computed from the primary copy coding such as coder type classification information, subframe pitch lag, factor M etc are used to measure the criticality of a frame. The threshold, to determine whether a particular frame is critical or not, is a configurable parameter at the encoder which can be dynamically adjusted depending on the network conditions. For example, under high FER conditions it may be desirable to adjust the threshold to classify more frames as critical. Partial frame coding of upper band signal relies on coarse encoding of gain parameters and interpolation/extrapolation of LSF parameters from primary frame. The TBE gain parameters estimated during the primary frame encoding of the (n-FEC offset)-th frame is re-transmitted during the n-th frame as partial copy information. Depending on the partial frame coding mode, i.e., GENERIC or VOICED or UNVOICED, the re-transmission of the gain frame, uses different quantization resolution and gain smoothing.
(107) The following sections describe the different partial redundant frame types and their composition.
(108) Construction of Partial Redundant Frame for Generic and Voiced Coding Modes
(109) In the coding of the redundant version of the frame, a factor M is determined based on the adaptive and fixed codebook energy.
(110)
(111) In this equation, E(ACB) denotes the adaptive codebook energy and E(FCB) denotes the fixed codebook energy. A low value of M indicates that most of the information in the current frame is carried by the fixed codebook contribution. In such cases, the partial redundant copy (RF_NOPRED) is constructed using one or more fixed codebook parameters only (FCB pulses and gain). A high value of M indicates that most of the information in the current frame is carried by the adaptive codebook contribution. In such cases, the partial redundant copy (RF_ALLPRED) is constructed using one or more adaptive codebook parameters only (pitch lag and gain). If M takes mid values then a mixed coding mode is selected where one or more adaptive codebook parameters and one or more fixed codebook parameters are coded (RF_GENPRED). Under Generic and Voiced Coding modes, the TBE gain frame values are typically low and demonstrate less variance. Hence a coarse TBE gain frame quantization with gain smoothing is used.
(112) Construction of Partial Redundant Frame for Unvoiced Coding Mode
(113) The low bit-rate Noise Excited Linear Prediction coding scheme is used to construct a partial redundant copy for an unvoiced frame type (RF_NELP). In Unvoiced coding mode, the TBE gain frame has a wider dynamic range. To preserve this dynamic range, the TBE gain frame quantization in Unvoiced coding mode uses a similar quantization range as that of the one used in the primary frame.
(114) Construction of Partial Redundant Frame for TCX Frame
(115) In case of TCX partial redundant frame type, a partial copy consisting of some helper parameters is used to enhance the frame loss concealment algorithm. There are three different partial copy modes available, which are RF_TCXFD, RF_TCXTD1 and RF_TCX_TD2. Similar to the PLC mode decision on the decoder side, the selection of the partial copy mode for TCX is based on various parameters such as the mode of the last two frames, the frame class, LTP pitch and gain.
(116) Frequency Domain Concealment (RF_TCXFD) Partial Redundant Frame Type
(117) 29 bits are used for the RF_TCXFD partial copy mode. 13 bits are used for the LSF quantizer which is the same as used for regular low rate TCX coding. The global TCX gain is quantized using 7 bits. The classifier info is coded on 2 bits.
(118) Time Domain Concealment (RF_TCXTD1 and RF_TCXTD2) Partial Redundant Frame Type
(119) The partial copy mode RF_TCXTD1 is selected if the frame contains a transient or if the global gain of the frame is much lower than the global gain of the previous frame. Otherwise RF_TCXTD2 is chosen.
(120) Overall 18 bits of side data are used for both modes. 9 bits are used to signal the TCX LTP lag 2 bits for signalling the classifier info
(121) RF NO DATA Partial Redundant Frame Type
(122) This is used to signal a configuration where the partial redundant copy is not sent and all bits are used towards primary frame coding.
(123) The primary frame bit-rate reduction and partial redundant frame coding mechanisms together determine the bit-rate allocation between the primary and redundant frames to be included within a 13.2 kbps payload.
Decoding
(124) At the receiver, the de-jitter buffer provides a partial redundant copy of the current lost frame if it is available in any of the future frames. If present, the partial redundant information is used to synthesize the lost frame. In the decoding, the partial redundant frame type is identified and decoding performed based on whether only one or more adaptive codebook parameters, only one or more fixed codebook parameters, or one or more adaptive codebook parameters and one or more fixed codebook parameters, TCX frame loss concealment helper parameters, or Noise Excited Linear Prediction parameters are coded. If current frame or previous frame is a partial redundant frame, the decoding parameter of current frame such as LSP parameters, the gain of adaptive codebook, fix codebook or the BWE gain, is firstly obtained and then post-processed according to decoding parameters, classification information or spectral tilt from previous frames of current frame, or future frames of current frame. The post-processed parameters are used to reconstruct the output signal. Finally, the frame is reconstructed based on the coding scheme. The TCX partial info is decoded, but in contrast to ACELP partial copy mode, the decoder is run in concealment mode. The difference to regular concealment is just that the parameters available from the bitstream are directly used and not derived by concealment.
Channel Aware Mode Encoder Configurable Parameters
(125) The channel aware mode encoder may use the following configurable parameters to adapt its operation to track the channel characteristics seen at the receiver. These parameters maybe computed at the receiver and communicated to the encoder via a receiver triggered feedback mechanism.
(126) Optimal partial redundancy offset (.sup.o): The difference in time units between the transmit time of the primary copy of a frame (n) and the transmit time of the redundant copy of that frame which is piggy backed onto a future frame (n+X) is called the FEC offset X. The optimal FEC offset is a value which maximizes the probability of availability of a partial redundant copy when there is a frame loss at the receiver.
(127) Frame erasure rate indicator (p) having the following values: LO(low) for FER rates <5% or HI (high) for FER>5%. This parameter controls the threshold used to determine whether a particular frame is critical or not. Such an adjustment of the criticality threshold is used to control the frequency of partial copy transmission. The HI setting adjusts the criticality threshold to classify more frames as critical to transmit as compared to the LO setting.
(128) It is noted that these encoder configurable parameters are optional with default set to p=HI and .sup.o=3.
Second EVS-Embodiment
(129) The following description passages describe an exemplary embodiment of the inventive concept which is used in packet-switched networks, such as Voice-over-IP (VoIP), Voice-over-LTE (VoLTE) or Voice-over-WiFi (VoWiFi).
(130) A highly error resilient mode of the newly standardized 3GPP EVS speech codec is described. Compared to the AMR-WB codec and other conversational codecs, the EVS channel aware mode offers significantly improved error resilience in voice communication over packet-switched networks such as Voice-over-IP (VoIP) and Voice-over-LTE (VoLTE). The error resilience is achieved using a form of in-band forward error correction. Source-controlled coding techniques are used to identify candidate speech frames for bitrate reduction, leaving spare bits for transmission of partial copies of prior frames such that a constant bit rate is maintained. The self-contained partial copies are used to improve the error robustness in case the original primary frame is lost or discarded due to late arrival. Subjective evaluation results from ITU-T P.800 Mean Opinion Score (MOS) tests are provided, showing improved quality under channel impairments as well as negligible impact to clean channel performance.
Introduction
(131) In packet-switched networks, packets may be subjected to varying scheduling and routing conditions, which results in time-varying end-to-end delay. The delay jitter is not amenable to most conventional speech decoders and voice post-processing algorithms that typically expect the packets to be received at fixed time intervals. Consequently, a de-jitter buffer (also referred to as Jitter Buffer Management (JBM) [8], [13]) is typically used in the receiving terminal to remove jitter and deliver packets to the decoder in the correct sequential order.
(132) The longer the de-jitter buffer, the better its ability to remove jitter and the greater the likelihood that jitter can be tolerated without discarding packets due to late arrival (or, buffer underflow). However, end-to-end delay is a key determiner of call quality in conversational voice networks, and the ability of the JBM to absorb jitter without adding excessive buffering delay is an important requirement. Thus, a trade-off exists between JBM delay and the jitter induced packet loss at the receiver. JBM designs have evolved to offer increasing levels of performance while maintaining minimal average delay [8]. Aside from delay jitter, the other primary characteristic of packet-switched networks is the presence of multiple consecutive packet losses (error bursts), which are more commonly seen than on circuit switched networks. Such bursts can result from bundling of packets at different network layers, scheduler behavior, poor radio frequency coverage, or even a slow-adapting JBM. However, the de-jitter buffer—an essential component for Vol P—can be leveraged for improved underflow prevention and more sophisticated packet loss concealment [8]. One such technique is to use forward error correction by transmitting encoded information redundantly for use when the original information is lost at the receiver.
Channel Aware Mode in the EVS Codec
(133) The EVS Channel Aware mode introduces a novel technique for transmitting redundancy in-band as part of the codec payload in a constant bitrate stream, and is implemented for wideband (WB) and super-wideband (SWB) at 13.2 kbps. This technique is in contrast to prior codecs, for which redundancy is typically added as an afterthought by defining mechanisms to transmit redundancy at the transport layer. For example, the AMR-WB RTP payload format allows for bundling of multiple speech frames to include redundancy into a single RTP payload [9]. Alternatively, RTP packets containing single speech frames can be simply re-transmitted at a later time.
(134)
Channel Aware Encoding
(135)
(136) Strongly-voiced and unvoiced frames are suitable for carrying partial copies of a previous frame with negligible perceptual impact to the primary frame quality. If the current frame is allowed to carry the partial copy, it is signaled by setting RfFlag in the bit stream to 1, or 0 otherwise. If the RfFlag is set to 1, then the number of bits, B.sub.primary, available to encode the current primary frame is determined by compensating for the number of bits, BRF, already used up by the accompanying partial copy, i.e., B.sub.primary=264−BRF at 13.2 kbps constant total bit rate. The number of bits, BRF, can range from 5 to 72 bits depending on frame criticality and RF frame type (Section 3.2).
(137) Primary Frame Coding
(138) The “primary frame coding” module 83 shown in
(139) Dietz et al., [16] give an overview of various advancements to the EVS primary modes that further improve the coding efficiency of the ACELP technology beyond the 3GPP AMR-WB coding efficiency [21]. The EVS Channel Aware mode leverages these ACELP and TCX core advancements for primary frame encoding. Additionally, as the partial copy uses varying number of bits across frames, the primary frame encoding also needs to correspondingly accommodate for an adaptive bit allocation.
(140) Redundant Frame Coding
(141) The “redundant frame (RF) coding” module 84 performs compact re-encoding of only those parameters that are critical to protect. The set of critical parameters are identified based on the frame's signal characteristics and are re-encoded at a much lower bitrate (e.g., less than 3.6 kbps). The “bit packer” module 85 arranges the primary frame bit-stream 86 and the partial copy 87 along with certain RF parameters such as RF frame type and FEC offset (see Table I) at fixed locations in the bit-stream.
(142) TABLE-US-00001 TABLE I BIT ALLOCATION FOR CHANNEL AWARE CODING AT 13.2 KBPS Core coder ACELP Bandwidth WB SWB TCX/IGF Signalling information 5 (bwidth, coder type, Rf.sub.Flag) Primary Core 181-248 169-236 232-254 frame TBE 6 18 Partial Core 0-62 0-62 0-22 frame TBE 0-5 0-5 FEC offset 2 RF frame type 3
(143) A frame is considered as critical to protect when loss of that frame would cause significant impact to the speech quality at the receiver. The threshold to determine whether a particular frame is critical or not is a configurable parameter at the encoder, which can be dynamically adjusted depending on the network conditions. For example, under high FER conditions it may be desirable to adjust the threshold to classify more frames as critical. The criticality may also depend on the ability to quickly recover from the loss of a previous frame. For example if the current frame depends heavily on the previous frame's synthesis, then the current frame may get re-classified from being non-critical to critical in order to arrest the error propagation in case the previous frame were to be lost at the decoder.
(144) a) ACELP Partial Frame Encoding
(145) For ACELP frames, the partial copy encoding uses one of the four RF frame types, RF_NOPRED, RF_ALLPRED, RF_GENPRED, and RF_NELP depending on the frame's signal characteristics. Parameters computed from the primary frame coding such as frame type, pitch lag, and factor τ are used to determine the RF frame type and criticality, where
(146)
(147) E.sub.ACB denotes the adaptive codebook (ACB) energy, and EFCB denotes the fixed codebook (FCB) energy. A low value of τ (e.g., 0.15 and below) indicates that most of the information in the current frame is carried by the FCB contribution. In such cases, the RF_NOPRED partial copy encoding uses one or more FCB parameters (e.g., FCB pulses and gain) only. On the other hand, a high value of τ (e.g., 0.35 and above) indicates that most of the information in the current frame is carried by the ACB contribution. In such cases, the RF_ALLPRED partial copy encoding uses one or more ACB parameters (e.g., pitch lag and gain) only. If T is in the range of [0.15, 0.35], then a mixed coding mode RF_GENPRED uses both ACB and FCB parameters for partial copy encoding. For the UNVOICED frames, low bitrate noise-excited linear prediction (NELP) [16] is used to encode the RF_NELP partial copy. The upper band partial copy coding relies on coarse encoding of gain parameters and extrapolation of LSF parameters from the previous frame [11].
(148) b) TCX Partial Frame Encoding
(149) In order to get a useful TCX partial copy, many bits would have to be spent for coding the MDCT spectral data, which reduces the available number of bits for the primary frame significantly and thus degrades the clean channel quality. For this reason, the number of bits for TCX primary frames is kept as large as possible, while the partial copy carries a set of control parameters, enabling a highly guided TCX concealment.
(150) The TCX partial copy encoding uses one of the three RF frame types, RF_TCXFD, RF_TCXTD1, and RF_TCXTD2. While the RF_TCXFD carries control parameters for enhancing the frequency-domain concealment, the RF_TCXTD1 and RF_TCXTD2 are used in time-domain concealment [20]. The TCX RF frame type selection is based on the current and previous frame's signal characteristics, including pitch stability, LTP gain and the temporal trend of the signal. Certain critical parameters such as the signal classification, the LSPs, the TCX gain and pitch lag are encoded in the TCX partial copy.
(151) In background noise or in inactive speech frames, a non-guided frame erasure concealment is sufficient to minimize the perceptual artifacts due to lost frames. An RF_NO_DATA is signaled indicating the absence of a partial copy in the bit-stream during the background noise. In addition, the first TCX frame after a switch from ACELP frame, also uses an RF_NODATA due to lack of extrapolation data in such a coding type switching scenario.
Channel Aware Decoding
(152)
(153) Interface with JBM
(154) As described earlier, if the N-th frame is not available (lost or delayed) at the play-out time, the JBM is checked for the availability of a future (N+K)-th frame that contains the partial redundancy of the current frame where K ∈{2, 3, 5, 7}. The partial copy of a frame typically arrives after the primary frame. JBM delay adaptation mechanisms are used to increase the likelihood of availability of partial copies in the future frames, especially for higher FEC offsets of 5 and 7. The EVS JBM conforms to the delay-jitter requirements specified by the 3GPP TS 26.114 [10] for all the EVS modes including the channel aware mode.
(155) In addition to the above described functionality, the EVS JBM [13] computes the channel error rate and an optimum FEC offset, K, that maximizes the availability of the partial redundant copy based on the channel statistics. The computed optimum FEC offset and the channel error rate can be transmitted back to the encoder through a receiver feedback mechanism (e.g., through a codec mode request (CMR) [9]) to adapt the FEC offset and the rate at which the partial redundancy is transmitted to improve the end user experience.
(156) ACELP and TCX Partial Frame Decoding
(157) The “bit-stream parser” module 98 in
(158) Furthermore, if the previous frame used a partial copy for synthesis, then a post-processing is performed in the current frame for a smoother evolution of LSPs and temporal gains. The post-processing is controlled based on the frame type (e.g., VOICED or UNVOICED) and spectral tilt estimated in the previous frame. If the current frame corresponds to a TCX partial copy, then the RF parameters are used to perform a highly-guided concealment.
Subjective Quality Tests
(159) Extensive testing of the EVS channel aware mode has been conducted via subjective ITU-T P.800 Mean Opinion Score (MOS) tests conducted at an independent test laboratory with 32 naïve listeners. The tests were conducted for both WB and SWB, using absolute category rating (ACR) and degradation category rating (DCR) test methodologies [24], respectively. Since the channel aware mode is specifically designed to improve performance for VoLTE networks, evaluating the performance in such networks is critical for establishing the potential benefits. Therefore, testing was conducted using codec outputs from simulations in which VoLTE-like patterns of packet delays and losses were applied to received RTP packets before insertion into the de-jitter buffer. Four of these patterns—or, delay-loss profiles—were derived from real-world call logs of RTP packet arrival times collected in VoLTE networks in South Korea and the United States.
(160) The resulting profiles mimic closely VoLTE network characteristics under different channel error conditions. In deriving the profiles, characteristics such as jitter, temporal evolution of jitter, and burstiness of errors were considered. These four profiles are identified in
(161) In addition to the VoLTE profiles, all codecs considered here were tested under error-free conditions and also for an HSPA profile included in the 3GPP MTSI specification [10] that yields about 6% frame erasure rate at the decoder. In all of the experiments, the EVS conditions used the reference EVS de-jitter buffer [13]. The AMR-WB conditions used a fixed delay buffer to convert delay-loss profiles to packet-loss profiles, such that packets experiencing a delay greater than a fixed threshold are discarded as described in the EVS performance requirements specification [14].
(162) The ACR scores for the WB case are shown in
(163) The performance advantage of the channel aware mode is similarly compelling in the super-wideband mode, the results for which are shown in
Conclusions
(164) The Channel Aware coding mode of the new 3GPP EVS codec offers users and network operators a highly error resilient coding mode for VoLTE at a capacity operating point similar to the most widely used bit rates of existing deployed services based on AMR and AMR-WB. The mode gives the codec the ability to maintain high quality WB and SWB conversational voice service even in the presence of high FER that may occur during network congestion, poor radio frequency coverage, handoffs, or in best-effort channels. Even with its graceful quality degradation under high loss, the impact to quality is negligible under low loss or even no-loss conditions. This error robustness offered by the Channel Aware mode further allows for relaxing certain system level aspects such as frequency of re-transmissions and reducing scheduler delays. This in turn has potential benefits such as increased network capacity, reduced signaling overhead and power savings in mobile handsets. Use of the Channel Aware mode, therefore, can be beneficial in most networks without capacity impact to insure high quality communications.
(165) Summarizing, the present invention utilizes the fact that the coder knows about the channel quality, for improving the speech/audio quality under erroneous conditions. In contrast to state of the art channel aware coding, the idea is to not have a partial copy that is just a low bitrate version of the primary encoded frame, but the partial copy consist of multiple key parameters that will enhance drastically the concealment. Therefore the decoder needs to distinguish between regular concealment mode where all parameters are concealed and frameloss mode where the partial copy parameters are available. Special care need to be taken for burst frameloss for cases where the concealment needs to switch between partial and full concealment.
(166) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
(167) Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
(168) The inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
(169) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
(170) Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(171) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
(172) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(173) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(174) A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
(175) A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
(176) A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
(177) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(178) A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
(179) In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
(180) The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
(181) The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
(182) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which will be apparent to others skilled in the art and which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
REFERENCES
(183) [1] RTP Payload for Redundant Audio Data, Internet Engineering Task Force, RFC 2198, September 1997 [2] U.S. Pat. No. 6,757,654—“Forward error correction in speech coding”, Westerlund, M. and al., 29 Jun. 2004 [3] “Adaptive joint playout buffer and FEC adjustment for Internet telephony” C. Boutremans, J.-Y. Le Boudec, INFOCOM 2003. Twenty-Second Annual Joint Conference of the IEEE Computer and Communications. IEEE Societies; April 2003 [4] Patent application: AUDIO DECODER AND METHOD FOR PROVIDING A DECODED AUDIO INFORMATION USING AN ERROR CONCEALMENT BASED ON A TIME DOMAIN EXCITATION SIGNAL [5] Patent application: AUDIO DECODER AND METHOD FOR PROVIDING A DECODED AUDIO INFORMATION USING AN ERROR CONCEALMENT MODIFYING A TIME DOMAIN EXCITATION SIGNAL [6] 3GPP TS 26.448: “Codec for Enhanced Voice Services (EVS); Jitter Buffer Management”. [7] 3GPP TS 26.442: “Codec for Enhanced Voice Services (EVS); ANSI C code (fixed-point)”. [8] D. J. Sinder, I. Varga, V. Krishnan, V. Rajendran and S. Villette, “Recent Speech Coding Technologies and Standards,” in Speech and Audio Processing for Coding, Enhancement and Recognition, T. Ogunfunmi, R. Togneri, M. Narasimha, Eds., Springer, 2014. [9] J. Sjoberg, M. Westerlund, A. Lakaniemi and Q. Xie, “RTP Payload Format and File Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) Audio Codecs,” April 2007. [Online]. Available: http://tools.ietf.org/html/rfc4867. [10] 3GPP TS 26.114, “Multimedia Telephony Service for IMS,” V12.7.0, September 2014. [11] 3GPP TS 26.445: “EVS Codec Detailed Algorithmic Description; 3GPP Technical Specification (Release 12),” 2014. [12] 3GPP, TS 26.447, “Codec for Enhanced Voice Services (EVS); Error Concealment of Lost Packets (Release 12),” 2014. [13] 3GPP TS26.448: “EVS Codec Jitter Buffer Management (Release 12),” 2014. [14] 3GPP Tdoc S4-130522, “EVS Permanent Document (EVS-3): EVS performance requirements,” Version 1.4. [15] S. Bruhn, et al., “Standardization of the new EVS Codec,” submitted to IEEE ICASSP, Brisbane, Australia, April, 2015. [16] M. Dietz, et al., “Overview of the EVS codec architecture,” submitted to IEEE ICASSP, Brisbane, Australia, April, 2015. [17] V. Atti, et al., “Super-wideband bandwidth extension for speech in the 3GPP EVS codec,” submitted to IEEE ICASSP, Brisbane, Australia, April, 2015. [18] G. Fuchs, et al., “Low delay LPC and MDCT-based Audio Coding in EVS,” submitted to IEEE ICASSP, Brisbane, Australia, April, 2015. [19] S. Disch et al., “Temporal tile shaping for spectral gap filling within TCX in EVS Codec,” submitted to IEEE ICASSP, Brisbane, Australia, April, 2015. [20] J. Lecomte et al., “Packet Loss Concealment Technology Advances in EVS,” submitted to IEEE ICASSP, Brisbane, Australia, April, 2015. [21] B. Bessette, et al, “The adaptive multi-rate wideband speech codec (AMR-WB),” IEEE Trans. on Speech and Audio Processing, vol. 10, no. 8, pp. 620-636, November 2002. [22] E. Ravelli, et al., “Open loop switching decision based on evaluation of coding distortions for audio codecs,” submitted to IEEE ICASSP, Brisbane, Australia, April, 2015. [23] M. Jelinek, T. Vaillancourt, and Jon Gibbs, “G.718: A New Embedded Speech and Audio Coding Standard with High Resilience to Error-Prone Transmission Channels,” IEEE Communications Magazine, vol. 47, no. 10, pp. 117-123, October 2009. [24] ITU-T P.800, “Methods for Subjective Determination of Transmission Quality,” International Telecommunication Union (ITU), Series P., August 1996.