Method and apparatus for polyphonic audio signal prediction in coding and networking systems

Abstract

A method, device, and apparatus provide the ability to predict a portion of a polyphonic audio signal for compression and networking applications. The solution involves a framework of a cascade of long term prediction filters, which by design is tailored to account for all periodic components present in a polyphonic signal. This framework is complemented with a design method to optimize the system parameters. Specialization may include specific techniques for coding and networking scenarios, where the potential of each enhanced prediction is realized to considerably improve the overall system performance for that application. One specific technique provides enhanced inter-frame prediction for the compression of polyphonic audio signals, particularly at low delay. Another specific technique provides improved frame loss concealment capabilities to combat packet loss in audio communications.

Claims

1. A method for processing an audio signal, comprising: reconstructing an approximation of the audio signal in a processor, by concealing a missing portion of the audio signal utilizing estimation of the missing portion by a plurality of cascaded long term prediction filters in the processor, wherein each of the plurality of cascaded long term prediction filters corresponds to one periodic component of the audio signal.

2. The method of claim 1, wherein the missing portion of the audio signal is missing due to packet loss during transmission, or physical damage to storage media, or corruption of stored data.

3. The method of claim 1, wherein the concealing is done at a decoder that is processing encoded data of an audio signal to reconstruct an approximation of the audio signal; and the missing portion of the audio signal corresponds to a missing portion of the encoded data.

4. The method of claim 1, further comprising adapting one or more cascaded filter parameters of the cascaded long term prediction filters to local audio signal characteristics, wherein the cascaded filter parameters comprise one or more of: a number of filters in a cascade, a time lag parameter, and a gain parameter.

5. The method of claim 4, wherein: adapting the one or more cascaded filters parameters comprises adjusting the one or more cascaded filter parameters for one or more of the plurality of cascaded long term prediction filters, at a time, while fixing all other cascaded filter parameters; and iterating over all of the cascaded long term prediction filters until a desired level of performance is met.

6. The method of claim 5, wherein: there is access to the audio signal on both sides of the missing portion to be concealed; the desired level of performance corresponds to a minimum prediction error energy; and the method further comprises predicting, based on the available audio samples on one side of the missing portion, both the missing portion and the available audio samples on an other side of the missing portion, wherein a prediction error energy is calculated for the available audio samples on the other side.

7. The method of claim 5, wherein: there is access to one or more linear combinations of audio samples on both sides of the missing portion to be concealed; the desired level of performance corresponds to a minimum prediction error energy; and the method further comprises predicting, based on the available linear combinations of audio samples on one side of the missing portion, both the missing portion and the available linear combinations of audio samples on an other side of the missing portion, wherein a prediction error energy is calculated for the available linear combinations of audio samples on the other side.

8. The method of claim 1, wherein the plurality of cascaded long term prediction filters is utilized to generate a first approximation of the missing portion from available past signal information.

9. The method of claim 8, further comprising a second plurality of cascaded long term prediction filters for operation in a reverse direction, optimized to predict a past from future audio samples, and which are utilized to generate a second approximation of the missing portion from available future signal information.

10. The method of claim 9, further comprising calculating a weighted average of the first approximation and the second approximation of the missing portion.

11. The method of claim 10, wherein weights employed for calculating the weighted average depend on a position of an approximated sample within the missing portion.

12. The method of claim 10, further comprising predicting available audio samples or linear combinations thereof on an other side of the missing portion, in both forward and reverse directions; wherein weights employed for calculating the weighted average depend on prediction errors calculated, on the other side of the missing portion, in the forward and reverse directions.

13. A device for processing an audio signal, comprising: a processor for reconstructing an approximation of the audio signal, wherein the processor comprises a plurality of cascaded long term prediction filters coupled in a cascaded manner, each of the plurality of cascaded long term prediction filters corresponds to one periodic component of the audio signal, and the processor conceals a missing portion of the audio signal by utilizing estimation of the missing portion by the plurality of cascaded long term prediction filters.

14. The device of claim 13, wherein the device adapts one or more cascaded filter parameters of the cascaded long term prediction filters to local audio signal characteristics by: adjusting the one or more cascaded filter parameters for one or more of the plurality of cascaded long term prediction filters, at a time, while fixing all other cascaded filter parameters; and iterating over all of the cascaded long term prediction filters until a desired level of performance is met.

15. The device of claim 14, wherein: there is access to the audio signal on both sides of the missing portion to be concealed; the desired level of performance corresponds to a minimum prediction error energy; and the device predicts, based on the available audio samples on one side of the missing portion, both the missing portion and the available audio samples on an other side of the missing portion, wherein a prediction error energy is calculated for the available audio samples on the other side.

16. The device of claim 14, wherein: there is access to one or more linear combinations of audio samples on both sides of the missing portion to be concealed; the desired level of performance corresponds to a minimum prediction error energy; and the device predicts, based on the available linear combinations of audio samples on one side of the missing portion, both the missing portion and the available linear combinations of audio samples on an other side of the missing portion, wherein a prediction error energy is calculated for the available linear combinations of audio samples on the other side.

17. The device of claim 13, wherein: the plurality of cascaded long term prediction filters is utilized to generate a first approximation of the missing portion from available past signal information; the device further comprises a second plurality of cascaded long term prediction filters for operation in a reverse direction, optimized to predict a past from future audio samples, and which are utilized to generate a second approximation of the missing portion from available future signal information.

18. The device of claim 17, further comprising calculating a weighted average of the first approximation and the second approximation of the missing portion.

19. The device of claim 18, wherein weights employed for calculating the weighted average depend on a position of an approximated sample within the missing portion.

20. The device of claim 18, further comprising predicting available audio samples or linear combinations thereof on an other side of the missing portion, in both forward and reverse directions; wherein weights employed for calculating the weighted average depend on prediction errors calculated, on the other side of the missing portion, in the forward and reverse directions.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Referring now to the drawings in which like reference numbers represent corresponding parts throughout:

(2) FIG. 1 illustrates a cascaded analysis filter approach in accordance with one or more embodiments of the present invention;

(3) FIG. 2 illustrates a cascaded synthesis filter approach in accordance with one or more embodiments of the present invention;

(4) FIG. 3 illustrates an encoder of an audio compression system in accordance with one or more embodiments of the present invention;

(5) FIG. 4 illustrates a decoder of an audio compression system in accordance with one or more embodiments of the present invention;

(6) FIG. 5 illustrates an application using CLTP based compression in accordance with one or more embodiments of the present invention;

(7) FIG. 6 illustrates a typical signal in accordance with one or more embodiments of the present invention;

(8) FIG. 7 illustrates an application using CLTP based frame loss concealment in accordance with one or more embodiments of the present invention;

(9) FIG. 8 is an exemplary hardware and software environment used to implement one or more embodiments of the invention; and

(10) FIG. 9 illustrates the logical flow for processing an audio signal in accordance with one or more embodiments of the invention.

DETAILED DESCRIPTION OF THE INVENTION

(11) In the following description of the preferred embodiment, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration a specific embodiment in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.

(12) Overview

(13) Most audio signals contain naturally occurring periodic sounds and exploiting redundancy due to these periodic components is critical to numerous important applications such as audio compression, audio networking, audio delivery to mobile devices, and audio source separation. For monophonic audio (which consists of a single periodic component) the Long Term Prediction (LTP) tool has been used successfully. This tool capitalizes on the periodic component of the waveform by selecting a past segment as the basis for prediction of the current frame. However, as described above, most audio signals are polyphonic in nature, consisting of a mixture of periodic signals. This renders the Long Term Prediction (LTP) results sub-optimal, as the mixture period equals the least common multiple of its individual component periods, which typically extends far beyond the duration over which the signal is stationary.

(14) Instead of seeking a past segment that represents a “compromise” for incompatible component periods, embodiments of the present invention comprises a more complex filter that caters to the individual signal components. More specifically, one may note that redundancies implicit in the periodic components of a polyphonic signal may offer a significant potential for compression gains and concealment quality improvement. Embodiments of the present invention exploit such redundancies by cascading LTP filters, each corresponding to individual periodic components of the signal, to form what is referred to as a “cascaded long term prediction” (CLTP) filter. In other words, every periodic component of the signal (in the current frame) may be predicted from its immediate history (i.e., the most recent previously reconstructed segment with which it is maximally correlated) by cascading LTP filters, each corresponding to an individual periodic component.

(15) As efficacy of such prediction is dependent on effective parameter estimation, prediction parameter optimization may target mean squared error (MSE) as a basic platform. Such a basic platform may then be adapted to specific coders and their distortion criteria (e.g., the perceptual distortion criteria of MPEG AAC). To estimate such prediction parameters at acceptable complexity (while approaching optimality), embodiments of the invention employ a recursive “divide and conquer” technique to estimate the parameters of all the LTP filters. More specifically, the optimal parameters of an individual filter in the cascade are found, while fixing all other filter parameters. This process is then iterated for all filters in a loop, until convergence or until a desired level of performance is met, to obtain the parameters of all LTP filters in the cascade. In compression systems, such a technique may also be employed in a backward adaptive way (e.g., in systems that use a simple quantization MSE distortion), to minimize the side information rate, as a decoder can mimic this procedure. In alternative compression systems (e.g., MPEG AAC), parameters may be estimated in two stages, where one first employs the backward adaptive MSE minimizing method to estimate a large subset of prediction parameters (which includes lags and preliminary gains of the CLTP filter, and per band prediction activation flags). In the next stage, the gains are further refined for the current frame, with respect to the perceptual criteria, and only refinement parameters are sent as side information. Low decoder complexity and moderate decoder complexity variants for such compression systems (e.g., for the MPEG AAC) may also be employed, wherein all the parameters are sent as side information to the decoder, or most of the parameters are sent as side information to the decoder, respectively. In such variants, parameter estimation is done in two stages where one first estimates a large subset of parameters to minimize MSE and in the next stage, the parameters are fine tuned to take perceptual distortion criteria into account. For frame loss concealment, a four stage process may be employed, wherein a preliminary set of parameters for CLTP are estimated from past reconstructed samples via the recursive technique. The parameters are then further enhanced via multiplicative factors to minimize the squared prediction error across future reconstructed samples or a linear combination thereof. Another set of parameters are estimated for predicting the lost frame in the reverse direction from future samples. Finally, the two sets of predicted samples are overlap-added with a triangular window to reconstruct the lost frame, depending on prediction error for available samples or linear combination thereof on the other side of the lost frame.

(16) Such embodiments have been evaluated after incorporation within existing systems, such as within the Bluetooth Sub-band Codec and MPEG AAC low delay (LD) mode coder. Results achieved through use of such embodiments show considerable gains achieved on a variety of polyphonic signals, thereby indicating the effectiveness of such embodiments.

(17) Detailed Technical Description

(18) A simple periodic signal with pitch period N can be described as follows:
x[n]=x[n−N] (1)

(19) However, naturally occurring periodic signals are not perfectly stationary and have non-integral pitch periods. Thus, a more accurate description is
x[n]=αx[n−N]+βx[n−N+1] (2)
where α and β capture amplitude changes and approximate the non-integral pitch period via a linear interpolation. A mixture of such periodic signals along with noise models a polyphonic audio signal, as described below

(20) $\begin{matrix} x [n] = {.Math.}_{i = 0}^{P - 1} x_{i} [n] + w [n] & (3) \end{matrix}$
where P is the number of periodic components, w[n] is a noise sequence, and x.sub.i[m] are periodic signals satisfying x.sub.i[n]=α.sub.ix.sub.i[n−N.sub.i]+β.sub.ix.sub.i[n−N.sub.i+1].

(21) Embodiments of the present invention comprise a filter that minimizes the prediction error energy. When all periodic components are filtered out, the prediction error is dependent only on the noise sequence (also known as w[n]) or the change in the signal during the time period (also referred to as the innovation). The related art of LTP typically attempts to resolve this issue by using a compromise solution, which minimizes the mean squared prediction error while using the history available for prediction of a future signal. Due to non-stationary nature of the signal over long durations, using the effective period of the polyphonic signal, which is the Least Common Multiple (LCM) of the periods of its individual components, as lag of the LTP is highly sub-optimal. Further, if the LCM is beyond the history available for prediction, the related art approach defaults to attempting to find an estimate despite incompatible periods for the signal components, which adds error to the prediction using such an approach.

(22) Embodiments of the present invention minimize or eliminate these deficiencies in the related art by cascading filters such that all of the periodic components are filtered out or canceled, leaving a minimum energy prediction error dependent only on the noise sequence. Such a cascaded long term prediction (CLTP) analysis filter for polyphonic signals described in equation (3) above is given below

(23) $\begin{matrix} H (z) = {.Math.}_{i = 0}^{P - 1} (1 - α_{i} z^{- N_{i}} - β_{i} z^{- N_{i} + 1}) & (4) \end{matrix}$

(24) FIG. 1 illustrates the cascaded long term prediction (CLTP) analysis filter in accordance with one or more embodiments of the invention. System 100 comprises filters 104, 106 and 108 put together to form the analysis filter H(z) given in equation (4). Although three filters 104-108 are shown, a larger or smaller number of filters can be used without departing from the scope of the present invention. As illustrated input signal 102 is processed through filters 104-108 that are cascaded. Each LTP filter 104-108 in this structure serves to filter (i.e., remove) a portion of input signal 102 leaving a residual signal 110. Signal 102 is typically a polyphonic audio signal, but can be a single periodic signal, a signal in a different frequency band, or any signal without departing from the scope of the present invention.

(25) FIG. 2 illustrates the cascaded long term prediction (CLTP) synthesis filter in accordance with one or more embodiments of the invention. System 200 comprises filters 104, 106 and 108 put together to form the synthesis filter, 1/H(z), where H(z) is given in equation (4). Although three filters 104-108 are shown, a larger or smaller number of filters can be used without departing from the scope of the present invention. As illustrated the residual signal 110 is processed through LTP filters 104-108 (with initial states 202-206) that are cascaded. Each LTP filter 104-108 in this structure serves to reconstruct a portion of the signal to produce the output signal 208.

(26) Parameter Estimation

(27) The parameters for each filter in the cascade can be estimated in several ways within the scope of the present invention. Parameter estimation specifically adapted for the application, for example the perceptual distortion criteria of an audio coder or accounting for all available information during frame loss concealment, is crucial to the effectiveness of this technique with real polyphonic signals. However, as a starting point to solve this problem, one may first derive a minimum mean squared prediction error technique to optimize the CLTP parameter set:
N.sub.i,α.sub.i,β.sub.i∀iε{0, . . . ,P−1}

(28) A straightforward purely combinatorial approach would be to evaluate all combinations from a predefined set of values to find the one that minimizes the prediction error. This can be done by first fixing the range of pitch periods to Q possibilities, then finding the best α.sub.i, β.sub.i for each of the Q.sup.P period combination and finally selecting the period combination that minimizes the mean squared prediction error. Clearly, the complexity of this approach grows exponentially with the number of periodic components. For the modest choice of Q=100 and P=5, there are Q.sup.P=10.sup.10 combinations to be re-evaluated every time the parameters undergo updates, resulting in prohibitive computational complexity. Thus, embodiments of the invention propose a “divide and conquer” recursive estimation technique. Other approaches, such as estimation exploiting application-specific information such as expected signal frequencies and bandwidth, or other parameter estimations can be employed within the scope of the present invention.

(29) One or more embodiments perform estimation by fixing the number of periodic components that are present in the incoming signal, and estimating the parameters for one filter based on that number while maintaining unchanged the parameters of other filters. Estimating parameters for a single prediction filter is a prediction problem involving correlation of current samples with past signal samples. For a given number of periodic components, P, to estimate the jth filter parameters, N.sub.j,α.sub.i,β.sub.i, all other filters are fixed and the partial filter is defined:

(30) ${\overline{H}}_{j} (z) = \underset{\forall i, i \neq j}{.Math.} (1 - α_{i} z^{- N_{i}} - β_{i} z^{- N_{i} + 1})$
and the corresponding residue
X.sub.j(z)=X(z)H.sub.j(z)

(31) The parameters of the jth filter H.sub.j(z)=1−α.sub.iz.sup.−N.sup.i−β.sub.iz.sup.−.sup.i.sup.+1 are optimized for the residue x.sub.j[m]. This boils down to the classic LTP problem, where for a given N the values α.sub.(j,N),β.sub.(j,N) are given by

(32) $[\begin{matrix} α_{(j, N)} \\ β_{(j, N)} \end{matrix}] = {[\begin{matrix} r_{(N, N)} & r_{(N - 1, N)} \\ r_{(N - 1, N)} & r_{(N - 1, N - 1)} \end{matrix}]}^{- 1} [\begin{matrix} r_{(0, N)} \\ r_{(0, N - 1)} \end{matrix}]$
where the correlation values r.sub.(k,l) are

(33) $r_{(k, l)} = {.Math.}_{m = Y_{start}}^{Y_{end}} x_{j} [m - k] x_{j} [m - l]$
where, Y.sub.start and Y.sub.end are the limits of summation and depend on the length of the available history and the length of the current frame. Stability of the synthesis filter used in prediction may be ensured by restricting α.sub.(j,N),β.sub.(j,N) solutions to only those that satisfy the sufficient stability criteria of:
|α.sub.(j,N)|+|β.sub.(j,N)|≦1

(34) For details on estimating parameters which satisfy the sufficient stability criteria, please refer to the provisional applications incorporated by reference herein. Given α.sub.(j,N),β.sub.(j,N), the optimal N.sub.j is found as

(35) $N_{j} = \underset{N \in [N_{\min}, N_{\max}]}{\arg \min} {.Math.}_{m = Y_{start}}^{Y_{end}} {(x_{j} [m] - α_{(j, N)} x_{j} [m - N] - β_{(j, N)} x_{j} [m - N + 1])}^{2}$
where N.sub.min,N.sub.max are the lower and upper boundaries of the period search range. In the above equations, the signal can be replaced with reconstructed samples {circumflex over (x)}[m] for backward adaptive parameter estimation. The process above is now iterated over the component filters of the cascade, until convergence or until a desired level of performance is met. Convergence is guaranteed as the overall prediction error is monotonically non-increasing at every step of the iteration.

(36) Finally, the number of filters (and equivalently the estimated number of periodic components) may be optimized by repeating the above optimization process while varying this number. The combination of CLTP parameters, namely the number of periodic components and all individual filter parameters, which minimizes the prediction error energy is the complete set of CLTP parameters, according to a preferred embodiment of the invention.

(37) The CLTP embodiments described above may be adapted for compression of audio signals within the real world codecs of Bluetooth SBC and MPEG AAC or for frame loss concealment as described next.

(38) CLTP for Compression of Audio Signals

(39) As explained earlier, CLTP can be used to exploit redundancies in the periodic components of a polyphonic signal to achieve significant compression gains.

(40) FIG. 3 illustrates an encoder 300 of an audio compression system in accordance with one or more embodiments of the present invention. Input signal 102 is processed block-wise and mapped from time to frequency domain via transform 302 (or alternatively by an analysis filter bank) to generate frequency domain coefficients which, after subtraction of their predicted values 314, yield the frequency domain residual 304. Frequency selective switch 306 may then be used to select between the coefficients or the residual 304 for better prediction efficiency. The signal is then quantized with quantizer 308, encoded with entropy coder 310 and sent to bitstream multiplexer 312. The frequency domain predicted coefficients 314 are now selectively added to the quantized signal using the frequency selective switch 306, the output of which is then mapped back from frequency to time domain by the inverse transform 316 (or alternatively by a synthesis filter bank) to generate time domain reconstructed samples. These samples are buffered in delay 318, so that the previously reconstructed samples are available for encoding the current frame. The CLTP encoder parameter estimator 320 may use a combination of previously reconstructed samples from delay 318 and/or the input signal 102, to estimate parameters for the LTP filters used in system 200 and parameters of the frequency selective switch 306. Parameters which are estimated using the input signal 102 cannot be re-estimated at the decoder of an audio compression system and thus must be provided as side information, and are sent to the bitstream multiplexer 312. The system 200 predicts an entire block of audio signals by using the cascaded synthesis filter with the residual signal 110 set to zero and initial states 202-206 set such that output signal 208 for previous blocks matches the previously reconstructed samples. The output signal 208 generated for the current block is now mapped from time to frequency domain by transform 302 (or alternatively by an analysis filter bank) to generate the frequency domain predicted coefficients 314. The bitstream multiplexer 312 multiplexes all its inputs onto the bitstream 322 which is transmitted to the decoder of an audio compression system.

(41) FIG. 4 illustrates a decoder 400 of an audio compression system in accordance with one or more embodiments of the present invention. The bitstream 322 is processed through the bitstream demultiplexer 402 which separates information to be sent to the entropy decoder 404 (which subsumes a dequantizer) and to the CLTP decoder parameter estimator 406. The quantized signal is decoded using the entropy decoder 404. The frequency domain predicted coefficients 406 are then selectively added to the quantized signal using the frequency selective switch 306, the output of which is then mapped from frequency to time domain by the inverse transform 316 (or alternatively by a synthesis filter bank) to generate time domain reconstructed signal 410. This signal is buffered in delay 412, so that the previously reconstructed samples are available for decoding the current frame. The CLTP decoder parameter estimator 406 may use previously reconstructed samples from delay 412 to estimate parameters of the cascaded synthesis filters used in system 200 and parameters of the frequency selective switch 306. Alternatively, the CLTP decoder parameter estimator 406 may receive all or part of these parameters from the bitstream. The system 200 predicts an entire block of audio signals by using the synthesis filter with the residual signal 110 set to zero and initial states 202-206 set such that output signal 208 for previous blocks matches the previously reconstructed samples. The output signal 208 generated for the current block is then mapped from time to frequency domain by transform 302 (or alternatively by an analysis filter bank) to generate the frequency domain predicted coefficients 412.

(42) The above CLTP embodiments of encoder 300 and decoder 400 may represent the Bluetooth Subband Codec (SBC) system where the mapping from time to frequency domain 302 is implemented by an analysis filter bank, and inverse mapping from frequency to time domain 306 is implemented by a synthesis filter bank. The CLTP encoder parameter estimator 320 and the CLTP decoder parameter estimator 406 may operate only on previously reconstructed samples, i.e., backward adaptive prediction to minimize mean squared error as described in the provisional applications cross referenced above and incorporated by reference herein.

(43) The above CLTP embodiments of encoder 300 and decoder 400 may represent the MPEG AAC system with transform to frequency domain 302 and inverse transform from frequency domain 306 implemented by MDCT and IMDCT, respectively. The CLTP encoder parameter estimator 320 and the CLTP decoder parameter estimator 406 may be designed such that most of the parameters are estimated from previously reconstructed samples, i.e., backward adaptively to minimize mean squared error, and the remaining parameters may be adjusted to the perceptual distortion criteria of the coder and sent as side information, as described in the provisional applications cross referenced above and incorporated by reference herein. The CLTP encoder parameter estimator 320 may alternatively be used with all of the parameters estimated forward adaptively and sent as part of the bitstream to the CLTP decoder parameter estimator 406, to achieve a low decoder complexity variant, as described in the provisional applications cross referenced above and incorporated by reference herein. The CLTP encoder parameter estimator 320 may be used with most of the parameters estimated forward adaptively and sent as part of bitstream to the CLTP decoder parameter estimator 406, while small subset of parameters is estimated backward adaptively in both CLTP encoder parameter estimator 320 and CLTP decoder parameter estimator 406 to obtain a moderate decoder complexity variant as described in the provisional applications cross referenced above and incorporated by reference herein. In both the low decoder complexity variant and the moderate decoder complexity variant the parameters may be initially estimated to minimize mean squared error and then adjusted to take perceptual distortion criteria of the coder into account.

(44) FIG. 5 illustrates an application in accordance with one or more embodiments of the present invention.

(45) System 500 with antenna 502 is illustrated, where decoder 400 as described above is coupled to a speaker 506, and microphone 508 is coupled to encoder 300 as described above. System 500 can be, for example, a Bluetooth transceiver or another wireless device, or a cellular telephone device, or another device for communication of audio or other signals 114.

(46) Signal 504 received at antenna 502 is input into decoder 400, which is decoded and played back on speaker 506. Similarly, signal captured at microphone 508, is encoded with encoder 300 and sent to antenna 502 for transmission.

(47) Frame Loss Concealment and Reverse Estimation

(48) As explained earlier, Frame Loss Concealment (FLC) forms a crucial tool to mitigate unreliable networking conditions. In this regard, a frame may be lost, and it is desirable to replace/conceal the lost frame using various FLC techniques.

(49) FIG. 6 illustrates a typical signal in accordance with one or more embodiments of the present invention. Input signal 102 may comprise segment 600, missing data 602, and segment 604, where time increases as shown from left to right. As such, there may be a beginning segment 600, where signal 102 is easily received and no estimation of signal 102 is required. When signal 102 is somehow interrupted, however, missing data portion 602 of signal 102 must be estimated, or the resulting replay of signal 102 will be discontinuous. Embodiments of the present invention as described herein provide the ability and devices to estimate missing data 602, such that the resulting reconstruction of signal 102 can be a continuous signal reasonably approximating the original, or, at least, reduce the amount of missing data such that signal 102 can be continuous between segment 600 and segment 604.

(50) The CLTP synthesis system 200 may be used to predict the block of missing data by using the cascaded synthesis filter with the residual signal 110 set to zero and initial states 202-206 set such that output signal 208 for previous blocks matches the previously reconstructed samples. Further, a preliminary set of parameters for these filters may be estimated from past segment 600 to minimize mean squared error via the recursive divide and conquer technique described above. The filter parameters may then be adjusted to minimize prediction error in the future segment 604 as described in the provisional applications cross referenced above and incorporated by reference herein.

(51) When a frame of compressed audio signal is lost and compression was performed by an encoder that employs lapped-transforms [42] (e.g., MPEG AAC [1]), both the past segment 600 and future segment 604 will partly or wholly contain a linear combination of the audio signal instead of the audio signal itself. The linear combination, also known as “aliasing” [42], is introduced by the lapped-transform. Embodiments of the present invention may also exploit the information available in aliased samples for frame loss concealment, e.g., by adjusting CLTP filter parameters to minimize prediction error with respect to the available linear combination of audio samples on the other side of the missing portion.

(52) However, there are also times when the continuity of signal 102 must match segment 604, e.g., at the interface between missing data 602 and segment 604. Such a continuity may have the benefit of segment 600 such that predictions that are “forward in time” (i.e., where portions of signal 102 prior in time to the predictions) are available, and there are also occasions when segment 600 is not available. Thus, the present invention must, and can, predict missing data 602 based only on segment 604, such that the predictions are for missing data 602 that occurred prior in time to segment 604. Such predictions are commonly referred to as “reverse” or “backward” predictions for missing data 602. Such predictions are also useful to harmonize the predictions between segment 600 and segment 604, such that missing data 602 is not predicted in a discontinuous or otherwise incompatible fashion, at the interfaces between missing data 602 portion of signal 102 and segments 600 and 604. Such bi-directional predictions are further described in the cross-referenced provisional applications which are incorporated by reference herein.

(53) In other words, further improvement in concealment quality is achieved by using samples predicted in the reverse direction from the future samples. To use an approach similar to the one described above for prediction in the forward direction, a reversed set of reconstructed samples available to the FLC module, is defined as {circumflex over (x)}.sub.r[m]={circumflex over (x)}[K−1−m]. This set in the range −M.sub.f≦m<0 forms the new “past” reconstructed samples and the range K≦m<K+M.sub.p forms the new “future” reconstructed samples. Since pitch periods are assumed to be stationary close to the lost frame, one may begin with the same preliminary CLTP filter estimate (as described above) for the reverse direction and estimate a new set of multiplicative factors via parameter refinement, to form the reverse CLTP filter,

(54) 0 $H_{c}^{r} (z) = {.Math.}_{i = 0}^{P - 1} (1 - G_{i}^{r} (α_{i} z^{- N_{i}} + β_{i} z^{- N_{i} + 1}))$
The parameter refinement may be done to minimize prediction error with respect to the available audio samples or linear combination thereof on the other side of the lost frame.

(55) Given this reverse CLTP filter, another set of samples of the lost frame is generated via the ‘looped’ prediction as {tilde over (x)}.sub.r[m], 0≦m<K. Finally, the overall lost frame {tilde over (x)}.sub.o[m], 0≦m<K is generated as a weighted average of the two sets as,
{tilde over (x)}.sub.o[m]={tilde over (x)}[m]g[m]+{tilde over (x)}.sub.r[K−1−m](1−g[m])
where g[m]=(1−m/(K−1)) are the weights which are proportional to each predicted sample's distance from the set of reconstructed samples used for their generation. To ensure consistent quality of concealment, the weights may also depend on the prediction errors calculated in both directions, for available audio samples or linear combination thereof on the other side of the missing portion.

(56) FIG. 7 illustrates an application in accordance with one or more embodiments of the present invention.

(57) System 700 with antenna 702 is illustrated, where decoder 706 is coupled to one system 200 which is coupled to speaker 708, and microphone 710 is coupled to another system 200 which is coupled to encoder 712. System 700 can be, for example, a Bluetooth transceiver or another wireless device, or a cellular telephone device, or another device for communication of audio or other signals 704.

(58) Signal 704 received at antenna 702 is input into decoder 706. When this input signal is somehow interrupted, e.g., because of interference or other reasons, system 200 along with the CLTP parameter estimator 714 can provide estimations for the lost signal as described above, which is output to speaker 708. Similarly, when there is an interruption of the input from microphone 710, the second system 200 along with second CLTP parameter estimator 714 can provide an estimate of the lost signal portion as described above to encoder 712, which then encodes that estimate.

(59) Hardware Environment

(60) FIG. 8 is an exemplary hardware and software environment 800 used to implement one or more embodiments of the invention. The hardware and software environment includes a computer 802 and may include peripherals. The computer 802 comprises a general purpose hardware processor 804A and/or a special purpose hardware processor 804B (hereinafter alternatively collectively referred to as processor 804) and a memory 806, such as random access memory (RAM). The computer 802 may be coupled to, and/or integrated with, other devices, including input/output (I/O) devices such as a keyboard 812 and a cursor control device 814 (e.g., a mouse, a pointing device, pen and tablet, touch screen, multi-touch device, etc.), a display 816, a speaker 818 (or multiple speakers or a headset) and a microphone 820. In yet another embodiment, the computer 802 may comprise a multi-touch device, mobile phone, gaming system, internet enabled television, television set top box, multimedia content delivery server, or other internet enabled device executing on various platforms and operating systems.

(61) In one embodiment, the computer 802 operates by the general purpose processor 804A performing instructions defined by the computer program 810 under control of an operating system 808. The computer program 810 and/or the operating system 808 may be stored in the memory 806 and may interface with the user and/or other devices to accept input and commands and, based on such input and commands and the instructions defined by the computer program 810 and operating system 808, to provide output and results.

(62) The CLTP and parameter estimation techniques may be performed within/by computer program 810 and/or may be executed by processors 804. Alternatively, or in addition, the CLTP filters may be part of computer 802 or accessed via computer 802.

(63) Output/results may be played on speaker 818 or provided to another device for playback or further processing or action.

(64) Some or all of the operations performed by the computer 802 according to the computer program 810 instructions may be implemented in a special purpose processor 804B. In this embodiment, the some or all of the computer program 810 instructions may be implemented via firmware instructions stored in a read only memory (ROM), a programmable read only memory (PROM) or flash memory within the special purpose processor 804B or in memory 806. The special purpose processor 804B may also be hardwired through circuit design to perform some or all of the operations to implement the present invention. Further, the special purpose processor 804B may be a hybrid processor, which includes dedicated circuitry for performing a subset of functions, and other circuits for performing more general functions such as responding to computer program 810 instructions. In one embodiment, the special purpose processor 804B is an application specific integrated circuit (ASIC).

(65) Of course, those skilled in the art will recognize that any combination of the above components, or any number of different components, peripherals, and other devices, may be used with the computer 802.

(66) Logical Flow

(67) FIG. 9 illustrates the logical flow for processing an audio signal in accordance with one or more embodiments of the invention.

(68) At step 900, an audio signal is compressed/decompressed and/or a missing portion of the audio signal (e.g., due to packet loss during transmission) is concealed (e.g., by estimating the missing portion). Step 900 is performed utilizing prediction by a plurality of cascaded long term prediction filters. Each of the plurality of cascaded long term prediction filters corresponds to one periodic component of the audio signal.

(69) At step 902, further details regarding the compression/decompression/concealing processing of step 900 are configured and/or performed. Such processing/configuring may include multiple aspects as described in detail above. For example, one or more cascaded filter parameters of the cascaded long term prediction filters may be adapted to local audio signal characteristics. Such parameters may include a number of filters in a cascade, a time lag parameter, and a gain parameter (which may be sent to a decoder as side information) and/or estimated from a reconstructed audio signal. Such an adaptation may adjust cascaded filter parameters for each of the plurality of cascaded long term prediction filters, successively, while fixing all other cascaded filter parameters. The adapting/adjusting may then be iterated over all filters until a desired level of performance (e.g., a minimum prediction error energy) is met. The parameters (e.g., gain parameters) may be further adjusted to satisfy a perceptual criterion that may be obtained by calculating a noise to mask ratio.

(70) The compression of the audio signal may include time-frequency mapping (e.g., employing a MDCT and/or an analysis filter bank), quantization, and entropy coding while the decompressing may include corresponding inverse operations of frequency-time mapping (e.g., employing an inverse MDCT and/or a synthesis filter bank), dequantization, and entropy decoding. The time-frequency mapping, quantization, entropy coding, and their inverse operations, may be utilized in an MPEG AAC scheme and/or utilized in a Bluetooth wireless system.

(71) When concealing the missing portion of an audio signal, access to the audio signal samples or a linear combination thereof may exist on both sides of the missing portion. Consequently, the concealing may include predicting the missing portion based on available audio samples or linear combination thereof on one side of the missing portion, and predicting the missing portion and available audio samples or linear combination thereof on the other side, wherein a prediction error is calculated for the available audio samples or linear combination thereof on the other side. Further, a first set of filters may be utilized to generate a first approximation of the missing portion from available past signal information. A second set of filters may also be utilized to operate in a reverse direction (having been optimized to predict a past from future audio samples), and generate a second approximation of the missing portion from available future signal information. The missing portion is then concealed by a weighted average of the first and second approximations of the missing portion. The weights used for the weighted average may depend on the position of an approximated sample within the missing portion, and on the prediction errors calculated in both directions, for available audio samples or linear combination thereof on the other side of the missing portion, which are indicative of the relative quality of the first and second approximations.

REFERENCES

(72) The following references are incorporated by reference herein to the description and specification of the present application. [1] Information technology—Coding of audio-visual objects—Part 3: Audio—Subpart 4: General audio coding (GA), ISO/IEC Std. ISO/IEC JTC1/SC29 14 496-3:2005, 2005. [2] Bluetooth Specification: Advanced Audio Distribution Profile, Bluetooth SIG Std. Bluetooth Audio Video Working Group, 2002. [3] F. de Bont, M. Groenewegen, and W. Oomen, “A high quality audiocoding system at 128 kb/s,” in Proc. 98th AES Convention, February 1995, paper 3937. [4] E. Allamanche, R. Geiger, J. Herre, and T. Sporer, “MPEG-4 low delay audio coding based on the AAC codec,” in Proc. 106th AES Convention, May 1999, paper 4929. [5] J. Ojanper, M. Vaananen, and L. Yin, “Long term predictor for transform domain perceptual audio coding,” in Proc. 107th AES Convention, September 1999, paper 5036. [6] T. Nanjundaswamy, V. Melkote, E. Ravelli, and K. Rose, “Perceptual distortion-rate optimization of long term prediction in MPEG AAC,” in Proc. 129th AES Convention, November 2010, paper 8288. [9] B. S. Atal and M. R. Schroeder, “Predictive coding of speech signals,” in Proc. Conf. Commun., Processing, November 1967, pp. 360-361. [10] S. M. Kay, Modern Spectral Estimation. Englewood Cliffs, N.J.: Prentice-Hall, 1988. [11] A. de Cheveign'e, “A mixed speech F0 estimation algorithm,” in Proceedings of the 2nd European Conference on Speech Communication and Technology (Eurospeech '91), September 1991. [12] D. Giacobello, T. van Waterschoot, M. Christensen, S. Jensen, and M. Moonen, “High-order sparse linear predictors for audio processing,” in Proc. 18th European Sig. Proc. Conf., August 2010, pp. 234-238. [13] Information technology—Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s—Part 3: Audio, ISO/IEC Std. ISO/IEC JTC1/SC29 11 172-3, 1993. [14] M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, K. Akagiri, H. Fuchs, M. Dietz, J. Herre, G. Davidson, and Y. Oikawa, “ISO/IEC MPEG-2 advanced audio coding,” J. Audio Eng. Soc., vol. 45, no. 10, pp. 789-814, October 1997. [15] A. Aggarwal, S. L. Regunathan, and K. Rose, “Trellis-based optimization of MPEG-4 advanced audio coding,” in Proc. IEEE Workshop on Speech Coding, 2000, pp. 142-144. [16] “A trellis-based optimal parameter value selection for audio coding,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 14, no. 2, pp. 623-633, 2006. [17] C. Bauer and M. Vinton, “Joint optimization of scale factors and Huffman codebooks for MPEG-4 AAC,” in Proc. 6th IEEE Workshop. Multimedia Sig. Proc., September 2004. [18] R. P. Ramachandran and P. Kabal, “Pitch prediction filters in speech coding,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 4, pp. 467-477, 1989. [19] R. Pettigrew and V. Cuperman, “Backward pitch prediction for low delay speech coding,” in Conf. Rec., IEEE Global Telecommunications Conf., November 1989, pp. 34.3.1-34.3.6. [20] H. Chen, W. Wong, and C. Ko, “Comparison of pitch prediction and adaptation algorithms in forward and backward adaptive CELP systems,” in Communications, Speech and Vision, IEE Proceedings I, vol. 140, no. 4, 1993, pp. 240-245. [21] M. Yong and A. Gersho, “Efficient encoding of the long-term predictor in vector excitation coders,” Advances in Speech Coding, pp. 329-338, Dordrecht, Holland: Kluwer, 1991. [22] S. McClellan, J. Gibson, and B. Rutherford, “Efficient pitch filter encoding for variable rate speech processing,” IEEE Trans. Speech Audio Process., vol. 7, no. 1, pp. 18-29, 1999. [23] J. Marques, I. Trancoso, J. Tribolet, and L. Almeida, “Improved pitch prediction with fractional delays in CELP coding,” in Proc. IEEE Intl. Conf. Acoustics, Speech, and Sig. Proc., 1990, pp. 665-668. [24] D. Veeneman and B. Mazor, “Efficient multi-tap pitch prediction for stochastic coding,” Kluwer international series in engineering and computer science, pp. 225-225, 1993. [25] P. Kroon and K. Swaminathan, “A high-quality multirate real-time CELP coder,” IEEE J. Sel. Areas Commun., vol. 10, no. 5, pp. 850-857, 1992. [26] J. Chen, “Toll-quality 16 kb/s CELP speech coding with very low complexity,” in Proc. IEEE Intl. Conf. Acoustics, Speech, and Sig. Proc., 1995, pp. 9-12. [27] W. Kleijn and K. Paliwal, Speech coding and synthesis. Elsevier Science Inc., 1995, pp. 95-102. [28] Method of Subjective Assessment of Intermediate Quality Level of Coding Systems, ITU Std. ITU-R Recommendation, BS 1534-1, 2001. [29] R. P. Ramachandran and P. Kabal, “Stability and performance analysis of pitch filters in speech coders,” IEEE Trans. Acoust., Speech, Signal Process., vol. 35, no. 7, pp. 937-946, 1987. [30] A. Said, “Introduction to arithmetic coding-theory and practice,” Hewlett Packard Laboratories Report, 2004. [31] C. Perkins, O. Hodson, and V. Hardman, “A survey of packet loss recovery techniques for streaming audio,” IEEE Network, vol. 12, no. 5, pp. 40-48, 1998. [32] S. J. Godsill and P. J. W. Rayner, Digital audio restoration: a statistical model based approach, Springer verlag, 1998. [33] J. Herre and E. Eberlein, “Evaluation of concealment techniques for compressed digital audio,” in Proc. 94th Conv. Aud. Eng. Soc, February 1993, Paper 3460. [34] R. Sperschneider and P. Lauber, “Error concealment for compressed digital audio,” in Proc. 111th Conv. Aud. Eng. Soc, November 2003, Paper 5460. [35] S. U. Ryu and K. Rose, “An mdct domain frame-loss concealment technique for mpeg advanced audio coding,” in IEEE ICASSP, 2007, pp. 1-273-1-276. [37] J. Nocedal, “Updating quasi-newton matrices with limited storage,” Mathematics of computation, vol. 35, no. 151, pp. 773-782, 1980. [38] J. Nocedal and S. J. Wright, Numerical optimization, Springer verlag, 1999. [39] I. Kauppinen and K. Roth, “Audio signal extrapolation—theory and applications,” in Proc. 5th Int. Conf. on Digital Audio Effects, September 2002, pp. 105-110. [40] P. A. A. Esquef and L. W. P. Biscainho, “An efficient model-based multirate method for reconstruction of audio signals across long gaps,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 14, no. 4, pp. 1391-1400, 2006. [41] J. J. Shynk, “Adaptive IIR filtering,” IEEE ASSP Magazine, vol. 6, no. 2, pp. 4-21, 1989. [42] J. Princen, A. Johnson, and A. Bradley, “Subband/transform coding using filter bank designs based on time domain aliasing cancellation,” in Proc. IEEE Intl. Conf. Acoustics, Speech, and Sig. Proc., April 1987, pp. 2161-2164.

CONCLUSION

(73) In conclusion, embodiments of the present invention provide an efficient and effective solution to the problem of predicting polyphonic signals. The solution involves a framework of a cascade of LTP filters, which by design is tailored to account for all periodic components present in a polyphonic signal. Embodiments of the invention complement this framework with a design method to optimize the system parameters. Embodiments also specialize to specific techniques for coding and networking scenarios, where the potential of each enhanced prediction considerably improves the overall system performance for that application. The effectiveness of such an approach has been demonstrated for various commercially used systems and standards, such as the Bluetooth audio standard for low delay short range wireless communications (e.g., SNR improvements of about 5 dB), and the MPEG AAC perceptual audio coding standard.

(74) Accordingly, embodiments of the invention enable performance improvement in various audio related applications, including for example, music storage and distribution (e.g., Apple™ iTunes™ store), as well as high efficiency storage and playback devices, wireless audio streaming (especially to mobile devices), and high-definition teleconferencing (including on smart phones and tablets). Embodiments of the invention may also be utilized in areas/products that involve mixed speech and music signals as well as in unified speech-audio coding. Further embodiments may also be utilized in multimedia applications that utilize cloud based content distribution services.

(75) In addition to the above, embodiments of the invention provide an effective means to conceal the damage due to lost samples, and specifically overcomes the main challenge due to the polyphonic nature of music signals by employing a cascade of long term prediction filters (tailored to each periodic component) so as to effectively estimate all periodic components in the time-domain while fully utilizing all of the available information. Methods of the invention are capable of exploiting available information from both sides of the missing frame or lost samples to optimize the filter parameters and perform uni or bi-directional prediction of the lost samples. Embodiments of the invention also guarantee that the concealed lost frame is embedded seamlessly within the available signal. The effectiveness of such concealing has been demonstrated and has provided improved quality over existing FLC techniques. For example, gains of 20-30 points (on a scale of 0 to 100) in a standard subjective qualify measure of MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) and Segmental SNR improvements of about 7 dB have been obtained.

(76) In view of the above, embodiments of the present invention disclose methods and devices for signal estimation/prediction.

(77) Although the present invention has been described in connection with the preferred embodiments, it is to be understood that modifications and variations may be utilized without departing from the principles and scope of the invention, as those skilled in the art will readily understand. Accordingly, such modifications may be practiced within the scope of the invention and the following claims, and the full range of equivalents of the claims.

(78) This concludes the description of the preferred embodiment of the present invention. The foregoing description of one or more embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto and the full range of equivalents of the claims. The attached claims are presented merely as one aspect of the present invention. The Applicant does not disclaim any claim scope of the present invention through the inclusion of this or any other claim language that is presented or may be presented in the future. Any disclaimers, expressed or implied, made during prosecution of the present application regarding these or other changes are hereby rescinded for at least the reason of recapturing any potential disclaimed claim scope affected by these changes during prosecution of this and any related applications. Applicant reserves the right to file broader claims in one or more continuation or divisional applications in accordance within the full breadth of disclosure, and the full range of doctrine of equivalents of the disclosure, as recited in the original specification.

Method and apparatus for polyphonic audio signal prediction in coding and networking systems

Assignee

Inventors

Cpc classification

Classification Explorer

G10L19/09

PHYSICS

Classification Explorer

G10L19/005

PHYSICS

International classification

Classification Explorer

G10L19/00

PHYSICS

Classification Explorer

G10L19/005

PHYSICS

Classification Explorer

G10L19/09

PHYSICS

Abstract

Claims

Description