Frame loss correction with voice information

10431226 ยท 2019-10-01

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for processing a digital audio signal, including a series of samples distributed in consecutive frames, is implemented when decoding the signal in order to replace at least one signal frame lost during decoding. The method includes the following steps: a) searching, in a valid signal segment available when decoding, for at least one period in the signal, determined in accordance with the valid signal; b) analyzing the signal in the period, in order to determine spectral components of the signal in the period; c) synthesizing at least one frame for replacing the lost frame, by construction of a synthesis signal from: an addition of components selected among the predetermined spectral components, and a noise added to the addition of components. In particular, the amount of noise added to the addition of components is weighted in accordance with voice information of the valid signal, obtained when decoding.

Claims

1. A non-transitory computer readable medium storing a code of a computer program, wherein said computer program comprises instructions for implementing, when the program is executed by a processor, a method for processing a digital audio signal comprising a series of samples distributed in successive frames, the method being implemented when decoding said signal in order to replace at least one lost signal frame during decoding, the method comprising the steps of: a) searching, in a valid signal segment available when decoding, for at least one period in the signal, determined based on said valid signal, b) analyzing the signal in said period, in order to determine spectral components of the signal in said period, c) synthesizing at least one replacement for the lost frame, by constructing a synthesis signal from: an addition of components selected from among said determined spectral components, and noise added to the addition of components, wherein the amount of noise added to the addition of components is weighted based on voice information of the valid signal, obtained when decoding, wherein the voice information is supplied in a bitstream received in decoding and corresponding to said signal comprising a series of samples distributed in successive frames, wherein, in a case of frame loss in decoding, the voice information contained in a valid signal frame preceding the lost frame is used, wherein the voice information comes from an encoder generating the bitstream and determining the voice information, wherein the voice information is encoded in a single bit in the bitstream, wherein, in step a), the period is searched for in a valid signal segment of greater length in the case of voicing in the valid signal, and wherein: if the signal is voiced, the period is searched for in a valid signal segment of a duration of more than 30 milliseconds, and if not, the period is searched for in a valid signal segment of a duration of less than 30 milliseconds.

2. The non-transitory computer readable medium according to claim 1, wherein the noise signal is obtained by a residual between the valid signal and the addition of selected components.

3. The non-transitory computer readable medium according to claim 1, wherein a number of components selected for the addition is larger in the case of voicing in the valid signal than in the case of unvoicing in the valid signal.

4. The non-transitory computer readable medium according to claim 1, wherein, in step a), the period is searched for in a valid signal segment of greater length in the case of voicing in the valid signal than in the case of unvoicing in the valid signal.

5. The non-transitory computer readable medium according to claim 1, wherein a noise signal added to the addition of components is weighted by a smaller gain in the case of voicing in the valid signal, and, if the signal is voiced, a gain value is 0.25, and otherwise is 1.

6. The non-transitory computer readable medium according to claim 1, wherein the voice information comes from an encoder determining a spectrum flatness value, obtained by comparing amplitudes of the spectral components of the signal to a background noise, said encoder delivering said value in binary form in the bitstream.

7. The non-transitory computer readable medium according to claim 6, wherein a noise signal added to the addition of components is weighted by a smaller gain in the case of voicing in the valid signal than in the case of unvoicing signal, and a gain value is determined as a function of said flatness value.

8. The non-transitory computer readable medium according to claim 6, wherein said flatness value is compared to a threshold in order to determine: that the signal is voiced if the flatness value is below the threshold, and that the signal is unvoiced otherwise.

9. The non-transitory computer readable medium according to claim 1, wherein a number of components selected for the addition is larger in the case of voicing in the valid signal, and wherein: if the signal is voiced, the spectral components having amplitudes greater than those of the neighboring first spectral components are selected, as well as the neighboring first spectral components, and otherwise only the spectral components having amplitudes greater than those of the neighboring first spectral components are selected.

10. The non-transitory computer readable medium according to claim 1, wherein a noise signal added to the addition of components is weighted by a smaller gain in the case of voicing in the valid signal than in the case of unvoicing in the valid signal.

11. A device for decoding a digital audio signal comprising a series of samples distributed in successive frames, the device comprising a computer circuit for replacing at least one lost signal frame, by: a) searching, in a valid signal segment available when decoding, for at least one period in the signal, determined based on said valid signal, b) analyzing the signal in said period, in order to determine spectral components of the signal in said period, c) synthesizing at least one frame for replacing the lost frame, by constructing a synthesis signal from: an addition of components selected from among said determined spectral components, and noise added to the addition of components, the amount of noise added to the addition of components being weighted based on voice information of the valid signal, obtained when decoding wherein the voice information is supplied in a bitstream received in decoding and corresponding to said signal comprising a series of samples distributed in successive frames, wherein, in a case of frame loss in decoding, the voice information contained in a valid signal frame preceding the lost frame is used, wherein the voice information comes from an encoder generating the bitstream and determining the voice information, wherein the voice information is encoded in a single bit in the bitstream, wherein, in step a), the period is searched for in a valid signal segment of greater length in the case of voicing in the valid signal, and wherein: if the signal is voiced, the period is searched for in a valid signal segment of a duration of more than 30 milliseconds, and if not, the period is searched for in a valid signal segment of a duration of less than 30 milliseconds.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) Other features and advantages of the invention will be apparent from examining the following detailed description and the appended drawings in which:

(2) FIG. 1 summarizes the main steps of the method for correcting frame loss in the sense of document FR 1350845;

(3) FIG. 2 schematically shows the main steps of a method according to the invention;

(4) FIG. 3 illustrates an example of steps implemented in encoding, in one embodiment in the sense of the invention;

(5) FIG. 4 shows an example of steps implemented in decoding, in one embodiment in the sense of the invention;

(6) FIG. 5 illustrates an example of steps implemented in decoding, for the pitch search in a valid signal segment Nc;

(7) FIG. 6 schematically illustrates an example of encoder and decoder devices in the sense of the invention.

DETAILED DESCRIPTION

(8) We now refer to FIG. 1, illustrating the main steps described in document FR 1350845. A series of N audio samples, denoted b(n) below, is stored in a buffer memory of the decoder. These samples correspond to samples already decoded and are therefore accessible for correcting frame loss at the decoder. If the first sample to be synthesized is sample N, the audio buffer corresponds to previous samples 0 to N1. In the case of transform coding, the audio buffer corresponds to samples in the previous frame, which cannot be changed because this type of encoding/decoding does not provide for delay in reconstructing the signal; therefore the implementation of a crossfade of sufficient duration to cover a frame loss is not provided for.

(9) Next is a step S2 of frequency filtering, in which the audio buffer b(n) is divided into two bands, a low band LB and a high band HB, with a separation frequency denoted Fc (for example Fc=4 kHz). This filtering is preferably a delayless filtering. The size of the audio buffer is now reduced to N=N*Fc/f following decimation of fs to Fc. In variants of the invention, this filtering step may be optional, the next steps being carried out on the full band.

(10) The next step S3 consists of searching the low band for a loop point and a segment p(n) corresponding to the fundamental period (or pitch) within buffer b(n) re-sampled at frequency Fc. This embodiment allows taking into account pitch continuity in the lost frame(s) to be reconstructed.

(11) Step S4 consists of breaking apart segment p(n) into a sum of sinusoidal components. For example, the discrete Fourier transform (DFT) of signal p(n) over a duration corresponding to the length of the signal can be calculated. The frequency, phase, and amplitude of each of the sinusoidal components (or peaks) of the signal are thus obtained. Transforms other than DFT are possible. For example, transforms such as DCT, MDCT, or MCLT may be applied.

(12) Step S5 is a step of selecting K sinusoidal components in order to retain only the most significant components. In one particular embodiment, the selection of components first corresponds to selecting the amplitudes A(n) for which A(n)>A(n1) and A(n)>A(n+1) where

(13) n [ 0 ; P 2 - 1 ] ,
which ensures that the amplitudes correspond to spectral peaks.

(14) To do this, the samples of segment p(n) (pitch) are interpolated to obtain segment p(n) composed of P samples, where P=2.sup.ceil(log.sup.2.sup.(P))>P, ceil(x) being an integer greater than or equal to x. Analysis by Fourier transform FFT is therefore done more efficiently over a length which is a power of 2, without modifying the actual pitch period (due to the interpolation). The FFT transform of p(n) is calculated: (k)=FFT(p(n)); and, from the FFT transform, the phases (k) and amplitudes A(k) of the sinusoidal components are directly obtained, the normalized frequencies between 0 and 1 being given here by:

(15) f ( k ) = 2 kP P 2 k [ 0 ; P 2 - 1 ]

(16) Next, among the amplitudes of this first selection, the components are selected in descending order of amplitude, so that the cumulative amplitude of the selected peaks is at least x % (for example x=70%) of the cumulative amplitude over typically half the spectrum at the current frame.

(17) In addition, it is also possible to limit the number of components (for example to 20) in order to reduce the complexity of the synthesis.

(18) The sinusoidal synthesis step S6 consists of generating a segment s(n) of a length at least equal to the size of the lost frame (T). The synthesis signal s(n) is calculated as a sum of the selected sinusoidal components:

(19) s ( n ) = .Math. k = 0 k = K A ( k ) sin ( f ( k ) n + ( k ) ) n [ 0 ; 2 T + LF 2 ]
where k is the index of the K peaks selected in step S5.

(20) Step S7 consists of noise injection (filling in the spectral regions corresponding to the lines not selected) in order to compensate for energy loss due to the omission of certain frequency peaks in the low band. One particular implementation consists of calculating the residual r(n) between the segment corresponding to the pitch p(n) and the synthesis signal s(n), where n[0; P1], such that:
r(n)=p(n)s(n)n[0;P1]

(21) This residual of size P is transformed, for example it is windowed and repeated with overlaps between windows of varying sizes, as described in patent FR 1353551:

(22) r ( k ) = f ( r ( n ) ) n [ 0 ; P - 1 ] et k [ 0 ; 2 T + LF 2 ]

(23) Signal s(n) is then combined with signal r(n):

(24) s ( n ) = s ( n ) + r ( n ) n [ 0 ; 2 T + LF 2 ]

(25) Step S8 applied to the high band may simply consist of repeating the passed signal.

(26) In step S9, the signal is synthesized by resampling the low band at its original frequency fc, after having been mixed with the filtered high band in step S8 (simply repeated in step S11).

(27) Step S10 is an overlap-add to ensure continuity between the signal before the frame loss and the synthesis signal.

(28) We now describe elements added to the method of FIG. 1, in one embodiment in the sense of the invention.

(29) According to a general approach presented in FIG. 2, voice information of the signal before frame loss, transmitted at at least one bitrate of the coder, is used in decoding (step DI-1) in order to quantitatively determine a proportion of noise to be added to the synthesis signal replacing one or more lost frames. Thus, the decoder uses the voice information to decrease, based on the voicing, the general amount of noise mixed in the synthesis signal (by assigning a gain G(res) lower than the noise signal r(k) originating from a residual in step DI-3, and/or by selecting more components of amplitudes A(k) for use in constructing the synthesis signal in step DI-4).

(30) In addition, the decoder may adjust its parameters, particularly for the pitch search, to optimize the compromise between quality/complexity of the processing, based on the voice information. For example, for the pitch search, if the signal is voiced, the pitch search window Nc may be larger (in step DI-5), as we will see below with reference to FIG. 5.

(31) For determining the voicing, information may be provided by the encoder, in two ways, at at least one bitrate of the encoder: in the form of a bit of value 1 or 0 depending on a degree of voicing identified in the encoder (received from the encoder in step DI-1 and read in step DI-2 in case of frame loss for the subsequent processing), or as a value of the average amplitude of the peaks composing the signal in encoding, compared to a background noise.

(32) This spectrum flatness data Pl may be received in multiple bits at the decoder in optional step DI-10 of FIG. 2, then compared to a threshold in step DI-11, which is the same as determining in steps DI-1 and DI-2 whether the voicing is above or below a threshold, and deducing the appropriate processing, particularly for the selection of peaks and for the choice of length of the pitch search segment.

(33) This information (whether in the form of a single bit or as a multi-bit value) is received from the encoder (at at least one bitrate of the codec), in the example described here.

(34) Indeed, with reference to FIG. 3, in the encoder, the input signal presented in the form of frames C1 is analyzed in step C2. The analysis step consists of determining whether the audio signal of the current frame has characteristics that require special processing in case of frame loss at the decoder, as is the case for example with voiced speech signals.

(35) In one particular embodiment, a classification (speech/music or other) already determined at the encoder is advantageously used in order to avoid increasing the overall complexity of the processing. Indeed, in the case of encoders that can switch coding modes between speech or music, classification at the encoder already allows adapting the encoding technique employed to the nature of the signal (speech or music). Similarly, in the case of speech, predictive encoders such as the encoder of the G.718 standard also use classification in order to adapt the encoder parameters to the type of signal (sounds that are voiced/unvoiced, transient, generic, inactive).

(36) In one particular first embodiment, only one bit is reserved for frame loss characterization. It is added to the encoded stream (or bitstream) in step C3 to indicate whether the signal is a speech signal (voiced or generic). This bit is, for example, set to 1 or 0 according to the following table, based on: the decision of the speech/music classifier and also on the decision of the speech coding mode classifier.

(37) TABLE-US-00001 Decision of the encoder's Speech Music classifier Value of frame loss Decision of the coding 0 characterization bit mode classifier: Voiced 1 Not voiced 0 Transient 0 Generic 1 Inactive 0

(38) Here, the term generic refers to a common speech signal (which is not a transient related to the pronunciation of a plosive, is not inactive, and is not necessarily purely voiced such as the pronunciation of a vowel without a consonant).

(39) In a second alternative embodiment, the information transmitted to the decoder in the bitstream is not binary but corresponds to a quantification of the ratio between the peaks and valleys in the spectrum. This ratio can be expressed as a measurement of the flatness of the spectrum, denoted Pl:

(40) Pl = log 2 ( exp ( 1 N .Math. k = 0 N - 1 ln ( x ( k ) ) ) 1 N .Math. k = 0 N - 1 x ( k ) )

(41) In this expression, x(k) is the spectrum of amplitude of size N resulting from analysis of the current frame in the frequency domain (after FFT).

(42) In an alternative, a sinusoidal analysis is provided, breaking down the signal at the encoder into sinusoidal components and noise, and the flatness measurement is obtained by a ratio of sinusoidal components and the total energy of the frame.

(43) After step C3 (including the one bit of voice information or the multiple bits of the flatness measurement), the audio buffer of the encoder is conventionally encoded in step C4 before any subsequent transmission to the decoder.

(44) Referring now to FIG. 4, we will describe the steps implemented in the decoder in one exemplary embodiment of the invention.

(45) In the case where there is no frame loss in step D1 (NOK arrow exiting test D1 of the FIG. 4), in step D2 the decoder reads the information contained in the bitstream, including the frame loss characterization information (at at least one bitrate of the codec). This information is stored in memory so it can be reused when a following frame is missing. The decoder then continues with the conventional steps of decoding D3, etc., to obtain the synthesized output frame FR SYNTH.

(46) In the case where frame loss(es) occurs (OK arrow exiting test D1), steps D4, D5, D6, D7, D8, and D12 are applied, respectively corresponding to steps S2, S3, S4, S5, S6, and S11 of FIG. 1. However, a few changes are made concerning steps S3 and S5, respectively steps D5 (searching for a loop point for the pitch determination) and D7 (selecting sinusoidal components). Furthermore, the noise injection in step S7 of FIG. 1 is carried out with a gain determination according to two steps D9 and D10 in FIG. 4 of the decoder in the sense of the invention.

(47) In the case where the frame loss characterization information is known (when the previous frame has been received), the invention consists of modifying the processing of steps D5, D7, and D9-D10, as follows.

(48) In a first embodiment, the frame loss characterization information is binary, of a value:

(49) equal to 0 for an unvoiced signal, of a type such as music or transient,

(50) equal to 1 otherwise (the above table).

(51) Step D5 consists of searching for a loop point and a segment p(n) corresponding to the pitch within the audio buffer resampled at frequency Fc. This technique, described in document FR 1350845, is illustrated in FIG. 5, in which: the audio buffer in the decoder is of sample size N, the size of a target buffer BC of Ns samples is determined, the correlation search is performed over Nc samples the correlation curve Correl has a maximum at mc, the loop point is designated Loop pt and is positioned at Ns samples of the correlation maximum, the pitch is then determined over the p(n) remaining samples at N1.

(52) In particular, we calculate a normalized correlation corr(n) between the target buffer segment of size Ns, between NNs and N1 (of a duration of 6 ms for example), and the sliding segment of size Ns which begins between sample 0 and Nc (where Nc>NNs):

(53) Corr ( n ) = .Math. k = 0 k = Ns b ( n + k ) b ( N - Ns + k ) .Math. k = 0 k = Ns b ( n + k ) 2 .Math. k = 0 k = Ns b ( N - Ns + k ) 2 n [ 0 ; Nc ]

(54) For music signals, due to the nature of the signal, the value Nc does not need to be very large (for example Nc=28 ms). This limitation saves in computational complexity during the pitch search.

(55) However, voice information from the last valid frame previously received allows determining whether the signal to be reconstructed is a voiced speech signal (mono pitch). It is therefore possible, in such cases and with such information, to increase the size of segment Nc (for example Nc=33 ms) in order to optimize the pitch search (and potentially find a higher correlation value).

(56) In step D7 in FIG. 4, sinusoidal components are selected such that only the most significant components are retained. In one particular embodiment, also presented in document FR 1350845, the first selection of components is equivalent to selecting amplitudes A(n) where A(n)>A(n1) and

(57) A ( n ) > A ( n + 1 ) with n [ 0 ; P 2 - 1 ] .

(58) In the case of the invention, it is advantageously known whether the signal to be reconstructed is a speech signal (voiced or generic) and therefore has pronounced peaks and a low level of noise. Under these conditions, it is preferable to select not only the peaks (A(n) where A(n)>A(n1) and A(n)>A(n+1) as shown above, but also to expand the selection to A(n1) and A(n+1) so that the selected peaks represent a larger portion of the total energy of the spectrum. This modification allows lowering the level of noise (and in particular the level of noise injected in steps D9 and D10 presented below) compared to the level of the signal synthesized by sinusoidal synthesis in step D8, while retaining an overall energy level sufficient to cause no audible artifacts related to energy fluctuations.

(59) Next, in the case where the signal is without noise (at least at low frequencies), as is the case in a generic or voiced speech signal, we observe that the addition of noise corresponding to the transformed residual r(n) within the meaning of FR 1350845, actually degrades the quality.

(60) Therefore the voice information is advantageously used to reduce noise by applying a gain G in step D10. Signal s(n) resulting from step D8 is mixed with the noise signal r(n) resulting from step D9, but a gain G is applied here which is dependent on the frame loss characterization information originating from the bitstream of the previous frame, which is:

(61) s ( n ) = s ( n ) + G * r ( n ) n [ 0 ; 2 T + LF 2 ] .

(62) In this particular embodiment, G may be a constant equal to 1 or 0.25 depending on the voiced or unvoiced nature of the signal of the previous frame, according to the table given below by way of example:

(63) TABLE-US-00002 Value of frame loss characterization bit 0 1 Gain G 1 0.25

(64) In the alternative embodiment where the frame loss characterization information has a plurality of discrete levels characterizing the flatness Pl of the spectrum, the gain G may be expressed directly as a function of the Pl value. The same is true for the bounds of segment Nc for the pitch search and/or for the number of peaks An to be taken into account for synthesis of the signal.

(65) Processing such as the following can be defined as an example.

(66) The gain G has already been directly defined as a function of the Pl value: G(Pl)=2.sup.Pl

(67) In addition, the Pl value is compared to an average value 3 dB, provided that the 0 value corresponds to a flat spectrum and 5 dB corresponds to a spectrum with pronounced peaks.

(68) If the Pl value is less than the average threshold value 3 dB (thus corresponding to a spectrum with pronounced peaks, typical of a voiced signal), then we can set the duration of the segment for the pitch search Nc to 33 ms, and we can select peaks A(n) such that A(n)>A(n1) and A(n)>A(n+1), as well as the first neighboring peaks A(n1) and A(n+1).

(69) Otherwise (if the Pl value is above the threshold, corresponding to less pronounced peaks, more background noise, such as a music signal for example), the duration Nc can be chosen to be shorter, for example 25 ms, and only the peaks A(n) are selected that satisfy A(n)>A(n1) and A(n)>A(n+1).

(70) The decoding can then continue by mixing noise for which the gain is thus obtained with the components selected in this manner, to obtain the synthesis signal in the low frequencies in step D13, which is added to the synthesis signal in the high frequencies that is obtained in step D14, in order to obtain the general synthesis signal in step D15.

(71) Referring to FIG. 6, one possible implementation of the invention is illustrated in which a decoder DECOD (comprising for example software and hardware such as a suitably programmed memory MEM and a processor PROC cooperating with this memory, or alternatively a component such as an ASIC, or other, as well as a communication interface COM) embedded for example in a telecommunications device such as a telephone TEL, for the implementation of the method of FIG. 4, uses voice information that it receives from an encoder ENCOD. This encoder comprises, for example, software and hardware such as a suitably programmed memory MEM for determining the voice information and a processor PROC cooperating with this memory, or alternatively a component such as an ASIC, or other, and a communication interface COM. The encoder ENCOD is embedded in a telecommunications device such as a telephone TEL.

(72) Of course, the invention is not limited to the embodiments described above by way of example; it extends to other variants.

(73) Thus, for example, it is understood that voice information may take different forms as variants. In the example described above, this may be the binary value of a single bit (voiced or not voiced), or a multi-bit value that can concern a parameter such as the flatness of the signal spectrum or any other parameter that allows characterizing voicing (quantitatively or qualitatively). Furthermore, this parameter may be determined by decoding, for example based on the degree of correlation which can be measured when identifying the pitch period.

(74) An embodiment was presented above by way of example which included a separation, into a high frequency band and a low frequency band, of the signal from preceding valid frames, in particular with a selection of spectral components in the low frequency band. This implementation is optional, however, although it is advantageous as it reduces the complexity of the processing. Alternatively, the method of frame replacement with the assistance of voice information in the sense of the invention can be carried out while considering the entire spectrum of the valid signal.

(75) An embodiment was described above in which the invention is implemented in a context of transform coding with overlap add. However, this type of method can be adapted to any other type of coding (CELP in particular).

(76) It should be noted that in the context of transform coding with overlap add (where typically the synthesis signal is constructed over at least two frame durations because of the overlap), said noise signal can be obtained by the residual (between the valid signal and the sum of the peaks) by temporally weighting the residual. For example, it can be weighted by overlap windows, as in the usual context of encoding/decoding by transform with overlap.

(77) It is understood that applying gain as a function of the voice information adds another weight, this time based on the voicing.