Method and arrangement for controlling smoothing of stationary background noise
09852739 · 2017-12-26
Assignee
Inventors
Cpc classification
G10L21/0308
PHYSICS
G10L19/06
PHYSICS
G10L2021/02168
PHYSICS
G10L19/087
PHYSICS
G10L21/02
PHYSICS
International classification
G10L19/087
PHYSICS
G10L21/02
PHYSICS
G10L21/0308
PHYSICS
Abstract
In a method for coding of information for enhancing a background noise representation, voice activity of an input speech signal is determined. A noisiness parameter is determined for an inactive speech signal, wherein the noisiness parameter is based on a ratio of prediction gains of two Linear Predictive Coder (LPC) prediction filters with different orders. The noisiness parameter is quantized, and the quantized noisiness parameter is encoded for transmission.
Claims
1. A method for coding of information for enhancing a background noise representation, the method comprising: determining voice activity of an input speech signal; determining a noisiness parameter for an inactive speech signal, wherein said noisiness parameter is based on a ratio of prediction gains of two Linear Predictive Coder (LPC) prediction filters with different orders; quantizing the noisiness parameter; and encoding the quantized noisiness parameter for transmission.
2. The method according to claim 1, wherein the noisiness parameter is obtained by a ratio σ.sup.2.sub.e,q/σ.sup.2.sub.e,p, where p>q and where σ.sup.2.sub.e represents prediction error variance, and p and q represent orders of LPC analysis.
3. The method according to claim 1, wherein orders of said LPC prediction filters are 2.sup.nd and 16.sup.th.
4. The method according to claim 1, wherein said noisiness parameter is adapted in response to a detected narrowband or wideband content of said input speech signal.
5. The method according to claim 1, wherein quantization of the noisiness parameter comprises normalizing the noisiness parameter with factor μ.
6. The method according to claim 5, wherein μ=2 for wideband content and μ=0.5 for narrowband content.
7. A speech encoder, comprising: processing circuitry configured to determine voice activity of an input speech signal; the processing circuitry configured to determine a noisiness parameter for an inactive speech signal, wherein said noisiness parameter is based on a ratio of prediction gains of two Linear Predictive Coder (LPC) prediction filters with different orders; the processing circuitry configured to quantize the noisiness parameter; and the processing circuitry configured to encode the speech signal for transmission.
8. The speech encoder according to claim 7, wherein said processing circuitry is further configured to calculate prediction error variances σ.sup.2.sub.e,q and σ.sup.2.sub.e,p, where p and q represent orders of LPC analysis, and the noisiness parameter is obtained as a ratio σ.sup.2.sub.e,q/σ.sup.2.sub.e,p, where p>q.
9. The speech encoder according to claim 7, wherein said processing circuitry is further configured to adapt the noisiness measure in response to a detected narrowband or wideband content of said input speech signal.
10. The speech encoder according to claim 7, wherein said processing circuitry is further configured to normalize the noisiness parameter with factor μ.
11. An anti-swirling method for coded background noise, the method comprising: receiving and decoding a coded speech signal; obtaining a voice activity indication and a noisiness parameter for said speech signal, wherein said noisiness parameter is based on a ratio of prediction gains of two Linear Predictive Coder (LPC) prediction filters with different orders; and adaptively smoothing background noise of said decoded speech signal based on said obtained noisiness parameter, wherein said smoothing operation is indirectly controlled by said noisiness parameter.
12. The method according to claim 11, wherein said smoothing operation is controlled by a further smoothing control parameter that is steered by said obtained noisiness parameter.
13. The method according to claim 11, wherein said noisiness parameter is received from an encoder, and decoded.
14. The method according to claim 11, wherein the smoothing control parameter is set to the maximum between the noisiness parameter and a smoothing control parameter used in a previous frame reduced by a step size δ.
15. The method according to claim 14, wherein the step size δ is 0.05.
16. The method according to claim 11, further comprising initiating said adaptive smoothing in response to said voice activity indication indicating inactive speech.
17. The method according to claim 16, comprising initiating said adaptive smoothing with a predetermined delay in response to a detected speech inactivity.
18. The method according to claim 17, wherein the predetermined delay is 5 frames.
19. The method according to claim 16, comprising resuming said background noise smoothing immediately in response to a detected speech inactivity after a spurious voice activity.
20. The method according to claim 19, wherein the spurious voice activity comprises detected activity period of less or equal to 3 frames.
21. The method according to claim 17, comprising gradually initiating said smoothing operation at the end of said delay.
22. The method according to claim 21, wherein the smoothing operation is gradually steered from inactivated to fully enabled during a phase-in period of K frames.
23. The method according to claim 22, wherein the smoothing control parameter for the phase-in period is modified as:
24. The method according to claim 16, comprising terminating said adaptive smoothing immediately in response to detecting active speech.
25. A speech decoder, comprising: processing circuitry configured to receive and decode a coded speech signal; the processing circuitry further configured to obtain a voice activity indication and a noisiness parameter for said speech signal, said noisiness parameter being based on a ratio of prediction gains of two Linear Predictive Coder (LPC) prediction filters with different orders; and the processing circuitry further configured to adaptively smooth background noise of said decoded speech signal based on said obtained noisiness parameter, wherein said processing circuitry is adapted to be indirectly controlled by said noisiness parameter.
26. The speech decoder according to claim 25, wherein said processing circuitry is further configured to receive and decode said noisiness parameter.
27. The speech decoder according to claim 25, wherein the processing circuitry is further configured to initiate said adaptive smoothing in response to said speech signal having an inactive status.
28. The speech decoder according to claim 27, wherein said processing circuitry is further configured, in response to said speech signal having an inactive status, to initiate said adaptive smoothing with a predetermined delay.
29. The speech decoder according to claim 28, wherein said processing circuitry is further configured to gradually initiate said smoothing operation at the end of said delay.
30. The speech decoder according to claim 28, wherein said processing circuitry is further configured, in response to said speech signal having an active status, to terminate said adaptive smoothing immediately.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The invention, together with further objects and advantages thereof, may best be understood by referring to the following description taken together with the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
ABBREVIATIONS
(9) AbS Analysis by Synthesis
(10) ADPCM Adaptive Differential PCM
(11) AMR-WB Adaptive Multi Rate Wide Band
(12) EVRC-WB Enhanced Variable Rate Wideband Codec
(13) CELP Code Excited Linear Prediction
(14) DXT Discontinuous Transmission
(15) DSVD Digital Simultaneous Voice and Data
(16) ISP Immittance Spectral Pair
(17) ITU-T International Telecommunication Union
(18) LPC Linear Predictive Coders
(19) LSF Line Spectral Frequency
(20) MPEG Moving Pictures Experts Group
(21) PCM Pulse Code Modulation
(22) SMV Selectable Mode Vocoder
(23) VAD Voice Activity Detector
(24) VOIP Voice Over Internet Protocol
DETAILED DESCRIPTION
(25) The present invention will be described in the context of a wireless mobile speech session. However, it is equally applicable to a wired connection. Throughout the following description, the terms speech and voice will be used as being identical. Accordingly, a speech session indicates a communication of voice/speech between at least two terminals or nodes in a telecommunication network. A speech session is assumed to always include two components, namely a speech component and a background noise component. The speech component is the actual voiced communication of the session, which can be active (e.g. one person is speaking) or inactive (e.g. the person is silent between words or phrases). The background noise component is the ambient noise from the environment surrounding the speaking person. This noise can be more or less stationary in nature.
(26) As mentioned before, one problem with speech sessions is how to improve the quality of the speech session in an environment including a stationary background noise, or any noise for that matter. According to known methods, there is frequently employed various methods of smoothing the background noise. However, there is a risk that a smoothing operation actually reduces the quality or “listenability” of the speech session by distorting the speech component, or making the remaining background noise even more disturbing.
(27) In the course of investigations underlying the present invention, it was found that background noise smoothing is particularly useful only for certain background signals, such as car noise. For other background noise types such as babble, office, double taker, etc. background noise smoothing does not provide the same degree of quality improvements to the synthesized signal and may even make the background noise re-production unnatural. It was further found that “noisiness” is a suitable characterizing feature indicating if background noise smoothing can provide quality enhancements or not. It was also found that noisiness is a more adequate feature than stationarity, which has been used in prior art methods.
(28) A main aim of the present invention is therefore to control the smoothing operation of stationary background noise gradually based on a noisiness measure or metric of the background signal. If during voice inactivity the background signal is found to be very noise-like, then a larger degree of smoothing is used. If the inactivity signal is less noise-like, then the degree of noise smoothing is reduced or no smoothing is carried out at all. The noisiness measure is preferably derived in the encoder and transmitted to the decoder where the control of the noise smoothing depends on it. However, it can also be derived in the decoder itself.
(29) Basically, with reference to
(30) According to a further embodiment of the invention, the noisiness metric describes how noise-like the signal is or how much of a random component it contains. More specifically, the noisiness measure or metric can be defined and described in terms of the predictability of the signal, where signals with strong random components are poorly predictable while those with weaker random component are more predictable. Consequently, such a noisiness measure can be defined by means of the well-known LPC prediction gain G.sub.p of the signal, which is defined as:
(31)
(32) Here σ.sub.x.sup.2 denotes the variance of the background (noise) signal and σ.sub.e,p.sup.2 denotes the variance of the LPC prediction error of this signal obtained with an LPC analysis of order p. Instead of variance, the prediction gain may also be defined by means of power or energy. It is also known that the prediction error variance σ.sub.e,p.sup.2 and the sequence of prediction error variances σ.sub.e,k.sup.2,k=1 . . . p−1 are readily obtained as by-products of the Levinson-Durbin algorithm, which is used for calculating the LPC parameters from the sequence of autocorrelation parameters of the background noise signal. Typically, the prediction gain is high for signals with weak random component while it is low for noise-like signals.
(33) According to a preferred embodiment of the present invention a suitable similar noisiness metric is obtained by taking the ratio of the prediction gains of two LPC prediction filters with different orders p and q, where p>q:
(34)
(35) This metric gives an indication how much the prediction gain increases when increasing the LPC filter order from q to p. It delivers a high value if the signal has low noisiness and a value close to 1 of the noisiness is high. Suitable choices are q=2 and p=16, though other values for the LPC orders are equally possible.
(36) It is to be noted that preferably the above described noisiness metric or measure is determined or calculated at the encoder side, and subsequently transmitted to, and provided at the decoder side. However, it is equally possible (with only minor adaptation) to determine or calculate the noisiness metric based on the actual received signal at the decoder side.
(37) One advantage of calculating the metric at the encoder side is that the computation can be based on un-quantized LPC parameters and hence potentially has the best possible resolution. In addition, the calculation of the metric requires no extra computational complexity since (as explained above) the required prediction error variances are readily obtained as a by-product of the LPC analysis, which typically is carried out in any case. Calculating the metric in the encoder requires that the metric subsequently it is quantized and that a coded representation of the quantized metric is transmitted to the decoder where it is used for controlling the background noise smoothing. The transmission of the noisiness parameter requires some bit rate of e.g. 5 bits per 20 ms frame and hence 250 bps, which may appear as a disadvantage. However, considering that the noisiness parameter is only needed during speech inactivity periods, it is possible, according to a specific embodiment, to skip this transmission during active speech and to merely transmit it during inactivity in which typically this bit rate may be available since the codec does not require the same bit rate as during active speech. Similarly, considering the special case of a speech codec that encodes unvoiced speech sounds and inactivity sounds with some particular lower-rate mode, it may also be possible to afford this extra bit rate without extra cost.
(38) However, as already mentioned, it is possible to derive the noisiness measure at the decoder side based on the received and decoded LPC parameters. The well-known step-up/step-down procedures provide a way for calculating the sequence of prediction error variances from received LPC parameters, which in turn, as explained above, can be used to calculate the noisiness measure.
(39) It should be pointed out that according to experimental results the noisiness measure of the present invention is very beneficial in combination with a specific background noise smoothing method with which it was combined in a study. However, in combination with other anti-swirling methods it may be beneficial to combine the measure with stationary measures, which are known from prior, art. One such measure with which the noisiness measure can be combined is an LPC parameter similarity metric. This metric evaluates the LPC parameters of two successive frames, e.g. by means of the Euclidian distance between the corresponding LPC parameter vectors such as e.g. LSF parameters. This metric leads to large values if successive LPC parameter vectors are very different and can hence be used as indication of the signal stationarity.
(40) It is also to be noted that, besides the above mentioned conceptual difference between “noisiness” of the present invention and “stationarity” of prior art methods, there is at least one further important discriminating difference between these measures. Namely, calculating stationarity involves deriving at least a current parameter of a current frame and relating it to at least a previous parameter of some previous frame. Noisiness in contrast can be calculated as an instantaneous measure on a current frame without any knowledge of some earlier frame. The benefit is that memory for storing the state from a previous frame can be saved.
(41) The following embodiments describe ways in which anti-swirling methods can be controlled based on the provided noisiness measure. It is assumed that the smoothing operation is controlled by means of control factors and that, without limiting the generality, a control factor equal to 1 means no smoothing operation while a factor of 0 means smoothing with the fullest possible degree.
(42) According to a basic embodiment, the provided noisiness measure directly controls the degree of smoothing that is applied during the decoding of the background noise signal. It is assumed that the degree of smoothing is controlled by means of a parameter γ. Then it is for instance possible to map the noisiness metric from the above directly to γ according to the following example expression
γ=Q{(metric−1).Math.μ}+ν (3)
(43) A suitable choice for ν is 0.5 and for μ a value between 0.5 and 2. It is to be noted that Q{.} denotes a quantization operator that also performs a limitation of the number range such that the control factors do not exceed 1. It is further to be noted that preferably the coefficient μ is chosen depending on the spectral content of the input signal. In particular, if the codec is a wideband codec operating with 16 kHz sampling rate and the input signal has a wideband spectrum (0-7 kHz) then the metric will lead to relatively smaller values than in the case that the input signal has a narrowband spectrum (0-3400 Hz). In order to compensate for this effect, μ should be larger for wideband content than for narrow band content. A suitable choice is μ=2 for wideband content and μ=0.5 for narrowband content. However, also other values are possible depending on the specific situation. Accordingly, the degree of smoothing operation can be specifically calibrated by means of a parameter μ, depending on if the signal comprises wideband content or narrowband content.
(44) One important aspect affecting the quality of the reconstructed background noise signal is that the noisiness metric during inactivity periods may change quite rapidly. If the afore-mentioned noisiness metric is used to directly control the background noise smoothing, this may introduce undesirable signal fluctuations. According to a further preferred embodiment of the invention, with reference to
γ.sub.min=Q{(metric−1).Math.μ}+ν (4)
(45) Then the smoothing control parameter γ is set to the maximum between γ.sub.min and the smoothing control parameter γ′ used previously (i.e. in the previous frame) reduced by some amount δ:
γ=max(γ.sub.min,γ′−δ) (5)
(46) The effect of this operation is that γ is steered step-wise towards γ.sub.min as long as γ is still greater than γ.sub.min. Otherwise it is identical to γ.sub.min. A suitable choice for this step size δ is 0.05. The described operation is visualized in
(47) Investigations by the inventors have shown that the smoothing of the background noise in direct or indirect dependency on the provided noisiness measure can provide quality enhancements of the reconstructed background noise signal. It has also been found that it is important for the quality to make sure that the smoothing operation is avoided during active speech and that the degree of smoothing of the background noise does not change too frequently and too rapidly.
(48) A related aspect is the voice activity detection (VAD) operation that controls if the background noise smoothing is enabled or not. Ideally, the VAD should detect the inactivity periods in between the active parts of the speech signal in which the background noise smoothing is enabled. However, in reality there is no such ideal VAD and it happens that parts of the active speech are declared inactive or that inactive parts are declared active speech. In order to provide a solution for the problem that active speech may be declared inactive it is common practice, e.g. in speech transmissions with discontinuous transmission (DTX) to add a so-called hangover period to the segments declared active. This is a means, which artificially extends the periods declared active. It decreases the likelihood that a frame is erroneously declared inactive. It has been found that a corresponding principle can also be applied with benefit in the context controlling the background noise smoothing operation.
(49) According to a preferred embodiment of the invention, with reference to
(50) In order to improve the performance of the background noise smoothing further, it is found beneficial to gradually enable the background noise smoothing after the hangover period rather than turning it on too abruptly. In order to achieve such a gradual enabling a phase-in period is defined during which the smoothing operation is gradually steered from inactivated to fully enabled. Assuming the phase-in period to be K frames long and further assuming that the current frame is the n-th frame in this phase-in period, then the smoothing control parameter g* for that frame is obtained by interpolation between its original value γ and its value corresponding to deactivation of the smoothing operation (γ.sub.inact=1):
(51)
(52) It is to be noted that it is beneficial to activate phase-in periods only after hangover periods, i.e. not after spurious VAD activation.
(53)
(54) A further embodiment of a procedure implementing the described method with voice activity driven (VAD) activation of the background noise smoothing is shown in the flow chart of
(55) If however the VAD flag has a value equal to 0 indicating inactivity, then the inactive speech path is executed. Here, first the inactive frame counter (Inact_count) is incremented. Then it is checked if this counter is less or equal to the hangover limit (Inact_count<=ho) in which case the execution path for the hangover period is carried out. In that case, the noise smoothing control parameter g* is set to 1, which disables the smoothing. In addition, the active frame counter is initialized with the spurious VAD activation limit (Act_count=enab_ho_lim), which means that hangover periods are still not disabled in case of subsequent spurious VAD activation. After that the procedure stops. If the inactivity frame counter is larger than the hangover limit, then it is checked if the inactive frame counter is less or equal to the hangover limit plus the phase-in limit (Inact_count<=ho+pi). If this is the case, then the processing of the phase-in period is carried out which means that the noise smoothing control parameter is obtained by means of interpolation (g*=interpolate) as described above. Otherwise, the noise smoothing control parameter is left unmodified. After that, the background noise smoothing procedure is carried out with a degree according to the noise smoothing parameter. Subsequently, the active frame counter is reset (Act_count=0), which means that subsequently hangover periods are disabled after spurious VAD activations. After that the procedure stops.
(56) Depending on the quality achieved with the noise smoothing procedure it may lead to quality enhancements not only during inactive speech but also during unvoiced speech which has a noise-like character. Hence, in this case the voice activity driven activation of the background noise smoothing may benefit from an extension that it is activated during not only inactive speech frames, but also unvoiced frames.
(57) A preferred embodiment of the invention is obtained by combining the methods with indirect control of background noise smoothing and with voice activity driven activation of the background noise smoothing.
(58) According to a further embodiment of the invention in connection with a scalable codec the degree of smoothing is generally reduced if the decoding is done with a higher rate layer. This is since higher rate speech coding usually has less swirling problems during background noise periods.
(59) A particularly beneficial embodiment of the present invention can be combined with a smoothing operation in which a combination of LPC parameter smoothing (e.g. low pass filtering) and excitation signal modification. In short, the smoothing operation comprises receiving and decoding a signal representative of a speech session, the signal comprising both a speech component and a background noise component. Subsequently, determining LPC parameters and an excitation signal for the signal. Thereafter, modifying the determined excitation signal by reducing power and spectral fluctuations of the excitation signal to provide a smoothed output signal. Finally, synthesizing and outputting an output signal based on the determined LPC parameters and excitation signal. In combination with the controlling operation of the present invention a synthesized speech signal with improved quality is provided.
(60) An arrangement according to the present invention will be described below with reference to
(61) With reference to
(62) According to a further embodiment, also with reference to
(63) With reference to
(64) Advantages of the present invention include: An improved background noise smoothing operation Improved control of background noise smoothing
(65) It will be understood by those skilled in the art that various modifications and changes may be made to the present invention without departure from the scope thereof, which is defined by the appended claims.
REFERENCES
(66) [1] U.S. Pat. No. 5,632,004. [2] U.S. Pat. No. 5,579,432. [3] U.S. Pat. No. 5,487,087. [4] U.S. Pat. No. 6,275,798 B1. [5] 3GPP TS 26.090, AMR Speech Codec; Transcoding functions. [6] EP 1096476. [7] EP 1688920 [8] U.S. Pat. No. 5,953,697 [9] EP 665530 B1 [10] Tasaki et. al., Post noise smoother to improve low bit rate speech-coding performance, IEEE Workshop on speech coding, 1999 [11] Ehara et al., Noise Post-Processing Based on a Stationary Noise Generator, IEEE Workshop on speech coding, 2002.