Generation of comfort noise
11621004 · 2023-04-04
Assignee
Inventors
Cpc classification
International classification
Abstract
A User Equipment (UE) is operative to generate CN (Comfort Noise) control parameters, e.g., as part of audio-decoding processing by the UE. A buffer of a predetermined size implemented in the UE is configured to store CN parameters for SID (Silence Insertion Descriptor) frames and active hangover frames. Processing circuitry of the UE is configured to determine a CN parameter subset relevant for SID frames based on the age of the stored CN parameters and on residual energies, and use the determined CN parameter subset to determine CN control parameters for a first SID frame following an active signal frame.
Claims
1. A method performed by a decoding circuit with respect to an encoded audio signal, the method comprising: determining Comfort Noise (CN) parameters to use for CN generation with respect to a transitional Silence Insertion Descriptor (SID) frame of the encoded audio signal, according to a subset of CN parameters corresponding to earlier SID or active hangover frames of the encoded audio signal, wherein the transitional SID frame is a first SID frame following an active non-hangover frame of the encoded audio signal; and determining the subset from among up to K CN parameter sets, each CN parameter set including a residual energy value and corresponding to a respective one among up to K earlier SID or active hangover frames, the determining based on at least one of: the number of consecutive active non-hangover frames of the audio encoded signal separating a most-recent one of the up to K earlier SID or active hangover frames and the transitional SID frame; or the residual energy values of the up to K earlier SID or active hangover frames.
2. The method of claim 1, wherein determining the subset comprises selecting CN parameter sets from among the up to K CN parameter sets in which the residual energy values are within a determined range of the residual energy value in a most-recent one among the up to K CN parameter sets.
3. The method of claim 2, wherein determining the subset further comprises reducing the number of CN parameter sets among the up to K CN parameter sets considered for inclusion in the subset, in dependence on the number of consecutive active non-hangover frames of the audio encoded signal separating the most-recent one of the up to K earlier SID or active hangover frames and the transitional SID frame.
4. The method of claim 3, wherein reducing the number of CN parameter sets among the up to K CN parameter sets considered for inclusion in the subset excludes one or more oldest ones among the up to K CN parameter sets from consideration for inclusion in the subset, irrespective of the residual energy values in the one or more oldest ones among the up to K CN parameter sets.
5. The method of claim 2, wherein determining the subset comprises reducing the number of CN parameter sets among the up to K CN parameter sets considered for inclusion in the subset, in dependence on the number of consecutive active non-hangover frames of the audio encoded signal separating the most-recent one of the up to K earlier SID or active hangover frames and the transitional SID frame.
6. The method of claim 5, wherein reducing the number of CN parameter sets among the up to K CN parameter sets considered for inclusion in the subset excludes one or more oldest ones among the up to K CN parameter sets from consideration for inclusion in the subset, irrespective of the residual energy values in the one or more oldest ones among the up to K CN parameter sets.
7. The method of claim 5, wherein determining the subset further comprises selecting CN parameter sets from among the non-excluded ones of the up to K CN parameter sets, in which the residual energy values are within a determined range of the residual energy value in a most-recent one among the up to K CN parameter sets.
8. A User Equipment (UE) configured for operation in a wireless communication network, the UE comprising: communication circuitry configured to receive an encoded audio signal from a radio network node in the wireless communication network; and a decoding circuit in or communicatively coupled to the communication circuitry, wherein, with respect to the encoded audio signal, the decoding circuit is configured to: determine Comfort Noise (CN) parameters to use for CN generation with respect to a transitional Silence Insertion Descriptor (SID) frame of the encoded audio signal, according to a subset of CN parameters corresponding to earlier SID or active hangover frames of the encoded audio signal, wherein the transitional SID frame is a first SID frame following an active non-hangover frame of the encoded audio signal; and determine the subset from among up to K CN parameter sets, each CN parameter set including a residual energy value and corresponding to a respective one among up to K earlier SID or active hangover frames, the determining based on at least one of: the number of consecutive active non-hangover frames of the audio encoded signal separating a most-recent one of the up to K earlier SID or active hangover frames and the transitional SID frame; or the residual energy values of the up to K earlier SID or active hangover frames.
9. The UE of claim 8, wherein the decoding circuit is configured to determine the subset by selecting CN parameter sets from among the up to K CN parameter sets in which the residual energy values are within a determined range of the residual energy value in a most-recent one among the up to K CN parameter sets.
10. The UE of claim 9, wherein the decoding circuit is configured to determine the subset further by reducing the number of CN parameter sets among the up to K CN parameter sets considered for inclusion in the subset, in dependence on the number of consecutive active non-hangover frames of the audio encoded signal separating the most-recent one of the up to K earlier SID or active hangover frames and the transitional SID frame.
11. The UE of claim 10, wherein the decoding circuit is configured to reduce the number of CN parameter sets among the up to K CN parameter sets considered for inclusion in the subset by excluding one or more oldest ones among the up to K CN parameter sets from consideration for inclusion in the subset, irrespective of the residual energy values in the one or more oldest ones among the up to K CN parameter sets.
12. The UE of claim 8, wherein the decoding circuit is configured to determine the subset by reducing the number of CN parameter sets among the up to K CN parameter sets considered for inclusion in the subset, in dependence on the number of consecutive active non-hangover frames of the audio encoded signal separating the most-recent one of the up to K earlier SID or active hangover frames and the transitional SID frame.
13. The UE of claim 12, wherein the decoding circuit is configured to reduce the number of CN parameter sets among the up to K CN parameter sets considered for inclusion in the subset by excluding one or more oldest ones among the up to K CN parameter sets from consideration for inclusion in the subset, irrespective of the residual energy values in the one or more oldest ones among the up to K CN parameter sets.
14. The UE of claim 12, wherein the decoding circuit is configured to determine the subset further by selecting CN parameter sets from among the non-excluded ones of the up to K CN parameter sets, in which the residual energy values are within a determined range of the residual energy value in a most-recent one among the up to K CN parameter sets.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The proposed technology, together with further objects and advantages thereof, may best be understood by making reference to the following description taken together with the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
DETAILED DESCRIPTION
(14) The embodiments described below relate to a system of audio encoder and decoder mainly intended for speech communication applications using DTX with comfort noise for inactive signal representation. The system that is considered utilizes LP for coding of both active and inactive signal frames, where a VAD is used for activity decisions.
(15) In the encoder illustrated in
(16) The disclosed embodiments am part of an audio decoder. Such a decoder 100 is schematically illustrated in
(17) The decoder 100 also includes a buffer 200 of a predetermined size M and configured to receive and store CN parameters for SID and active mode hangover frames, a unit 300 configured to determine which of the stored CN parameters that are relevant for SID based on the age of stored CN parameters, a unit 400 configured to determine which of the determined CN parameters that are relevant for SID based on residual energy measurements, and a unit 500 configured to use the determined CN parameters that are relevant for SID for the first SID frame following active signal frame(s).
(18) The parameters in the buffers are constrained to be recent in order to be relevant. Thereby the sizes of the buffers used for selection of relevant buffer subsets are reduced during longer periods of active coding. Additionally, the stored parameters are replaced by newer values during SID and actively coded hangover frames.
(19) By using circular buffers, the complexity and memory requirement for the buffer handling can be reduced. In such implementations, the already stored elements do not have to be moved when a new element is added. The position of the last added parameter, or parameter set, is used together with the size of the buffer to place new elements. When new elements are added, old elements might be overwritten.
(20) Since the buffers hold parameters from earlier SID and hangover frames they describe signal characteristics of previous audio frames that probably, but not necessarily, contain background noise. The number of parameters that are considered relevant is defined by the size of the buffer and the time, or corresponding number of frames, elapsed since the information was stored.
(21) The technology disclosed herein can be described in a number of algorithmic steps. e.g. performed at the decoder side illustrated in
(22) 1a. Step 1a (performed by the unit denoted step 1a in
(23)
The buffer position index j∈[0,M−1] is increased by one prior to each buffer update and reset if the index exceeds the buffer size M, i.e.
j=0 if j>M−1 (8)
As will be described below, subsets Q.sup.K and .sup.K of the K.sub.0 latest stored elements in Q.sup.M and E.sup.M, respectively, define the sets of stored parameters.
(24) 1b. Step 1b (performed by the unit denoted step 1b in
(25) During decoding of active frames, the size of subsets Q.sup.K and E.sup.K is decreased by a rate of γ.sup.−1 elements per frame according to:
(26)
where K.sub.0 is the number of stored elements in previous SID and hangover frames, η∈.sup.+ and p.sub.A is the number of consecutive active non-hangover frames. The rate of decrement relates to time, where γ=25 is feasible for 20 ms frames. This corresponds to a decrease by one element every half second while decoding active frames. The decrement rate constant γ can potentially be defined as any value γ∈
.sup.+, but it should be chosen such
that old noise characteristics that are likely not to represent the current background noise are excluded from the subsets Q.sup.K and
.sup.K. The value might for example be chosen based on the expected dynamics of the background noise. In addition, the natural length of speech bursts and the behavior of the VAD may be considered, as long sequences of consecutive active frames are unlikely. Typically, the constant would be in the range γ≤500 for 20 ms frames, which corresponds to less than 10 seconds. As an alternative equation (9) may be written in a more compact form as:
K=K.sub.0−η for η.Math.γ≤p.sub.A<(η+1).Math.γ (10)
where
K.sub.0 is the number of CN parameters for SID frames and active hangover frames stored in the buffer 200,
(27) γ is a predetermined constant, and
(28) η is a non-negative integer.
(29) 2. Step 2 (performed by the unit denoted step 2 in
(30) At the first SID following active frames a subset of the buffer .sup.K is selected based on the residual energies. The subset E.sup.S={E.sub.0.sup.S, . . . , E.sub.L-1.sup.S}∪E.sup.K of size L is defined as:
E.sup.S={E.sub.k.sup.K∈E.sup.K|E.sub.k.sub.
where
E.sub.k.sub.
(31) Typically, γ.sub.2 is selected from the range γ.sub.2∈[0,100] as larger values would include high residual energies compared to the latest stored residual energy E.sub.k.sub.
(32) It should be noted that the energies E.sub.k.sup.K can as well as in linear domain be represented in a logarithmic domain, e.g. dB. With energies in logarithmic domain the selection of relevant buffer elements, as specified in equation (11), is described equivalently with energies E.sub.k.sup.K in linear domain as:.sup.S={E.sub.k.sup.K∈
.sup.K|E.sub.k.sub.
where log({tilde over (γ)}.sub.1)=−γ.sub.1 and log({circumflex over (γ)}.sub.2)=γ.sub.2. Suitable boundaries specifying the subset of the buffer .sup.K are for example given by {tilde over (γ)}.sub.1=0.7 and {tilde over (γ)}.sub.2=1.03 or {circumflex over (γ)}.sub.1∈[0.5,0.9] and {tilde over (γ)}.sub.2 ∈[1.0,1.25]. The
corresponding vectors in the LSP buffer Q.sup.K define the subset Q.sup.S={q.sub.0.sup.S, . . . , q.sub.L-1.sup.S}.
(33) 3. Step 3 (performed by the unit denoted step 3 in
(34) To find a representative residual energy the weighted mean of the subset E.sup.S is computed as:
(35)
(36) where w.sub.k.sup.S are the elements in the subset of weights:
w.sup.S={w.sub.j.sup.M∈w.sup.M} for ∀j|E.sub.j.sup.M∈.sup.S
(37) For a maximum buffer size M=8 a suitable set of weights is:
w.sup.M={0.2,0.16,0.128,0.1024,0.08192,0.065536,0.0524288,0.01048576} This means that recent energies get more weight in the residual energy mean Ē, which makes the energy transition between active and inactive frames smoother. Among LSP vectors in the subset Q.sup.S, the median LSP vector is selected by computing the distances between all the LSP vectors in the subset buffer
.sup.S according to:
(38)
(39) For every LSP vector the distance to the other vectors are summed, i.e.
(40)
{tilde over (q)}={q.sub.1∈Q.sup.S|S.sub.1≤S.sub.ml≠m} for l,m=0, . . . ,L−1 (16) If several vectors have equal total distance, the median can be arbitrarily chosen among those vectors. As an alternative representative LSP vector may be determined as the mean vector of the subset Q.sup.S.
(41) 4. Step 4 (performed by the unit denoted step 4 in
(42) .sup.S. Suitable values are for example α=0.2 and β=0.2 or β=0.05. The comfort noise parameter for the first SID frame are then used by a comfort noise generator 32 to control filling of no data frames from mode selector 26 with noise based on excitations from excitation generator 34.
(43) If the subsets Q.sup.S and are empty, the latest extracted SID parameters may be used directly without interpolation from older noise parameters.
(44) The transmitted LSP vector {tilde over (q)}.sub.SID used in the interpolation is in the encoder usually obtained directly from the LP analysis of the current frame, i.e. no previous frames are considered. The transmitted residual energy Ē.sub.SID is preferably obtained using LP parameters corresponding to the LSP parameters used for the signal synthesis in the decoder. These LSP parameters can be obtained in the encoder by performing steps 1-4 with a corresponding encoder side buffer. Operating the encoder in this way implies that the energy of the decoder output can be matched to the input signal energy by control of the encoded and transmitted residual energy since the decoder synthesis LP parameters are known in the encoder.
(45)
(46) Although it is true that there will be only one first SID frame following an active signal frame, it will indirectly affect the CN parameters in following SID frames due to the smoothing/interpolation.
(47)
(48)
(49)
(50) .sup.K of the stored CN parameters based on the number p.sub.A of consecutive active non-hangover frames, for example as described under subsection 1b above. A buffer element selector 300 is configured to select the CN parameter subset Q.sup.S,
.sup.S from the age restricted subset Q.sup.K.
.sup.K based on residual energies, for example as described under subsection 2 above. A comfort noise parameter estimator 400 is configured to determine representative CN parameters {tilde over (q)}, Ē from the CN parameter subset Q.sup.S,
.sup.S, for example as described under subsection 3 above. A comfort noise parameter interpolator 500 is configured to interpolate the representative CN parameters {tilde over (q)}, Ē with decoded CN parameters {tilde over (q)}.sub.SID, Ē.sub.SID, for example as described under subsection 4 above. The obtained comfort noise control parameters q.sub.i, E.sub.i for the first SID frame are then used by comfort noise generator 32 to control filling of no data frames with noise based on excitations from excitation generator 34.
(51) The steps, functions, procedures and/or blocks described herein may be implemented in hardware using any conventional technology, such as discrete circuit or integrated circuit technology, including both general-purpose electronic circuitry and application-specific circuitry.
(52) Alternatively, at least some of the steps, functions, procedures and/or blocks described herein may be implemented in software for execution by suitable processing equipment. This equipment may include, for example, one or several microprocessors, one or several Digital Signal Processors (DSP), one or several Application Specific Integrated Circuits (ASIC), video accelerated hardware or one or several suitable programmable logic devices, such as Field Programmable Gate Arrays (FPGA). Combinations of such processing elements are also feasible.
(53) It should also be understood that it may be possible to reuse the general processing capabilities already present in a network node, such as a mobile terminal or pc. This may, for example, be done by reprogramming of the existing software or by adding new software components.
(54)
(55) According to an aspect of the embodiments, a decoder for generating comfort noise representing an inactive signal is provided. The decoder can operate in DTX mode and can be implemented in a mobile terminal and by a computer program product which can be implemented in the mobile terminal or pc. The computer program product can be downloaded from a server to the mobile terminal.
(56)
(57)
(58) In the embodiments of the proposed technology described above the LP coefficients α.sub.k are transformed to an LSP domain. However, the same principles may also be applied to LP coefficients that are transformed to an LSF, ISP or ISF domain.
(59) For codecs with attenuation of the comfort noise it can be beneficial to gradually attenuate the actively coded signal during VAD hangover frames. The energy for the comfort noise would then better match the latest actively coded frame, which further improves the perceived audio quality. An attenuation factor A can be computed and applied to the LP residual for each hangover frame by:
(60)
where p.sub.HO is the number of consecutive VAD hangover frames. As an alternative λ may be computed as:
(61)
where L=0.6 and L.sub.0=6 control the maximum attenuation and rate of attenuation. The maximum attenuation can typically be selected in the range L=[0.5,1) and the rate control parameter L.sub.0 for example be selected such that
(62)
where p.sub.HO.sup.FULL is the number of frames needed for maximum attenuation. p.sub.HO.sup.FULL could for example be set to the average or maximum number of consecutive VAD hangover frames that is possible (due to the hangover addition in the VAD). Typically, this would be in the range of p.sub.HO.sup.FULL={1, . . . , 15} frames.
(63) It should be understood that the technology described herein can co-operate with other solutions handling the first CN frames following active signal segments. For example, it can complement an algorithm where a large change in CN parameters is allowed for high energy frames (relative to background noise level). For these frames, the previous noise characteristics might not much affect the update in the current SID frame. The described technology may then be used for frames that are not detected as high energy frames.
(64) It will be understood by those skilled in the art that various modifications and changes may be made to the proposed technology without departure from the scope thereof, which is defined by the appended claims.
ABBREVIATIONS
(65) ACELP Algebraic Code-Excited Linear Prediction
(66) AMR Adaptive Multi-Rate
(67) AMR NB AMR Narrowband
(68) AR Auto Regressive
(69) ASIC Application Specific Integrated Circuits
(70) CN Comfort Noise
(71) DFT Discrete Fourier Transform
(72) DSP Digital Signal Processors
(73) DTX Discontinuous Transmission
(74) EEPROM Electrically Erasable Programmable Read-only Memory
(75) FPGA Field Programmable Gate Arrays
(76) ISF Immitance Spectrum Frequencies
(77) ISP Immitance Spectrum Pairs
(78) LP Linear Prediction,
(79) LSF Line Spectral Frequencies
(80) LSP Line Spectral Pairs
(81) MDCT Modified Discrete Cosine Transform
(82) RAM Random-access memory
(83) SAD Sound Activity Detector
(84) SID Silence Insertion Descriptor
(85) UE User Equipment
(86) VAD Voice Activity Detector