Voice audio encoding device, voice audio decoding device, voice audio encoding method, and voice audio decoding method
09767815 · 2017-09-19
Assignee
Inventors
Cpc classification
International classification
G10L19/00
PHYSICS
G10L19/02
PHYSICS
Abstract
Provided are a voice audio encoding device, voice audio decoding device, voice audio encoding method, and voice audio decoding method that efficiently perform bit distribution and improve sound quality. Dominant frequency band identification unit identifies a dominant frequency band having a norm factor value that is the maximum value within the spectrum of an input voice audio signal. Dominant group determination units and non-dominant group determination unit group all sub-bands into a dominant group that contains the dominant frequency band and a non-dominant group that contains no dominant frequency band. Group bit distribution unit distributes bits to each group on the basis of the energy and norm variance of each group. Sub-band bit distribution unit redistributes the bits that have been distributed to each group to each sub-band in accordance with the ratio of the norm to the energy of the groups.
Claims
1. A speech/audio coding apparatus comprising: a receiver that receives a time-domain speech/audio input signal; a memory; and a processor that transforms the speech/audio input signal into a frequency domain; splits a frequency spectrum of the speech/audio signal to obtain a plurality of subbands; estimates an energy envelope which represents an energy level for each of the plurality of subbands; quantizes the energy envelope; determines a plurality of groups from the quantized energy envelope, each of the plurality of groups being composed of a plurality of subbands; allocates bits to the determined plurality of groups on a group-by-group basis; allocates the bits allocated to each of the plurality of groups to the plurality of subbands included in each of the groups on a subband-by-subband basis; and encodes the frequency spectrum using the bits allocated to the subbands, wherein, when determining the plurality of groups, the processor identifies one or more dominant groups which are composed of a dominant frequency subband in which an energy envelope of the frequency spectrum has a local maximum value and mutually adjacent subbands on both sides of the dominant frequency subband, the mutually adjacent subbands each forming a descending slope of an energy envelope, and identifies one or more non-dominant groups which are composed of mutually adjacent subbands other than those included in the one or more dominant groups.
2. The speech/audio coding apparatus according to claim 1, wherein the processor further calculates group-specific energy, and wherein the processor allocates, based on the calculated group-specific energy, more bits to a group when the energy is greater and allocates fewer bits to a group when the energy is smaller.
3. The speech/audio coding apparatus according to claim 1, wherein the processor allocates more bits to a subband having a greater energy envelope and allocates fewer bits to a subband having a smaller energy envelope.
4. The speech/audio coding apparatus according to claim 1, wherein a group width of the dominant group is defined as a width of a group of subbands centered on both sides of the dominant frequency band up to subbands where a descending slope of a norm coefficient value ends.
5. The speech/audio coding apparatus according to claim 1, wherein when the dominant frequency band is the highest frequency band or the lowest frequency band among available frequency bands, only one side of the descending slope is included in the dominant group.
6. A speech/audio decoding apparatus comprising: a receiver that receives encoded speech/audio data; a memory; and a processor that de-quantizes a quantized spectral envelope; determines a plurality of groups from the quantized spectral envelope, each of the plurality of groups being composed of a plurality of subbands; allocates bits to the determined plurality of groups on a group-by-group basis; allocates the bits allocated to each of the plurality of groups to the plurality of subbands included in each of the groups on a subband-by-subband basis; decodes a frequency spectrum of a speech/audio signal using the bits allocated to the subbands; applies the de-quantized spectral envelope to the decoded frequency spectrum and reproduces a decoded spectrum; and inversely transforms the decoded spectrum from a frequency domain to a time domain, wherein, when determining the plurality of groups, the processor identifies one or more dominant groups which are composed of a dominant frequency subband in which an energy envelope of the frequency spectrum has a local maximum value and mutually adjacent subbands on both sides of the dominant frequency subband, the mutually adjacent subbands each forming a descending slope of an energy envelope, and identifies one or more non-dominant groups which are composed of mutually adjacent subbands other than those included in the one or more dominant groups.
7. The speech/audio decoding apparatus according to claim 6, wherein the processor further calculates group-specific energy, and wherein the processor allocates, based on the calculated group-specific energy, more bits to the groups when the energy is greater and allocates fewer bits to the groups when the energy is smaller.
8. The speech/audio decoding apparatus according to claim 6, wherein the processor allocates more bits to subbands having a greater energy envelope and allocates fewer bits to subbands having a smaller energy envelope.
9. A speech/audio coding method comprising: receiving a time-domain speech/audio input signal; transforming the speech/audio input signal into a frequency domain; splitting a frequency spectrum of the speech/audio signal to obtain a plurality of subbands; estimating an energy envelope that represents an energy level for each of the plurality of subbands; quantizing the energy envelope; determining, from the quantized energy envelope, a plurality of groups, each of the plurality of groups being composed of a plurality of subbands; allocating bits to the determined plurality of groups on a group-by-group basis; allocating the bits allocated to each of the plurality of groups to the plurality of subbands included in each of the groups on a subband-by-subband basis; and encoding the frequency spectrum using the bits allocated to the subbands, wherein, when determining the plurality of groups, identifying one or more dominant groups which are composed of a dominant frequency subband in which an energy envelope of the frequency spectrum has a local maximum value and mutually adjacent subbands on both sides of the dominant frequency subband, the mutually adjacent subbands each forming a descending slope of an energy envelope, and identifying one or more non-dominant groups which are composed of mutually adjacent subbands other than those included in the one or more dominant groups.
10. A speech/audio decoding method comprising: receiving encoded speech/audio data; de-quantizing a quantized spectral envelope; determining a plurality of groups from the quantized spectral envelope, each of the plurality of groups being composed of a plurality of subbands; allocating bits to the determined plurality of groups on a group-by-group basis; allocating the bits allocated to each of the plurality of groups to the plurality of subbands included in each of the groups on a subband-by-subband basis; decoding a frequency spectrum of a speech/audio signal using the bits allocated to the subbands; applying the de-quantized spectral envelope to the decoded frequency spectrum and reproducing a decoded spectrum; and inversely transforming the decoded spectrum from a frequency domain to a time domain, wherein, when determining the plurality of groups, identifying one or more dominant groups which are composed of a dominant frequency subband in which an energy envelope of the frequency spectrum has a local maximum value and mutually adjacent subbands on both sides of the dominant frequency subband, the mutually adjacent subbands each forming a descending slope of an energy envelope, and identifying one or more non-dominant groups which are composed of mutually adjacent subbands other than those included in the one or more dominant groups.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DESCRIPTION OF EMBODIMENTS
(10) Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
(11) (Embodiment)
(12)
(13) Transient detector 101 detects, from an input signal, either a transient frame corresponding to a leading edge or an end edge of speech or a stationary frame corresponding to a speech section other than that, and outputs the detection result to transformation section 102. Transformation section 102 applies, to the frame of the input signal, high-frequency resolution transformation or low-frequency resolution transformation depending on whether the detection result outputted from transient detector 101 is a transient frame or stationary frame, and acquires a spectral coefficient (or transform coefficient) and outputs the spectral coefficient to norm estimation section 103 and spectrum normalization section 105. Transformation section 102 outputs a frame configuration which is the detection result outputted from transient detector 101, that is, a transient signal flag indicating whether the frame is a stationary frame or a transient frame to multiplexer 110.
(14) Norm estimation section 103 splits the spectral coefficient outputted from transformation section 102 into bands of different bandwidths and estimates a norm (or energy) of each split band. Norm estimation section 103 outputs the estimated norm of each band to norm quantization section 104.
(15) Norm quantization section 104 determines a spectral envelope made up of norms of all bands based on norms of respective bands outputted from norm estimation section 103, quantizes the determined spectral envelope and outputs the quantized spectral envelope to spectrum normalization section 105 and norm adjustment section 106.
(16) Spectrum normalization section 105 normalizes the spectral coefficient outputted from transformation section 102 according to the quantized spectral envelope outputted from norm quantization section 104 and outputs the normalized spectral coefficient to lattice-vector coding section 108.
(17) Norm adjustment section 106 adjusts the quantized spectral envelope outputted from norm quantization section 104 based on adaptive spectral weighting and outputs the adjusted quantized spectral envelope to bit allocation section 107.
(18) Bit allocation section 107 allocates available bits for each band in a frame using the adjusted quantized spectral envelope outputted from norm adjustment section 106 and outputs the allocated bits to lattice-vector coding section 108. Details of bit allocation section 107 will be described later.
(19) Lattice-vector coding section 108 performs lattice-vector coding on the spectral coefficient normalized by spectrum normalization section 105 using the bits allocated for each band in bit allocation section 107 and outputs the lattice coding vector to noise level adjustment section 109 and multiplexer 110.
(20) Noise level adjustment section 109 estimates the level of the spectral coefficient prior to coding in lattice-vector coding section 108 and encodes the estimated level. A noise level adjustment index is determined in this way. The noise level adjustment index is outputted to multiplexer 110.
(21) Multiplexer 110 multiplexes the transient signal flag outputted from transformation section 102, quantized spectral envelope outputted from norm quantization section 104, lattice coding vector outputted from lattice-vector coding section 108 and noise level adjustment index outputted from noise level adjustment section 109, and forms a bit stream and transmits the bit stream to a speech/audio decoding apparatus.
(22)
(23) Norm de-quantization section 202 de-quantizes the quantized spectral envelope (that is, norm) outputted from the multiplexer, obtains a spectral envelope made up of norms of all bands and outputs the spectral envelope obtained to norm adjustment section 203.
(24) Norm adjustment section 203 adjusts the spectral envelope outputted from norm de-quantization section 202 based on adaptive spectral weighting and outputs the adjusted spectral envelope to bit allocation section 204.
(25) Bit allocation section 204 allocates available bits for each band in a frame using the spectral envelope outputted from norm adjustment section 203. That is, bit allocation section 204 recalculates bit allocation indispensable to decode the lattice-vector code of the normalized spectral coefficient. The allocated bits are outputted to lattice decoding section 205.
(26) Lattice decoding section 205 decodes the lattice coding vector outputted from demultiplexer 201 based on a frame configuration indicated by the transient signal flag outputted from demultiplexer 201 and the bits outputted from bit allocation section 204 and acquires a spectral coefficient. The spectral coefficient is outputted to spectral-fill generator 206 and adder 207.
(27) Spectral-fill generator 206 regenerates a low-frequency spectral coefficient to which no bit has been allocated using a codebook created based on the spectral coefficient outputted from lattice decoding section 205. Spectral-fill generator 206 adjusts the level of the regenerated spectral coefficient using the noise level adjustment index outputted from demultiplexer 201. Furthermore, spectral-fill generator 206 regenerates the spectral coefficient not subjected to high-frequency coding using a low-frequency coded spectral coefficient. The level-adjusted low-frequency spectral coefficient and regenerated high-frequency spectral coefficient are outputted to adder 207.
(28) Adder 207 adds up the spectral coefficient outputted from lattice decoding section 205 and the spectral coefficient outputted from spectral-fill generator 206, generates a normalized spectral coefficient and outputs the normalized spectral coefficient to envelope shaping section 208.
(29) Envelope shaping section 208 applies the spectral envelope outputted from norm de-quantization section 202 to the normalized spectral coefficient generated by adder 207 and generates a full-band spectral coefficient (corresponding to the decoded spectrum). The full-band spectral coefficient generated is outputted to inverse transformation section 209.
(30) Inverse transformation section 209 applies inverse transform such as inverse modified discrete cosine transform (IMDCT) to the full-band spectral coefficient outputted from envelope shaping section 208, transforms it to a time-domain signal and outputs an output signal. Here, inverse transform with high-frequency resolution is applied to a case of a stationary frame and inverse transform with low-frequency resolution is applied to a case of a transient frame.
(31) Next, the details of bit allocation section 107 will be described using
(32)
(33) Dominant group determining sections 302-1 to 302N adaptively determine group widths according to input signal characteristics centered on the dominant frequency band outputted from dominant frequency band identification section 301. More specifically, the group width is defined as the width of a group of subbands centered on and on both sides of the dominant frequency band up to subbands where a descending slope of the norm coefficient value stops. Dominant group determining sections 302-1 to 302N determine frequency bands included in group widths as dominant groups and output the determined dominant groups to non-dominant group determining section 303. Note that when a dominant frequency band is located at an edge (end of an available frequency), only one side of the descending slope is included in the group.
(34) Non-dominant group determining section 303 determines continuous subbands outputted from dominant group determining sections 302-1 to 302N other than the dominant groups as non-dominant groups without dominant frequency bands. Non-dominant group determining section 303 outputs the dominant groups and the non-dominant groups to group energy calculation section 304 and norm variance calculation section 306.
(35) Group energy calculation section 304 calculates group-specific energy of the dominant groups and the non-dominant groups outputted from non-dominant group determining section 303 and outputs the calculated energy to total energy calculation section 305 and group bit distribution section 308. The group-specific energy is calculated by following equation 1.
Energy(G(k))=Σ.sub.i=1.sup.MNorm(i) (Equation 1)
(36) Here, k denotes an index of each group, Energy(G(k)) denotes energy of group k, i denotes a subband index of group k, M denotes the total number of subbands of group k and Norm(i) denotes a norm coefficient value of subband i of group k.
(37) Total energy calculation section 305 adds up all group-specific energy outputted from group energy calculation section 304 and calculates total energy of all groups. The total energy calculated is outputted to group bit distribution section 308. The total energy is calculated by following equation 2.
Energy.sub.total=Σ.sub.k=1.sup.NEnergy(G(k)) (Equation 2)
(38) Here, Energy.sub.total denotes total energy of all groups, N denotes the total number of groups in a spectrum, k denotes an index of each group, and Energy(G(k)) denotes energy of group k.
(39) Norm variance calculation section 306 calculates group-specific norm variance for the dominant groups and the non-dominant groups outputted from non-dominant group determining section 303, and outputs the calculated norm variance to total norm variance calculation section 307 and group bit distribution section 308. The group-specific norm variance is calculated by following equation 3.
Norm.sub.var(G(k))=Norm.sub.max(G(k))−Norm.sub.min(G(k)) (Equation 3)
(40) Here, k denotes an index of each group, Norm.sub.var(G(k)) denotes a norm variance of group k, Norm.sub.max(G(k)) denotes a maximum norm coefficient value of group k, and Norm.sub.min(G(k)) denotes a minimum norm coefficient value of group k.
(41) Total norm variance calculation section 307 calculates a total norm variance of all groups based on the group-specific norm variance outputted from norm variance calculation section 306. The calculated total norm variance is outputted to group bit distribution section 308. The total norm variance is calculated by following equation 4.
Norm.sub.vartotal=Σ.sub.k=1.sup.NNorm.sub.var(G(k)) (Equation 4)
(42) Here, Norm.sub.vartotal denotes a total norm variance of all groups, N denotes the total number of groups in a spectrum, k denotes an index of each group, and Norm.sub.var(G(k)) denotes a norm variance of group k.
(43) Group bit distribution section 308 (corresponding to a first bit allocation section) distributes bits on a group-by-group basis based on group-specific energy outputted from group energy calculation section 304, total energy of all groups outputted from total energy calculation section 305, group-specific norm variance outputted from norm variance calculation section 306 and total norm variance of all groups outputted from total norm variance calculation section 307, and outputs bits distributed on a group-by-group basis to subband bit distribution section 309. Bits distributed on a group-by-group basis are calculated by following equation 5.
(44)
(45) Here, k denotes an index of each group, Bits(G(k)) denotes the number of bits distributed to group k, Bits.sub.total denotes the total number of available bits, scale1 denotes the ratio of bits allocated by energy, Energy(G(k)) denotes energy of group k, Energy.sub.total denotes total energy of all groups, and Normvar(G(k)) denotes a norm variance of group k.
(46) Furthermore, scale1 in equation 5 above takes on a value within a range of [0, 1] and adjusts the ratio of bits allocated by energy or norm variance. The greater the value of scale1, the more bits are allocated by energy, and in an extreme case, if the value is 1, all bits are allocated by energy. The smaller the value of scale1, the more bits are allocated by norm variance, and in an extreme case, if the value is 0, all bits are allocated by norm variance.
(47) By distributing bits on a group-by-group basis as described above, group bit distribution section 308 can distribute more bits to dominant groups and distribute fewer bits to non-dominant groups.
(48) Thus, group bit distribution section 308 can determine the perceptual importance of each group by energy and norm variance and enhance dominant groups more. The norm variance matches a masking theory and can determine the perceptual importance more accurately.
(49) Subband bit distribution section 309 (corresponding to a second bit allocation section) distributes bits to subbands in each group based on group-specific bits outputted from group bit distribution section 308 and outputs the bits allocated to group-specific subbands to lattice-vector coding section 108 as the bit allocation result. Here, more bits are distributed to perceptually important subbands and fewer bits are distributed to perceptually less important subbands. Bits distributed to each subband in a group are calculated by following equation 6.
(50)
(51) Here, Bits.sub.G(k)sb(i) denotes a bit allocated to subband i of group k, i denotes a subband index of group k, Bits.sub.(G(k)) denotes a bit allocated to group k, Energy(G(k)) denotes energy of group k, and Norm(i) denotes a norm coefficient value of subband i of group k.
(52) Next, a grouping method will be described using
(53) Dominant group generation sections 302-1 to 302-N determine subbands centered on and on both sides of dominant frequency bands 9 and 20 up to subbands where a descending slope of the norm coefficient value stops as an identical dominant group. In examples in
(54) Non-dominant group determining section 303 determines continuous frequency bands other than the dominant groups as non-dominant groups without the dominant frequency bands. In the example in
(55) As a result, the quantized spectral envelopes are split into five groups, that is, two dominant groups (groups 2 and 4) and three non-dominant groups (groups 1, 3 and 5).
(56) Using such a grouping method, it is possible to adaptively determine group widths according to input signal characteristics. According to this method, the speech/audio decoding apparatus also uses available quantized norm coefficients, and therefore additional information need not be transmitted to the speech/audio decoding apparatus.
(57) Note that norm variance calculation section 306 calculates a group-specific norm variance. In the examples in
(58) Next, the perceptual importance will be described. A spectrum of a speech/audio signal generally includes a plurality of peaks (mountains) and valleys. A peak is made up of a spectrum component located at a dominant frequency of the speech/audio signal (dominant sound component). The peak is perceptually very important. The perceptual importance of the peak can be determined by a difference between energy of the peak and energy of the valley, that is, by a norm variance. Theoretically, when a peak has sufficiently large energy compared to neighboring frequency bands, the peak should be encoded with a sufficient number of bits, but if the peak is encoded with an insufficient number of bits, coding noise that mixes in becomes outstanding, causing sound quality to deteriorate. On the other hand, a valley is not made up of any dominant sound component of a speech/audio signal and is perceptually not important.
(59) According to the frequency band grouping method of the present embodiment, a dominant frequency band corresponds to a peak of a spectrum and grouping frequency bands means separating peaks (dominant groups including dominant frequency bands) from valleys (non-dominant groups without dominant frequency bands).
(60) Group bit distribution section 308 determines perceptual importance of a peak. In contrast to the G.719 technique in which perceptual importance is determined only by energy, the present embodiment determines perceptual importance based on both energy and norm (energy) distributions and determines bits to be distributed to each group based on the determined perceptual importance.
(61) In subband bit distribution section 309, when a norm variance in a group is large, this means that this group is one of peaks, the peak is perceptually more important and a norm coefficient having a maximum value should be accurately encoded. For this reason, more bits are distributed to each subband of this peak. On the other hand, when a norm variance in a group is very small, this means that this group is one of valleys, and the valley is perceptually not important and need not be accurately encoded. For this reason, fewer bits are distributed to each subband of this group.
(62) Thus, the present embodiment identifies a dominant frequency band in which a norm coefficient value in a spectrum of an input speech/audio signal has a local maximum value, groups all subbands into dominant groups including a dominant frequency band and non-dominant groups not including any dominant frequency band, distributes bits to each group based on group-specific energy and norm variances, and further distributes the bits distributed on a group-by-group basis to each subband according to a ratio of a norm to energy of each group. In this way, it is possible to allocate more bits to perceptually important groups and subbands and perform an efficient bit distribution. As a result, sound quality can be improved.
(63) Note that the norm coefficient in the present embodiment represents subband energy and is also referred to as “energy envelope.”
(64) The disclosure of Japanese Patent Application No. 2012-272571, filed on Dec. 13, 2012, including the specification, drawings and abstract is incorporated herein by reference in its entirety.
INDUSTRIAL APPLICABILITY
(65) The speech/audio coding apparatus, speech/audio decoding apparatus, speech/audio coding method and speech/audio decoding method according to the present invention are applicable to a radio communication terminal apparatus, radio communication base station apparatus, telephone conference terminal apparatus, video conference terminal apparatus and voice over Internet protocol (VoIP) terminal apparatus or the like.
REFERENCE SIGNS LIST
(66) 101 Transient detector 102 Transformation section 103 Norm estimation section 104 Norm quantization section 105 Spectrum normalization section 106, 203 Norm adjustment section 107, 204 Bit allocation section 108 Lattice-vector coding section 109 Noise level adjustment section 110 Multiplexer 201 Demultiplexer 202 Norm de-quantization section 205 Lattice decoding section 206 Spectral-fill generator 207 Adder 208 Envelope shaping section 209 Inverse transformation section 301 Dominant frequency band identification section 302-1 to 302-N Dominant group determining section 303 Non-dominant group determining section 304 Group energy calculation section 305 Total energy calculation section 306 Norm variance calculation section 307 Total norm variance calculation section 308 Group bit distribution section 309 Subband bit distribution section