Speech model parameter estimation and quantization
11715477 · 2023-08-01
Assignee
Inventors
Cpc classification
G10L19/08
PHYSICS
G10L19/087
PHYSICS
International classification
Abstract
Quantizing speech model parameters includes, for each of multiple vectors of quantized excitation strength parameters, determining first and second errors between first and second elements of a vector of excitation strength parameters and, respectively, first and second elements of the vector of quantized excitation strength parameters, and determining a first energy and a second energy associated with, respectively, the first and second errors. First and second weights for, respectively, the first error and the second error, are determined and are used to produce first and second weighted errors, which are combined to produce a total error. The total errors of each of the multiple vectors of quantized excitation strength parameters are compared and the vector of quantized excitation strength parameters that produces the smallest total error is selected to represent the vector of excitation strength parameters.
Claims
1. A method of quantizing speech model parameters, the method comprising: for each of multiple vectors of quantized excitation strength parameters: determining a first error between a first element of a vector of excitation strength parameters and a first element of the vector of quantized excitation strength parameters, determining a second error between a second element of the vector of excitation strength parameters and a second element of the vector of quantized excitation strength parameters, determining a first energy associated with the first error and a second energy associated with the second error, determining a first weight for the first error and a second weight for the second error such that, when the first energy is larger than the second energy, the ratio of the first weight to the second weight is less than the ratio of the first energy to the second energy, and, when the second energy is larger than the first energy, the ratio of the second weight to the first weight is less than the ratio of the second energy to the first energy, weighting the first error using the first weight to produce a first weighted error and weighting the second error using the second weight to produce a second weighted error, and combining the first weighted error and the second weighted error to produce a total error, comparing the total errors of each of the multiple vectors of quantized excitation strength parameters; and selecting the vector of quantized excitation strength parameters that produces the smallest total error to represent the vector of excitation strength parameters.
2. The method of claim 1, wherein determining the first weight and the second weight include applying a nonlinearity to the first energy and the second energy, respectively.
3. The method of claim 2, wherein the nonlinearity is a power function with an exponent between zero and one.
4. The method of claim 1, wherein the first element of the vector of excitation strength parameters corresponds to an associated frequency band and time interval, and the first weight depends on an energy of the associated frequency band and time interval and an energy of at least one other frequency band or time interval.
5. The method of claim 4, further comprising increasing the first weight when an excitation strength is different between the associated frequency band and time interval and the at least one other frequency band or time interval.
6. The method of claim 1, wherein the vector of excitation strength parameters includes a voiced strength/pulsed strength pair, and the first weight is selected such that the error between a high voiced strength/low pulsed strength pair and a quantized low voiced strength/high pulsed strength pair is less than the error between the high voiced strength/low pulsed strength pair and a quantized low voiced strength/low pulsed strength pair.
7. The method of claim 1, wherein the vector of excitation strength parameters corresponds to a MBE speech model.
8. A method of estimating speech model parameters from a digitized speech signal, the method comprising: dividing the digitized speech signal into two or more frequency band signals; determining a first preliminary excitation parameter using a first method that includes performing a nonlinear operation on at least two of the frequency band signals to produce at least two modified frequency band signals, determining weights to apply to the at least two modified frequency band signals, and determining the first preliminary excitation parameter using a first weighted combination of the at least two modified frequency band signals; determining a second preliminary excitation parameter by applying weights corresponding to the weights determined in the first method to the at least two of the frequency band signals to form a second weighted combination of at least two frequency band signals and using a second method different from the first method to determine the second preliminary excitation parameter from the second weighted combination; and using the first and second preliminary excitation parameters to determine an excitation parameter for the digitized speech signal.
9. The method of claim 8, wherein determining the weights includes examining estimated background noise energy.
10. The method of claim 8, further comprising determining a third preliminary excitation parameter by comparing energy near a peak frequency to total energy and using the first, second and third preliminary excitation parameters to determine the excitation parameter for the digitized speech signal.
11. The method of claim 10, wherein the peak frequency is determined after excluding frequencies below a threshold level.
12. The method of claim 8, further comprising determining a third preliminary excitation parameter using a measure of periodicity over less than the fill bandwidth of the digitized speech signal and using the first, second and third preliminary excitation parameters to determine the excitation parameter for the digitized speech signal.
13. The method of claim 8, further comprising determining a fundamental frequency for the digitized speech signal.
14. The method of claim 13, further comprising determining a target frequency based on previous fundamental frequency estimates.
15. The method of claim 14, further comprising selecting a subharmonic of a current fundamental frequency based on proximity to the target frequency.
16. The method of claim 8, wherein the first preliminary excitation parameter is a fundamental frequency estimate.
17. The method of claim 16, wherein the fundamental frequency estimate is determined by evaluating parameters for at least a first fundamental frequency estimate and a second fundamental frequency estimate.
18. The method of claim 17, further comprising comparing a ratio of the parameter for the second fundamental frequency estimate to the parameter for the first fundamental frequency estimate to a sequence of two or more threshold parameters.
19. The method of claim 18, wherein success for a comparison results in additional parameter tests and failure results in comparing the ratio to the next threshold parameter in the sequence.
20. The method of claim 19, wherein failure of the additional parameter tests also results in comparing the ratio to the next threshold parameter in the sequence.
21. The method of claim 8, wherein the excitation parameter corresponds to a MBE speech model.
22. A speech coder configured to quantize speech model parameters, the speech coder being operable to: for each of multiple vectors of quantized excitation strength parameters: determine a first error between a first element of a vector of excitation strength parameters and a first element of the vector of quantized excitation strength parameters, determine a second error between a second element of the vector of excitation strength parameters and a second element of the vector of quantized excitation strength parameters, determine a first energy associated with the first error and a second energy associated with the second error, determine a first weight for the first error and a second weight for the second error such that, when the first energy is larger than the second energy, the ratio of the first weight to the second weight is less than the ratio of the first energy to the second energy, and, when the second energy is larger than the first energy, the ratio of the second weight to the first weight is less than the ratio of the second energy to the first energy, weight the first error using the first weight to produce a first weighted error and weight the second error using the second weight to produce a second weighted error, and combine the first weighted error and the second weighted error to produce a total error; comparing the total errors of each of the multiple vectors of quantized excitation strength parameters; and select the vector of quantized excitation strength parameters that produces the smallest total error to represent the vector of excitation strength parameters.
23. The speech coder of claim 22, wherein the speech coder is operable to determine the first weight and the second weight by applying a nonlinearity to the first energy and the second energy, respectively.
24. The speech coder of claim 23, wherein the nonlinearity is a power function with an exponent between zero and one.
25. The speech coder of claim 22, wherein the first element of the vector of excitation strength parameters corresponds to an associated frequency band and time interval, and the first weight depends on an energy of the associated frequency band and time interval and an energy of at least one other frequency band or time interval.
26. The speech coder of claim 25, wherein the speech coder is further operable to increase the first weight when an excitation strength is different between the associated frequency band and time interval and the at least one other frequency band or time interval.
27. The speech coder of claim 22, wherein the vector of excitation strength parameters includes a voiced strength/pulsed strength pair, and the speech coder is operable to select the first weight such that the error between a high voiced strength/low pulsed strength pair and a quantized low voiced strength/high pulsed strength pair is less than the error between the high voiced strength/low pulsed strength pair and a quantized low voiced strength/low pulsed strength pair.
28. The speech coder of claim 22, wherein the vector of excitation strength parameters corresponds to a MBE speech model.
29. A handset or mobile radio including the speech coder of claim 22.
30. A base station or console including the speech coder of claim 22.
Description
DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION
(8) As discussed below, techniques are provided for improving speech coding and compression techniques that rely on quantization to encode speech in a way that permits the output of high quality speech even when faced with reduced transmission bandwidth or storage constraints. The techniques may be implemented with software. For example, the techniques may be incorporated in a vocoder that is implemented by, for example, a mobile radio or a cellular telephone.
(9) Vocoders typically model speech over a short interval of time as the response of a system excited by some form of excitation. Typically, an input signal s.sub.0(n) is obtained by sampling an analog input signal. For applications such as speech coding or speech recognition, the sampling rate ranges typically between 6 kHz and 48 kHz. In general, the excitation model works well for any sampling rate with corresponding changes in the associated parameters. To focus on a short interval centered at time t, the input signal s.sub.0(n) is typically multiplied by a window w(t,n) centered at time t to obtain a windowed signal s(t,n). The window used is typically a Hamming window or Kaiser window and may be time invariant so that w(t,n)=w.sub.0(n−t) or may have characteristics which change as a function of time. The length of the window w(t,n) typically ranges between 5 ms and 40 ms. The windowed signal s(t,n) may be computed at center times of t.sub.0, t.sub.1, . . . , t.sub.m, t.sub.m+1, . . . , Typically, the interval between consecutive center times t.sub.m+1−t.sub.m approximates the effective length of the window w(t,n) used for these center times. The windowed signal s(t,n) for a particular center time may be referred to as a segment or frame of the input signal.
(10) For each segment of the input signal, system parameters and excitation parameters are determined. The system parameters typically model the spectral envelope or the impulse response of the system. The excitation parameters typically include a fundamental frequency (or pitch period) and a voiced/unvoiced (V/UV) parameter which indicates whether the input signal has pitch (or indicates the degree to which the input signal has pitch). For vocoders such as MBE, IMBE, and AMBE, the input signal is divided into frequency bands and the excitation parameters may also include a V/UV decision for each frequency band. High quality speech reproduction may be provided using a high quality speech model, accurate estimation of the speech model parameters, and high quality synthesis methods.
(11) The Fourier transform of the windowed signal s(t,n) may be denoted by S(t,ω) and may be referred to as the signal Short-Time Fourier Transform (STFT). If s(n) is a periodic signal with a fundamental frequency ω.sub.0 or pitch period n.sub.0, the parameters ω.sub.0 and n.sub.0 are related to each other by 2π/ω.sub.0=n.sub.0. Non-integer values of the pitch period n.sub.0 are often used in practice.
(12) A speech signal s.sub.0(n) may be divided into multiple frequency bands using bandpass filters. Characteristics of these bandpass filters are allowed to change as a function of time and/or frequency. A speech signal may also be divided into multiple bands by applying frequency windows or weightings to the speech signal STFT S(t,ω).
(13) Referring to
(14) The voiced strength V(t,ω), unvoiced strength U(t,ω), and pulsed strength P(t,ω) parameters control the proportion of quasi-periodic, noise-like, and pulsed signals in each frequency band. These parameters are functions of time (t) and frequency (ω). The voiced strength parameter V(t,ω) may vary between zero, which indicates that there is no voiced signal at time t and frequency ω, and one, which indicates that the signal at time t and frequency ω is entirely voiced. The unvoiced strength and pulsed strength parameters provide similar indications. The excitation strength parameters may be constrained in the speech synthesis system so that they sum to one (i.e., V(t,ω)+U(t,ω)+P(t,ω)=1).
(15) The vector of parameters v(t,ω) associated with the voiced strength parameter V(t,ω) includes voiced excitation parameters and voiced system parameters. The voiced excitation parameters may include a time and frequency dependent fundamental frequency ω.sub.0(t,ω) (or equivalently a pitch period n.sub.0(t,ω)).
(16) The vector of parameters u(t,ω) associated with the unvoiced strength parameter U(t,ω) includes unvoiced excitation parameters and unvoiced system parameters. The unvoiced excitation parameters may include, for example, statistics and energy distribution.
(17) The vector of parameters p(t,ω) associated with the pulsed excitation strength parameter P(t,ω) includes pulsed excitation parameters and pulsed system parameters. The pulsed excitation parameters may include one or more pulse positions n.sub.0(t,ω) and amplitudes.
(18) Referring to
(19) Analysis units 210, 215, and 220 may use the analysis methods disclosed in U.S. Pat. No. 6,912,495. Voiced strength analysis generally involves determining how periodic the signal is in a frequency band and time interval. Pulsed strength analysis involves determining how pulse-like the signal is in a frequency band and time interval. The time interval for pulsed strength analysis is generally the frame length. For voiced strength analysis, a longer time interval is generally used to span multiple periods for low fundamental frequencies. So, for low fundamental frequencies it is possible to have periodic pulses over the voiced analysis time interval but only a single pulse in the pulsed analysis time interval. Consequently, it is possible for the analysis system to produce a high pulsed strength estimate and a high voiced strength estimate for the same frequency band and center time.
(20) Referring to
(21) One implementation uses a weighted vector quantizer to jointly quantize the strength parameters from two adjacent frames using 7 bits. The strength parameters are divided into 8 frequency bands. Typical band edges for these 8 frequency bands for an 8 kHz sampling rate are 0 Hz, 375 Hz, 875 Hz, 1375 Hz, 1875 Hz, 2375 Hz, 2875 Hz, 3375 Hz, and 4000 Hz. The codebook for the vector quantizer contains 128 entries consisting of 16 quantized strength parameters for the 8 frequency bands of two adjacent frames. For each codebook index m, the error is evaluated using
(22)
(23) where
E.sub.m(t.sub.n,ω.sub.k)=max{(V(t.sub.n,ω.sub.k)−{hacek over (V)}.sub.m(t.sub.n,ω.sub.k)).sup.2,(1−{hacek over (V)}.sub.m(t.sub.n,ω.sub.k))(P(t.sub.n,ω.sub.k)−{hacek over (P)}.sub.m(t.sub.n,ω.sub.k)).sup.2}, (2)
(24) α(t.sub.n,ω.sub.k) is a frequency and time dependent weighting typically set to the energy in the speech transform S(t,ω) around time t.sub.n, and frequency ω.sub.k, max(a,b) evaluates to the maximum of a or b, and {hacek over (V)}.sub.m(t.sub.n,ω.sub.k) and {hacek over (P)}.sub.m(t.sub.n,ω.sub.k) are the quantized voice strength and quantized pulse strength. The error E.sub.m of Equation (1) is computed for each codebook index m and the codebook index which minimize E.sub.m is selected. To reduce storage in the codebook, the entries are quantized so that, for a particular frequency band and time index, a value of zero is used for entirely unvoiced, one is used for entirely voiced, and two is used for entirely pulsed. The quantized strength pair ({hacek over (V)}.sub.m(t.sub.n,ω.sub.k), {hacek over (P)}.sub.m(t.sub.n,ω.sub.k)) has the values (0, 0) for unvoiced, (1, 0) for voiced and (0, 1) for pulsed.
(25) In another approach disclosed in U.S. Pat. No. 6,912,495, the error E.sub.m(t.sub.n,ω.sub.k) of Equation (2) is replaced by
E.sub.m(t.sub.n,ω.sub.k)=γ.sub.m(t.sub.n,ω.sub.k)+β(1−{hacek over (V)}.sub.m(t.sub.n,ω.sub.k))(1−γ.sub.m(t.sub.n,ω.sub.k))(P(t.sub.n,ω.sub.k)−{hacek over (P)}.sub.m(t.sub.n,ω.sub.k)).sup.2, (3)
where
γ.sub.m(t.sub.n,ω.sub.k)=(V(t.sub.n,ω.sub.k)−{hacek over (V)}.sub.m(t.sub.n,ω.sub.k)).sup.2
(26) and β is typically set to a constant of 0.5.
(27) Listening tests of speech coding systems implemented using the methods disclosed in U.S. Pat. No. 6,912,495 indicate that quality may be increased while maintaining the same coding rate by improving on the error criteria in Equations (2) and (3). One aspect of these error criteria which may be improved relates to their behavior for quantizing a voiced strength, pulsed strength pair that has high voiced strength and low pulsed strength. When the error E.sub.m(t.sub.n,ω.sub.k) of Equation (2) is evaluated for an unvoiced element in the codebook, it simplifies to
E.sub.U(t.sub.n,ω.sub.k)=max[V(t.sub.n,ω.sub.k).sup.2,P(t.sub.n,ω.sub.k).sup.2]. (4)
(28) When the error E.sub.m(t.sub.n,ω.sub.k) of Equation (2) is evaluated for a pulsed element in the codebook, it simplifies to
E.sub.p(t.sub.n,ω.sub.k)=max[V(t.sub.n,ω.sub.k).sup.2,(1−P(t.sub.n,ω.sub.k)).sup.2]. (5)
(29) Comparing these two errors leads to
E.sub.U(t.sub.n,ω.sub.k)≤E.sub.p(t.sub.n,ω.sub.k),if P(t.sub.n,ω.sub.k)≤½. (6)
(30) So, there is no preference for a pulsed element in the codebook over an unvoiced element in the codebook for low pulsed strength (P(t.sub.n,ω.sub.k)≤½).
(31) Similarly, when the error E.sub.m(t.sub.n,ω.sub.k) of Equation (3) is evaluated for an unvoiced element in the codebook, it simplifies to
E.sub.U(t.sub.n,ω.sub.k)=V(t.sub.n,ω.sub.k).sup.2+β(1−V(t.sub.n,ω.sub.k).sup.2)P(t.sub.n,ω.sub.k).sup.2. (7)
(32) When the error E.sub.m(t.sub.n,ω.sub.k) of Equation (3) is evaluated for a pulsed element in the codebook, it simplifies to
E.sub.p(t.sub.n,ω.sub.k)=V(t.sub.n,ω.sub.k).sup.2+β(1−V(t.sub.n,ω.sub.k).sup.2)(1−P(t.sub.n,ω.sub.k)).sup.2. (8)
(33) When β<0, unvoiced elements are preferred over pulsed elements for high pulsed strengths so this is not a useful operating region. When β≥0, comparing these two errors leads to
E.sub.U(t.sub.n,ω.sub.k)≤E.sub.p(t.sub.n,ω.sub.k),if P(t.sub.n,ω.sub.k)≤½. (9)
(34) So, there is no preference for a pulsed element in the codebook over an unvoiced element in the codebook for low pulsed strength (P(t.sub.n,ω.sub.k)≤½).
(35) Listening tests indicate that preferring pulsed elements over unvoiced elements when voiced strength is high and pulsed strength is low improves the quality of the synthesized speech especially when the fundamental frequency is low. Based on these listening tests, an improved error criterion may be introduced:
E.sub.m(t.sub.n,ω.sub.k)−{hacek over (V)}.sub.m(t.sub.n,ω.sub.k)E.sub.v(t.sub.n,ω.sub.k)+{hacek over (P)}.sub.m(t.sub.n,ω.sub.k)E.sub.p(t.sub.n,ω.sub.k)+{hacek over (U)}.sub.m(t.sub.n,ω.sub.k)E.sub.u(t.sub.n,ω.sub.k), (10)
where
{hacek over (U)}.sub.m(t.sub.n,ω.sub.k)=(1−{hacek over (V)}.sub.m(t.sub.n,ω.sub.k))(1−{hacek over (P)}.sub.m(t.sub.n,ω.sub.k), (11)
E.sub.v(t.sub.n,ω.sub.k)=1−max(V.sub.m(t.sub.n,ω.sub.k),μP.sub.m(t.sub.n,ω.sub.k)), (12)
E.sub.p(t.sub.n,ω.sub.k)=1−max(ξV.sub.m(t.sub.n,ω.sub.k),P.sub.m(t.sub.n,ω.sub.k)), (13)
E.sub.u(t.sub.n,ω.sub.k)=max(V.sub.m(t.sub.n,ω.sub.k),P.sub.m(t.sub.n,ω.sub.k)), (14)
μ=A min(1,ω.sub.c/ω.sub.0), (15)
ξ=B min(1,ω.sub.c/ω.sub.0). (16)
A is typically set to a constant of 0.8, B is typically set to a constant of 0.7, ω.sub.c typically set to a constant of 2π/S. S is the number of samples in a synthesis frame which is typically about 80 for a sampling rate of 8 kHz, and the function min(a,b) evaluates to the minimum of a or b. When the novel error criterion E.sub.m(t.sub.n,ω.sub.k) of Equation (10) is evaluated for a pulsed element in the codebook, it simplifies to E.sub.p(t.sub.n,ω.sub.k) of Equation (13). When it is evaluated for an unvoiced element in the codebook, it simplifies to E.sub.u(t.sub.n,ω.sub.k) of Equation (14). So, a pulsed element is preferred over an unvoiced element for low pulsed strength and high voiced strength (V.sub.m(t.sub.n,ω.sub.k)>1/(1+ξ)). The threshold 1/(1+ξ) is ½ for fundamentals at or below the cutoff frequency Ω.sub.c and approaches 1 as the fundamental increases above the cutoff. So, this error criterion achieves the behavior favored in listening tests.
(36) Listening tests of speech coding systems implemented using the methods disclosed in U.S. Pat. No. 6,912,495 indicate that quality may also be increased while maintaining the same coding rate by improving the frequency and time dependent weighting α(t.sub.n,ω.sub.k) in the error criterion of Equation (1). Listening tests indicate that setting the weights α(t.sub.n,ω.sub.k) to the energy e(t.sub.n,ω.sub.k) in the speech transform S(t,ω) around time t.sub.n, and frequency ω.sub.k tends to overweight higher energy regions relative to lower energy regions. This issue is more of a problem when smaller codebooks are used at lower bit rates.
(37) One method of reducing the weighting of a high energy region relative to a lower energy region is to set the weights α(t.sub.n,ω.sub.k) to a nonlinear function λ( ) of the energy e(t.sub.n,ω.sub.k):
α(t.sub.n,ω.sub.k)=λ(e(t.sub.n,ω.sub.k)), (17)
where the nonlinear function has the property
(38)
(39) One set of nonlinear functions which satisfy the property of Equation (18) are the power functions with exponent between 0 and 1
λ(x)=x.sup.p,0<p<1. (19)
In one implementation, the power function exponent p is set to ½.
(40) In another implementation, the nonlinearity may not be applied to every frame. Typically, the nonlinearity of Equation (17) provides better quality when the energy at low frequencies is much higher than the energy at high frequencies. So, much of the quality improvement may be pined by only applying the nonlinearity when the ratio of energy at low frequencies to the energy at high frequencies is above a threshold. For example, in one implementation, the threshold is 10. The range of low frequencies may be 0-1000 Hz and the range of high frequencies may be 1000-4000 Hz.
(41) Referring to
(42) Listening tests indicate that quality may be further improved by including models of auditory system behavior in the weight generation unit. Referring to
(43) The band masking matrix employed by the matrix multiply unit 510 models the frequency masking effects of the auditory system. The auditory system may be modeled as a filter bank consisting of band pass filters. Frequency masking experiments generally measure whether a band pass target signal at a target frequency and level is audible in the presence of a band pass masking signal at a masking frequency and level. The bandwidth of the auditory filters increases as the center frequency increases. In order to treat masking effects in a more uniform manner, it is useful to transform frequency f in Hz to the frequency e in units of Equivalent Rectangular Bandwidth Scale (ERBS):
∈=21.4*log.sub.10(1+0.00437f). (20)
The frequency ∈ of Equation (20) is an approximation to the number of equivalent rectangular bandwidths below the frequency f. One implementation of the band masking matrix is
(44)
(45) where ∈.sub.d is the difference between the target frequency ∈.sub.j and the masking frequency ∈.sub.k, P is the peak masking (typically a constant of 0.1122), ∈.sub.p is the positive extent of the mask peak (typically a constant of 1.0), ∈.sub.n is the negative extent of the mask peak (typically a constant of 0.2), Ω.sub.p (typically a constant of 0.5) is the slope of the mask for frequencies above ∈.sub.p, and δ.sub.n (typically a constant of 0.25) is the slope of the mask for frequencies below ∈.sub.n. Typical target and masking frequencies for an 8 band implementation sampled at 8 kHz are 125 Hz, 625 Hz, 1125 Hz, 1625 Hz, 2125 Hz, 2625 Hz, 3125 Hz, and 3625 Hz. These frequencies are transformed to the ERBS scale using Equation (20) to produce ∈.sub.j and ∈.sub.k.
(46) The band masking matrix of Equation (21) may be normalized to make the response more uniform as a function of frequency band:
(47)
(48) Listening tests for band-pass-filtered masks and target signals with unvoiced, voiced, or pulsed excitation characteristics indicate that mask levels are reduced when mask and target signals have different excitation types when compared to mask levels when mask and target signals have the same type. In addition, listening tests indicate that mask levels are reduced for low fundamental frequencies relative to high fundamental frequencies when one signal is voiced and the other is unvoiced. In one implementation, masks are corrected to address these issues as follows:
m.sub.jk=1−max((1−)|V(t.sub.n,ω.sub.k)−V(t.sub.n,ω.sub.j)|,(1−b)|P(t.sub.n,ω.sub.k)−P(t.sub.n,ω.sub.j)|) (23)
where
a=c.sub.0(f.sub.0−f.sub.1)+c.sub.1, (24)
b is typically a constant of 0.316, f.sub.0 is the estimated fundamental frequency in Hz, f.sub.1 is typically a constant of 125 Hz, c.sub.0 is typically a constant of 0.001145, and c.sub.1 is typically a constant of 0.316. These mask corrections may be applied to the band masking matrix of Equation (22) to produce an improved band masking matrix
M.sub.jk=m.sub.jkM.sub.jk. (25)
(49) The masking matrix may be applied to the output of nonlinear operation unit 505 λ(e(t.sub.n,ω.sub.k)) with a traditional matrix multiply:
μ.sub.j=Σ.sub.k=0.sup.7M.sub.jkλ(e(t.sub.n,ω.sub.k)),j=0,1, . . . ,7, (26)
(50) where μ.sub.j is the output masking level of unit 52 for band j.
(51) The nonlinear operation unit 515 applies the same nonlinearity as the nonlinear operation unit 505 to an estimate of the background noise energy in each band. The background noise energy estimate may be obtained using known methods such as those disclosed in U.S. Pat. No. 4,630,304 titled “Automatic Background Noise Estimator for a Noise Suppression System,” which is incorporated by reference. The multiply unit 520 multiplies a time decay factor with a typical value of 0.4 by a delayed version of the output of the combine unit 525. The delay unit 530 has a typical delay of 10 ms. The combine unit 525 typically takes the maximum of its inputs to produce its output. The signal to mask ratio unit 535 divides the output of the nonlinear operation unit 505 by the output of the combine unit 525. The nonlinear operation unit 540 limits its output between a typical minimum of 0.001 and a typical maximum of 8.91. The weights α(t.sub.n,ω.sub.k) of Equation (1) may be set to the output of weight generation unit 500 and used to find the best codebook index.
(52)
(53) Band processing A units 605 may use known methods such as those disclosed in U.S. Pat. No. 5,826,222, titled “Estimation of Excitation Parameters,” which is incorporated by reference. Band processing A units 605 divide the speech signal into different frequency bands using bandpass filters with different center frequencies. A nonlinearity is applied to the output of each bandpass filter to emphasize the fundamental frequency. The frequency domain signal T.sub.k(ω) may be produced for frequency band k by applying a window, Fourier transform, and magnitude squared to the output of the nonlinearity.
(54) The combine bands unit 610 combines the outputs of band processing A units 605 using a weighted summation. The weights may be computed by comparing the energy in a frequency band to an estimate of the background noise in that band to produce a signal to noise ratio (SNR). The weights may be determined from the estimated SNR so that weights are higher when the estimated SNR is higher. A fundamental frequency ω.sub.A may be estimated from the weighted summation T(ω) along with a probability that the estimated fundamental frequency is correct P.sub.A or an error E.sub.A that indicates how close the combined frequency domain signal is to the spectrum of a periodic signal.
(55) The band processing B units 615 use a method different from the band processing A units 605. For example, the B units may use the same bandpass filters as the A units. However, the frequency domain signal U.sub.k(ω) may be produced for frequency band k by applying a window, Fourier transform, and magnitude squared to the output of the bandpass filters directly. In another implementation, frequency domain signal U.sub.k(ω) may be produced by applying a window, Fourier transform, and magnitude squared to the speech signal s.sub.0(n) and then multiplying by a frequency domain window to select frequency band k.
(56) Combine bands unit 620 combines the outputs of band processing B units 615 using a weighted summation
(57)
(58) where γ.sub.k is a band weighting which should be similar to the band weighting selected for combine band unit 610 in order to improve performance of the combine parameter estimates unit 625. A fundamental frequency ω.sub.B may be estimated from the weighted summation along with a probability that the fundamental frequency is correct P.sub.B or an error E.sub.B that indicates how close the combined frequency domain signal is to the spectrum of a periodic signal. In one implementation, fundamental frequency ω.sub.B may be estimated by maximizing a voiced energy
(59)
(60) where I.sub.n=[(n−∈)ω.sub.B,(n+∈)ω.sub.B] and ∈ has a typical value of 0.167 and N is the number of harmonics of the fundamental in the bandwidth W (typically 4 kHz). For example, the energy E.sub.v(ω.sub.B) may be evaluated for fundamental frequencies between 400 Hz and 720 Hz. The evaluation points may be uniform in frequency or log frequency with a typical number of 21. Accuracy may be increased by increasing the number of evaluation points at the expense of increased computation.
(61) In another implementation, accuracy of the fundamental frequency estimate may be increased without additional evaluation points through the following iterative procedure
(62)
(63) where the initial estimate e starts at the evaluation point,
(64) I.sub.n=[nω.sub.B.sup.n-1−∈ω.sub.B.sup.0, nω.sub.B.sup.n-1+∈ω.sub.B.sup.0], and the fundamental estimate is updated at each harmonic. A fundamental frequency ω.sub.B may be estimated from the weighted average of the estimates at each harmonic.
(65)
(66) The error E.sub.B may be computed using
E.sub.B=1−E.sub.v(ω.sub.B)/E.sub.U (31)
where
(67)
(68) is the energy in U(ω) and the typical range of summation for m is zero to the largest value for which ω.sub.m≤(N+0.5)ω.sub.B.
(69) Combine parameter estimates unit 625 combines the fundamental frequency estimates produced by combine band units 610 and 620 to produce an output fundamental frequency estimate ω.sub.0. In one implementation, the parameter estimates are combined by selecting fundamental frequency estimate ω.sub.A when the probability P.sub.A that fundamental frequency estimate ω.sub.A is correct is higher than the probability P.sub.B that fundamental frequency estimate ω.sub.B is correct, and the fundamental frequency estimate ω.sub.B is otherwise selected.
(70) In another implementation, fundamental frequency estimate ω.sub.A is selected when the error E.sub.A associated with fundamental frequency estimate ω.sub.A is less than the error E.sub.B associated with fundamental frequency estimate ω.sub.B and fundamental frequency estimate ω.sub.B is otherwise selected.
(71) In yet another implementation, fundamental frequency estimate ω.sub.A is selected when the associated error E.sub.A is below a threshold with a typical value of 0.1, and otherwise fundamental frequency estimate ω.sub.A is selected when the error E.sub.A associated with fundamental frequency estimate ω.sub.A is less than the error E.sub.B associated with fundamental frequency estimate ω.sub.B and fundamental frequency estimate ω.sub.B is otherwise selected.
(72) An output error E.sub.0 may be set to correspond to the error associated with the selected fundamental frequency estimate.
(73) Advantages of using similar band weightings for combine bands units 610 and 620 may be demonstrated by considering a scenario where one or more of the bands is dominated by high energy background noise (low SNR bands) and the other bands are dominated by harmonics of the fundamental for a speech signal (high SNR bands). For this case, even though combine bands unit 610 may have a better estimate of the fundamental frequency, it may have a larger error if the low SNR bands are weighted more heavily than combine bands unit 620. This larger error may lead to the selection of the less accurate estimate of combine bands unit 620 and reduced performance.
(74) Combine parameter estimates unit 625 may use additional parameters to produce an output fundamental frequency estimate ω.sub.0. For example, in firefighting applications, voice communication may occur in the presence of loud tonal alarms. These alarms may have time varying frequencies and amplitudes which reduce the effectiveness of automatic background noise estimation methods. To improve performance in this case, the magnitude of the STFT |S(t,ω)| may be computed and, for a particular frame time t, the energy may be summed for a high frequency interval (typically 2-4 kHz) to form parameter E.sub.H which may be compared to the total energy in the frame E.sub.T to form a ratio τ.sub.H=E.sub.H/E.sub.T. In addition, a low pass version E.sub.LB of the error E.sub.B of Equation (31) may be computed using a bandwidth W of 2 kHz. When the ratio r.sub.H is above a threshold (typically 0.9) and E.sub.LB is above a threshold (typically 0.2) performance may be increased by ignoring fundamental frequency estimate ω.sub.B in combine parameter estimates unit 625.
(75) In another implementation, the magnitude of the STFT |S(t,w)| may be computed and the frequency at which it achieves its maximum ω.sub.p may be determined for a particular frame time t. The energy E.sub.p in an interval ∈.sub.p (typically about 156 Hz wide) around the peak frequency ω.sub.p may be compared to the total energy in the frame E.sub.T to form a ratio r.sub.p=E.sub.p/E.sub.T. When the ratio r.sub.p is above a threshold (typically 0.7) and the peak frequency ω.sub.p is above a threshold (typically 2 kHz), performance may be increased by ignoring fundamental frequency estimate ω.sub.B in combine parameter estimates unit 625.
(76) Quality of the synthesized signal may be improved in some cases by using additional parameters in combine parameter estimates unit 625 to produce a smoother output fundamental frequency estimate ω.sub.0 as a function of time. For example, when frequency estimate ω.sub.B is preferred over ω.sub.A, the subharmonic l of fundamental frequency estimate ω.sub.B may be selected as the output fundamental frequency estimate ω.sub.0 for the current frame if the subharmonic frequency (ω.sub.B/l) is closer to a target frequency ω.sub.T.
(77) In another implementation, thresholds T.sub.l=(l+0.5) ω.sub.T are determined based on the target frequency and the subharmonic number. When frequency estimate ω.sub.B is selected over ω.sub.A, frequency estimate ω.sub.B is compared to threshold T.sub.l for subharmonic number l=1, 2, 3, 4. The first subharmonic number for which the frequency estimate ω.sub.B is less than the threshold T.sub.l is selected to compute the output fundamental frequency estimate ω.sub.0=ω.sub.B/l.
(78) The target frequency ω.sub.T may be selected as the previous output fundamental frequency estimate ω.sub.0 when the previous error E.sub.0 is below a threshold (typically 0.2). Otherwise, the target frequency may be set to an average output fundamental frequency estimate
(79) An average output fundamental frequency estimate
(80) In another implementation, only samples of the sequence ω.sub.0(t.sub.n) with error E.sub.0(t.sub.n) below a threshold (typically 0.1) are used in the computation of the average.
(81) Quality of the synthesized signal may be improved in some cases by using additional parameters in combine parameter estimates unit 625 to select between fundamental frequency estimate ω.sub.A and ω.sub.A/2 before combining with fundamental frequency estimate ω.sub.B.
(82)
(83)
(84) where I.sub.n=[(n−∈)ω.sub.A,(n+∈)ω.sub.A], ∈ has a typical value of 0.25, and N is the number of harmonics of the fundamental ω.sub.A in the bandwidth W.sub.A (typically 500 Hz).
(85)
(86) where K.sub.n=[(n−∈)ω.sub.A/2,(n+E)ω.sub.A/2], ∈ has a typical value of 0.25, and M is the number of harmonics of the fundamental ω.sub.A/Z in the bandwidth W.sub.A (typically 500 Hz).
(87) If the voiced energy ε.sub.2 for ω.sub.A/2 is greater than the product of constant c.sub.0 and voiced energy ε.sub.1, the sub-process 700 proceeds to step 715. Otherwise, the sub-process 700 proceeds to step 805 of a sub-process 800 shown in
(88) In step 715, the fundamental track length τ is compared to a constant c.sub.1(typically 3). The unit of the fundamental track length is typically frames and is initialized to zero. It measures the number of consecutive frames for which the fundamental frequency estimate deviates from the estimate in the previous frames by less than a percentage (typically 15%). If the fundamental track length s is less than the constant c.sub.1, the sub-process 700 proceeds to step 730. Otherwise, the sub-process 700 proceeds to step 720.
(89) In step 720, fundamental ω.sub.A is compared with the product of constant c.sub.2 (typically 0.9) and fundamental ω.sub.1 (typically set to the fundamental estimate ω.sub.A from the previous frame). If the fundamental ω.sub.A is less than the product of constant c.sub.2 and fundamental ω.sub.1, the sub-process 700 proceeds to step 730. Otherwise, the sub-process 700 proceeds to step 725.
(90) In step 725, fundamental ω.sub.A is compared with the product of constant c.sub.3 (typically 1.1) and fundamental ω.sub.1. If the fundamental ω.sub.A is greater than the product of constant c.sub.3 and fundamental ω.sub.1, the sub-process 700 proceeds to step 730. Otherwise, the sub-process 700 proceeds to step 805 of sub-process 800.
(91) In step 730, fundamental ω.sub.A is compared with the product of constant c.sub.4 (typically 0.85) and average fundamental
(92) In step 735, fundamental ω.sub.A is compared with the product of constant c.sub.5 (typically 1.15) and average fundamental
(93) Referring to
(94) In step 810, voiced energy ε.sub.2 is compared to the product of a.sub.0 (typically 1.1) and voiced energy ε.sub.1. If voiced energy ε.sub.2 is greater than the product of a.sub.0 and voiced energy ε.sub.1, the sub-process 800 proceeds to step 815. Otherwise, the sub-process 800 proceeds to step 905 of a sub-process 900 shown in
(95) In step 815, the normalized voiced energy E.sub.2 for the previous frame is compared to the normalized voiced energy E.sub.1. The normalized voiced energy E.sub.1 for a frame is calculated as:
(96)
(97) where I=[(1−ε)ω.sub.A,W.sub.A], ∈ has a typical value of 0.5, and bandwidth W.sub.A is typically 500 Hz. If the normalized voiced energy E.sub.2 is less than the normalized voiced energy E.sub.1, the sub-process 800 proceeds to step 825. Otherwise, the sub-process 800 proceeds to step 820.
(98) In step 820, the normalized voiced energy E.sub.2 for the previous frame is compared to a constant a.sub.1 (typically 0.2). If the normalized voiced energy E.sub.2 is less than a.sub.1, the sub-process 800 proceeds to step 825. Otherwise, the sub-process 800 proceeds to step 905 of sub-process 900.
(99) In step 825, V.sub.1 (the voicing decisions for the previous frame) are compared to a.sub.2 (typically all bands unvoiced). If they are not equal, the sub-process 800 proceeds to step 830. Otherwise, the sub-process 800 proceeds to step 905 of sub-process 900.
(100) In step 830, fundamental ω.sub.2 (typically at to ω.sub.A/2) is compared to the product of constant a.sub.3 (typically 0.8) and fundamental ω.sub.1 (typically set to the fundamental estimate ω.sub.A from the previous frame). If fundamental ω.sub.2 is greater than the product of the product of constant a.sub.3 and fundamental ω.sub.1, the sub-process 800 proceeds to step 835. Otherwise, the sub-process 800 proceeds to step 905 of sub-process 900.
(101) In step 835, fundamental ω.sub.2 is compared to the product of constant a.sub.4 (typically 1.2) and fundamental ω.sub.1. If fundamental ω.sub.2 is less than the product of constant a.sub.4 and fundamental ω.sub.1, the sub-process 800 proceeds to step 905 of sub-process 900. Otherwise, the sub-process 800 proceeds to step 1040 of sub-process 1000.
(102) Referring to
(103) In step 910, voiced energy ε.sub.2 is compared to the product of a.sub.s (typically 1.4-0.3p.sub.3, where p.sub.3 is the predicted fundamental valid) and voiced energy ε.sub.1. The predicted fundamental valid p.sub.3 ranges from 0 to 1 and is an estimate of the validity of a predicted fundamental ω.sub.3. One method for determining predicted fundamental valid p.sub.3 initializes it to zero. Then, if normalized voiced energy E.sub.1 is less than a constant (typically 0.2) and previous normalized voiced energy E.sub.2 is less than a constant (typically 0.2) and fundamental track length τ is greater than a constant (typically 0), then predicted fundamental valid p.sub.3 is set to one, otherwise it is multiplied by a constant (typically 0.9).
(104) If voiced energy ε.sub.2 is greater than the product of a.sub.5 and voiced energy ε.sub.1, the sub-process 900 proceeds to step 915. Otherwise, the sub-process 900 proceeds to step 1005 of sub-process 1000.
(105) In step 915, predicted fundamental valid p.sub.3 is compared to a.sub.6 (typically 0.1). If predicted fundamental valid p.sub.3 is leas than a.sub.6, the sub-process 900 proceeds to step 1040 of sub-process 1000. Otherwise, the sub-process 900 proceeds to step 920.
(106) In step 920, fundamental ω.sub.2 (typically set to ω.sub.A/2) is compared to the product of constant a.sub.7 (typically 0.8) and predicted fundamental ω.sub.3. One method of generating predicted fundamental ω.sub.3 sets it to the current output fundamental frequency estimate ω.sub.0 when predicted fundamental valid p.sub.3 is set to one. The predicted fundamental for the next frame may be increased by an estimated fundamental slope. One method of generating an estimated fundamental slope sets it to the difference between the current output fundamental frequency estimate ω.sub.0 and the output fundamental frequency for the previous frame when predicted fundamental valid p.sub.3 is set to one. Otherwise, the estimated fundamental slope may be multiplied by a constant (typically 0.8).
(107) If fundamental ω.sub.2 is greater than the product of constant a.sub.7 and predicted fundamental ω.sub.3, the sub-process 900 proceeds to step 925. Otherwise, the sub-process 900 proceeds to step 1005 of sub-process 1000.
(108) In step 925, fundamental ω.sub.2 is compared to the product of a.sub.8 (typically 1.2) and predicted fundamental ω.sub.3. If fundamental ω.sub.2 is less than the product of constant a.sub.8 and predicted fundamental ω.sub.3, the sub-process 900 proceeds to step 1040 of sub-process 1000. Otherwise, the sub-process 900 proceeds to step 1005 of sub-process 1000.
(109) Referring to
(110) In step 1010, voiced energy ε.sub.2 is compared to the product of b.sub.0 (typically 1.0) and voiced energy ε.sub.1. If voiced energy ε.sub.2 is greater than or equal to the product of b.sub.0 (typically 1.0) and voiced energy ε.sub.1, the sub-process 1000 proceeds to step 1015. Otherwise, the sub-process 1000 proceeds to step 1020, which ends the process with no change to fundamental ω.sub.A.
(111) In step 1015, the fundamental track length τ is compared to b.sub.1 (typically 3). If the fundamental track length τ is greater than or equal to b.sub.1, the sub-process 1000 proceeds to step 1025. Otherwise, the sub-process 1000 proceeds to step 1020, which ends the process with no change to fundamental ω.sub.A.
(112) In step 1025, fundamental ω.sub.A (typically set to ω.sub.A/2) is compared with the product of constant b.sub.2 (typically 0.8) and fundamental ω.sub.1 (typically set to the fundamental estimate ω.sub.A from the previous frame). If fundamental ω.sub.2 is greater than the product of constant b.sub.2 and fundamental ω.sub.1, the sub-process 1000 proceeds to step 1030. Otherwise, the sub-process 1000 proceeds to step 1020, which ends the process with no change to fundamental ω.sub.A.
(113) In step 1030, fundamental ω.sub.2 is compared with the product of constant b.sub.3 (typically 1.2) and fundamental ω.sub.1. If fundamental ω.sub.2 is less than the product of constant b.sub.3 and fundamental ω.sub.1, the sub-process 1000 proceeds to step 1035. Otherwise, the sub-process 1000 proceeds to step 1020, which ends the process with no change to fundamental ω.sub.A.
(114) In step 1035 (which is also reached from step 1040), fundamental ω.sub.A is set to half its value and the sub-process proceeds to step 1045, which ends the process with the ω.sub.A reduced by half.
(115) The comparisons in steps 710, 810, 910, and 1010 could also be performed by computing the ratio of voiced energy ε.sub.2 to voiced energy ε.sub.1 and comparing that ratio to the parameters c.sub.0, a.sub.0, a.sub.5, and b.sub.0, respectively. The comparisons in steps 710, 810, 910, and 1010 provide computational benefits, ratio comparisons may be referenced for conceptual reasons. It should be noted that the overall structure of the process of
(116) Referring to
(117)
(118) Other implementations are within the scope of the following claims.