Real-time pitch tracking by detection of glottal excitation epochs in speech signal using Hilbert envelope
11443761 · 2022-09-13
Inventors
- Prem Chand Pandey (Mumbai, IN)
- Hirak Dasgupta (Kolkata, IN)
- Nataraj Kathriki Shambulingappa (Davangere, IN)
Cpc classification
International classification
Abstract
A technique, suitable for real-time processing, is disclosed for pitch tracking by detection of glottal excitation epochs in speech signal. It uses Hilbert envelope to enhance saliency of the glottal excitation epochs and to reduce the ripples due to the vocal tract filter. The processing comprises the steps of dynamic range compression, calculation of the Hilbert envelope, and epoch marking. The Hilbert envelope is calculated using the output of a FIR filter based Hilbert transformer and the delay-compensated signal. The epoch marking uses a dynamic peak detector with fast rise and slow fall and nonlinear smoothing to further enhance the saliency of the epochs, followed by a differentiator or a Teager energy operator, and amplitude-duration thresholding. The technique is meant for use in speech codecs, voice conversion, speech and speaker recognition, diagnosis of voice disorders, speech training aids, and other applications involving pitch estimation.
Claims
1. A method for real-time pitch tracking by detection of glottal excitation epochs in a speech signal, the method comprising the steps of: applying a dynamic range compression on the speech signal to obtain a dynamic range compressed signal; calculating a Hilbert envelope of the dynamic range compressed signal; and marking epochs and outputting pitch periods from the Hilbert envelope, wherein marking epochs and outputting pitch periods comprises: calculating a peak envelope and a smoothed peak envelope from the Hilbert envelope to retain saliency of the epochs and reduce residual ripples in the Hilbert envelope; and marking epochs and outputting pitch periods from the smoothed peak envelope by detection of a saliency, wherein the detection of the saliency comprises: obtaining a saliency-enhanced peak envelope by processing the smoothed peak envelope to emphasize the points with high rate of change; and applying amplitude-duration thresholding on the saliency-enhanced peak envelope, with an amplitude threshold and a duration threshold, to mark the epochs and output the pitch periods by: calculating the amplitude threshold as a short-time average magnitude of the saliency-enhanced peak envelope; and calculating the duration threshold as half of the mean of the preceding ten pitch periods which are lying within a set range of 2-15 ms, and applying a lower limit of 2 ms.
2. The method as claimed in claim 1, wherein applying the dynamic range compression comprises applying a feed-forward compression, wherein applying the feed-forward compression comprises: calculating a short-time average magnitude of the speech signal to obtain a magnitude envelope; calculating a compressed envelope from the magnitude envelope; calculating a gain from the magnitude envelope and the compressed envelope; delaying the speech signal by a delay to obtain a delayed speech signal, wherein the delay is equal to the delay introduced in obtaining the magnitude envelope; and obtaining the dynamic range compressed signal from the delayed speech signal and the gain.
3. The method as claimed in claim 1, wherein calculating the Hilbert envelope comprises: obtaining a Hilbert transformed signal of the dynamic range compressed signal; delaying the dynamic range compressed signal by a delay to obtain a delayed dynamic range compressed signal, wherein the delay is equal to the delay introduced in obtaining the Hilbert transformed signal; calculating the square of the Hilbert transformed signal to obtain a squared Hilbert transformed signal; calculating the square of the delayed dynamic range compressed signal to obtain a squared delayed dynamic range compressed signal; and adding the squared Hilbert transformed signal and the squared delayed dynamic range compressed signal to obtain the Hilbert envelope.
4. The method as claimed in claim 1, wherein calculating the peak envelope of the Hilbert envelope comprises updating a peak and a valley of the Hilbert envelope, using recursive relations with fast rise and slow fall rates.
5. The method as claimed in claim 1, wherein the nonlinear smoothing is carried out by applying a two-stage median-mean filtering on the peak envelope to obtain the smoothed peak envelope.
6. The method as claimed in claim 1, wherein applying the amplitude-duration thresholding to the saliency-enhanced peak envelope, with the amplitude threshold and the duration threshold comprises: marking a point as an epoch for the saliency-enhanced peak envelope that exceeds the amplitude threshold and the time interval since the last detected epoch exceeds the duration threshold; and outputting an impulse as epoch at each epoch marking and simultaneously outputting the inter-epoch interval as the pitch period.
7. The method as claimed in claim 1, wherein obtaining the saliency-enhanced peak envelope by processing the smoothed peak envelope comprises differentiating the smoothed peak envelope.
8. The method as claimed in claim 1, wherein obtaining the saliency-enhanced peak envelope by processing the smoothed peak envelope comprises applying a Teager energy operator on the smoothed peak envelope.
9. A system for real-time pitch tracking by detection of glottal excitation epochs in a speech signal, the system comprising: a dynamic range compression module configured to apply a dynamic range compression on the speech signal to obtain a dynamic range compressed signal; a Hilbert envelope calculation module configured to calculate a Hilbert envelope of the dynamic range compressed signal; and an epoch marking and pitch detection module to process the Hilbert envelope for marking epochs and outputting pitch periods, wherein the epoch marking and pitch detection module comprises: a dynamic peak detector module configured to obtain a peak envelope from the Hilbert envelope; a nonlinear smoother module configured to calculate a smoothed peak envelope from the peak envelope; and a saliency detector module configured to mark epochs and output pitch periods from the smoothed peak envelope by detection of a saliency, wherein the saliency detector module comprises: a saliency enhancer module configured to obtain a saliency-enhanced peak envelope by processing the smoothed peak envelope to emphasize the points with high rate of change; an amplitude-duration thresholding module configured to apply amplitude-duration thresholding on the saliency-enhanced peak envelope, with an amplitude threshold and a duration threshold, to mark the epochs and output the pitch periods; an amplitude threshold calculator configured to calculate the amplitude threshold as a short-time average magnitude of the saliency-enhanced peak envelope; and a duration threshold calculator configured to calculate the duration threshold as half of the mean of the preceding ten pitch periods which are lying within a set range of 2-15 ms and apply a lower limit of 2 ms.
10. The system as claimed in claim 9, wherein the dynamic range compression module applies a feed-forward compression, wherein the dynamic range compression module comprises: a magnitude envelope estimation module configured to calculate a short-time average magnitude of the speech signal to obtain a magnitude envelope; a compressed envelope calculation module configured to calculate a compressed envelope from the magnitude envelope; a gain calculator module configured to calculate a gain from the magnitude envelope and the compressed envelope; a first delay module for delaying the speech signal by a delay to obtain a delayed speech signal, wherein the delay is equal to the delay introduced in obtaining the magnitude envelope; and a multiplier module for obtaining the dynamic range compressed signal from the delayed speech signal and the gain.
11. The system as claimed in claim 9, wherein the Hilbert envelope calculation module comprises: a Hilbert transformer module configured to obtain a Hilbert transformed signal of the dynamic range compressed signal; a second delay module for delaying the dynamic range compressed signal to obtain a delayed dynamic range compressed signal, wherein the delay introduced by the second delay module is equal to the delay introduced by the Hilbert transformer module; a first squaring module for calculating the square of the Hilbert transformed signal to obtain a squared Hilbert transformed signal; a second squaring module for calculating the square of the delayed dynamic range compressed signal to obtain a squared delayed dynamic range compressed signal; and an adder module for adding the squared Hilbert transformed signal and the squared delayed dynamic range compressed signal to obtain the Hilbert envelope.
12. The system as claimed in claim 9, wherein the dynamic peak detector module is configured to calculate the peak envelope of the Hilbert envelope, by updating a peak and a valley of the Hilbert envelope, using recursive relations with fast rise and slow fall rates.
13. The system as claimed in claim 9, wherein the nonlinear smoother module is configured to carry out nonlinear smoothing by applying a two-stage median-mean filtering on the peak envelope to obtain the smoothed peak envelope.
14. The system as claimed in claim 9, wherein the amplitude-duration thresholding module is configured to apply amplitude-duration thresholding on the saliency-enhanced peak envelope to obtain the epochs and the pitch periods, by marking a point as an epoch if the saliency-enhanced peak envelope exceeds the amplitude threshold and the time interval since the last detected epoch exceeds the duration threshold, and outputting an impulse as epoch at each epoch marking and simultaneously outputting the inter-epoch interval as the pitch period.
15. The system as claimed in claim 9, wherein the saliency enhancer module is configured to obtain the saliency-enhanced peak envelope by differentiating the smoothed peak envelope.
16. The system as claimed in claim 9, wherein the saliency enhancer module is configured to obtain the saliency-enhanced peak envelope by applying a Teager energy operator on the smoothed peak envelope.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The detailed description of the invention is described with reference to the accompanying figures.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
DETAILED DESCRIPTION OF THE INVENTION
(11) A method and system are disclosed for pitch tracking by detection of glottal excitation epochs in speech signal, wherein the method permits real-time processing and is robust against high-pass filtering. Further, the method is based on calculating the Hilbert envelope of the speech signal to enhance the excitation epochs and to suppress the ripples related to the vocal tract response. A dynamic range compression can be applied before the calculation of the Hilbert envelope, and an epoch marker may be used to detect the high-saliency points in the Hilbert envelope. The impulses corresponding to the detected epochs can then be used for pitch period estimation.
(12) The voiced speech signal can be assumed as the convolution of the impulse response of the time-varying vocal tract and glottal filter with the quasi-periodic impulse train due to glottal vibration. The speech signal s(n) during voiced regions can be approximated by the short-time harmonic model as
(13)
where b.sub.k and θ.sub.k represent the combined effect of the vocal tract and glottal filters and ω.sub.0 is the fundamental frequency. The Hilbert envelope of the speech signal s(n) is the squared magnitude of the complex analytic signal s.sub.a(n), which is given as
s.sub.a(n)=s(n)+js.sub.h(n) (2)
where s.sub.h (n) is the Hilbert transform (see A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Processing, Upper Saddle River, N.J.: Prentice-Hall, 1999) of the speech signal s(n). The Hilbert transform can be obtained by a π/2-phase shifter, also known as the Hilbert transformer, with the frequency and impulse responses given as
(14)
The Hilbert envelope e.sub.h (n) may be given as
e.sub.h(n)=s.sup.2(n)+s.sub.h.sup.2(n) (5)
The Hilbert transform s.sub.h (n) for the speech signal s(n) in Equation 1, can be given as
(15)
The Hilbert envelope e.sub.h (n) can be expressed as
(16)
The Hilbert envelope e.sub.h (n) consists of an offset and sum of harmonics of ω.sub.0, with several harmonics in s(n) contributing to the fundamental and enhancing the instants of significant excitation.
(17)
(18)
(19) The processing modules of the embodiment illustrated in
(20) The dynamic range compression serves as a pre-processing step to the Hilbert envelope calculation in order to reduce the possibility of misdetection of the epochs during low-energy speech segments. Dynamic range compression can be implemented in several ways.
(21) In the dynamic range compression module 210 as illustrated in
a(n)=a(n−1)+[|s.sub.in(n)|−|s.sub.in(n−L)|]/L (8)
The value L selected corresponds to a 25-ms window, i.e. L=25×10.sup.−3f.sub.s. For the input signal range of [−1, +1], the A-law compressed envelope is given as
(22)
A time-varying gain g(n) is calculated from the magnitude envelope a(n) and the compressed envelope ã(n) as
g(n)=ã(n)/a(n) (10)
The speech signal s.sub.in(n) is delayed with a delay equal to the delay introduced by the magnitude envelope estimation module and is multiplied with the time-varying gain g(n) to obtain the dynamic range compressed signal s(n) as
s(n)=g(n)s.sub.in(n−(L−1)/2) (11)
The value of A in Equation 9 is set as 40 to provide compression without excessive increase of noise during the silences and it results in the highest gain of approximately 19 dB.
(23)
(24) The Hilbert transformer 510, used for the Hilbert envelope calculation as shown in
s.sub.ht(n)=s(n)*h.sub.t(n) (12)
s.sub.d(n)=s(n−(M−1)/2) (13)
e.sub.ht(n)=s.sub.ht.sup.2(n)+s.sub.d.sup.2(n) (14)
In order to suppress the glottal and vocal tract filter responses without excessive smearing of the representation of the glottal excitation in the envelope, M is empirically selected to correspond to 15 ms, i.e. M=15×10.sup.−3f.sub.s.
(25) The epoch marking and pitch detection module 230 in the block diagram of
(26) The dynamic peak detector module 610 of
(27)
The valley d(n) tracks the time-varying offset in the Hilbert envelope, where the constants μ and v, selected to be in the range [0,1], control the rise and fall rates. A fast rise (small μ) and slow fall (large v) help in suppressing the ripples while retaining saliency of the epochs. In an exemplary embodiment, these values are selected as μ=0.1 and v=0.9954 for 90% rise in one sample and 60% fall in 100 samples.
(28)
(29) The nonlinear smoother 620 of
(30) Referring to the saliency detector module 630 of
(31) In the saliency enhancer module of the saliency detector module 630 as shown in
y(n)=[−x(n)+8x(n−1)−8x(n−3)+x(n−4)]/12 (17)
It may be noted that the differentiator may be replaced by other operations to emphasize the points with high-rate of change. One such operation is a real-time version of the Teager energy operator given as
y(n)=x.sup.2(n−1)−x(n)x(n−2) (18)
(32) In the saliency detector module (630) as shown in
A.sub.θ(n)=A.sub.θ(n−1)+[|y(n)|−|y(n−P)|]/P (19)
where P corresponds to a 10-ms window, i.e. P=10×10.sup.−3f.sub.s. The duration threshold T.sub.θ(n) is calculated from the pitch periods, as half of the mean of the preceding ten pitch periods which are lying within a set range, which may be 2-15 ms. A lower limit, which may be 2 ms. is applied on the duration threshold T.sub.θ(n).
(33) The implementation of the glottal excitation epoch detector uses a total storage of 725 variables and coefficients: 253 for magnitude envelope calculation in Equation 8, 3 for dynamic range compression in Equation 9, 1 for compressed signal in Equations 10-11, 302 for Hilbert envelope in Equations 12-14, 47 for smoothed peak in Equations 15-16 and two-stage median mean smoothing, 5 for differentiation in Equation 17, 103 for amplitude thresholding, and 11 for duration thresholding. The technique involves an algorithmic delay of 21.4 ms, consisting of 12.5 ms for compression, 7.5 ms for Hilbert envelope, and 1.4 ms for epoch marking.
(34)
(35) The various modules disclosed in the above description can be implemented using digital signal processors, embedded microcontrollers, FPGAs (field programmable gate arrays), or ASICs (application specific integrated circuits) or a combination of such processors. Further, one, two, or more modules can be integrated into a single processor.
(36) The above description along with the accompanying drawings is intended to disclose and describe the preferred embodiment of the invention in sufficient detail to enable those skilled in the art to practice the invention. It should not be interpreted as limiting the scope of the invention. Various changes in form and detail may be made without departing from its spirit and scope.