Single-sided speech quality measurement

09786300 · 2017-10-10

Assignee

Inventors

Cpc classification

International classification

Abstract

A non-intrusive speech quality estimation technique is based on statistical or probability models such as Gaussian Mixture Models (“GMMs”). Perceptual features are extracted from the received speech signal and assessed by an artificial reference model formed using statistical models. The models characterize the statistical behavior of speech features. Consistency measures between the input speech features and the models are calculated to form indicators of speech quality. The consistency values are mapped to a speech quality score using a mapping optimized using machine learning algorithms, such as Multivariate Adaptive Regression Splines (“MARS”). The technique provides competitive or better quality estimates relative to known techniques while having lower computational complexity.

Claims

1. A single-ended speech quality measurement method comprising the steps of: for each frame of a plurality of frames containing a speech signal that has been processed by network equipment, transmitted on a communications link, or both: extracting perceptual features; and classifying the frame based on the perceptual features into a class selected from a set of classes including voiced and unvoiced; and for the frames of each class: assessing the perceptual features with a statistical model of that class to generate an indicator of speech quality, the statistical model of that class being part of a reference model which includes at least one statistical model for each class of the set of classes, the reference model generated prior to extracting the perceptual features to form indicators of speech quality, including assessing at least some unvoiced frames; and employing the indicators of speech quality from different classes to produce an estimate of subjective speech quality score without reference to a corresponding speech signal that has not been processed by network equipment, transmitted on a communications link, or both.

2. The method of claim 1 including the further step of separately modeling a probability distribution of the features for each frame class and different classes of speech signals with statistical models.

3. The method of claim 2 wherein the classes include inactive.

4. The method of claim 2 including the further step of calculating a consistency measure indicative of speech quality for each class separately with a plurality of statistical models.

5. The method of claim 4 including the further step of employing the consistency measures to obtain an estimate of subjective scores.

6. The method of claim 5 including the further step of mapping the consistency measures to a speech quality score using a mapping comprising Multivariate Adaptive Regression Splines.

7. The method of claim 1 wherein the perceptual features are assessed with Gaussian Mixture Models to form indicators of speech quality.

8. Apparatus operable to provide a single-end speech quality Measurement, comprising: a feature extraction module which extracts, frame-by-frame, perceptual features from a received speech signal that has been processed by network equipment, transmitted on a communications link, or both; a time segmentation module which classifies each frame based on the perceptual features into a class selected from a set of classes including voiced and unvoiced; a statistical reference model generated prior to extraction of the perceptual features, the reference model including at least one statistical model for each class of the set of classes; a consistency calculation module which, for the frames of each class, operates in response to output from the feature extraction module to assess the perceptual features with a statistical model of that class to form indicators of subjective speech quality without reference to a corresponding speech signal that has not been processed by network equipment, transmitted on a communications link, or both, including assessing at least some unvoiced frames; and a scoring module which employs the indicators of speech quality from different classes to produce a speech quality score without reference to a corresponding speech signal that has not been processed by network equipment, transmitted on a communications link, or both.

9. The apparatus of claim 8 wherein the consistency calculation module is further operable to separately model a probability distribution of the features for each class and different classes of speech signals with the statistical models.

10. The Apparatus of claim 9 wherein the classes include inactive.

11. The apparatus of claim 9 wherein the consistency calculation module is further operable to calculate a consistency measure indicative of speech quality for each class separately with a plurality of Gaussian Mixture Models.

12. The apparatus of claim 11 further including a mapping module operable to employ the consistency measures to obtain an estimate of subjective scores.

13. The apparatus of claim 12 wherein the mapping module employs a mapping optimized using Multivariate Adaptive Regression Splines.

14. The apparatus of claim 8 wherein the statistical reference model includes Gaussian Mixture Models.

Description

BRIEF DESCRIPTION OF THE FIGURES

(1) FIG. 1 is a block diagram of a non-intrusive measurement technique including a statistical reference model.

DETAILED DESCRIPTION

(2) FIG. 1 illustrates a relatively easily calculable non-intrusive measurement technique. The input is a speech (“test”) signal for which a subjective quality score is to be estimated (100), e.g., a speech signal that has been processed by network equipment, transmitted on a communications link, or both. A feature extraction module (102) is employed to extract perceptual features, frame by frame, from the test signal. A time segmentation module (104) labels the feature vector of each frame as belonging to one of three possible segment classes: voiced, unvoiced, or inactive. In a separate process, statistical or probability models such as Gaussian Mixture Models are formed. The terms “statistical model” and “statistical reference model” as used herein encompass probability models, statistical probability models and the like, as those terms are understood in the art. Different models may be formed for different classes of speech signals. For instance, one class could be high-quality, undistorted speech signal. Other classes could be speech impaired by different types of distortions. A distinct model may be used for each of the segment classes in each speech signal class, or one single model may be used for a speech class with no distinction between segments. The different statistical models together comprise a reference model (106) of the behavior of speech features. Features extracted from the test signal (100) are assessed using the reference model by calculating a “consistency” measure with respect to each statistical model via a consistency calculation module (108). The consistency values serve as indicators of speech quality and are mapped to an estimated subjective score, such as Mean Opinion Score (“MOS”), degradation mean opinion score (“DMOS”), or some other type of subjective score, using a mapping module (110), thereby producing an estimated score (112).

(3) Referring now to the feature extraction module (102), perceptual linear prediction (“PLP”) cepstral coefficients serve as primary features and are extracted from the speech signal every 10 ms. The coefficients are obtained from an “auditory spectrum” constructed to exploit three psychoacoustic precepts: critical band spectral resolution, equal-loudness curve, and intensity loudness power law. The auditory spectrum is approximated by an all-pole auto-regressive model, the coefficients of which are transformed to PLP cepstral coefficients. The order of the auto-regressive model determines the amount of detail in the auditory spectrum preserved by the model. Higher order models tend to preserve more speaker-dependent information. Since the illustrated embodiment is directed to measuring quality variation due to the transmission system rather than the speaker, speaker independence is a desirable property. In the illustrated embodiment fifth-order PLP coefficients as described in H. Hermansky, “Perceptual linear prediction (PLP) analysis of speech,” J. Acoust. Soc. Amer., vol. 87, pp. 1738-1752, 1990, (“Hermansky”), which is incorporated by reference, are employed as speaker-independent speech spectral parameters. Other types of features, such as RASTA-PLP, may also be employed in lieu of PLP.

(4) Referring now to the time segmentation module (104), time segmentation is employed to separate the speech frames into different classes. Each class appears to exert different influence on the overall speech quality. Time segmentation is performed using a voice activity detector (“VAD”) and a voicing detector. The VAD identifies each 10-ms speech frame as being active or inactive. The voicing detector further labels active frames as voiced or unvoiced. In the illustrated embodiment the VAD from ITU-T Rec. G.729-Annex B, A Silence Compression Scheme for G.729 Optimized for Terminals Conforming to Recommendation V.70, International Telecommunication Union, Geneva, Switzerland. November 1996, which is incorporated by reference, is employed.

(5) Referring to the GMM reference model (106), where u is a K-dimensional feature vector, a Gaussian mixture density is a weighted sum of M component densities as

(6) p ( u .Math. λ ) = .Math. i = 1 M α i b i ( u ) ( Eq . 1 )
where α.sub.i≧0, i=1, . . . , M are the mixture weights, with Σ.sub.i=1.sup.M α.sub.i=1, and b.sub.i(u), i=1, . . . , M, are K-variate Gaussian densities with mean vector μ.sub.i and covariance matrix Σ.sub.i. The parameter list λ={λ.sub.1, . . . , λ.sub.M} defines a particular Gaussian mixture density, where λ.sub.i={μ.sub.i, Σ.sub.i, α.sub.i}. GMM parameters are initialized using the k-means algorithm described in A. Gersho and R. Gray, Vector Quantization and Signal Compression. Norwell, M A: Kluwer, 1992, which is incorporated by reference, and estimated using the expectation-maximization (“EM”) algorithm described in A. Dempster, N. Lair, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Royal Statistical Society, vol.˜39, pp. 1-38, 1977, which is incorporated by reference. The EM algorithm iterations produce a sequence of models with monotonically non-decreasing log-likelihood (“LL”) values. The algorithm is deemed to have converged when the difference of LL values between two consecutive iterations drops below 10.sup.−3.

(7) Referring specifically to the reference model (106), a GMM is used to model the PLP cepstral coefficients of each class of speech frames. For instance, consider the class of clean speech signals. Three different Gaussian mixture densities p.sub.class(u|λ) are trained. The subscript “class” represents either voiced, unvoiced, or inactive frames. In principle, by evaluating a statistical model at the PLP cepstral coefficients x of the test signal, i.e., p.sub.class(x|λ), a measure of consistency between the coefficient vector and the statistical model is obtained. Voiced coefficient vectors are applied to p.sub.voiced(u|λ), unvoiced vectors to p.sub.unvoiced(u|λ), and inactive vectors to p.sub.inactive(u|λ).

(8) Referring now to the consistency calculation module (108), it should be noted that a simplifying assumption is made that vectors between frames are independent. Improved performance might be obtained from more sophisticated approaches that model the statistical dependency between frames, such as Markov modeling. Nevertheless, a model with low computational complexity has benefits as already discussed above. For a given speech signal whose feature vectors have been classified as described above, the consistency between the feature vectors of a class and the statistical model of that class is calculated as

(9) c class ( x 1 , .Math. , x N class ) = 1 N class .Math. j = 1 N class log ( p class ( x | λ ) ) ( Eq . 2 )
where x.sub.1, . . . , x.sub.Nclass, are the feature vectors in the class, and N.sub.class is the number of such vectors in the statistical model class. Larger C.sub.class indicates greater consistency. C.sub.class is set to zero whenever N.sub.class is zero. For each class, the product of the consistency measure C.sub.class and the fraction of frames of that class in the speech signal is calculated. The products for all the model classes serve as quality indicators to be mapped to an objective estimate of the subjective score value.

(10) Referring now to the mapping module (110), mapping functions which may be utilized include multivariate polynomial regression and multivariate adaptive regression splines (“MARS”), as described in J.H. Friedman, “Multivariate adaptive regression splines,” The Annals of Statistics, vol. 19, no 1, pp. 1-141, March 1991. With MARS, the mapping is constructed as a weighted sum of basis functions, each taking the form of a truncated spline.

(11) While the invention is described through the above exemplary embodiments, it will be understood by those of ordinary skill in the art that modification to and variation of the illustrated embodiments may be made without departing from the inventive concepts herein disclosed. Moreover, while the preferred embodiments are described in connection with various illustrative structures, one skilled in the art will recognize that the system may be embodied using a variety of specific structures. Accordingly, the invention should not be viewed as limited except by the scope and spirit of the appended claims.