Evaluation of speech quality in audio or video signals

Abstract

An apparatus for generating a score signal representing the quality of an audio or video signal supplied to the apparatus is proposed. The apparatus comprises: an input for supplying an audio or video signal, a computing unit implementing a neural network, the computing unit being supplied with the audio or video signal, and producing a score signal representing the quality of an audio or video signal supplied representing at least one predefined quality parameter of the audio or video signal, the neural network being set up by being trained with training data of a specific transmission standard and/or codec used for generating the audio or video data.

Claims

1. An apparatus for generating a score signal representing the quality of an audio or video signal supplied to the apparatus, the apparatus comprising: an input for supplying the supplied audio or video signal; and a computing unit implementing a neural network, the computing unit being supplied with the supplied audio or video signal, and producing a score signal representing the quality of the supplied audio or video signal, the score signal representing at least one predefined quality parameter of the supplied audio or video signal, the neural network being set up by being trained with training data of a specific transmission standard and/or codec used for generating data of the supplied audio or video signal; wherein the neural network simultaneously produces a plurality of quality score values according to different ITU (International Telecommunications Union) measurement methods, and wherein the supplied audio or video signal is a digital audio signal and the score signal represents simultaneously the speech quality according to at least two of the following ITU-T (International Telecommunications Union-Telecommunication Standard Sector) speech quality testing methods: PESQ (Perceptual Evaluation of Speech Quality); PEAQ (Perceptual Evaluation of Audio Quality); and POLQA (Perceptual Objective Listening Quality Analysis).

2. The apparatus of claim 1, wherein the neural network is not supplied with a reference signal.

3. The apparatus of claim 1, wherein the supplied audio or video signal is a speech signal and the score signal represents the ITU (International Telecommunications Union) P.800 value LQS (Listening Quality Subjective).

4. The apparatus according to claim 1, wherein the neural network is obtained by the following supervised learning steps: feeding a training audio or video signal to the neural network to obtain a first training output signal; feeding said training audio or video signal to an objective analytical quality testing device, together with a reference signal to obtain a second training output signal; and comparing the first and second training score signals output by the neural network and the analytical quality testing device, and using the result of the comparison for training the neural network.

5. The apparatus of claim 1, comprising a user interface for inputting information as to one or more of the transmission standard, codec and fading data as to the supplied audio or video signal.

6. The apparatus of claim 1, wherein the supplied audio or video signal is a VoIP signal.

7. An apparatus for generating a score signal representing the quality of a speech signal supplied to the apparatus, the apparatus comprising: an input for supplying a supplied speech signal; and a computing unit implementing a neural network, the computing unit being supplied with the supplied speech signal, and producing a score signal representing the quality of the supplied speech signal, the score signal representing at least one predefined quality parameter of the supplied speech signal, wherein the score signal represents the ITU (International Telecommunications Union) P.800 value LQS (Listening Quality Subjective), wherein the neural network simultaneously produces a plurality of quality score values according to different ITU (International Telecommunications Union) measurement methods, and wherein the LQS value is produced by at least two of the following ITU-T (International Telecommunications Union-Telecommunication Standard Sector) speech quality testing methods: PESQ (Perceptual Evaluation of Speech Quality); PEAQ (Perceptual Evaluation of Audio Quality); and POLQA (Perceptual Objective Listening Quality Analysis).

8. An apparatus for generating a score signal representing the quality of an audio or video signal supplied to the apparatus, wherein the apparatus implements a Siamese network and comprises: a first neural network being supplied with a supplied reference audio or video signal and designed to generate a first output signal; a second neural network being supplied with a supplied audio or video signal for which a score signal is to be generated, the second neural network being designed to generate a second output signal; and a third neural network supplied with the first and second output signal, respectively, and generating the score signal, wherein the third neural network simultaneously produces a plurality of quality score values according to different ITU (International Telecommunications Union) measurement methods, and wherein the supplied audio or video signal is a digital audio signal and the score signal represents simultaneously the speech quality according to at least two of the following ITU-T (International Telecommunications Union-Telecommunication Standard Sector) speech quality testing methods: PESQ (Perceptual Evaluation of Speech Quality); PEAQ (Perceptual Evaluation of Audio Quality); and POLQA (Perceptual Objective Listening Quality Analysis).

9. A computer-implemented method for generating a score signal representing the quality of an audio or video signal, comprising the steps of: supplying a supplied audio or video signal; and supplying a trained neural network with the supplied audio or video signal, the neural network producing a score signal representing the quality of the supplied audio or video signal, the score signal representing at least one predefined quality parameter of the supplied audio or video signal, wherein the training data for the trained neural network are specific for a transmission standard and/or codec used for generating the supplied audio or video signal, wherein the neural network simultaneously produces a plurality of quality score values according to different ITU (International Telecommunications Union) measurement methods, and wherein the supplied audio or video signal is a digital audio signal and the score signal represents simultaneously the speech quality according to at least two of the following ITU-T (International Telecommunications Union-Telecommunication Standard Sector) speech quality testing methods: PESQ (Perceptual Evaluation of Speech Quality); PEAQ (Perceptual Evaluation of Audio Quality); and POLQA (Perceptual Objective Listening Quality Analysis).

10. A computer-implemented method for generating a score signal representing the quality of an audio or video signal, comprising the steps of: supplying a speech signal; and supplying a trained neural network with the supplied speech signal, the trained neural network producing a score signal representing the quality of the supplied speech signal, the score signal representing at least one predefined quality parameter of the supplied speech signal, wherein the score signal represents the ITU (International Telecommunications Union) P.800 value LQS (Listening Quality Subjective), wherein the neural network simultaneously produces a plurality of quality score values according to different ITU (International Telecommunications Union) measurement methods, and wherein the LQS value is produced by at least two of the following ITU-T (International Telecommunications Union-Telecommunication Standard Sector) speech quality testing methods: PESQ (Perceptual Evaluation of Speech Quality); PEAQ (Perceptual Evaluation of Audio Quality); and POLQA (Perceptual Objective Listening Quality Analysis).

Description

(1) Further aspects, features and advantages of the invention will now become evident by means of the following explanation of non-limiting embodiments of the invention, when taken in conjunction with the figures of the enclosed drawings:

(2) FIG. 1 shows a prior art system for producing each for these scores,

(3) FIG. 2 shows the implementation of a single and objective model according to ITU recommendations,

(4) FIG. 3 shows the implementation of an objective model according to the PESQ algorithm,

(5) FIG. 4 shows an inventive system using a neural network for producing the LGS score,

(6) FIG. 5 shows the training of a neural network used in the invention, together with the production of training data,

(7) FIG. 6 shows the details as to a neural network used in the context of the present invention,

(8) FIG. 7 shows details as to the training of the neural network of FIG. 6, and

(9) FIG. 8 shows an alternative embodiment in which the neural network is implemented as a so called Siamese (neural) network

(10) The general aspects of the present invention will now be explained with reference to FIG. 4.

(11) Note that the system 2 according to the invention may be a system emulating a transmission standard. It may furthermore comprise a fading unit simulating different scenarios (urban, rural, . . . etc.). It may furthermore comprise a video/audio analyzer. The input signal may be an analog or a digital signal.

(12) As can be seen by comparison to FIG. 1 according to this aspect the invention proposes to use a neural network 10, alternatively or in addition to the human/subjective test for producing for example the LQS score parameter. In other words, according to the invention, the output signal (feedback signal) of the system 2 under test is used as the input signal for a neural network 10, thus producing the LQS score in an objective manner. Note that the LQS score maybe produced by one or more of at least the following ITU methods: PESQ PEAQ POLQA

(13) Other measurements can be applied in addition or alternatively to the mentioned ones.

(14) As can be seen from FIG. 4, preferably, no reference signal (input signal to the system under test) is required to be fed to the neural network 10.

(15) As can be seen in FIG. 5, the neural network is trained for a specific system 2, i.e. preferably used for a system using a specific codec and/or transmission standard.

(16) The codec may be e.g. one or more of G.711 G.729 G.726 Ilbc G.729a G.723.1 G.728

(17) During the training of the neural network, the output signal (degraded signal) of the system under test is used as an input signal for the neural network 10.

(18) Furthermore, the reference signal (input signal) is also fed to the objective model 6, and an output thereof is used as training data by comparing these objectively produced training data with the output of the neural network 10. In methods known as such these training data can be used for supervised learning of the neural network 10. Preferably, thus, this neural network 10 is training using a specific codec and/or transmission standard (3G, 4G, 5G etc.).

(19) FIG. 6 shows a possible implementation of the neural network according to the invention. The neural network thus is a model of the human ear which is otherwise used for the human/subjective test file.

(20) The neural network may be any of known neural network such for example a DNN, CNN or RNN.

(21) The audio signal (which is the output signal/degraded signal) is fed to the input layer 20 of the neural network.

(22) The output layer 21 of the neural network 10 produces a quality score, preferably the LQS quality score, according to one or more ITU measurement methods. Thus, the neuronal network 10 maps the supplied audio signal to one or more quality scores.

(23) Preferably, a plurality of quality score values according to different ITU measurement methods is produced simultaneously, such for example values according to PESQ, PEAQ, and/or POLQA.

(24) FIG. 7 shows in a simplified manner the training of the neural network 10.

(25) The input layer of the neural network 10 is provided with the output signal, degraded signal, which may be a transmitted or stored audio file 25.

(26) This degraded output signal, together with the input signal as reference signal, is also fed to a classical (objective) measurement according to for example ITU standards).

(27) The output score value according to different ITU measurement methods, such as for example the PESQ, PEAQ and POLQA value are then to compared in order to produce a signal for supervised learning of the neural network 10.

(28) In FIG. 8 alternative approach according to the invention is shown, in which a so called Siamese network is used for producing the (speech) quality scores, such as for example according to the PESQ, PEAQ or POLQA measurement method of the ITU.

(29) According to this approach, the input signal (reference signal) is fed to a first neural network 30 producing a first output signal 31.

(30) The degraded signal (transmitted audio signal) is fed to a second neural network 32, producing a second output signal 32.

(31) The first and the second output signal 31, 32, respectively, are fed to a third neural network which produces the PESQ, PEAQ or POLQA values.

Evaluation of speech quality in audio or video signals

Assignee

Inventors

Cpc classification

Classification Explorer

H04M3/2236

ELECTRICITY

Classification Explorer

H04M7/006

ELECTRICITY

Classification Explorer

G10L25/69

PHYSICS

Classification Explorer

G10L25/30

PHYSICS

Classification Explorer

G10L19/08

PHYSICS

Classification Explorer

G10L25/90

PHYSICS

Classification Explorer

G10L21/0208

PHYSICS

International classification

Classification Explorer

G10L25/69

PHYSICS

Classification Explorer

G10L21/0208

PHYSICS

Classification Explorer

H04M3/22

ELECTRICITY

Classification Explorer

H04M7/00

ELECTRICITY

Classification Explorer

G10L19/08

PHYSICS

Classification Explorer

G10L25/90

PHYSICS

Abstract

Claims

Description