METHOD FOR RATING THE SPEECH QUALITY OF A SPEECH SIGNAL BY WAY OF A HEARING DEVICE
20220068294 · 2022-03-03
Inventors
Cpc classification
H04R25/407
ELECTRICITY
H04R25/43
ELECTRICITY
International classification
Abstract
A method for rating the speech quality of a speech signal by a hearing device. An acousto-electric input transducer records sound containing the speech signal and converts it into an input audio signal. At least one articulatory and/or prosodic property of the speech signal is quantitatively acquired through analysis of the input audio signal, and a quantitative measure of speech quality is derived based on the articulatory and/or prosodic property. A hearing device with an acousto-electric input transducer configured to record a sound and convert it into an input audio signal, and a signal processing apparatus that is designed to quantitatively acquire at least one articulatory and/or prosodic property of a component, contained in the input audio signal, of a speech signal based on analysis of the input audio signal and to derive a quantitative measure of the speech quality based on the at least one articulatory and/or prosodic property.
Claims
1. A method for rating a speech quality of a speech signal by a hearing device, the method comprising: recording a sound with an acousto-electric input transducer of the hearing device, the sound containing the speech signal from surroundings of the hearing device, and converting the sound into an input audio signal; quantitatively acquiring at least one articulatory property and/or prosodic feature of the speech signal through analysis of the input audio signal by a signal processing operation, and deriving a quantitative measure of the speech quality based on the at least one articulatory property and/or prosodic feature.
2. The method according to claim 1, the method further comprising acquiring, as articulatory property of the speech signal, at least one of: a characteristic variable correlated with the precision of predefined formants of vowels in the speech signal; a characteristic variable correlated with the dominance of consonants and/or fricatives in the speech signal; or a characteristic variable correlated with the precision of transitions from voiced and unvoiced sounds.
3. The method according to claim 2, wherein the step of acquiring the characteristic variable correlated with the dominance of consonants in the speech signal, comprises: calculating a first energy contained in a low frequency range; calculating a second energy contained in a frequency range higher than the low frequency range; and forming the characteristic variable based on a ratio, and/or a ratio weighted over the respective bandwidths of the frequency ranges, of the first energy and the second energy.
4. The method according to claim 2, wherein the step of acquiring the characteristic variable correlated with precision of the transitions from voiced and unvoiced sounds, comprises: making a distinction between voiced temporal sequences and unvoiced temporal sequences based on a correlation measurement and/or based on a zero crossing rate; ascertaining a transition from a voiced temporal sequence to an unvoiced temporal sequence or from an unvoiced temporal sequence to a voiced temporal sequence; ascertaining the energy contained in the voiced or unvoiced temporal sequence prior to the transition for at least one frequency range, and ascertaining the energy contained in the unvoiced or voiced temporal sequence following the transition for the at least one frequency range; and ascertaining the characteristic variable based on the energy prior to the transition and based on the energy following the transition.
5. The method according to claim 2, wherein the step of acquiring the characteristic variable correlated with the precision of predefined formants of vowels in the speech signal, comprises: ascertaining a signal component of the speech signal in at least one formant range in a frequency space; ascertaining a signal variable correlated with a level for the signal component of the speech signal in the at least one formant range; and ascertaining the characteristic variable based on a maximum value and/or based on a temporal stability of the signal variable correlated with the level.
6. The method according to claim 1, the method further comprising: acquiring a fundamental frequency of the speech signal in a temporally resolved manner; and ascertaining a characteristic variable characteristic of a temporal stability of the fundamental frequency as a prosodic feature of the speech signal.
7. The method according to claim 1, the method further comprising: acquiring a variable correlated with a volume in a temporally resolved manner for the speech signal; forming, over a predefined time interval, a quotient of a maximum value of the variable correlated with the volume to a mean of said variable ascertained over the predefined time interval; and ascertaining a characteristic variable as prosodic feature of the speech signal based on the quotient that is formed from the maximum value and the mean of the variable correlated with the volume over the predefined time interval.
8. The method according to claim 1, the method further comprising: ascertaining at least two characteristic variables, each characteristic of articulatory properties and/or prosodic features, based on the analysis of the input audio signal; and forming the quantitative measure of the speech quality based on a product of the ascertained at least two characteristic variables and/or based on a weighted mean of the ascertained at least two characteristic variables.
9. The method according to claim 1, the method further comprising: detecting speech activity and/or ascertaining a signal-to-noise ratio in the input audio signal before the at least one articulatory property and/or prosodic feature of the speech signal is acquired; and performing analysis regarding the at least one articulatory property and/or prosodic feature of the speech signal based on the detected voice activity or the ascertained signal-to-noise ratio.
10. A hearing device, comprising: an acousto-electric input transducer configured to record a sound from surroundings of the hearing device and to convert said sound into an input audio signal; a signal processing unit configured to: quantitatively acquire at least one articulatory property and/or prosodic feature of a component, contained in said input audio signal, of a speech signal based on analysis of said input audio signal; and derive a quantitative measure of a speech quality based on said at least one articulatory property and/or prosodic feature.
11. The hearing device according to claim 10, configured as a hearing aid.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0042]
[0043]
DETAILED DESCRIPTION OF THE INVENTION
[0044]
[0045] The input audio signal 8 is fed to a signal processing apparatus 10 of the hearing aid 2, in which the input audio signal 8 is processed appropriately, in particular in accordance with the audiological requirements of the user of the hearing aid 2 and is in the process for example amplified and/or compressed in terms of frequency band. The signal processing apparatus/unit 10 is for this purpose in particular embodied by way of an appropriate signal processor (not illustrated in more detail in
[0046] The signal processing apparatus 10, by processing the input audio signal 8, in this case generates an output audio signal 12 that is converted into an output sound signal 16 of the hearing aid 2 by way of an electro-acoustic output transducer 14. The input transducer 4 is in this case preferably formed by a microphone, and the output transducer 14 is formed for example by a loudspeaker (such as for instance a balanced metal case receiver), but may also be formed by a bone conduction hearing device or the like.
[0047] The sound 6 from the surroundings of the hearing aid 2 that is acquired by the input transducer 4 contains, inter alia, a speech signal 18 from a speaker, not illustrated in more detail, and other sound components 20, which may comprise in particular directional and/or diffuse interfering noise (interfering sound or background noise), but may also contain noise that could be considered to be a payload signal depending on the situation, that is to say for example music or acoustic warning or information signals concerning the surroundings.
[0048] The signal processing operation on the input audio signal 8 performed in the signal processing apparatus 10 in order to generate the output audio signal 12 may in particular comprise suppression of signal components that suppress the interfering noise contained in the sound 6, or relative boosting of the signal components representing the speech signal 18 in relation to the signal component representing the other sound components 20. Frequency-dependent or wideband dynamic compression and/or amplification and noise suppression algorithms may in particular also be applied in this case.
[0049] In order to make the signal components in the input audio signal 8 that represent the speech signal 18 as audible as possible in the output audio signal 12 and nevertheless to be able to give the user of the hearing aid 2 the most natural possible auditory impression in the output sound 16, a quantitative measure of the speech quality of the speech signal 18 should be ascertained in the signal processing apparatus 10 for controlling the algorithms to be applied to the input audio signal 8. This is described with reference to
[0050]
[0051] The first algorithm 25 may in particular also make provision to classify an auditory situation that is created in the sound 6, and to set individual parameters on the basis of the classification, potentially as appropriate for an auditory program provided for a specific auditory situation. In addition to this, the individual audiological requirements of the user of the hearing aid 2 may also be taken into consideration for the first algorithm 25 in order to be able to compensate for a hearing impairment of the user as well as possible by applying the first algorithm 25 to the input audio signal 8.
[0052] If, however, noteworthy speech activity is identified in the speech activity VAD identification (path “y”), then an SNR is ascertained next and compared with a predefined limit value Th.sub.SNR. If the SNR is not above the limit value, that is to say SNR≤Th.sub.SNR, then the first algorithm 25 is applied again to the input audio signal 8 in order to generate the output audio signal 12. If however the SNR is above the predefined limit value Th.sub.SNR, that is to say SNR>Th.sub.SNR, then a quantitative measure 30 of the speech quality of the speech component 18 contained in the input audio signal 8 is ascertained for the further processing of the input audio signal 8 in the manner described below. Articulatory and/or prosodic properties of the speech signal 18 are quantitatively acquired for this purpose. The term speech signal component 26 contained in the input audio signal 8 should in this case be understood to mean those signal components of the input audio signal 8 that represent the speech component 18 of the sound 6 from which the input audio signal 8 is generated by way of the input transducer 4.
[0053] In order to ascertain said quantitative measure 30, the input audio signal 8 is split into individual signal paths.
[0054] For a first signal path 32 of the input audio signal 8, a centroid wavelength Ac is first of all ascertained and compared with a predefined limit value for the centroid wavelength Th.sub.λ. If it is identified, on the basis of said limit value of the centroid wavelength Th.sub.λ, that the signal components in the input audio signal 8 are of sufficiently high frequency, then the signal components are selected in the first signal path 32, possibly after appropriately selected temporal smoothing (not illustrated), for a low frequency range NF and a higher frequency range HF above the low frequency range NF. One possible split may for example be such that the low frequency range NF comprises all frequencies f.sub.N≤2500 Hz, in particular f.sub.N≤2000 Hz, and the higher frequency range HF comprises frequencies f.sub.H where 2500 Hz<f.sub.H≤10 000 Hz, in particular 4000 Hz≤f.sub.H≤8000 Hz or 2500 Hz<f.sub.H≤5000 Hz.
[0055] The selection may be made directly in the input audio signal 8 or else be made such that the input audio signal 8 is split into individual frequency bands by way of a filter bank (not illustrated), wherein individual frequency bands are assigned to the low or higher frequency range NF or HF depending on the respective band limits.
[0056] A first energy E1 is then ascertained for the signal contained in the low frequency range NF and a second energy E2 is ascertained for the signal contained in the higher frequency range HF. A quotient QE is then formed from the second energy as numerator and the first energy E1 as denominator. The quotient QE, if the low and higher frequency range NF, HF are selected appropriately, may then be applied as a characteristic variable 33 that is correlated with dominance of consonants in the speech signal 18. The characteristic variable 33 thus allows a statement about an articulatory property of the speech signal components 26 in the input audio signal 8. A value of the quotient QE>>1 (that is to say QE>Th.sub.QE with a predefined limit value Th.sub.QE>>1 not illustrated in more detail) may thus for example infer a high dominance of consonants, while a value QE<1 may infer a low dominance.
[0057] In a second signal path 34, a distinction 36 is made in the input audio signal 8 between voiced temporal sequences V and unvoiced temporal sequences UV based on correlation measurements and/or based on a zero crossing rate of the input audio signal 8. Based on the voiced and unvoiced temporal sequences V and UV, a transition TS from a voiced temporal sequence V to an unvoiced temporal sequence UV is ascertained. The length of a voiced or unvoiced temporal sequence may for example be between 10 and 80 ms, in particular between 20 and 50 ms.
[0058] An energy Ev for the voiced temporal sequence V prior to the transition TS and an energy En for the unvoiced temporal sequence UV following the transition TS is then in each case ascertained for at least one frequency range (for example a selection of particularly meaningful frequency bands ascertained as being suitable, for example frequency bands 16 to 23 on the Bark scale, or frequency bands 1 to 15 on the Bark scale). In this case, appropriate energies prior to and following the transition TS may in particular also be ascertained in each case separately for more than one frequency range. It is then determined how the energy changes at the transition TS, for example through a relative change ΔE.sub.TS or through a quotient (not illustrated) of the energies Ev, En prior to and following the transition TS.
[0059] The measure of the change of the energy, that is to say in this case the relative change, is then compared with a limit value Th.sub.E, ascertained beforehand for good articulation, for energy distribution at transitions. A characteristic variable 35 may in particular be formed based on a ratio of the relative change ΔE.sub.TS and said limit value Th.sub.E or based on a relative deviation of the relative change ΔE.sub.TS from this limit value Th.sub.E. Said characteristic variable 35 is correlated with the articulation of the transitions from voiced and unvoiced sounds in the speech signal 18, and thus makes it possible to conclude as to a further articulatory property of the speech signal components 26 in the input audio signal 8. It is generally applicable here that a transition between voiced and unvoiced temporal sequences is articulated more precisely the faster, that is to say the more temporally definable, a change in the energy distribution takes place across the frequency ranges relevant to voiced and unvoiced sound.
[0060] For the characteristic variable 35, it is however also possible to consider an energy distribution into two frequency ranges (for example the abovementioned frequency ranges in accordance with the Bark scale, or else in the low and upper frequency range NF, HF), for example via a quotient of the respective energies or a comparable characteristic value, and to apply a change in the quotient or the characteristic value across the transition for the characteristic variable. A rate of change of the quotient or of the characteristic variable may thus for example be determined and compared with a reference value, ascertained beforehand as being suitable, for the rate of change.
[0061] Transitions from unvoiced temporal sequences may be considered in the same way in order to form the characteristic variable 35. The specific embodiment, in particular in terms of the frequency ranges and limit or reference values to be used, may generally be achieved based on empirical results regarding a corresponding significance of the respective frequency bands or groups of frequency bands.
[0062] In a third signal path 38, a fundamental frequency f.sub.G of the speech signal component 26 is acquired in a temporally resolved manner in the input audio signal 8, and a temporal stability 40 is ascertained for said fundamental frequency f.sub.G based on a variance of the fundamental frequency f.sub.G. The temporal stability 40 may be used as a characteristic variable 41 that allows a statement about a prosodic property of the speech signal components 26 in the input audio signal 8. A stronger variance in the fundamental frequency f.sub.G may in this case be used as an indicator for better speech intelligibility, while a monotonic fundamental frequency f.sub.G comprises lower speech intelligibility.
[0063] In a fourth signal path 42, a level LVL is acquired in a temporally resolved manner for the input audio signal 8 and/or for the speech signal component 26 contained therein, and a temporal mean MN.sub.LVL is formed over a time interval 44 that is predefined in particular based on corresponding empirical findings. The maximum MX.sub.LVL of the level LVL is also ascertained over the time interval 44. The maximum MX.sub.LVL of the level LVL is then divided by the temporal mean MN.sub.LVL of the level LVL, and a characteristic variable 45 correlated with a volume of the speech signal 18 is thus ascertained, this allowing a further statement about a prosodic property of the speech signal components 26 in the input audio signal 8. Instead of the level LVL, another variable correlated with the volume and/or the energy content of the speech signal component 26 may also be used here.
[0064] The characteristic variables 33, 35, 41 and 45 respectively ascertained, as described, in the first to fourth signal path 32, 34, 38, 42 may then each be used individually as the quantitative measure 30 of the quality of the speech component 18 contained in the input audio signal 8, on the basis of which a second algorithm 46 is then applied to the input audio signal 8 for signal processing purposes. The second algorithm 46 may in this case be derived from the first algorithm 25 through an appropriate change of one or more signal processing parameters made on the basis of the relevant quantitative measure 30 or provide a completely standalone auditory program.
[0065] An individual value may in particular also be determined as quantitative measure 30 of the speech quality based on the characteristic variables 33, 35, 41 or 45 ascertained as described, for example through a weighted mean or a product of the characteristic variables 33, 35, 41, 45 (schematically illustrated in
[0066] Although the invention has been described and illustrated in more detail through the preferred exemplary embodiment, the invention is not restricted to the disclosed examples, and other variations may be derived therefrom by a person skilled in the art without departing from the scope of protection of the invention.
[0067] The following is a summary list of reference numerals and the corresponding structure used in the above description of the invention: [0068] 1 Hearing device [0069] 2 Hearing aid [0070] 4 Input transducer [0071] 6 Sound from the surroundings [0072] 8 Input audio signal [0073] 10 Signal processing apparatus [0074] 12 Output audio signal [0075] 14 Output transducer [0076] 16 Output sound [0077] 18 Speech signal [0078] 20 Sound components [0079] 25 First algorithm [0080] 26 Speech signal component [0081] 30 Quantitative measure of speech quality [0082] 32 First signal path [0083] 33 Characteristic variable [0084] 34 Second signal path [0085] 35 Characteristic variable [0086] 36 Distinction [0087] 38 Third signal path [0088] 40 Temporal stability [0089] 41 Characteristic variable [0090] 42 Fourth signal path [0091] 44 Time interval [0092] 45 Characteristic variable [0093] 46 Second algorithm [0094] ΔE.sub.TS Relative change (of the energy at the transition) [0095] λ.sub.C Centroid wavelength [0096] E1 First energy [0097] E2 Second energy [0098] Ev Energy (prior to the transition) [0099] En Energy (following the transition) [0100] f.sub.G Fundamental frequency [0101] LVL Level [0102] HF Higher frequency range [0103] MN.sub.LVL Temporal mean (of the level) [0104] MX.sub.LVL Maximum of the level [0105] NF Low frequency range [0106] QE Quotient [0107] SNR Signal-to-noise ratio (SNR) [0108] Th.sub.λ Limit value (for the centroid wavelength) [0109] Th.sub.E Limit value (for the relative change of the energy) [0110] Th.sub.SNR Limit value (for the SNR) [0111] TS Transition [0112] V Voiced temporal sequence [0113] VAD Speech activity identification [0114] UV Unvoiced temporal sequence