METHOD FOR PREDICTING THE INTELLIGIBILITY OF NOISY AND/OR ENHANCED SPEECH AND A BINAURAL HEARING SYSTEM
20170272870 · 2017-09-21
Assignee
Inventors
- Asger Heidemann ANDERSEN (Smorum, DK)
- Jan Mark De Haan (Smorum, DK)
- Zheng-Hua TAN (Aalborg ost, DK)
- Jesper Jensen (Smorum, DK)
- Michael Syskind Pedersen (Smorum, DK)
Cpc classification
H04R2225/51
ELECTRICITY
G10L19/00
PHYSICS
H04R25/554
ELECTRICITY
International classification
Abstract
An intrusive binaural speech intelligibility predictor system receives a target signal comprising speech in left and right essentially noise-free and noisy and/or processed versions at left and right ears of a listener. The system comprises a) first, second, third and fourth input units for providing time-frequency representations of said left and right noise-free and noisy/processed versions of the target signal, respectively; b) first and second Equalization-Cancellation stages adapted to receive and relatively time shift and amplitude adjust the left and right noise-free and noisy/processed versions, respectively, and to provide resulting noise-free and noisy/processed signals, respectively; and c) a monaural speech intelligibility predictor unit for providing final binaural speech intelligibility predictor value SI-Measure based on said resulting noise-free and noisy/processed signals. The Equalization-Cancellation stages are adapted to optimize the SI-Measure to indicate a maximum intelligibility of said noisy/processed versions of the target signal by said listener. The invention may e.g. be used in development systems for hearing aids.
Claims
1. An intrusive binaural speech intelligibility prediction system comprising a binaural speech intelligibility predictor unit adapted for receiving a target signal comprising speech in a) left and right essentially noise-free versions x.sub.l, x.sub.r and in b) left and right noisy and/or processed versions y.sub.l, y.sub.r, said signals being received or being representative of acoustic signals as received at left and right ears of a listener, the binaural speech intelligibility predictor unit being configured to provide as an output a final binaural speech intelligibility predictor value SI measure indicative of the listener's perception of said noisy and/or processed versions y.sub.l, y.sub.r of the target signal, the binaural speech intelligibility predictor unit comprising First and second input units for providing time-frequency representations x.sub.l(k,m) and x.sub.r(k,m) of said left x.sub.l and right x.sub.r noise-free version of the target signal, respectively, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index; Third and fourth input units for providing time-frequency representations y.sub.l(k,m) and y.sub.r(k,m) of said left y.sub.l and right y.sub.r noisy and/or processed versions of the target signal, respectively, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index; A first Equalization-Cancellation stage adapted to receive and relatively time shift and amplitude adjust the left and right noise-free versions x.sub.l(k,m) and x.sub.r(k,m), respectively, and to subsequently subtract the time shifted and amplitude adjusted left and right noise-free versions x′.sub.l(k,m) and x′.sub.r(k,m) of the left and right target signals from each other, and to provide a resulting noise-free signal x(k,m); A second Equalization-Cancellation stage adapted to receive and relatively time shift and amplitude adjust the left and right noisy and/or processed versions y.sub.l(k,m) and y.sub.r(k,m), respectively, and to subsequently subtract the time shifted and amplitude adjusted left and right noisy and/or processed versions y′.sub.l(k,m) and y′.sub.r(k,m) of the left and right target signals from each other, and to provide a resulting noisy and/or processed signal y(k,m); and A monaural speech intelligibility predictor unit for providing final binaural speech intelligibility predictor value SI measure based on said resulting, noise-free signal x(k,m) and said resulting noisy and/or processed signal y(k,m); Wherein said first and second Equalization-Cancellation stages are adapted to optimize the final binaural speech intelligibility predictor value SI measure to indicate a maximum intelligibility of said noisy and/or processed versions y.sub.l, y.sub.r of the target signal by said listener.
2. An intrusive binaural speech intelligibility prediction system according to claim 1 configured to repeat the calculations performed by the first and second Equalization-Cancellation stages and the monaural speech intelligibility predictor unit to optimize the final binaural speech intelligibility predictor value to indicate a maximum intelligibility of said noisy and/or processed versions of the target signal by said listener.
3. An intrusive binaural speech intelligibility prediction system according to claim 1 wherein the monaural speech intelligibility predictor unit comprises A first envelope extraction unit for providing a time-frequency sub-band representation of the resulting noise-free signal x(k,m) in the form of temporal envelopes, or functions thereof, of said resulting noise-free signal providing time-frequency sub-band signals X(q,m), q being a frequency sub-band index, q=1, 2, . . . , Q, and m being the time index; A second envelope extraction unit for providing a time-frequency sub-band representation of the resulting noisy and/or processed signal y(k,m) in the form of temporal envelopes, or functions thereof, of said resulting noisy and/or processed signal providing time-frequency sub-band signals Y(q,m), q being a frequency sub-band index, q=1, 2, . . . , Q, and m being the time index; A first time-frequency segment division unit for dividing said time-frequency sub-band representation X(q,m) of the resulting noise-free signal x(k,m) into time-frequency envelope segments x(q,m) corresponding to a number N of successive samples of said sub-band signals; A second time-frequency segment division unit for dividing said time-frequency sub-band representation Y(q,m) of the noisy and/or processed signal y(k,m) into time-frequency envelope segments y(q,m) corresponding to a number N of successive samples of said sub-band signals; A correlation coefficient unit adapted to compute a correlation coefficient {circumflex over (ρ)}(q,m) between each time frequency envelope segment of the noise-free signal and the corresponding envelope segment of the noisy and/or processed signal; A final speech intelligibility measure unit providing a final binaural speech intelligibility predictor value SI measure as a weighted combination of the computed correlation coefficients across time frames and frequency sub-bands.
4. An intrusive binaural speech intelligibility prediction system according to claim 1 comprising a binaural hearing loss model.
5. A binaural hearing system comprising left and right hearing aids adapted to be located at left and right ears of a user, and an intrusive binaural speech intelligibility prediction system according to claim 1.
6. A binaural hearing system according to claim 5, wherein of the left and right hearing aids comprises left and right configurable signal processing units configured for processing the left and right noisy and/or processed versions y.sub.l, y.sub.r, of the target signal, respectively, and providing left and right processed signals u.sub.left, u.sub.right, respectively, and left and right output units for creating output stimuli configured to be perceivable by the user as sound based on left and right electric output signals, either in the form of the left and right processed signals u.sub.left, u.sub.right, respectively, or signals derived therefrom. wherein the binaural hearing system comprises b) a binaural hearing loss model unit operatively connected to the intrusive binaural speech intelligibility predictor unit and configured to apply a frequency dependent modification reflecting a hearing impairment of the corresponding left and right ears of the user to the electric output signals to provide respective modified electric output signals to the intrusive binaural speech intelligibility predictor unit.
7. A binaural hearing system according to claim 5 wherein of the left and right hearing aids comprises antenna and transceiver circuitry for establishing an interaural link between them allowing the exchange of data between them, including audio and/or control data signals.
8. A method of providing a binaural speech intelligibility predictor value, the method comprising S1. receiving a target signal comprising speech in a) left and right essentially noise-free versions x.sub.l, x.sub.r and in b) left and right noisy and/or processed versions y.sub.l, y.sub.r, said signals being received or being representative of acoustic signals as received at left and right ears of a listener, the method further comprises S2. providing time-frequency representations x.sub.l(k,m) and y.sub.l(k,m) of said left noise-free version x.sub.l and said left noisy and/or processed version y.sub.l of the target signal, respectively, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index; S3. providing time-frequency representations x.sub.r(k,m) and y.sub.r(k,m) of said right noise-free version x.sub.r and said right noisy and/or processed version y.sub.r of the target signal, respectively, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index; S4. receiving and relatively time shifting and amplitude adjusting the left and right noise-free versions x.sub.l(k,m) and x.sub.r(k,m), respectively, and subsequently subtracting the time shifted and amplitude adjusted left and right noise-free versions x.sub.l′(k,m) and x.sub.r′(k,m), respectively, of the target signals from each other, and providing a resulting noise-free signal x(k,m); S5. receiving and relatively time shifting and amplitude adjusting the left and right noisy and/or processed versions y.sub.l(k,m) and y.sub.r(k,m), respectively, and subsequently subtracting the time shifted and amplitude adjusted left and right noisy and/or processed versions y′.sub.l(k,m) and y′.sub.r(k,m), respectively, of the target signals from each other, and providing a resulting noisy and/or processed signal y(k,m); and S6. providing a final binaural speech intelligibility predictor value SI measure indicative of the listener's perception of said noisy and/or processed versions y.sub.l, y.sub.r of the target signal based on said resulting noise-free signal x(k,m) and said resulting noisy and/or processed signal y(k,m); S7. repeating steps S4-S6 to optimize the final binaural speech intelligibility predictor value SI measure to indicate a maximum intelligibility of said noisy and/or processed versions y.sub.l, y.sub.r of the target signal by said listener.
9. A method according to claim 8 wherein steps S4 and S5 each comprises providing that the relative time shift and amplitude adjustment is given by the factor:
λ=10.sup.(γ+Δγ)/40e.sup.jω(τ+Δτ)/2 where τ denoted time shift in seconds and γ denotes amplitude adjustment in dB, and where Δτ and Δγ are uncorrelated noise sources which model imperfections of the human auditory system of a normally hearing person, and where the resulting noise-free signal x(k,m) and the resulting noisy and/or processed signal y(k,m) is given by:
x.sub.k,m=λx.sub.k,m.sup.(l)−λ.sup.−1x.sub.k,m.sup.(r),
and
y.sub.k,m=λy.sub.k,m.sup.(l)−λ.sup.−1y.sub.k,m.sup.(r), respectively.
10. A method of according to claim 9 wherein the uncorrelated noise sources, Δτ and Δγ, are normally distributed with zero mean and standard deviation
11. A method of according to claim 8 wherein step S6 comprises providing a time-frequency sub-band representation of the resulting noise-free signal x(k,m) in the form of temporal envelopes, or functions thereof, of said resulting noise-free signal providing time-frequency sub-band signals X(q,m), q being a frequency sub-band index, q=1, 2, . . . , Q, and m being the time index; providing a time-frequency sub-band representation of the resulting noisy and/or processed signal y(k,m) in the form of temporal envelopes, or functions thereof, of said resulting noisy and/or processed signal providing time-frequency sub-band signals Y(q,m), q being a frequency sub-band index, q=1, 2, . . . , Q, and m being the time index; dividing said time-frequency sub-band representation X(q,m) of the resulting noise-free signal x(k,m) into time-frequency envelope segments x(q,m) corresponding to a number N of successive samples of said sub-band signals; dividing said time-frequency sub-band representation Y(q,m) of the noisy and/or processed signal y(k,m) into time-frequency envelope segments y(q,m) corresponding to a number N of successive samples of said sub-band signals; computing a correlation coefficient ρ(q,m) between each time frequency envelope segment of the noise-free signal and the corresponding envelope segment of the noisy and/or processed signal; providing a final binaural speech intelligibility predictor value SI measure as a weighted combination of the computed correlation coefficients across time frames and frequency sub-bands.
12. A method according to claim 11 wherein said time-frequency signals X(q,m), X(q,m), q being a frequency sub-band index, q=1, 2, . . . , Q, representing temporal envelopes of the respective q.sup.th sub-band signals are power envelopes determined as
13. A method according to claim 12 wherein the power envelopes are arranged into vectors of N samples
x.sub.q,m=[X.sub.q,m−N+1,X.sub.q,m−N+2, . . . ,X.sub.q,m].sup.T and
y.sub.q,m=[Y.sub.q,m−N+1,Y.sub.q,m−N+2, . . . ,Y.sub.q,m].sup.T where vectors x.sub.q,m and y.sub.q,mε.sup.N×1.
14. A method according to claim 13 wherein the correlation coefficient between clean and noisy/processed envelopes are determined as:
15. A method according to claim 14 wherein an N-sample estimate {circumflex over (ρ)}.sub.q,m of the correlation coefficient ρ.sub.q across the input signals is then given by:
16. A method according to claim 15 wherein the final binaural speech intelligibility predictor value is obtained by estimating the correlation coefficients, {circumflex over (ρ)}.sub.q,m, for all frames, m, and frequency bands, q, in the signal and averaging across these:
17. Use of an intrusive binaural speech intelligibility prediction system as claimed in claim 1 in listening test for evaluating a person's intelligibility of a noisy and/or processed target signal comprising speech.
18. A data processing system comprising a processor and program code means for causing the processor to perform the steps of the method according to claim 8.
19. A tangible computer-readable medium storing a computer program comprising program code means for causing a data processing system to perform the steps of the method according to claim 8.
20. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method according to claim 8.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0078] The aspects of the disclosure may be best understood from the following detailed description taken in conjunction with the accompanying figures. The figures are schematic and simplified for clarity, and they just show details to improve the understanding of the claims, while other details are left out. Throughout, the same reference numerals are used for identical or corresponding parts. The individual features of each aspect may each be combined with any or all features of the other aspects. These and other aspects, features and/or technical effect will be apparent from and elucidated with reference to the illustrations described hereinafter in which:
[0079]
[0080]
[0081]
[0082]
[0083]
[0084]
[0085]
[0086]
[0087]
[0088]
[0089]
[0090]
[0091]
[0092]
[0093]
[0094]
[0095] The figures are schematic and simplified for clarity, and they just show details which are essential to the understanding of the disclosure, while other details are left out. Throughout, the same reference signs are used for identical or corresponding parts.
[0096] Further scope of applicability of the present disclosure will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only. Other embodiments may become apparent to those skilled in the art from the following detailed description.
DETAILED DESCRIPTION OF EMBODIMENTS
[0097] The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practised without these specific details. Several aspects of the apparatus and methods are described by various blocks, functional units, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as “elements”). Depending upon particular application, design constraints or other reasons, these elements may be implemented using electronic hardware, computer program, or any combination thereof.
[0098] The electronic hardware may include microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functionality described throughout this disclosure. Computer program shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
[0099] The present application relates to the field of hearing devices, e.g. hearing aids, in particular to speech intelligibility prediction. The topic of Speech Intelligibility Prediction (SIP) has been widely investigated since the introduction of the Articulation Index (AI) [French & Steinberg; 1947], which was later refined and standardized as the Speech Intelligibility Index (SII) [ANSI S3.5-1997]. While the research interest initially came from the telephone industry, the possible application to hearing aids and cochlear implants has recently gained attention, see e.g. [Taal et al.; 2012] and [Falk et al.; 2015].
[0100] The SII predicts monaural intelligibility in conditions with additive, stationary noise. Another early and highly popular method is the Speech Transmission Index (STI), which predicts the intelligibility of speech, which has been transmitted through a noisy and distorting transmission system (e.g. a reverberant room). Many additional SIP methods have been proposed, mainly with the purpose of extending the range of conditions under which predictions can be made.
[0101] For SIP methods to be applicable in relation to binaural communication devices such as hearing aids, the operating range of the classical methods must be expanded in two ways. Firstly, they must be able to take into account the non-linear processing that typically happens in such devices. This task is complicated by the fact that many SIP methods assume knowledge of the clean speech and interferer in separation; an assumption which is not meaningful when the combination of speech and noise has been processed non-linearly. One example of a method which does not make this assumption, is the STOI measure [Taal et al.; 2011] which predicts intelligibility from a noisy/processed signal and a clean speech signal. The STOI measure has been shown to predict well the influence on intelligibility of multiple enhancement algorithms. Secondly, SIP methods must take into account the fact that signals are commonly presented binaurally to the user. Binaural auditory perception provides the user with different degrees of advantage, depending on the acoustical conditions and the applied processing [Bronkhorst; 2000]. Several SIP methods have focused on predicting this advantage. Existing binaural methods, however, can generally not provide predictions for non-linearly processed signals.
[0102] A setup of a binaural intrusive speech intelligibility predictor unit BSIP in combination with an evaluation unit EVAL is illustrated in
[0103] The clean (target) speech signals (x.sub.l, x.sub.r) as presented to the left and right ears of the listener from a given acoustic (target) source in the environment of the listener (at a given location relative to the user) may be generated from an acoustic model of the setup including measured or modelled head related transfer functions (HRTF) to provide appropriate frequency and angle dependent interaural time (ITD) and level differences (ILD). The contributions (n.sub.i,l, n.sub.i,r) as presented to the left and right ears of the listener of individual noise sources N.sub.i, i=1, 2, . . . , N.sub.s, N.sub.s being the number of noise sources considered (e.g. equal to one or more), located at different positions around the listener may likewise be determined from an acoustic model of the setup. Thereby, noisy (e.g. un-processed) signals (y.sub.l, y.sub.r) comprising the target speech as presented to the left and right ears of the listener may be provided as the sum of the respective clean (target) speech signals (x.sub.l, x.sub.r) and the noise signals (n.sub.i,l, n.sub.i,r) of individual noise sources N.sub.i, i=1, 2, . . . , N.sub.s, as presented to the left and right ears of the listener (cf. e.g.
[0104] Alternatively, the clean (target) speech signals (x.sub.l, x.sub.r) and noisy (e.g. un-processed) signals (y.sub.l, y.sub.r) as presented to the left and right ears of a listener may be measured in a specific geometric setup, e.g. using a dummy head model (e.g. performed in a sound studio with a head-and-torso-simulator (HATS, Head and Torso Simulator 4128C from Brüel & Kjær Sound & Vibration Measurement A/S)) (cf. e.g.
[0105] Hence, in an embodiment, the clean and noisy signals as presented to the left and right ears of the listener and used as inputs to the binaural speech intelligibility predictor unit are provided as artificially generated and/or measured signals.
[0106]
[0107]
[0108]
[0109] The exemplary measure as shown in
[0110]
[0111] In the embodiment of an intrusive binaural speech intelligibility prediction system shown in
[0112]
[0113] In [Andersen et al.; 2015], a binaural extension of the STOI measure—the Binaural STOI (BSTOI) measure—was proposed. The BSTOI measure has been shown to predict well the intelligibility (including binaural advantage) obtained in conditions with a frontal target and a single point noise source in the horizontal plane. The BSTOI measure was also shown to predict the intelligibility of diotic speech which had been processed by ITFS (Ideal Time Frequency Segregation).
[0114] In the present application an improved version of the BSTOI measure is presented, which is computationally less demanding and, unlike BSTOI, produces deterministic results. The proposed measure has the advantage of being able to predict intelligibility in conditions where both binaural advantage and non-linear processing simultaneously influence intelligibility. To the knowledge of the present inventors, no other SIP method is capable of producing predictions in conditions where intelligibility is affected by both. We refer to the improved binaural speech intelligibility measure as the Deterministic BSTOI (DBSTOI) measure.
[0115] The DBSTOI measure scores intelligibility based on four signals: The noisy/processed signal as presented to the left and right ears of the listener and a clean speech signal, also at both ears. The clean (essentially noise-free) signal should be the same as the noisy/processed one, but with neither noise nor processing. The DBSTOI measure produces a score in the range 0 to 1. The aim is to have a monotonic correspondence between the DBSTOI measure and measured intelligibility, such that a higher DBSTOI measure corresponds to a higher intelligibility (e.g. percentage of words heard correctly).
[0116] The DBSTOI measure is based on combining a modified Equalization Cancellation (EC) stage with the STOI measure as proposed in [Andersen et al.; 2015]. Here, we introduce further structural changes in the STOI measure to allow for better integration with the EC-stage. This allows for computing the measure deterministically and in closed form, contrary to the BSTOI measure [Andersen et al.; 2015], which is computed using Monte Carlo simulation.
[0117] The structure of the DBSTOI measure is shown in
Specific Example
[0118] As a specific example of the proposed type of binaural intelligibility predictor, the DBSTOI measure as described in the following. A block diagram of the binaural speech intelligibility prediction unit providing this specific measure is shown in
[0119] An outline of the procedure of computing the DBSTOI measure is given by: [0120] 1) The input signals are time-frequency decomposed by use of a short time Fourier transformation. Subsequent steps are carried out in the short-time Fourier domain. [0121] 2) The left and right ear signals are combined by means of a modified equalization stage. [0122] Specifically: [0123] a. The left and right ear signals are time shifted and amplitude adjusted relative to each other. This is done separately for a range of third octave bands. See equations (1) and (2) below. [0124] b. The time shifted and amplitude adjusted left and right signals are subtracted from one-another. This difference is referred to as the combined signal. The same time shifts and amplitude adjustment factors are applied for the clean signals and the noisy/processed signals. One combined clean signal and one combined noisy/processed signal is obtained in this manner. See equations (1) and (2) below. [0125] 3) A power envelope is extracted from each third octave band for each signal (the clean and the noisy/processed one). See equation (5) below. [0126] 4) The envelopes are arranged into short overlapping segments. See equation (8) below. [0127] 5) The correlation coefficient is computed between each envelope segment of the clean signal and the corresponding envelope segment of the noisy/processed signal. See equation (9) below. [0128] 6) The final measure is obtained as an average of the computed correlation coefficients across all time frames and third octave bands. See equation (15) below.
[0129] Advantageously, the time shift and amplitude adjustment factors in step 2 are determined independently for each short envelope segment and are determined such as to maximize the correlation between the envelopes. This corresponds to the assumption that the human brain uses the information from both ears such as to make speech as intelligible as is possible. The final number typically lies in the interval between 0 to 1, where 0 indicates that the noisy/processed signal is much unlike the clean signal and should be expected to be unintelligible, while numbers close to 1 indicate that the noisy/processed signal is close to the clean signal and should be expected to be highly intelligible.
Step 1: TF Decomposition
[0130] The first step (cf. e.g. Step 1 in be the TF unit corresponding to the clean signal at the left ear in the m.sup.th time frame and the k.sup.th frequency bin (cf.
Step 2: EC Processing
[0131] The second step (cf. e.g. Step 2 in
[0132] A combined clean signal is obtained by relatively time shifting and amplitude adjusting the left and right clean signals and thereafter subtracting one from the other. The same is done for the noisy/processed signals to obtain a single noisy/processed signal. The relative time shift of τ (seconds) and amplitude adjustment of γ (dB) is given by the factor:
λ=10.sup.(γ+Δγ)/40e.sup.jω(τ+Δτ)/2 (1)
where Δτ and Δγ are uncorrelated noise sources which model imperfections of the human auditory system of a normally hearing person. The resulting combined clean signal is given by:
x.sub.k,m=λx.sub.k,m.sup.(l)−λ.sup.−1x.sub.k,m.sup.(r) (2)
[0133] A combined noisy/processed TF-unit, y.sub.k,m, is obtained in a similar manner (using the same value of λ).
[0134] The uncorrelated noise sources, Δτ and Δγ, are normally distributed with zero mean and standard deviation:
[0135] Following the principle introduced in [Andersen et al.; 2015], the values γ and τ are determined such as to maximize the scoring of intelligibility. This is further described below.
Step 3: Intelligibility Prediction
[0136] At this point the four input signals have been condensed to two signals: a clean signal, x.sub.k,m, and a noisy/processed signal, y.sub.k,m. We compute an intelligibility score for these signals by use of a variation of the STOI measure. For mathematical tractability, we use power envelopes rather than magnitude envelopes as originally proposed in STOI [Taal et al.; 2011]. This is also done in [Taal et al.; 2012] and appears not to have a significant effect on predictions. Furthermore, we discard the clipping mechanism contained in the original STOI, as also done in [Taal et al.; 2012]. We have seen no indication that this negatively influences results.
[0137] The clean and processed signal power envelope is determined in Q=15 third octave bands (cf. blocks Envelope extraction in
where α=10.sup.(γ+Δγ)/20 and:
X.sub.q,m.sup.(l)/(r)=Σ.sub.k=k.sub.
where superscript c indicates the correlation between the left and right channels and where k.sub.1(q) and k.sub.2(q) denote the lower and upper DFT bins for the q.sup.th third octave band, respectively, and ω.sub.q is the center frequency of the q.sup.th frequency band. The approximate equality is obtained by inserting (1) and (2) and assuming that the energy in each third octave band is contained at the center frequency. A similar procedure for the processed signal yields third octave power envelopes, Y.sub.q,m.
[0138] If we assume that the input signals are wide sense stationary stochastic processes, the power envelopes, X.sub.q,m and Y.sub.q,m are also stochastic processes, due to the stochastic nature of the input signals as well as the noise sources, Δτ and Δγ, in the EC stage. An underlying assumption of STOI is that intelligibility is related to the correlation between clean and noisy/processed envelopes (cf. e.g. [Taal et al.; 2011]):
where the expectation is taken across both input signals and the noise sources in the EC stage.
[0139] To estimate ρ.sub.g, the power envelopes are arranged into vectors of N=30 samples (cf. e.g. [Taal et al.; 2011] and blocks Short-time segmentation in
x.sub.q,m=[X.sub.q,m−N+1,X.sub.q,m−N+2, . . . ,X.sub.q,m].sup.T. (8)
[0140] Similar vectors, y.sub.q,mε.sup.N×1 are defined for the processed signal.
[0141] An N-sample estimate of ρ.sub.q across the input signals is then given by:
where μ(.Math.) denotes the mean of the entries in the given vector, E.sub.Δ is the expectation across the noise in the EC stage and 1 is the vector of all ones (cf. block Correlation coefficient in
E.sub.Δ[(x.sub.q,m−μ.sub.x.sub.
where
and similarly for the noisy/processed signal. An expression for E.sub.Δ[∥x.sub.q,m−μ.sub.x.sub.
[0142] The final DBSTOI measure is obtained by estimating the correlation coefficients, {circumflex over (ρ)}.sub.q,m, for all frames, m, and frequency bands, q, in the signal and averaging across these [Taal et al.; 2011];
where Q and M is the number of frequency bands and the number of frames, respectively (cf. block Average in
[0143] It can be shown that whenever the left and right ear inputs are identical, the DBSTOI measure produces scores which are identical those of the monaural STOI (that is, the modified monaural STOI measure based on (5) and without clipping).
Determination of γ and τ
[0144] Finally, we consider the parameters γ and τ. These parameters are determined individually for each time unit, m, and third octave band, q, such as to maximize the final DBSTOI measure (cf. feedback loop from output DBSTOI to blocks Modified (⅓ octave) EC-stage in
{circumflex over (ρ)}.sub.q,m=max.sub.γτ{circumflex over (ρ)}.sub.q,m(γ,τ). (16)
[0145] In general, the optimization may be carried out by evaluating {circumflex over (ρ)}.sub.q,m for a discrete set of γ and τ values and choosing the highest value.
[0146]
[0147]
[0148] In the present application, a number Q of (non-uniform) frequency sub-bands with sub-band indices q=1, 2, . . . , J is defined, each sub-band comprising one or more DFT-bins (cf. vertical Sub-band q-axis in
[0149]
[0150]
[0151] A target signal from target source S comprising speech (e.g. from a person or a loudspeaker) in left and right essentially noise-free (clean) target signals x.sub.l(n), x.sub.r(n), n being a time index, as received at the left and right hearing aids (HD.sub.L, HD.sub.R), respectively, when located at the left and right ears of the user can e.g. be recorded in a recording session, where each of the hearing aids comprise appropriate microphone and memory units. Likewise, a signal from a noise sound source V.sub.i can be recorded as received at the left and right hearing aids (HD.sub.L, HD.sub.R), respectively, providing noise signals v.sub.il(n), v.sub.ir(n). This can be performed for each of the sound sources V.sub.i, i=1, 2, . . . , N.sub.V. Left and right noisy and/or processed versions y.sub.l(n), y.sub.r(n) of the target signal can then be composed by mixing (addition) of the noise-free (clean) left and right target signals x.sub.l(n), x.sub.r(n), and the left and right noise signals v.sub.il(n), v.sub.ir(n), i=1, 2, . . . , N.sub.V. In other words left and right noisy and/or processed versions y.sub.l(n), y.sub.r(n) of the target signal can be determined as y.sub.l(n)=x.sub.l(n)+v.sub.il(n), and y.sub.r(n)=x.sub.r(n)+v.sub.ir(n), i=1, 2, . . . , N.sub.V, respectively. These signals x.sub.l(n), x.sub.r(n), and y.sub.l(n), y.sub.r(n) can be forwarded to the binaural speech intelligibility predictor unit and a resulting speech intelligibility predictor d.sub.bin (or respective left d.sub.bin,l and right d.sub.bin,r predictors, cf. e.g.
[0152] Alternatively, the recorded (electric) noise-free (clean) left and right target signals x.sub.l(n), x.sub.r(n), and a mixture y.sub.l(n), y.sub.r(n) of the clean target source and noise sound sources as (acoustically) received at the left and right hearing aids and picked up by microphones of the respective hearing aids can be provided to the binaural speech intelligibility predictor unit and a resulting binaural speech intelligibility predictor d.sub.bin (alternatively denoted SI measure or DBSTOI) determined. Thereby the effect on the resulting binaural speech intelligibility predictor d.sub.bin of changes in location, type and level of the noise sound sources V.sub.i can be evaluated (for a fixed sound source S).
[0153] By including a processing algorithm of a hearing aid, the binaural speech intelligibility prediction system can be used to test the effect of different algorithms on the resulting binaural speech intelligibility predictor. Alternatively or additionally, such setup can be used to test the effect of different parameter settings of a given algorithm (e.g. a noise reduction algorithm or a directionality algorithm) on the resulting binaural speech intelligibility predictor.
[0154] The setup of
[0155]
[0156] The test system (TEST) comprises a user interface (UI) for initiating a test and/or for displaying results of a test. The test system further comprises a processing part (PRO) configured to provide predefined test signals, including a) left and right essentially noise-free versions x.sub.l, x.sub.r of a target speech signal and b) left and right noisy and/or processed versions y.sub.left, y.sub.right of the target speech signal. The signals x.sub.l, x.sub.r, y.sub.left, y.sub.right are adapted to emulate signals as received or being representative of acoustic signals as received at left and right ears of a listener. The signals may e.g. be generated as described in connection with
[0157] The test system (TEST) comprises a (binaural) signal processing unit (BSPU) that applies one or more processing algorithms to the left and right noisy and/or processed versions y.sub.left, y.sub.right of the target speech signal and provides resulting processed signals u.sub.left and u.sub.right.
[0158] The test system (TEST) further comprises a binaural hearing loss model (BHLM) for emulating the hearing loss (or deviation from normal hearing) of a user. The binaural hearing loss model (BHLM) receives processed signals u.sub.left and u.sub.right from the binaural signal processing unit (BSPU) and provides left and right modified processed signals y.sub.l and y.sub.r, which are fed to the binaural speech intelligibility prediction unit (BSIP) as the left and right noisy and/or processed versions of the target signal. Simultaneously, the clean versions of the target speech signals x.sub.l, x.sub.r, are provided from the processing part (PRO) of the test system to the binaural speech intelligibility prediction unit (BSIP). The processed signals u.sub.left and u.sub.right may e.g. be fed to respective loudspeakers (indicated in dotted line) for acoustically presenting the signals to a listener.
[0159] The processing part (PRO) of the test system is further be configured to receive the resulting speech intelligibility predictor value SI measure and to process and/or present the result of the evaluation of the listeners' intelligibility of speech in the current noisy and processed signals u.sub.left and u.sub.right via the user interface UI. Based thereon, the effect of the current algorithm (or a setting of the algorithm) on speech intelligibility can be evaluated. In an embodiment, a parameter setting of the algorithm is changed in dependence of the value of the present resulting speech intelligibility predictor value SI measure (e.g. manually or automatically, e.g. according to a predefined scheme, e.g. via control signal cntr).
[0160] The test system (TEST) may e.g. be configured to apply a number of different (e.g. stored) test stimuli comprising speech located at different positions relative to the listener, and to mix it with one or more different noise sources, located at different positions relative to the listener, and having configurable frequency content and amplitude shaping. The test stimuli are preferably configurable and applied via the user interface (UI).
Intelligibility-Based Signal Selection.
[0161]
[0162]
[0165] Option 1) has the advantage that the hearing instrument microphone signals (y.sub.l,y.sub.r) are recorded binaurally. Hereby the spatial perception of the speech signal is essentially correct, and the spatial cues may assist the listener to better understand the target talker. Furthermore, the (potential) acoustic noise present in the microphone signals of the hearing aid user may be reduced using the external microphone signal as side information (see e.g. our co-pending European patent application EP15190783.9 filed at the European Patent Office on 20 Oct. 2015), which is incorporated herein by reference. Even so, the SNR in this enhanced signal may still be very poor compared to the SNR at the external microphone.
[0166] Option 2) has the advantage that the SNR of the signal (x) picked up at the external microphone (M) close to the mouth of the target talker (TLK) most likely is much better than the SNR at the microphones of hearing instruments (HD.sub.L, HD.sub.R). While this signal (x) can be presented to the hearing aid user (U), the disadvantage is that we only have a mono version to present, so that any binaural spatial cues have to be restored artificially (see e.g. EP15190783.9 as referred to above).
[0167] For that reason, for high signal to noise ratio situations, where intelligibility degradation is not a problem, it is better to present the processed signals originally recorded at the hearing instrument microphones. On the other hand, if the SNR is very poor, it may be an advantage to trade the spatial cues for a better signal to noise ratio.
[0168] In order to decide which signal is the best to present to the listener in a given situation, a speech intelligibility model may be used. Most existing speech intelligibility models are monaural, see e.g. the one described in [Taal et al., 2011], while a few existing ones work on binaural signals, e.g. [Beutelmann&Brand; 2006]. For the idea presented in the present application, better performance is expected with a binaural model, but the basic idea does not require a binaural model. Most speech intelligibility models assume that a clean reference is available. Based on this clean reference signal and the noisy (and potentially processed) signal, it is possible to predict the speech intelligibility of the noisy/processed signal. With the wireless microphone situation described above and depicted in
[0169] So far, a binary choice between presenting 1) the speech signal picked up by the hearing instrument microphones, and 2) the speech signal picked up by the wireless microphone has been discussed. It may be useful to generalize this idea. Specifically, one could present an appropriate combination of the two signals. In particular, for linear combinations, the presented signal u.sub.local is given by
u.sub.local=a*y.sub.local+(1−a)*x.sub.wireless,
where y.sub.local is the microphone signal of the hearing aid user (local=left or right), and x.sub.wireless is the signal (=signal x in
u.sub.left=a.sub.l*y.sub.left+(1−a.sub.l)*x.sub.lr, and
u.sub.right=a.sub.r*y.sub.right+(1−a.sub.r)*x.sub.lr.
[0170] The left and right mixing units MIXl, MIXr are configured to apply mixing constants a.sub.l, a.sub.r as indicated in the above equations via mixing control signals mx.sub.l, mx.sub.r.
[0171] In an embodiment, the binaural hearing system is configured to provide that 0<a.sub.l, a.sub.r<1. In an embodiment, the binaural hearing system is configured to provide that 0≦a.sub.l, a.sub.r≦1.
[0172] In an embodiment, a.sub.l=a.sub.r=a and determined from a the binaural speech intelligibility model, so that
u.sub.left=a*y.sub.left+(1−a)*x.sub.lr, and
u.sub.right=a*y.sub.right+(1−a)*x.sub.lr.
[0173] Thus the mixing control signals mx.sub.l, mx.sub.r (cf.
[0174] In an embodiment, the binaural hearing system is configured to provide that 0<a<1. In an embodiment, the binaural hearing system is configured to provide that 0≦a≦1.
[0175] In an embodiment, the mixing constant(s) is(are) adaptively determined based on an estimate of the resulting left and right signals u.sub.left and u.sub.right based on an optimization of the speech intelligibility predictor provided by the BSIP unit. An embodiment, of a binaural hearing system implementing an adaptive optimization of the mixing ratio of clean and noisy versions of the target signal is described in the following (
[0176]
[0177]
[0178]
[0179]
[0180] In the embodiment of
[0181] Each of the hearing aids (HD.sub.L, HD.sub.R) comprise two microphones, a signal processing unit (SPU), a mixing unit (MIX), and a loudspeaker (SP.sub.l, SP.sub.r). Additionally, one or both of the hearing aids comprise a binaural speech intelligibility unit (BSIP). The two microphones of each of the left and right hearing aids each pick up a—potentially noisy (time varying) signal y(t) (cf. y.sub.1,left, y.sub.2,left and y.sub.1,right, y.sub.2,right in
[0182] Based on binaural speech intelligibility prediction system (BSIP), the signal processing units (SPU) of each hearing aid may be (individually) adapted (cf. control signals d.sub.bin,l, d.sub.bin,r). Since, in the embodiment of
[0183] In
[0184]
S1. Providing or receiving a target signal comprising speech in a) left and right essentially noise-free versions x.sub.l, x.sub.r and in b) left and right noisy and/or processed versions y.sub.l, y.sub.r, said signals being received or being representative of acoustic signals as received at left and right ears of a listener;
S2. Providing time-frequency representations x.sub.l(k,m) and y.sub.l(k,m) of said left noise-free version x.sub.l and said left noisy and/or processed version y.sub.l of the target signal, respectively, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index;
S3. Providing time-frequency representations x.sub.r(k,m) and y.sub.r(k,m) of said right noise-free version x.sub.r and said right noisy and/or processed version y.sub.r of the target signal, respectively, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index;
S4. Receiving and relatively time shifting and amplitude adjusting the left and right noise-free versions x.sub.l(k,m) and x.sub.r(k,m), respectively, and subsequently subtracting the time shifted and amplitude adjusted left and right noise-free versions x.sub.l′(k,m) and x.sub.r′(k,m), respectively, of the target signals from each other, and providing a resulting noise-free signal x(k,m);
S5. Receiving and relatively time shifting and amplitude adjusting the left and right noisy and/or processed versions y.sub.l(k,m) and y.sub.r(k,m), respectively, and subsequently subtracting the time shifted and amplitude adjusted left and right noisy and/or processed versions y′.sub.l(k,m) and y′.sub.r(k,m), respectively, of the target signals from each other, and providing a resulting noisy and/or processed signal y(k,m);
S6. Providing a final binaural speech intelligibility predictor value SI measure indicative of the listener's perception of said noisy and/or processed versions y.sub.l, y.sub.r of the target signal based on said resulting noise-free signal x(k,m) and said resulting noisy and/or processed signal y(k,m);
S7. Repeating steps S4-S6 to optimize the final binaural speech intelligibility predictor value SI measure to indicate a maximum intelligibility of said noisy and/or processed versions y.sub.l, y.sub.r of the target signal by said listener.
[0185] It is intended that the structural features of the devices described above, either in the detailed description and/or in the claims, may be combined with steps of the method, when appropriately substituted by a corresponding process.
[0186] As used, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well (i.e. to have the meaning “at least one”), unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element but an intervening elements may also be present, unless expressly stated otherwise. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. The steps of any disclosed method is not limited to the exact order stated herein, unless expressly stated otherwise.
[0187] It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” or “an aspect” or features included as “may” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the disclosure. The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects.
[0188] The claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more.
[0189] Accordingly, the scope should be judged in terms of the claims that follow.
REFERENCES
[0190] [Andersen et al.; 2015] A. H. Andersen, J. M. de Haan, Z. Tan, and J. Jensen, “A binaural short time objective intelligibility measure for noisy and enhanced speech,” in INTERSPEECH, Dresden, Germany, September 2015, pp. 2563-2567, 2015. [0191] [Andersen et al.; 2016] A. H. Andersen, J. M. de Haan, Z. Tan, and J. Jensen, “A method for predicting the intelligibility of noisy and non-linearly enhanced binaural speech”, To be presented at ISCASP 2016, Shanghai, China, 20-25 Mar. 2016, Published in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4995-4999, 2016. [0192] [ANSI S3.5-1997] American National Standards Institute, “S3.5-1997: Methods for calculation of the speech intelligibility index,” 1997. [0193] [Beutelmann&Brand; 2006] Beutelmann, R. and Brand, T., “Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners,” J. Acoust. Soc. Am., Vol. 120, pp. 331-342, 2006. [0194] [Bronkhorst; 2000] A. W. Bronkhorst, “The cocktail party phenomenon: A review on speech intelligibility in multiple-talker conditions,” Acta Acustica United with Acustica, vol. 86, no. 1, pp. 117-128, January 2000. [0195] [Falk et al.; 2015] T. H. Falk, V. Parsa, J. F. Santos, K. Arehart, O. Hazrati, R. Huber, J. M. Kates, and S. Scollie, “Objective quality and intelligibility prediction for users of assistive listening devices,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 114-124, March 2015. [0196] [French & Steinberg; 1947] N. R. French and J. C. Steinberg, “Factors governing the intelligibility of speech sounds,” J. Acoust. Soc. Am., vol. 19, no. 1, pp. 90-119, January 1947. [0197] [Durlach; 1963] N. I. Durlach, “Equalization and cancellation theory of binaural masking-level differences”, J. Acoust. Soc. Am., vol. 35, no. 8, pp. 1206-1218, August 1963. [0198] [Durlach; 1972] N. I. Durlach, “Binaural signal detection: Equalization and cancellation theory”, in Foundations of Modern Auditory Theory Volume II, Jerry V. Tobias, Ed., pp. 371-462. Academic Press, New York, 1972. [0199] [Taal et al.; 2011] Taal, C., Hendriks, R., Heusdens, R., and Jensen, J., “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Lang. Process., Vol. 19, pp. 2125-2136, 2011. [0200] [Taal et al.; 2012] C. H. Taal, R. C. Hendriks, and R. Heusdens, “Matching pursuit for channel selection in coclear implants based on an intelligibility metric,” in Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, August 2012, pp. 504-508.