MONAURAL INTRUSIVE SPEECH INTELLIGIBILITY PREDICTOR UNIT, A HEARING AID AND A BINAURAL HEARING AID SYSTEM
20170256269 · 2017-09-07
Assignee
Inventors
Cpc classification
H04R25/606
ELECTRICITY
H04R25/554
ELECTRICITY
H04R2225/43
ELECTRICITY
H04R2225/55
ELECTRICITY
H04R25/70
ELECTRICITY
International classification
G10L21/02
PHYSICS
Abstract
A monaural intrusive speech intelligibility predictor unit comprises: first and second input units for providing time-frequency representations s(k,m) and x(k,m) of noise-free and noisy and/or processed versions of a target signal, respectively, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index; first and second envelope extraction units for providing time-frequency sub-band representations of the signals s.sub.j(m) and x.sub.j(m), j being a frequency sub-band index, j=1, 2, . . . , J; first and second time-frequency segment division units for dividing the time-frequency sub-band representations s.sub.j(m) and x.sub.j(m) into time-frequency segments S.sub.m and X.sub.m corresponding to a number N of successive samples of the sub-band signals; an intermediate speech intelligibility calculation unit adapted for providing intermediate speech intelligibility coefficients d.sub.m estimating an intelligibility of said time-frequency segment X.sub.m, based on said time-frequency segments S.sub.m and X.sub.m or normalized and/or transformed versions {tilde over (S)}.sub.m, and {tilde over (X)}.sub.m thereof; and a final monaural speech intelligibility calculation unit for calculating a final monaural speech intelligibility predictor d estimating an intelligibility of said noisy and/or processed version x of the target signal by combining said intermediate speech intelligibility coefficients d.sub.m, or a transformed version thereof, over time. A hearing aid comprises a monaural, intrusive intelligibility predictor unit, and a configurable signal processor adapted to control or influence the processing of one or more electric input signals representing environment sound to maximize the final speech intelligibility predictor d. A binaural hearing aid system comprises first and second hearing aids.
Claims
1. A monaural speech intelligibility predictor unit adapted for receiving a target signal comprising speech in an essentially noise-free version s and in a noisy and/or processed version x, the monaural speech intelligibility predictor unit being configured to provide as an output a final monaural speech intelligibility predictor value d indicative of a listener's perception of said noisy and/or processed version x of the target signal, the monaural speech intelligibility predictor unit comprising A first input unit for providing a time-frequency representation s(k,m) of said noise-free version s of the target signal, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index; A second input unit for providing a time-frequency representation x(k,m) of said noisy and/or processed version x of the target signal, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index; A first envelope extraction unit for providing a time-frequency sub-band representation s.sub.j(m) of the noise-free version s of the target signal representing temporal envelopes, or functions thereof, of frequency sub-band signals s.sub.j(m) of said noise-free target signal, j being a frequency sub-band index, j=1, 2, . . . , J, and m being the time index; A second envelope extraction unit for providing a time-frequency sub-band representation x.sub.j(m) of the noisy and/or processed version x of the target signal representing temporal envelopes, or functions thereof, of frequency sub-band signals x.sub.j(m) of said noisy and/or processed version of the target signal, j=1, 2, . . . , J, and m being the time index; A first time-frequency segment division unit for dividing said time-frequency sub-band representation s.sub.j(m) of the noise-free version s of the target signal into time-frequency segments S.sub.m corresponding to a number N of successive samples of said sub-band signals; A second time-frequency segment division unit for dividing said time-frequency sub-band representation x.sub.j(m) of the noisy and/or processed version x of the target signal into time-frequency segments X.sub.m corresponding to a number N of successive samples of said sub-band signals; A normalization and transformation unit configured to provide at least one normalization and/or transformation operation of rows and at least one normalization and/or transformation operation of columns of the time-frequency segments S.sub.m and X.sub.m; An intermediate speech intelligibility calculation unit adapted for providing intermediate speech intelligibility coefficients d.sub.m estimating an intelligibility of said time-frequency segment X.sub.m, said intermediate speech intelligibility coefficients d.sub.m being based on said essentially noise-free, normalized and/or transformed time frequency segments {tilde over (S)}.sub.m, and said noisy and/or processed, normalized and/or transformed time-frequency segments {tilde over (X)}.sub.m; A final monaural speech intelligibility calculation unit for calculating a final monaural speech intelligibility predictor d estimating an intelligibility of said noisy and/or processed version x of the target signal by combining, e.g. by averaging, or by applying a MIN or MAX-function to, said intermediate speech intelligibility coefficients d.sub.m, or a transformed version thereof, over time.
2. A monaural speech intelligibility predictor unit according to claim 1 comprising a voice activity detector unit for indicating whether or not or to what extent a given time-segment of the essentially noise-free version s and the noisy and/or processed version x, respectively, of the target signal comprises or is estimated to comprise speech, and providing a voice activity control signal indicative thereof.
3. A monaural speech intelligibility predictor unit according to claim 1 comprising a voice activity detector unit for identifying time-segments of the essentially noise-free version s and the noisy and/or processed version x, respectively, of the target signal comprising or estimated to comprise speech, and wherein the monaural speech intelligibility predictor unit is configured to provide modified versions of the essentially noise-free version s and of the noisy and/or processed version x, respectively, of the target signal, said modified versions comprising only such time segments comprising speech or being estimated to comprise speech.
4. A monaural speech intelligibility predictor unit according to claim 1 comprising a hearing loss model unit configured to apply a modification of the said noisy and/or processed version x of the target signal reflecting a deviation from normal hearing of a relevant ear of the listener to provide a modified noisy and/or processed version x of the target signal for use together with said essentially noise-free version s of the target signal as a basis for calculating the final monaural speech intelligibility predictor d.
5. A monaural speech intelligibility predictor unit according to claim 1 wherein said hearing loss model unit is configured to add a statistically independent noise signal, which is spectrally shaped according to an audiogram of the relevant ear of the listener, to said noisy and/or processed version x of the target signal.
6. A monaural speech intelligibility predictor unit according to claim 1 adapted to extract said temporal envelope signals x.sub.j(m) and s.sub.j(m), respectively, as
7. A monaural speech intelligibility predictor unit according to claim 6 wherein the function f(•)-f(w), where w represents
8. A monaural speech intelligibility predictor unit according to claim 1 wherein said first and second time-frequency segment division units are configured to divide said time-frequency representations s.sub.j(m) and x.sub.j(m), respectively, into segments in the form of spectrograms corresponding to N successive samples of all sub-band signals, wherein the m.sup.th segment is defined by the J×N matrix
9. A monaural speech intelligibility predictor unit according to claim 1 comprising A first normalization and/or transformation unit adapted for providing normalized and/or transformed versions {tilde over (S)}.sub.m of said time-frequency segments S.sub.m; A second normalization and/or transformation unit adapted for providing normalized and/or transformed versions {tilde over (X)}.sub.m of said time-frequency segments X.sub.m;
10. A monaural speech intelligibility predictor unit according to claim 9, wherein first and second normalization and/or transformation units are configured to apply one or more of the following algorithms to the time-frequency segments X.sub.m and S.sub.m, respectively, commonly denoted Z.sub.m, where sub-script, time index m is skipped for simplicity in the following expressions: Normalization of rows to zero mean:
g.sub.1(Z)=Z−μ.sub.z.sup.r1.sup.T, where μ.sub.z.sup.r is a J×1 vector whose j′th entry is the mean of the j′th row of Z, hence the superscript r in μ.sub.z.sup.r, where 1 denotes an N×1 vector of ones, and where superscript T denotes matrix transposition; Normalization of rows to unit-norm:
g.sub.2(Z)=D.sup.r(Z)Z, where D.sup.r(Z)=diag(└1/√{square root over (Z(1,:)Z(1,:).sup.H)} . . . 1/√{square root over (Z(J,:)Z(J,:).sup.H)}┘), where diag(•) is a diagonal matrix with the elements of the arguments on the main diagonal, and where Z(j,:) denotes the j′th row of Z, such that D′(Z) is a J×J diagonal matrix with the inverse norm of each row on the main diagonal, and zeros elsewhere, the superscript H denotes Hermitian transposition, and pre-multiplication with D′(Z) normalizes the rows of the resulting matrix to unit-norm; Fourier transformation applied to each row
g.sub.3(Z)=ZF, where F is an N×N Fourier matrix; Fourier transformation applied to each row followed by computing the magnitude of the resulting complex-valued elements
g.sub.4=|ZF| where |•| computes the element-wise magnitudes; The identity operator
g.sub.5(Z)=Z. Normalization of columns to zero mean:
h.sub.1(Z)=Z−1μ.sub.z.sup.c.sup.
h.sub.2(Z)=ZD.sup.c(Z), where D.sup.c(Z)=diag(└1/√{square root over (Z(:,1).sup.HZ(:,1))} . . . 1/√{square root over (Z(:,N).sup.HZ(:,N))}┘, where Z(:,n) denotes the n′th row of Z, such that D.sup.c(z) is a diagonal N×N matrix with the inverse norm of each column on the main diagonal, and zeros elsewhere, and where a post-multiplication with D.sup.c(Z) normalizes the rows of the resulting matrix to unit-norm.
11. A monaural speech intelligibility predictor unit according to claim 1 wherein the intermediate speech intelligibility calculation unit is adapted to determine said intermediate speech intelligibility coefficients d.sub.m in dependence on a, e.g. linear, sample correlation coefficient d(a,b) of the elements in two K×1 vectors a and b, d(a,b) being defined by:
12. A monaural speech intelligibility predictor unit according to claim 11 wherein the intermediate intelligibility index d.sub.m is defined as the average sample correlation coefficient of all columns in S.sub.m and X.sub.m, or {tilde over (S)}.sub.m and {tilde over (X)}.sub.m, respectively, i.e.,
d.sub.m=d({tilde over (S)}.sub.m(:),{tilde over (X)}.sub.m(:)), where the notation S.sub.m(:) and X.sub.m(:), or {tilde over (S)}.sub.m(:) and {tilde over (X)}.sub.m(:), represents NJ×1 vectors formed by stacking the columns of the respective matrices.
13. A monaural speech intelligibility predictor unit according to claim 1 wherein the final speech intelligibility calculation unit is adapted to calculate the final speech intelligibility predictor d from the intermediate speech intelligibility coefficients d.sub.m, optionally transformed by a function u(d.sub.m), as an average over time of said information signal x:
14. A monaural speech intelligibility predictor unit according to claim 13 wherein function u(d.sub.m) is defined as
u(d.sub.m)=d.sub.m.
15. A hearing aid adapted for being located at or in left or right ears of a user, or for being fully or partially implanted in the head of the user, the hearing aid comprising a monaural speech intelligibility predictor unit according to claim 1.
16. A hearing aid according to claim 15 configured to adaptively modify the processing of an input signal to the hearing aid to maximize the monaural speech intelligibility predictor d, to enhance the user's intelligibility of an output signal of the hearing aid presented to the user.
17. A binaural hearing system comprising left and right hearing aids according to claim 15, wherein each of the left and right hearing aids comprises antenna and transceiver circuitry for allowing a communication link to be established and information to be exchanged between said left and right hearing aids, the binaural hearing system further comprising a binaural speech intelligibility prediction unit for providing a final binaural speech intelligibility measure d.sub.binaural of the predicted speech intelligibility of the user, when exposed to said sound input, based on the monaural speech intelligibility predictor values d.sub.left, d.sub.right of the respective left and right hearing aids.
18. A binaural hearing system according to claim 17, wherein the respective configurable signal processors of the left and right hearing aids are adapted to control or influence the processing of the respective electric input signals to maximize said final binaural speech intelligibility measure d.sub.binaural.
19. A method of providing a monaural speech intelligibility predictor for estimating a user's ability to understand an information signal x comprising a noisy and/or processed version of a target speech signal, the method comprising providing a time-frequency representation s(k,m) of said noise-free version s of the target signal, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index; providing a time-frequency representation x(k,m) of said noisy and/or processed version x of the target signal, k being a frequency bin index, k=1, 2, . . . , K, and m being a time index; providing a time-frequency sub-band representation s.sub.j(m) of the noise-free version s of the target signal representing temporal envelopes, or functions thereof, of frequency sub-band signals s.sub.j(m) of said noise-free target signal, j being a frequency sub-band index, j=1, 2, . . . , J, and m being the time index; providing a time-frequency sub-band representation x.sub.j(m) of the noisy and/or processed version x of the target signal representing temporal envelopes, or functions thereof, of frequency sub-band signals x.sub.j(m) of said noisy and/or processed version of the target signal, j=1, 2, . . . , J, and m being the time index; dividing said time-frequency sub-band representation s.sub.j(m) of the noise-free version s of the target signal into time-frequency segments S.sub.m corresponding to a number N of successive samples of said sub-band signals; dividing said time-frequency sub-band representation x.sub.j(m) of the noisy and/or processed version x of the target signal into time-frequency segments X.sub.m corresponding to a number N of successive samples of said sub-band signals; providing at least one normalization and/or transformation operation of rows and at least one normalization and/or transformation operation of columns of the time-frequency segments S.sub.m and X.sub.m; providing intermediate speech intelligibility coefficients d.sub.m estimating an intelligibility of said time-frequency segment X.sub.m, said intermediate speech intelligibility coefficients d.sub.m being based on said essentially noise-free, normalized and/or transformed time frequency segments {tilde over (S)}.sub.m, and said noisy and/or processed, normalized and/or transformed time-frequency segments {tilde over (X)}.sub.m; calculating a final monaural speech intelligibility predictor d estimating an intelligibility of said noisy and/or processed version x of the target signal by combining, e.g. by averaging or applying a MIN or MAX-function, said intermediate speech intelligibility coefficients d.sub.m, or a transformed version thereof, over time.
20. A non-transitory computer-readable medium storing a computer program comprising instructions which, when executed by a computer, cause the computer to carry out the method of claim 19.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0111] The aspects of the disclosure may be best understood from the following detailed description taken in conjunction with the accompanying figures. The figures are schematic and simplified for clarity, and they just show details to improve the understanding of the claims, while other details are left out. Throughout, the same reference numerals are used for identical or corresponding parts. The individual features of each aspect may each be combined with any or all features of the other aspects. These and other aspects, features and/or technical effect will be apparent from and elucidated with reference to the illustrations described hereinafter in which:
[0112]
[0113]
[0114]
[0115]
[0116]
[0117]
[0118]
[0119]
[0120]
[0121]
[0122]
[0123]
[0124]
[0125]
[0126]
[0127]
[0128] The figures are schematic and simplified for clarity, and they just show details which are essential to the understanding of the disclosure, while other details are left out. Throughout, the same reference signs are used for identical or corresponding parts.
[0129] Further scope of applicability of the present disclosure will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only. Other embodiments may become apparent to those skilled in the art from the following detailed description.
DETAILED DESCRIPTION OF EMBODIMENTS
[0130] The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practised without these specific details. Several aspects of the apparatus and methods are described by various blocks, functional units, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as “elements”). Depending upon particular application, design constraints or other reasons, these elements may be implemented using electronic hardware, computer program, or any combination thereof.
[0131] The electronic hardware may include microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functionality described throughout this disclosure. Computer program shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
[0132] The present application relates to the field of hearing aids or hearing aid systems.
[0133] The present disclosure relates to signal processing methods for predicting the intelligibility of speech, e.g., the output signal of a signal processing device such as a hearing aid. The intelligibility prediction is made in the form of an index that correlates highly with the fraction of words that an average listener would be able to understand from some speech material. For situations where an estimate of absolute intelligibility, i.e., the actual percentage of words understood, is desired, this index may be transformed to a number in the range 0-100 percent, see e.g. [3] for one method to do this.
[0134] The method proposed here belongs to the class of so-called intrusive methods. Methods in this class are characterized by the fact that they make their intelligibility prediction by comparing the noisy—and potentially signal processed—speech signal, with a noise-free, undistorted version of the underlying speech signal, see [1, 2, 3] for examples of existing methods. The assumption that a noise-free reference signal is available is reasonable in many practically relevant situations. For example, when evaluating the impact of various hearing aid signal processing algorithms on intelligibility, one normally conducts a listening test with human subjects. In preparing such a test, the stimuli are often created artificially by explicitly adding noise signal to noise-free speech signals—in other words, noise-free signals are readily available. Hence, the proposed intelligibility prediction algorithm allows one to replace a costly and time-consuming listening test involving human subjects, with machine predictions.
[0135] Much of the signal processing of the present disclosure is performed in the time-frequency domain, where a time domain signal is transformed into the (time-)frequency domain by a suitable mathematical algorithm (e.g. a Fourier transform algorithm) or filter (e.g. a filter bank).
[0136]
[0137]
[0138] In the present application, a number J of (non-uniform) frequency sub-bands with sub-band indices j=1, 2, . . . , J is defined, each sub-band comprising one or more DFT-bins (cf. vertical Sub-band j-axis in
[0139]
[0143]
[0144] The monaural speech intelligibility predictor unit (MSIP) comprises a first input unit (IU) for providing a time-frequency representation s(k,m) of said noise-free version s of the target signal from the time variant signal s(n), and a second input unit (IU) for providing a time-frequency representation x(k,m) of the noisy and/or processed version x of the target signal from the the time variant signal x(n), k being a frequency bin index, k=1, 2, . . . , K, and m being a time index.
[0145] The monaural speech intelligibility predictor unit (MSIP) further comprises a first envelope extraction unit (AEU) for providing a time-frequency sub-band representation s.sub.j(m) of the noise-free version s of the target signal representing temporal envelopes, or functions thereof, of frequency sub-band signals s.sub.j(m) of said noise-free target signal from the time-frequency representation s(k,m), and a second envelope extraction unit (AEU) for providing a time-frequency sub-band representation x.sub.j(m) of the noisy and/or processed version x of the target signal representing temporal envelopes, or functions thereof, of frequency sub-band signals x.sub.j(m) of said noisy and/or processed version of the target signal from the time-frequency representation s(k,m), j=1, 2, . . . , J, and m being the time index.
[0146] The monaural speech intelligibility predictor unit (MSIP) further comprises a first time-frequency segment division unit (SDU) for dividing said time-frequency sub-band representation s.sub.j(m) of the noise-free version s of the target signal into time-frequency segments S.sub.m corresponding to a number N of successive samples of the sub-band signals s.sub.j(m), and a second time-frequency segment division unit (SDU) for dividing the time-frequency sub-band representation x.sub.j(n) of the noisy and/or processed version x of the target signal into time-frequency segments X.sub.m corresponding to a number N of successive samples of the sub-band signals x.sub.j(m).
[0147] The monaural speech intelligibility predictor unit (MSIP) further optionally comprises a first normalization and/or transformation unit (N/TU) adapted for providing normalized and/or transformed versions {tilde over (S)}.sub.m of the time-frequency segments S.sub.m, and optionally a second normalization and/or transformation unit (N/TU) adapted for providing normalized and/or transformed versions {tilde over (X)}.sub.m of the time-frequency segments X.sub.m.
[0148] The monaural speech intelligibility predictor unit (MSIP) further comprises an intermediate speech intelligibility calculation unit (ISIU) adapted for providing intermediate speech intelligibility coefficients d.sub.m estimating an intelligibility of the time-frequency segment X.sub.m, wherein the intermediate speech intelligibility coefficients d.sub.m are based on the essentially noise-free, optionally normalized and/or transformed, time frequency segments S.sub.m, {tilde over (S)}.sub.m, and the noisy and/or processed, optionally normalized and/or transformed, time-frequency segments X.sub.m, {tilde over (X)}.sub.m.
[0149] The monaural speech intelligibility predictor unit (MSIP) further comprises a final monaural speech intelligibility calculation unit (FSIU) for calculating a final monaural speech intelligibility predictor d estimating an intelligibility of the noisy and/or processed version x of the target signal by combining, e.g. by averaging or applying a MIN or MAX-function, the intermediate speech intelligibility coefficients d.sub.m, or a transformed version thereof, over time.
[0150]
[0151] In order to simulate the potential decrease in intelligibility due to a hearing loss, an optional hearing loss model is included (cf.
[0152] The proposed monaural, intrusive speech intelligibility predictor may be decomposed into a number of sub-stages as illustrated in
Voice Activity Detection (VAD).
[0153] Speech intelligibility (SI) relates to regions of the input signal with speech activity—silence regions do no contribute to SI. Hence, the first step is to detect voice activity regions in the input signals. Since the noise-free speech signal s′(n) is available, voice activity is trivial. For example, in [3] the noise-free speech signal s′(n) was divided into successive frames. Speech-active frames were then identified as the ones with a frame-energy no less than e.g. 40 dB of the frame with maximum energy. The speech inactive frames, i.e., the ones with energy less than e.g., 40 dB of the maximum frame energy, are then discarded from both signals, x′(n) and s′(n). Let us denote the input signals with speech activity by x(n) and s(n), respectively, where n is a discrete-time index. A voice activity detector is shown in
Frequency Decomposition (IU) and Envelope Extraction (AEU)
[0154] The first step is to perform a frequency decomposition (cf. input unit IU in
[0155] As an example, we describe in the following how the frequency decomposition and envelope extraction can be achieved using an STFT; the described procedure is similar to the one in [3]. Let us assume, as an example, that signals are sampled with a frequency of f.sub.s=10000 Hz. First, a time-frequency representation is obtained by segmenting signals x(n) and s(n) into (e.g. 50%) overlapping, windowed frames (cf. e.g.
[0156] Temporal envelope signals may then be extracted as
where k1(j) and k2(j) denote DFT bin indices corresponding to lower and higher cut-off frequencies of the j.sup.th sub-band, J is the number of sub-bands (e.g. 16), and M is the number of signal frames in the signal in question, and where the function ƒ(x) is included for generality. For example, for ƒ(x)=x, we get the temporal envelope used in [4], with ƒ(x)=x.sup.2, we extract power envelopes, and with ƒ(x)=2 log x, or ƒ(x)=x.sup.β, 0<β<2, we can model the compressive non-linearity of the healthy cochlea, respectively. It should be clear that other reasonable choices for ƒ(x) exist. Temporal envelope signals s.sub.j(m) for the noise-free speech signal are found in a similar manner. The same choice of ƒ(x) may be used in both cases.
[0157] As mentioned, other envelope representations may be implemented, e.g., using a Gammatone filterbank, followed by a Hilbert envelope extractor, etc., and functions ƒ(x) may be applied to these envelopes in a similar manner as described above for STFT based envelopes. In any case, the result of this procedure is a time-frequency representation in terms of sub-band temporal envelopes, x.sub.j(m) and s.sub.j(m), where j is a sub-band index, and m is a time index.
Time-Frequency Segments (SDU)
[0158] Next, we divide the time-frequency representations x.sub.j(m) and s.sub.j(m) into segments, i.e., spectrograms corresponding to N successive samples of all sub-band signals. For example, the m.sup.th segment for the noisy/processed signal is defined by the J×N matrix
[0159] The corresponding segment S.sub.m for the noise-free reference signal is found in an identical manner.
[0160] It should be understood that other versions of the time-segments could be used, e.g., segments, which have been shifted in time to operate on frame indices m−N/2+1 through m+N/2.
Normalizations and Transformation of Time-Frequency Segments (N/TU)
[0161] The rows and columns of each segment X.sub.m and S.sub.m may be normalized/transformed in various ways (below, we show the normalizations/transformations as applied to X.sub.m; they are applied to S.sub.m in a completely analogously manner. The same normalization/transformation is applied to both X.sub.m and S.sub.m). In particular, we consider the following row (R) normalizations/transformations
R1) Normalization of rows to zero mean:
g.sub.1(X)=X−μ.sub.x.sup.r1.sup.T,
where μ.sub.x.sup.r is a J×1 vector whose j.sup.th entry is the mean of the j.sup.th row of X (hence the superscript r in μ.sub.x.sup.r), and where 1 denotes an N×1 vector of ones.
R2) Normalization of rows to unit-norm:
g.sub.2(X)=D.sup.r(X)X,
where
D.sup.r(X)=diag(└1/√{square root over (X(1,:)X(1,:).sup.H)} . . . 1/√{square root over (X(J,:)X(J,:).sup.H)}┘),
and where diag(•) is a diagonal matrix with the elements of the arguments on the main diagonal. Furthermore, X(j,:) denotes the j.sup.th row of X, such that D′(X) is a J×J diagonal matrix with the inverse norm of each row on the main diagonal, and zeroes elsewhere (the superscript H denotes Hermitian transposition). Pre-multiplication with D′(X) normalizes the rows of the resulting matrix to unit-norm.
R3) Fourier transformation applied to each row
g.sub.3(X)=XF,
where F is an N×N Fourier matrix.
R4) Fourier transformation applied to each row followed by computing the magnitude of the resulting complex-valued elements
g.sub.4(X)=|XF|,
where |•| computes the element-wise magnitudes.
R5) The identity operator
g.sub.5(X)=X.
[0162] We consider the following column (C) normalizations
C1) Normalization of columns to zero mean:
h.sub.1(X)=X−1μ.sub.x.sup.c.sup.
where μ.sub.x.sup.c is a N×1 vector whose i.sup.th entry is the mean of the i.sup.th row of X, and where 1 denotes an J×1 vector of ones.
C2) Normalization of columns to unit-norm:
h.sub.2(X)=XD.sup.c(X), where
D.sup.c(X)=diag(└1/√{square root over (X(:,1).sup.HX(:,1))} . . . 1/√{square root over (X(:,N).sup.HX(:,N))}┘).
[0163] Here X(:,n) denotes the n.sup.th row of X, such that D′(X) is a diagonal N×N matrix with the inverse norm of each column on the main diagonal, and zeros elsewhere. Post-multiplication with D.sup.c(X) normalizes the rows of the resulting matrix to unit-norm.
[0164] The row—(R#, #=1, 2, . . . , 5) and column (C#, #=1, 2) normalizations/transformations listed above may be combined in different ways. In a preferred embodiment, at least one of row normalizations/transformations g.sub.i(•) (i=1, 2, . . . , 5) and at least one of the column normalizations/transformations h.sub.j(•) (j=1, 2) is applied (in any order).
[0165] One combination of particular interest is where, first, the rows are normalized to zero-mean and unit-norm, followed by a similar mean and norm normalization of the columns. This particular combination may be written as
{tilde over (X)}.sub.m=h.sub.2(h.sub.1(g.sub.2(g.sub.1(X.sub.m))),
where X.sub.m is the resulting row- and column normalized matrix.
[0166] Another transformation of interest is to compute the magnitude Fourier spectrum of each row of matrix X.sub.m followed by mean- and norm-normalization of the resulting columns. With the introduced notation, this may be written simply as
{tilde over (X)}.sub.m=h.sub.2(h.sub.1(g.sub.3(X.sub.m))).
[0167] Other combinations of these normalizations/transformations may be of interest, e.g.,
{tilde over (X)}.sub.m=g.sub.2(g.sub.1(h.sub.2(h.sub.1(X.sub.m))))
(mean- and norm-standardization of the columns followed by mean- and norm-standardization of the rows), etc. As mentioned, a particular combination of row- and column-normalizations/transformations is chosen and applied to all segments X.sub.m and S.sub.m of the noisy/processed and noise-free signal, respectively.
Estimation of Intermediate Intelligibility Coefficients (ISIU)
[0168] The time-frequency segments S.sub.m or the normalized/transformed time-frequency segments {tilde over (S)}.sub.m of the noise-free reference signal may now be used together with the corresponding noisy/processed segments X.sub.m, {tilde over (X)}.sub.m to compute an intermediate intelligibility index d.sub.m, reflecting the intelligibility of the noisy/processed signal segment X.sub.m, {tilde over (X)}.sub.m. To do so, let us first define the sample correlation coefficient d(x,y) of the elements in two K×1 vectors x and y:
[0169] Several options exist for computing the intermediate intelligibility index d.sub.m. In particular, d.sub.m may be defined as [0170] 1) the average sample correlation coefficient of the columns in {tilde over (S)}.sub.m and {tilde over (X)}.sub.m, i.e.,
or [0171] 2) the average sample correlation coefficient of the rows in {tilde over (S)}.sub.m and {tilde over (X)}.sub.m, i.e.,
or [0172] 3) the sample correlation coefficient of all elements in {tilde over (S)}.sub.m and {tilde over (X)}.sub.m, i.e.,
d.sub.m=d({tilde over (S)}.sub.m(:),{tilde over (X)}.sub.m(:)),
where we adopted the notation {tilde over (S)}.sub.m (:) and {tilde over (X)}.sub.m(:) to represent NJ×1 vectors formed by stacking the columns of the respective matrices.
Estimation of Final Intelligibility Coefficient (FSIU)
[0173] The final intelligibility coefficient d, which reflects the intelligibility of the noisy/processed input signal x(n), is defined as the average of the intermediate intelligibility coefficients, potentially transformed via a function u(d.sub.m), across the duration of the speech-active parts of x(n), i.e.,
[0174] The function u(d.sub.m) could for example be
to link the intermediate intelligibility coefficients to information measures, but it should be clear that other choices exist.
[0175] The “do-nothing” function u(d.sub.m)=d.sub.m is also a possible choice (it has previously been used in the STOI algorithm [3]).
[0176] In the following, a noisy/reverberant speech signal x(n) which potentially has been passed through a signal processing device, e.g. in a hearing aid, is considered. An algorithm is proposed, which can predict the average intelligibility of x(n), as perceived by a group of listeners with similar hearing profiles, e.g. normal hearing or hearing impaired listeners. To achieve this, the proposed algorithm relies on the presence of the noise-free, undistorted underlying signal s(n), see
[0177]
[0178] The embodiment of
[0179]
[0180]
[0181]
[0182] The hearing aid (HD) used in the two scenarios of
[0183]
[0184] The clean target signal s is transmitted from the CELL PHONE to the hearing aid HD. The background noise v′ (Noise v′) of the car cabin is captured by the microphone(s) (IT) of the hearing aid. It can be assumed that the background noise v′ as captured is substantially equal to the noise v.sub.ed (Noise v.sub.ed) that is present at the ear drum (Ear drum) of the user (cf.
[0185]
[0186] The basic idea of the embodiment of a hearing aid in
[0187] Using a model of speech intelligibility (e.g. as disclosed in the present disclosure) in the configuration of
[0188] Preferably, the loudspeaker (or alternatively an acoustic guide element) is located in the ear canal, preferably close to the ear drum to deliver the processed signal ƒ(s) to the ear drum. Preferably, the microphone(s) of the hearing device, which is(are) used to pick up background noise v′ (cf.
[0189] In the configuration of
[0190] As an alternative to using a speech intelligibility predictor to modify (optimize) s (or as an extreme option of the present disclosure), a simple increase of gain of the clean target signal s (i.e. f(s)=g's, g being a gain factor, e.g. g=10) may be used to increase the signal to noise ratio (SNR) at the ear drum (assuming a constant level of the background (cabin) noise v.sub.ed at the ear drum). In practice, such reliance only on increasing gain of the clean target signal may, however, not be attractive or possible (e.g. due to acoustic feedback problems, maximum power output limitations of the loudspeaker, or uncomfortable levels of the user, etc.). Instead an appropriate frequency dependent shaping of the clean target signal is generally proposed and governed by the monaural speech intelligibility predictor (including the hearing loss model (HLM) preferably defining decisive aspects of a hearing impairment of the user of the hearing aid).
[0191]
[0192] The hearing aid (HD) exemplified in
[0193] In an embodiment, the hearing aid (HD) comprises a directional microphone system (beamformer) adapted to enhance a target acoustic source among a multitude of acoustic sources in the local environment of the user wearing the hearing aid device. In an embodiment, the directional system is adapted to detect (such as adaptively detect) from which direction a particular part of the microphone signal originates.
[0194] The hearing aid of
[0195]
[0196] ’) together with an indication of the current noise level (indicated as ‘HIGH’)). The grey shaded button Lecture mode (as described in connection with
[0197]
[0198] It is intended that the structural features of the devices described above, either in the detailed description and/or in the claims, may be combined with steps of the method, when appropriately substituted by a corresponding process.
[0199] As used, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well (i.e. to have the meaning “at least one”), unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element but an intervening elements may also be present, unless expressly stated otherwise. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. The steps of any disclosed method is not limited to the exact order stated herein, unless expressly stated otherwise.
[0200] It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” or “an aspect” or features included as “may” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the disclosure. The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects.
[0201] The claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more.
[0202] Accordingly, the scope should be judged in terms of the claims that follow.
REFERENCES
[0203] [1] American National Standards Institute, “ANSI S3.5, Methods for the Calculation of the Speech Intelligibility Index,” New York 1995. [0204] [2] K. S. Rhebergen and N. J. Versfeld, “A speech intelligibility index based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners,” J. Acoust. Soc. Am., vol. 117, no. 4, pp. 2181-2192, 2005. [0205] [3] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech.” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125-2136, September 2011. [0206] [4] B. C. J. Moore, “Cochlear Hearing Loss,” Physiological, Psychological and Technical Issues, “Wiley, 2007. [0207] [5] R. Beutelmann and T. Brand, “Prediction of intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners,” J. Acoust. Soc. Am., Vol. 120, no. 1, pp. 331-342, April 2006.