HEARING DEVICE COMPRISING A SPEECH INTELLIGIBILITY ESTIMATOR

20220400349 · 2022-12-15

Assignee

Inventors

Cpc classification

International classification

Abstract

A hearing device, e.g. a hearing aid, comprises a) an input unit configured to provide at least one time-variant electric input signal representing sound, the at least one electric input signal comprising target signal components and optionally noise signal components, the target signal components originating from a target sound source; b) a signal processing unit for processing the at least one electric input signal and providing a processed signal; c) an output unit for creating output stimuli configured to be perceivable by the user as sound based on the processed signal from the signal processing unit; d) a speech presence probability prediction unit for repeatedly providing a measure of a predicted speech presence probability of the at least one electric input signal, or of a signal originating therefrom; and e) a speech intelligibility prediction unit for repeatedly providing a current measure of a predicted speech intelligibility of the at least one electric input signal, or of a signal originating therefrom. The speech intelligibility prediction unit is configured to determine said current measure of the predicted speech intelligibility in dependence of said measure of the predicted speech presence probability. A method of operating a hearing device is further disclosed. The invention may e.g. be used in hearing aids, headsets, earpieces (ear buds), etc.

Claims

1. A hearing device adapted for being worn by a user, the hearing device comprising an input unit configured to provide at least one time-variant electric input signal representing sound, the at least one electric input signal comprising target signal components and optionally noise signal components, the target signal components originating from a target sound source; a signal processing unit for processing the at least one electric input signal and providing a processed signal; an output unit for creating output stimuli configured to be perceivable by the user as sound based on the processed signal from the signal processing unit; a speech presence probability prediction unit for repeatedly providing a measure of a predicted speech presence probability of the at least one electric input signal, or of a signal originating therefrom; a speech intelligibility prediction unit for repeatedly providing a current measure of a predicted speech intelligibility of the at least one electric input signal, or of a signal originating therefrom, and wherein said speech intelligibility prediction unit is configured to determine said current measure of the predicted speech intelligibility in dependence of said measure of the predicted speech presence probability.

2. A hearing device according to claim 1 wherein the speech intelligibility prediction unit is configured to determine said current measure of the predicted speech intelligibility as a function (ƒ(.)) of a present value and a number of past values of said measure of the predicted speech presence probability.

3. A hearing device according to claim 1 comprising a mapping unit configured to provide a mapping of said at least one electric input signal from a first domain having a first dimension to a second domain having a second dimension, wherein said mapping is a non-linear or linear mapping, and wherein said second dimension is equal to or different from said first dimension.

4. A hearing device according to claim 1 wherein said input unit is configured to provide said at least one electric input signal in a transform domain representation.

5. A hearing device according to claim 1 wherein said input unit is configured to provide said at least one electric input signal in a time-frequency representation (k,m), k being a frequency band index, m being a time index.

6. A hearing device according to claim 5 wherein the speech presence probability prediction unit is configured to determine said current measure of the predicted speech intelligibility in a number of time frequency units (k,m).

7. A hearing device according to claim 6 wherein the speech intelligibility prediction unit is configured to determine said current measure of the predicted speech intelligibility as a function of a present value and a number of past values of said measure of the predicted speech presence probability, wherein said present and said number of past values is M×K, where M is a number of time units and K is a number of frequency units.

8. A hearing device according to claim 2 wherein the speech intelligibility prediction unit is configured to determine said current measure of the predicted speech intelligibility in dependence of an, optionally normalized, sum of said present value and said number of past values of said measure of the predicted speech presence probability.

9. A hearing device according to claim 2 wherein the speech intelligibility prediction unit is configured to determine said current measure of the predicted speech intelligibility in dependence of a weighted sum of said present value and said number of past values of said measure of the predicted speech presence probability.

10. A hearing device according to claim 2 configured to provide that said function (ƒ(.)) is a data-driven model, learned from training data.

11. A hearing device according to claim 10 configured to provide that said function ƒ(.) is provided by a deep neural network whose parameters are learned offline—before use of the hearing device—using training data comprising estimated speech presence probabilities P.sub.k,m′, k=1, . . . , K; m′=m−M+1, . . . m, for a particular noisy or processed time segment of a speech signal along with ground truth speech intelligibility of that speech segment, k being a frequency band index, m being a time index.

12. A hearing device according to claim 1 wherein the signal processing unit comprises at least one processing algorithm configured to be applied to the at least one electric input signal or a signal or signals originating therefrom.

13. A hearing device according to claim 1 comprising a controller (CTR) configured to provide appropriate processing parameters for use in the processing of the at least one electric input signal, or a signal or signals originating therefrom, in dependence of the current measure of the predicted speech intelligibility (Î).

14. A hearing device according to claim 12 wherein the at least one processing algorithm comprises a noise reduction algorithm.

15. A hearing device according to claim 12 wherein the controller is configured to provide one or more processing parameters of the at least one processing algorithm, and wherein the one or more processing parameters is provided in dependence of the current measure of the predicted speech intelligibility.

16. A hearing device according to claim 13 wherein the input unit is configured to provide at least two time-variant electric input signals representing sound, and wherein the hearing aid comprises a beamformer configured to provide a beamformed signal in dependence of said at least two time-variant electric input signals and adaptively updated beamformer weights (w.sub.ij) wherein the controller is configured to control the beamformer in dependence of the current measure of the predicted speech intelligibility (Î).

17. A hearing device according to claim 16 wherein the controller is configured to control the beamformer weights (w.sub.ij(k,m)) in dependence of the current measure of the predicted speech intelligibility (Î(m)) to increase omni-directionality of the beamformer, the higher the current measure of the predicted speech intelligibility (Î(m)).

18. A hearing device according to claim 1 being constituted by or comprising a hearing aid, a headset, an earphone, an ear protection device, or a combination thereof.

19. A method of operating a hearing device adapted for being worn by a user, the method comprising providing at least one time-variant electric input signal representing sound, the at least one electric input signal comprising target signal components, and optionally noise signal components, the target signal components originating from a target sound source; processing the at least one electric input signal and providing a processed signal; creating output stimuli configured to be perceivable by the user as sound based on the processed signal; repeatedly providing a measure of a predicted speech presence probability of the at least one electric input signal, or of a signal originating therefrom; repeatedly providing a measure of a predicted speech intelligibility of the at least one electric input signal, or of a signal originating therefrom, and determining said current measure of the predicted speech intelligibility in dependence of said measure of the predicted speech presence probability.

20. A non-transitory computer readable medium storing a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of claim 19.

Description

BRIEF DESCRIPTION OF DRAWINGS

[0105] The aspects of the disclosure may be best understood from the following detailed description taken in conjunction with the accompanying figures. The figures are schematic and simplified for clarity, and they just show details to improve the understanding of the claims, while other details are left out. Throughout, the same reference numerals are used for identical or corresponding parts. The individual features of each aspect may each be combined with any or all features of the other aspects. These and other aspects, features and/or technical effect will be apparent from and elucidated with reference to the illustrations described hereinafter in which:

[0106] FIG. 1 shows an exemplary speech intelligibility estimator (or predictor) according to the present disclosure comprising an input speech signal s(t) that is analyzed to estimate speech presence probabilities, SPPs (P.sub.k,m), and wherein the estimated SPPs are further processed to provide an estimate Î.sub.m of a current speech intelligibility,

[0107] FIG. 2 schematically illustrates speech presence probabilities (P.sub.k,m) in a time frequency domain,

[0108] FIG. 3 schematically shows that the proposed speech intelligibility index Î.sub.m for time instant m is a function of SPPs (P.sub.k,m) from the present and recent past (defined by parameter M),

[0109] FIG. 4A, 4B schematically illustrate a simple example of max-pooling with M=5, K=5, k0=1, m0=1, where

[0110] FIG. 4A) shows SPPs (P.sub.k,m) before Max-pooling; and

[0111] FIG. 4B) shows SPPs (P.sub.k,m) after Max-pooling,

[0112] FIG. 5 shows an exemplary block diagram for training of a neural network for estimating a current speech intelligibility of an input word or sentence based on current and past speech presence probabilities, and

[0113] FIG. 6A shows an exemplary speech intelligibility estimator (or predictor) according to the present disclosure, and

[0114] FIG. 6B schematically shows an embodiment of a hearing aid comprising a speech intelligibility estimator (or predictor) according to the present disclosure.

[0115] The figures are schematic and simplified for clarity, and they just show details which are essential to the understanding of the disclosure, while other details are left out. Throughout, the same reference signs are used for identical or corresponding parts.

[0116] Further scope of applicability of the present disclosure will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only. Other embodiments may become apparent to those skilled in the art from the following detailed description.

DETAILED DESCRIPTION OF EMBODIMENTS

[0117] The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practiced without these specific details. Several aspects of the apparatus and methods are described by various blocks, functional units, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as “elements”). Depending upon particular application, design constraints or other reasons, these elements may be implemented using electronic hardware, computer program, or any combination thereof.

[0118] The electronic hardware may include micro-electronic-mechanical systems (MEMS), integrated circuits (e.g. application specific), microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), gated logic, discrete hardware circuits, printed circuit boards (PCB) (e.g. flexible PCBs), and other suitable hardware configured to perform the various functionality described throughout this disclosure, e.g. sensors, e.g. for sensing and/or registering physical properties of the environment, the device, the user, etc. Computer program shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.

[0119] The present application relates to the field of hearing devices, e.g. hearing aids, headsets, ear buds, etc.

[0120] The present disclosure proposes to estimate the probability of speech presence in certain parts, e.g. in disjoint time-frequency regions of an input speech signal s(t), t representing time. The estimated speech presence probabilities (SPPs) from a particular temporal neighborhood are combined into a speech intelligibility index Î.sub.m, where m is a time index, that reflects the intelligibility of the signal neighborhood in question, cf. FIG. 1.

[0121] FIG. 1 shows an exemplary speech intelligibility estimator according to the present disclosure comprising an input speech.sub.signal s(t) that is analyzed to estimate speech presence probabilities, SPPs (P.sub.k,m), and wherein the estimated SPPs are further processed to provide an estimate Îm of a current speech intelligibility. In FIG. 1, P.sub.k,m represent SPPs estimated for a time index m and a frequency channel k for ease of illustration—SPPs estimated in other domains could be used (see e.g. FIG. 6A below). SPPs are processed (e.g. integrated or combined) to form a speech intelligibility index Î.sub.m which correlates highly with the intelligibility of the speech signal, measured in the vicinity of time index m.

[0122] In the present application, the notation ‘P.sub.k,m’ and ‘P(k,m)’ is used interchangeably for the speech presence probability (depending on indices k, and m, or other indices) without any intended difference in meaning between the two.

[0123] By “intelligibility index” we mean a number (scalar) as a function of time that correlates highly with true intelligibility of the speech signal in question as a function of time—as perceived by a group of listeners or a particular individual. In other words, when the true underlying intelligibility is high at a particular point in time, so is Î.sub.m, and vice versa. In our proposal, 0≤Î.sub.m≤1, where “1” means “high intelligibility”.

[0124] In FIG. 1 it is assumed that the input speech signal is decomposed into the time-frequency domain, i.e., by a filter bank (e.g. a short-time Fourier Transform (STFT) filter bank) in order to estimate the speech presence probability P.sub.k,m in each time-frequency tile. However, the proposed idea is not necessarily limited to the time-frequency domain. For example, the proposed idea could also operate using a spectro-temporal decomposition of the incoming speech signal (see e.g. [Edraki et al.; 2020] for an example), etc. In this case, the ordinary filter bank is replaced by a spectro-temporal filter bank, and SPPs would be estimated on a short-time basis for each spectro-temporal filter channel. Other domains could be envisaged.

[0125] Various input signals s(t) could be envisaged for the proposed algorithm. In one embodiment, the input signal s(t) to the proposed algorithm could be a microphone signal of the hearing device—in this case, the proposed algorithm provides an estimate Î.sub.m of the intelligibility of that microphone signal as a function of time t. In another embodiment, the input to the algorithm consists of several microphone signals used for SPP estimation (see below)—in this case the output Î.sub.m of the proposed algorithm typically reflects the intelligibility of one of the microphone signals (decided upon in advance, e.g. a reference microphone signal of a beamformer). In a third embodiment, the input signal s(t) to the proposed algorithm is the output signal of the hearing aid system, i.e., the signal to be presented for the hearing aid user. In this case, Î.sub.m reflects the intelligibility experienced by the hearing aid user, when listening to the output signal.

[0126] SPP estimation from noisy speech signals is a well-known discipline in the area of single- and multi-microphone speech enhancement (see e.g. [Hoang et al.; 2021] for a recent proposal). Most methods work in a spectral domain, e.g., via a short-time Fourier Transform and provides SPP estimates for each and every time-frequency coefficient. For each time-frequency tile, the estimated probabilities tend towards 0, if there is no speech present, or if there is speech present, but it is dominated by noise, and tend towards 1, if speech is clearly present.

[0127] FIG. 2 schematically illustrates speech presence probabilities (P.sub.k,m) in a time frequency domain.

[0128] SPP estimation methods can be categorized into the following broad classes:

[0129] a) statistical model-based methods (e.g., [Hoang et al.; 2021], Chapter 5) and the references therein), which rely on statistical assumption wrt. speech and noise signals,

[0130] b) (deep) learning based methods, e.g., EP3598777A2, in which SPPs are estimated using a data-driven (trained) SPP estimator, and

[0131] c) hybrid methods, e.g. where monaural DNN-based SPP estimates are refined using a statistical spatial model or the other way around.

[0132] Alternatively, SPP estimation methods can be categorized into algorithms that

[0133] a) use a single input signal (see e.g., [Hoang et al.; 2021]) and the references therein, and [Heymann, et al.; 2017].

[0134] b) use multiple input signals to estimate SPPs in one of them, e.g., [Heymann, et al.; 2017]).

[0135] The speech intelligibility prediction scheme according to the present disclosure is not limited to use SPPs estimated for each coefficient in the time-frequency domain, although this is the domain in which SPP estimation is typically performed in the literature. It is equally possible to envisage schemes where SPPs—for example—are related to coefficients in the spectro-temporal modulation domain. In this case, the signal under analysis may be decomposed using a spectro-temporal modulation filter bank (e.g., as proposed in [Edraki et al.; 2020]), e.g. applied to a linear-amplitude or a log-amplitude spectrogram.

[0136] The present disclosure proposes to combine the estimated SPPs from the recent past to form an intelligibility index reflecting the intelligibility of the speech signal across this recent past, i.e.,


I.sub.m=ƒ(P.sub.k,m′),k=1, . . . ,K; m′=m-M+1, . . . m,

[0137] where P.sub.k,m′ denotes the SPP estimate at the k′th frequency index and m′'th time index, we assume there are K frequency channels and M observations in the recent past, and ƒ(.) denotes a function that maps the SPPs to an intelligibility index, cf. FIG. 3. Essentially, ƒ(.) is chosen as a non-decreasing map, such that if one of the elements P.sub.k,m′, k=1, . . . , K; m′=m−M+1, . . . m, increases, then Î.sub.m=ƒ(P.sub.k,m′), k=1, . . . , K; m′=m−M+1, . . . m, does not decrease (see below for examples of the function ƒ(.)).

[0138] FIG. 3 schematically shows that the proposed speech intelligibility index Îm for time instant m is a function of SPPs (Pk,m) from the present and recent past (defined by parameter M).

[0139] Alternatively, the intelligibility index Î.sub.m reflecting the intelligibility at time instant m is a function of recent past and near future SPPs wrt. the time index m. The parameter M, which defines the time duration upon which Î.sub.m is based, is chosen according to the application. Typical values of M correspond to time durations of 100 ms, 200 ms, 500 ms, 1 s, 5 s, 10 s, or more. In some applications, M could correspond to the duration of a speech signal or a set of speech signals.

[0140] The simplest form of the function ƒ(.) is to simply form an intelligibility index estimate from an arithmetic average of the SPPs of the recent past.

[00002] I ˆ m = 1 M 1 K .Math. k = 1 K .Math. m = m - M + 1 m P k , m .

[0141] Another slightly more general approach is to introduce frequency weights

[00003] I ˆ m = 1 M 1 K .Math. k = 1 K .Math. m = m - M + 1 m w k P k , m ,

[0142] where w.sub.k are pre-determined weight factors. Often, we use pre-determined weights that indicate proportionate importance of different frequency bands, i.e., 0≤w.sub.k≤1 and Σ.sub.kw.sub.k=1.

[0143] In another embodiment, the SPPs are transformed, before they are combined, e.g., using a log-transform, P.sub.k,m:=log(P.sub.k,m+c1), where c1 is a small number to avoid numerical problems in computing log(.) for very low SPPs. Other transformation could be envisaged, e.g., compressive transforms such as square roots, etc.

[0144] In yet another embodiment, the SPPs are quantized, before they are combined, e.g. using a 2-level quantizer, P.sub.k,m:=1 if P.sub.k,m>c2 and P.sub.k,m=0 otherwise.

[0145] In yet another embodiment, the SPPs P.sub.k,m are passed through a max-pool map, before combined. A max-pool map replaces a given SPP, say P.sub.k,m, by the maximal SPP in its vicinity, e.g., P.sub.k,m:=max P.sub.k′,m′, k′=k−k0, . . . , k′=k+k0, m′=m−m0, . . . , m+m0, see FIG. 4 for a small example.

[0146] FIG. 4 shows a simple example of max-pooling with M=5, K=5, k0=1, m0=1, where

[0147] FIG. 4A) shows SPPs (Pk, m) before Max-pooling; and

[0148] FIG. 4B) shows SPPs (Pk, m) after Max-pooling.

[0149] The rationale of max-pooling of SPPs before combination is to take into account the observation that time-frequency tiles with high SPPs tend to convey more intelligibility if they are spread across the time-frequency plane, than if they are clustered. This is so, because in the former case, they are likely to be related to different formant frequencies, and, hence, be more informative, whereas in the latter case, they are likely to be related to the same formant frequency.

[0150] In the examples above (potentially non-linearly transformed) SPPs are combined by addition. Obviously, other ways of combining SPPs exist, e.g., multiplication, etc.

[0151] In a final embodiment, the mapping function ƒ(.) is a data-driven model, learned from training data. In particular, ƒ(.) could be a deep neural network whose parameters could be learned offline—before actual system deployment—using training data consisting of estimated SPPs P.sub.k,m′, k=1, . . . , K; m′=m−M+1, . . . m, for a particular noise/processed segment of a speech signal (i.e., input to the data-driven model ƒ(.)) along with ground truth speech intelligibility of that speech segment, as measured in listening tests with human test subjects, i.e., desired output of the data-driven model ƒ(.). Details of the approach for training deep neural networks for intelligibility prediction is described in [Pedersen et al.; 2020], but this work differs from the proposed approach, because it does not rely on SPPs and assumes access to a noise-free reference signal.

[0152] The procedure for deriving an intelligibility index described so far has not taken into account any potential hearing deficits of the device user. This may not be a problem, if the intelligibility index is simply used to decide if an algorithm setting (A) leads to higher intelligibility than another setting (B). However, in general, it could be useful to incorporate the effect of a hearing loss in the intelligibility prediction algorithm. Several options exist for incorporating prior knowledge of such hearing deficits.

[0153] For example, for the statistical model based SPP estimation methods mentioned above, the hearing loss may be modelled crudely as an imaginary additive noise term—spectrally (and potentially temporally) shaped according to the hearing loss profile in question—and added to the acoustic noisy signal in question for deriving mathematically the SPPs. The derived SPPs will generally be reduced due to the presence of the imaginary noise term, reflecting the fact that certain speech cues will be harder to detect for the hearing impaired end user.

[0154] Similarly, for the learning based SPP estimation methods described above, the noise signal simulating the hearing loss is simply physically added to the noisy signals during training of the SPP estimation algorithm. As for the statistical model based SPP estimation methods, the output SPP estimates of the learning based algorithms will generally be reduced due to the presence of the additional “hearing loss” noise.

[0155] FIG. 5 shows an exemplary block diagram for training of an algorithm, e.g. a neural network, such as a deep neural network DNN (DNN(SPP-SI)), for estimating the speech presence probability Î for a particular word or sentence. The trained DNN is represented by a parameter set comprising optimized weights and possibly bias and/or non-linear function parameters. The circuit for training the neural network DNN comprises an input transducer (IT), e.g. a microphone for capturing environment sound signals and providing an (e.g. analogue or digitized) electric input signal x(n) representative thereof, n denoting time. The microphone path comprises a transform unit for transforming the electric (e.g. time domain) input signal to another domain, e.g. an analysis filter bank FBA for (possibly digitizing and) converting the time domain electric input signal x(n) to a corresponding electric input signal X(km) in a time frequency representation, where k and m are frequency and time (frame) indices, respectively. The electric input signal X(km) is fed to Speech Presence Probability estimation unit (SPPE) for proving an estimate of a speech presence probability P(k,m). From a (practical) simplicity point of view, the test signals X(km) (associated with ground truth speech intelligibility measures (Î(m)) may be fed directly to the Speech Presence Probability estimation unit (SPPE). It may, however, be advantageous to include the input stage comprising microphone(s), analysis filter bank(s) (and possible beamformers applied to the microphone signals before being fed to the Speech Presence Probability estimation unit (SPPE)) and the SPP-estimator (SPPE) in the actual training setup to thereby resemble the subsequent processing of the acoustic input signal in the hearing aid comprising the trained speech intelligibility estimator (SIE) as much as possible.

[0156] As indicated, other ‘input stage’ configurations than shown on FIG. 5 may be used, e.g. more than one input transducer, e.g. a beamformer, e.g. another transform or mapping unit(s) than the analysis filter bank(s), a particular speech presence probability estimator (e.g. as known from the prior art, or a proprietary solution) may be used. It is however advantageous that the same configuration is applied in the hearing aid that is to host the speech intelligibility estimator (SIE) with the optimized (trained) parameters determined in the training process described in the following.

[0157] The training setup comprises a context unit (CONTEXT) for providing an appropriate input vector Z(k,m) to the neural network (DNN (SPP-SI)) to be trained. The context (cf. hatched part of time-frequency map denoted ‘Context’ in the top part of FIG. 5, and also comprised in the input vector denoted Z(k,m)) may be controlled via input control CTXT, e.g. via a user interface. It may, however, be predetermined, e.g. fixed in advance of the training procedure. A simple control may be to use the number of historic time-frequency tiles (or frames) that should be included in the input vector Z(k,m). A specific time may also be used to control the context applied in the training of the parameters of the speech intelligibility estimator (SIE), e.g. corresponding to the average length of a word or sentence, or a number of sentences, as pronounced by a speaker.

[0158] The training data comprises a larger number of words or numbers or sentences (e.g. hundreds or thousands, but ideally many more (e.g. as many as possible)), e.g. spoken by a number of different speakers' (e.g. at different signal-to-noise ratios, using different noise types, etc.) and associated (e.g. average) speech intelligibility measures (e.g. provided by different listeners), e.g. provided by listening tests or by an algorithm having been trained with data from listening tests, cf. e.g. EP3514792A1.

[0159] A noisy time domain training signal (x(n)) is passed through an analysis filter bank (FBA), providing frequency sub-band (time-frequency domain) signals X(k m). For a particular time instant m′, noisy signals representing a particular time segment of test data (e.g. a word, sentences, etc.) are passed through Speech Presence Probability estimation unit (SPPE) providing speech presence probability estimates P(k,m′) for each frequency index k=1, . . . , K.

[0160] The SPPE-unit provides speech presence probability estimates P(k,m) for each time-frequency tile (k,m) to the context unit (CONTEXT) to build a desired input vector Z(k,m). to the neural network (DNN (SPP-SI)) to be trained. For a given time instant m=m′, the estimated value Î(m′), of the speech intelligibility measure is estimated by the neural network using present and past values of the speech presence probability estimate, P(k, m), k=1, . . . , K; m=m′−L+1, . . . , m′, where L denotes the number of past frames used to estimate I(m′). The number L of frames represents the ‘history’ of the SPP estimates that is included in the estimation of speech intelligibility measure. With a view to the general nature of speech, the ‘history’ (L) may e.g. include up to 10 s of the input signal (SPP estimates), e.g. representing a few sentences.

[0161] The input vector Z(k,m) to the neural network may thus comprise a number of time frequency values of the speech presence probability P(k,m), k=1, . . . , K; m=m′−L+1, . . . , m′, (e.g. real numbers between 0 and 1) as illustrated by the top time-frequency (TF) map in FIG. 5. The values of the input vector may be subject to a functional ‘transformation’ (e.g. logarithm) before being fed to the first layer of the neural network, if appropriate. In the time-frequency map insert in the top part of FIG. 5, the frequency range represented by indices k=1, . . . , K may be the full operational range of the hearing device in question (e.g. representing a frequency range between 0 and 12 kHz (or more)), or it may represent a more limited sub-band range (e.g. where speech elements are expected to be located, e.g. between 0.5 kHz and 8 kHz, or between 1 kHz and 4 kHz). The limited sub-band range may contain a continuous range or selected sub-ranges between k=1 and k=K.

[0162] SPP input vectors Z(k, m′) for given time instants m=m′, e.g., comprising speech presence probability estimates P(k, m), k=1, . . . , K; m=m′−L+1, . . . , m′, corresponding to a word or one or more sentences, as appropriate, and corresponding ground truth speech intelligibility values I(m′) for said word or one or more sentences, are used to train the (e.g. deep) neural network (DNN (SPP-SI)). Using the neural network, we wish to provide an estimate I(m′), ‘now’ (=m′) (corresponding to the current word(s) or sentence(s), if any, present in the input data) based on L observations (L time frames) up to (and including) time ‘now’ (see e.g. time-frequency map insert in the top part of FIG. 5). The network parameters are collected in a set denoted by DNN*. Typically, this parameter set encompasses weight and bias values associated with each network layer. The network may be a feedforward multi-layer perceptron, a convolutional network, a recurrent network, e.g., a long short-term memory (LSTM) network, a gated recurrent unit (GRU), or combinations of these networks. Other network structures are possible. The output layer of the network may have a logistic (e.g. sigmoid) output activation function to ensure that outputs (Î(m′)) are constrained to the range 0 to 1. The network parameters may be found using standard, iterative, steepest-descent methods, e.g., implemented using back-propagation (cf. e.g. [4]), minimizing e.g. the mean-squared error (MSE), cf. e.g. signal err(m′) provided by optimization algorithm (COST), between the network output Î(l′) and the ground truth I(m′). The mean-squared error is computed across many training pairs of the ground truth speech intelligibility measures I(m) and noisy signals X(k, m).

[0163] FIG. 6A shows an exemplary speech intelligibility estimator (SIE) according to the present disclosure. The speech intelligibility estimator (SIE) is similar to the one shown in FIG. 1 apart from the embodiment of FIG. 6A additionally comprising a mapping unit (MAP) configured to provide a mapping of input signal (x) from a first domain having a first dimension to a second domain having a second dimension. The mapping may be a non-linear or linear mapping. The second dimension may be equal to or different from the first dimension. The second domain may have more dimensions than the first. For example, the first domain may be short-time spectrograms, e.g., a (log-)spectrogram of the recent past compared to m′=“now”. These short-time-spectrograms may be mapped into the (Temporal) Modulation Domain, in which case the second domain would have dimensions (time, acoustic frequency, modulation frequency), i.e., 3-dimensional. The second domain may e.g. also be a spectro-temporal modulation domain with four dimensions (time, acoustic frequency, temporal modulation frequency, spectral modulation frequency).

[0164] The mapping unit may represent a Fourier transformation or any other transform process for transformation from one domain to another domain (e.g. a Discrete Cosine Transform (DCT) or the Karhunen Loéve Transform (KLT), a Temporal transformation, a Spectro-Temporal modulation, etc.). An inverse mapping unit (or another (possibly different) mapping unit) may be applied in the hearing aid at any appropriate location appropriate for the design in question to bring the signal in question to a domain (e.g. time-frequency domain or time domain) as suitable for the specific solution. As indicated in FIG. 6A, the mapping unit (MAP) transforms the input (e.g. time domain) signal x to a mapped domain (e.g. transform domain) signal x(a, b, . . . ), e.g. having a higher dimension, and depending on a number (e.g. a multitude) of parameters (a, b, . . . ). The speech presence probability estimator (SPPE) is configured to provide the speech presence probability estimate P in the mapped domain (P(a, b, . . . )). The speech presence probability estimate (P(a, b, . . . )) is fed to the speech presence probability integrator (SPP-INT) providing the speech intelligibility estimate ( ), either in the mapped domain or in the time domain as SI-estimate (Î.sub.m). Hence, the speech intelligibility estimator (SIE) may comprise an inverse mapping unit. An inverse mapping unit may e.g. form part of a neural network implementing the speech presence probability integrator (SPP-INT), as discussed in connection with FIG. 5.

[0165] FIG. 6B schematically shows an embodiment of a hearing aid (HD) comprising a speech intelligibility estimator (SIE) according to the present disclosure. The hearing aid (HD) comprises a multitude of microphones (Mi, i=1, . . . , N) providing a different one of a multitude of electric input signals (x.sub.i(n) i=1, . . . N, n representing time). The multitude of electric input signals are e.g. digitized and provided as digital samples (in the time domain), e.g. by corresponding analogue to digital converters, as appropriate. Each microphone path comprises an analysis filter bank (FB-A1, . . . , FB-AN) each providing an electric input signal x.sub.i in the time-frequency domain. The N analysis filter banks provide respective electric input signals X.sub.i, i=1, . . . , N in a time-frequency representation (k,m). The N electric input signals X.sub.i(k,m), i=1, . . . , N, are fed to a beamformer-noise reduction system (BF-NR) providing a beamformed (and possibly further noise reduced signal) Y.sub.BF(k,m). The beamformed signal is fed to signal processing unit (HLC) for applying a number of signal processing algorithms to the beamformed signal (or a signal originating therefrom), e.g. a hearing loss compensation algorithm (compressor) for providing a frequency and level dependent gain to compensate for the user's hearing impairment. Other processing algorithms may be applied to the signal by the processing unit (HLC) providing a processed signal Y.sub.G(km). The processed signal Y.sub.C(km) in the time-frequency domain is converted to a time domain signal y.sub.G(n) by synthesis filter bank (FB-S). The time domain signal y.sub.G(n) is forwarded to an output transducer, here loudspeaker (SPK), for providing stimuli perceivable to the user as sound. The above described units and interconnecting signals represent a forward (audio processing) path of the hearing aid.

[0166] In the embodiment of FIG. 6B, a multitude of input transducers (microphones) and a beamformer are shown. A hearing device comprising a single input transducer (e.g. a microphone) may, however, also be provided according to the present disclosure.

[0167] The hearing aid further comprises an analysis path, here comprising a speech intelligibility estimator (SIE) according to the present disclosure. The speech intelligibility estimator (SIE) provides an estimate of speech intelligibility Î of a given electric input signal of the forward path, here shown as time domain signal x.sub.i(n) from microphone MI. The analysis path further comprises a control unit (CTR)(e.g. a controller) configured to provide appropriate parameters for use in the signal processing of the forward path in dependence of the current estimate of speech intelligibility Î. In the embodiment of FIG. 6B, the controller is configured to control the beamformer in dependence of the estimate of speech intelligibility L. Here, beamformer weights w.sub.ij(k,m) are determined by the controller (CTR) in dependence of Î(m). The beamformer weights may be controlled to decrease focus of the beamformer (increase omni-directionality), the higher the estimate of speech intelligibility Î(m). Correspondingly, the beamformer weights may be controlled to increase focus of the beamformer, the lower the estimate of speech intelligibility I(m). A postfilter for attenuating noise in the spatially filtered signal from the beamformer may also benefit from receiving the estimate of speech intelligibility Î(m), e.g. to make noise reduction less aggressive, the higher the estimate of speech intelligibility Î(m) (and vice versa). Signal processing algorithms of the signal processing unit (HLC) may likewise benefit from information about the estimate of speech intelligibility Î(m), cf. processing control signal (PRC) from the control unit (CTR) to the signal processing unit (HLC).

[0168] Other signals of the forward than the time domain electric input signal (x.sub.i(n)) may alternatively or additionally be provided with an analysis path comprising a comprising a speech intelligibility estimator (SIE) according to the present disclosure. This may e.g. be one or more of the time-frequency domain signals X.sub.i(k,m), the beamformed signal Y.sub.BF(km), or the processed signal (either in time-frequency domain Y.sub.G(km) or in time domain y.sub.G(n), or any intermediate signal of the forward path, e.g. depending on the applied processing algorithms).

[0169] It is intended that the structural features of the devices described above, either in the detailed description and/or in the claims, may be combined with steps of the method, when appropriately substituted by a corresponding process.

[0170] As used, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well (i.e. to have the meaning “at least one”), unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, but an intervening element may also be present, unless expressly stated otherwise. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. The steps of any disclosed method are not limited to the exact order stated herein, unless expressly stated otherwise.

[0171] It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” or “an aspect” or features included as “may” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the disclosure. The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects.

[0172] The claims are not intended to be limited to the aspects shown herein but are to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more.

REFERENCES

[0173] [ANSI; 1995] American National Standards Institute, “ANSI S3.5, Methods for the Calculation of the Speech Intelligibility Index,” New York 1995. [0174] [Rhebergen & Versfeld; 2005] K. S. Rhebergen and N. J. Versfeld, “A speech intelligibility index based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners,” J. Acoust. Soc. Am., vol. 117, no. 4, pp. 2181-2192, 2005. [0175] [Taal et al.; 2011] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125-2136, September 2011. [0176] [Edraki et al.; 2020] A. Edraki, W.-Y. Chan, J. Jensen, and D. Fogerty, “Speech Intelligibility Prediction Using Spectro-Temporal Modulation Analysis” IEEE/ACM Transactions on Audio, Speech and Language Processing, 2020, pp. 210-225. [0177] [Hoang et al.; 2021] P. Hoang, Z.-H. Tan, J. M. de Haan, J. Jensen, “Joint Maximum Likelihood Estimation of Power Spectral Densities and Relative Acoustic Transfer Functions for Acoustic Beamforming,” Proc. ICASSP 2021 (to appear). [0178] EP3057335A1 (Oticon) 17.08.2016. [0179] EP3220661A1 (Oticon) 20.09.2017. [0180] EP3203473A1 (Oticon) 09.08.2017. [0181] EP3598777A2 (Oticon) 22.01.2020. [0182] EP3514792A1 (Oticon) 24.07.2019. [0183] [Edraki et al.; 2020] A. Edraki, W.-Y. Chan, J. Jensen, and D. Fogerty, “Speech Intelligibility Prediction Using Spectro-Temporal Modulation Analysis” IEEE/ACM Transactions on Audio, Speech and Language Processing, 2020, pp. 210-225. [0184] [Heymann, et al.; 2017] J. Heymann, L. Drude, R. Haeb-Umbach, “A generic neural acoustic beamforming architecture for robust multi-channel speech processing,” Computer Speech and Language, Vol. 46, November 2017, pp, 374-385. [0185] [Pedersen et al.; 2020] M. B. Pedersen, A. H. Andersen, S. H. Jensen, and J. Jensen, “A Neural Network for Monaural Intrusive Speech Intelligbility Prediction,” ICASSP, pp. 336-340, May 2020.