Hearing device comprising a speech presence probability estimator

Abstract

A hearing device, e.g. a hearing aid, comprises a) a multitude of input units, each providing an electric input signal representing sound in the environment of the user in a time-frequency representation, wherein the sound is a mixture of speech and additive noise or other distortions, e.g. reverberation, b) a multitude of beamformer filtering units, each being configured to receive at least two, e.g. all, of said multitude of electric input signals, each of said multitude of beamformer filtering units being configured to provide a beamformed signal representative of the sound in a different one of a multitude of spatial segments, e.g. spatial cells, around the user, c) a multitude of speech probability estimators each configured to receive the beamformed signal for a particular spatial segment and to estimate a probability that said particular spatial segment contains speech at a given point in time and frequency, wherein at least one, e.g. all, of the multitude of speech probability estimators is/are implemented as a trained neural network, e.g. a deep neural network. The invention may e.g. be used in hearing aids or communication devices, such as headsets, or telephones, or speaker phones.

Claims

1. A method of providing an estimate Î* of a speech presence probability in a sound signal comprising speech and additive noise or other distortions in a hearing device the hearing device comprising a multitude of input units each providing an electric input signal representing said sound signal in a time-frequency representation (k,l), the method comprising providing a subdivision of space around the user in a multitude of spatial segments (i,j); providing a speech presence indicator function, which for a given electric input signal indicates whether or not, or to which extent, speech is present in a given spatial segment (i,j), at a given frequency and time (k,l); and, for each spatial segment (i,j) providing a first database (Ψ.sub.ij) of training signals comprising a multitude of pairs of corresponding noisy beamformed signals X(k,l,74 .sub.i,r.sub.j) representative of sound in the spatial segment in question and associated values of said speech presence indicator function I(k,l,74 .sub.i,r.sub.j) in a time frequency representation, wherein said values of said speech presence indicator function represent ground truth values; and determining optimized parameters (Ψ*.sub.ij) of an algorithm for estimating said speech presence probability by optimizing, e.g. training, it with at least some of said noisy beamformed signals X(k,l,θ.sub.i,r.sub.j) and said associated values of said speech presence indicator function I(k,l,θ.sub.i,r.sub.j) in of said first database (Ψ.sub.ij), the algorithm providing corresponding estimated speech presence indicator values Î(k,l,θ.sub.i,r.sub.j), said optimization of parameters (Ψ*.sub.ij) being conducted under a constraint of minimizing a cost function of said estimated speech presence indicator values.

2. A method according to claim 1 wherein said multitude of spatial segments comprises an own voice segment including a segment around the user's mouth to allow for estimating a speech presence probability of the user of the hearing device.

3. A method according to claim 1 wherein a multitude of clean electric input signals S.sub.m(k,l), m=1, . . . , M, for each of said multitude of input units are generated (or recorded) by varying one or more of 1) the target speech source; 2) the target spatial position (θ.sub.i,r.sub.j); 3) head size; 4) input unit.

4. A method according to claim 3 wherein said noisy beamformed signals X(k,l,θ.sub.i,r.sub.j) are generated based on said clean electric input signals S.sub.m(k,l), m=1, . . . , M, by varying one or more, such as all of a) the additive noise or other distortion type, b) the signal-to-noise ratio (SNR) at which the target signal is typically observed in practice, in the application at hand, to thereby generate noisy electric input signals X.sub.m(k,l), m=1, . . . , M, corresponding to said clean electric input signals; and by exposing said noisy electric input signals to respective beamformers providing said noisy beamformed signals X(k,l,θ.sub.i,r.sub.j) representative of sound in the spatial segments in question.

5. A method according to claim 1 comprising the provision of a number of sets of semi-personalized, optimized parameters (Ψ*.sub.ij) of said algorithm for estimating a speech presence probability for a given spatial segment.

6. A method according to claim 4 comprising wherein said sets of semi-personalized, optimized parameters (Ψ*.sub.ij) of said algorithm are associated with a corresponding number of different head models having different head dimensions or form.

7. A method according to claim 6 comprising, providing that one of said sets of semi-personalized, optimized parameters (Ψ*.sub.ij) of said algorithm for said different head models is selected for use in the hearing device of the user based on test training data of sound signals played for the user while wearing the hearing device.

8. A method of operating a hearing device, the method comprising: providing a multitude of electric input signals representing sound in the environment of the user in a time-frequency representation, wherein the sound is a mixture of speech and additive noise or other distortions; providing a multitude of beamformed signals, each being representative of the sound in a different one of a multitude of spatial segments around the user, and each being based on at least two of said multitude of electric input signals; providing for each of said multitude of spatial segments an estimate of a probability Pij(k,l) that the spatial segment in question contains speech at a given point in time and frequency in dependence of the beamformed signals; and wherein at least one of the multitude of estimates of speech probability is/are provided by a trained neural network, wherein said at least one of the multitude of estimates of speech probability is provided by a neural network trained according to the method of claim 1.

9. A hearing device according configured to he worn by a user comprising: a multitude of input units, each providing an electric input signal representing sound in the environment of the user in a time-frequency representation, wherein the sound is a mixture of speech and additive noise or other distortions; a multitude of beamformer filtering units, each being configured to receive at least two of said multitude of electric input signals, each of said multitude of beamformer filtering units being configured to provide a beamformed signal representative of the sound in a different one of a multitude of spatial segments; and a multitude of speech probability estimators each configured to receive the beamformed signal for a particular spatial segment and to estimate a probability that said particular spatial segment contains speech at a given point in time and frequency; and wherein at least one of the multitude of speech probability estimators is/are implemented as a trained neural network, and the at least one of the multitude of speech probability estimators is/are implemented as a trained neural network according to the method of claim 1.

Description

BRIEF DESCRIPTION OF DRAWINGS

(1) The aspects of the disclosure may be best understood from the following detailed description taken in conjunction with the accompanying figures. The figures are schematic and simplified for clarity, and they just show details to improve the understanding of the claims, while other details are left out. Throughout, the same reference numerals are used for identical or corresponding parts. The individual features of each aspect may each be combined with any or all features of the other aspects. These and other aspects, features and/or technical effect will be apparent from and elucidated with reference to the illustrations described hereinafter in which:

(2) FIG. 1A illustrates a use case of an embodiment of a (single, monaural) hearing device according to the present disclosure,

(3) FIG. 1B illustrates a use case of a first embodiment of a binaural hearing system according to the present disclosure; and

(4) FIG. 1C illustrates a use case of a second embodiment of a binaural hearing system according to the present disclosure,

(5) FIG. 2 shows for a particular time instant/and a particular frequency index k, space is divided into cells parameterized by the angle θ and distance r to the center of the cell, with respect to the center of the users' head,

(6) FIG. 3 shows all exemplary block diagram for determining ‘ground truth’ binary speech presence indicator functions I(k,l,θ,r) from clean microphone target signals s.sub.1(n), . . . , s.sub.M(n), here M=2,

(7) FIG. 4 illustrates an exemplary block diagram for training of DNN Ψ.sub.θi,rj for estimating the speech presence probability for a particular spatial cell (θ.sub.i, r.sub.j),

(8) FIG. 5 shows an application of trained DNNs Ψ*.sub.θi,rj to noisy microphone signals to produce speech presence probability estimates I*(k,l,θ.sub.i,r.sub.j),

(9) FIG. 6 shows an exemplary spatial decomposition using relative acoustic transfer functions rather than acoustic transfer functions results in a “pie slice” de-composition (cf. FIG. 2),

(10) FIG. 7 schematically illustrates a neural network for determining speech presence probability estimator (SPPE) Î*(k,l,θ.sub.i,r.sub.j) from a noisy input signal in a time-frequency representation, and

(11) FIG. 8 shows a hearing device according to a first embodiment of the present disclosure,

(12) FIG. 9 shows a hearing device according to a second embodiment of the present disclosure,

(13) FIG. 10 shows an exemplary spatial decomposition focusing on estimation of own voice presence probability,

(14) FIG. 11 shows a further exemplary spatial decomposition including a number of designated cells to be used for estimation of an own voice presence probability, and

(15) FIG. 12A illustrates a scheme for generating a test database of sound data for selecting a specific set of optimized parameters of a neural network among a number of pre-determined optimized parameters for different head models, and

(16) FIG. 12B illustrates a scheme for selecting a specific set of optimized parameters of a neural network among a number of pre-determined optimized parameters for different head models using the test database of sound data determined in FIG. 12A.

(17) The figures are schematic and simplified for clarity, and they just show details, which are essential to the understanding of the disclosure, while other details are left out. Throughout, the same reference signs are used for identical or corresponding parts.

(18) Further scope of applicability of the present disclosure will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only. Other embodiments may become apparent to those skilled in the art from the following detailed description.

DETAILED DESCRIPTION OF EMBODIMENTS

(19) The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practiced without these specific details. Several aspects of the apparatus and methods are described by various blocks, functional units, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as “elements”). Depending upon particular application, design constraints or other reasons, these elements may be implemented using electronic hardware, computer program, or any combination thereof.

(20) The electronic hardware may include microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functionality described throughout this disclosure. Computer program shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.

(21) The present application relates to the field of hearing devices, e.g. hearing aids.

(22) SPP Estimation:

(23) We consider acoustic situations e.g. illustrated in FIG. 1A, 1B or 1C. Specifically, we consider a user of a hearing assistive device or system—the hearing assistive device or system has access to a total of M≥2 microphones, which are typically located at/in the ears of a user, and which may be organized in a monaural (at one ear) or binaural (at both ears) configuration.

(24) FIG. 1A illustrates a use case of an embodiment of a (single, monaural) hearing device according to the present disclosure. A user (U) wears a single (monaural) hearing device (HD) at a left ear (Ear). The hearing device comprises a BTT-part adapted for being located behind an ear of the user. The hearing device comprises first and second input transducers, here front and rear microphones (FM1, RM1) providing first and second electric input signals, respectively. The two microphones are distanced ΔL.sub.M (e.g. ≈10 mm) apart and define a microphone axis (MIC-DIR).

(25) The hearing device (HD) comprises a beamformer filtering unit allowing beamforming according to the present disclosure to be performed based on the first and second electric input signals,

(26) In the scenario of FIG. 1A, and 1C, the microphone axis of the individual hearing devices (HD1 in FIG. 1A and HD1, HD2 in FIG. 1C) is parallel to a look direction (LOOK-DIR) defined by the nose of the user. This is achieved by mounting the hearing device(s) as illustrated, so that the body of the BTE-part (and hence the microphone axis) is substantially parallel to a front direction of the user.

(27) FIG. 1B illustrates a use case of a first embodiment of a binaural hearing system according to the present disclosure. A user (U) wears first and second hearing devices (HD1, HD2) at a left ear (Left ear) and a right ear (Right ear), respectively. The two hearing devices each comprises a BTE-part adapted for being located behind an ear of the user. Each of the first and second hearing devices (HD1, HD2) is shown to contain a single microphone (M1, M2, respectively). The microphones of the first and second hearing devices provide first and second electric input signals, respectively. The two microphones are in this embodiment located a distance α (roughly equal to a head diameter, e.g. 200 mm) apart. The ‘microphone axis (in case the two microphone signals are processed together) is perpendicular to the look direction (LOOK-DIR) of the user (U). The first and second hearing devices (HD1, HD2) each comprises antenna and transceiver circuitry allowing the two hearing devices to exchange the respective microphone signals, or to forward their microphone signal (in full or in part) to a processing device (e.g. a remote control or a smartphone, or one of the hearing devices). The hearing system (e.g. one of, or each of, the hearing devices (or a separate processing device) comprises a beamformer filtering unit allowing beamforming according to the present disclosure to be performed based on the first and second electric input signals.

(28) FIG. 1C illustrates a use case of a second embodiment of a binaural hearing system according to the present disclosure. A user (U) wears first and second hearing devices (HD1, HD2) at a left ear (Left ear) and a right ear (Right ear), respectively, as described in connection with FIG. 1B. The two hearing devices each comprises a BTE-part adapted for being located behind an ear of the user. In the embodiment of FIG. 1C, however, each of the first and second hearing devices (HD1, HD2) comprises two microphones (FM1, RM1) and (FM2, RM2), respectively), as discussed in connection with FIG. 1A. Each of the two pairs of microphones provide first and second electric input signals, respectively. The hearing system (e.g. one of, or each of, the hearing devices (or a separate processing device) comprises a beamformer filtering unit allowing beamforming according to the present disclosure to be performed based on at least two of the microphones of the two sets of first and second electric input signals of the first and second hearing devices. In an embodiment, each of the first and second hearing devices comprises a beamformer filtering unit providing beamforming, e.g. including estimation of speech presence probability and providing a resulting beamformed signal, according to the present disclosure, The beamforming may e.g. be based on the locally generated first and second electric input signals or based on one or both locally generated electric input signal and one or both electric input signal from the opposite hearing device (or parts thereof, e.g. selected frequency ranges/bands). The microphone directions of the ‘local’ microphone systems of the respective first and second hearing devices (HD1, HD2) are indicated in FIG. 1C (denoted REF-DIR1 and REF-DIR2, respectively). An advantage of using microphones from both hearing devices is that a resulting beamformer can be more advanced (include more lobes of high sensitivity and/or more minima in its angular sensitivity (polar plot)).

(29) In the embodiments of FIG. 1A, 1B, 1C, the hearing device(s) are shown to comprise a ‘behind the ear’ (BTE) part wherein the microphone(s) is(are) located. Other styles of hearing devices comprising parts adapted for being located elsewhere on the head of the user (e.g. in or around ears of the user) may be applied, while still advantageously providing estimation of speech presence probability and possibly providing a resulting beamformed signal according to the present disclosure.

(30) A. Signal Model

(31) We assume that the signal x.sub.m(n) received at microphone m consists of a clean signal s.sub.m(n) and an additive noise component v.sub.m(n),
x.sub.m(n)=s.sub.m(n)+v.sub.m(n); m=1, . . . , M. (1)

(32) Each microphone signal is passed through an analysis filter bank, leading to the time-frequency representation,
X.sub.m(k,l)=S.sub.m(k,l)+V.sub.m(k,l), m=1, . . . , M, (2)
where k and l denote a frequency and a time index, respectively. Generally, X.sub.m(k,l), S.sub.m(k,l), V.sub.m(k,l) ∈ custom character , i.e. they are complex-valued. Stacking microphone signals for a particular (k l) pair in a vector, we arrive at
X(k,l)=S(k,l)+V(k,l), (3)
where X(k,l)=[X.sub.1(k;l) . . . X.sub.M(k,l)].sup.T is an M×1 vector, superscript .sup.T denotes transposition, and where vectors S(k,l) and V(k,l) are defined similarly.
B. Spatial Decomposition

(33) We will be interested in the spatial origin of the clean and noisy signals. Hence, we divide space into segments, e.g, cells, as shown in FIG. 2. All parameters (k,l,θ,r) are discrete-valued. In particular, for a given frequency k and time instant l, space is divided into cells parameterized by (θ,r). The pair (θ,r) represents the distance and angle of a spatial cell, respectively, with respect to the center of the users' head, and are selected from a discrete set {θ.sub.i,r.sub.j}, i=1, . . . , T, j=1, . . . , R. We consider here a 2-dimensional representation of space for simplicity—extension to a 3-dimensional description is straightforward.

(34) To perform this spatial decomposition of the clean and noisy signals, we use spatial filters (beamformers). Specifically, to decompose the clean signal into spatial cells S(k,l,θ; r), beamformers are applied to the clean microphone signal vector S(k,l) (this is e.g. done in an off-line training phase, where the clean signal is accessible, see below for details). For example, S(k,l,θ; r) may be computed as
S(k,l,θ; r)=W.sub.S.sup.H(k,θ,r)S(k,l) (4)
where W.sub.S(k,θ,r) ∈ custom character .sup.M is a beamformer weight vector, given by
W.sub.S(k,θ,r)=d(k,θ,r)/(d.sup.H(k,θ,r)d(k,θ,r)) (5)
where d(k,θ,r) ∈.sup.M is the acoustic transfer function vector whose entries are acoustic transfer functions from the spatial position (r,θ) to each microphone, and where superscript .sup.H denotes vector transposition and complex conjugation (Hermitian transposition). Note that this beamformer is time-invariant (independent of l).

(35) To compute X(k,l,θ,r) from the noisy microphone signals, a minimum variance distortion-less response (MVDR) beamformer, W.sub.X(k,l,θ,d), may, for example, he applied to the noisy microphone signal vector X(k,l).

(36) $\begin{matrix} X (k, l, θ, r) = W_{X}^{H} (k, l, θ, r) X (k,l) & (6) \\ where \\ W_{X} (k, l, θ, r) = \frac{C_{X}^{- 1} (k, l) d (k, θ, r)}{d^{H} (k, θ, r) C_{X}^{- 1} (k, l) d (k, θ, r)} & (7) \end{matrix}$
and where C.sub.X custom character E[X(k,l)X.sup.H(k,l)] is the cross-power spectral density matrix of the noisy signal, which can readily be estimated from the noisy microphone signals. Other beamformers could be used here, e.g., W.sub.S(k,θ,r) (Eq. (5)). The advantage of using the MVDR beamformer W.sub.X in Eq. (7), however, is that this beamformer preserves signal components from position (r,θ), perfectly, while suppressing maximally signal components from other directions (this reduces “leakage” of unwanted signal components into X(k,l,θ,r) and ensures an optimal estimate of the noisy signal component originating from position (r,θ)).

(37) FIG. 2 schematically illustrates for a particular time instant l and a particular frequency index k, space around a user (U) is divided into cells (θ.sub.i,r.sub.j) parameterized by the angle θ and distance r (e.g. to the center of the cell), with respect to the center of the users' head. The user (U) wears an exemplary binaural hearing system comprising first and second hearing devices located at left and right ears of the user, as e.g. illustrated in FIG. 1C. Values (S(k,l,θ.sub.i,r.sub.j) and S(k,l,θ.sub.i′,r.sub.j′)) of a signal S in a specific frequency band (k) at a specific time (l) are indicated for two different spatial cells (θ.sub.i,r.sub.j) and (θ.sub.i′,r.sub.j′). In an embodiment, specific values of the signal is determined for a multitude of such as all cells of the space around the user. The space around the user may e.g. be limited to a certain distance, e.g. r.sub.j<r.sub.max, as e.g. indicated in FIG. 2 by the outer bold dashed circle. In an embodiment, space around a user having a radial valuer r.sub.j larger than a (e.g. predefined) threshold value r.sub.th is represented by a single cell for each specific angular value θ.sub.i, i.e. e.g. in the illustration of FIG. 2, each ‘pie-slice’ (represented by a specific value of θ) outside the bold dashed circle (in that case representing the threshold value r.sub.th) only contains one cell. Likewise, the cells of the space around the user may be of equal or different size. In an embodiment, the cell size vary with radial distance (r.sub.j) from and/or angle (θ.sub.i) around the user (U). The size of the cells may e.g. increase with increasing radial distance from the user. In an embodiment, the cell size is not uniform in an angular view, e.g., comprising smaller cells in front of the user than elsewhere. In an embodiment, the spatial segmentation is configurable, e.g. from a user interface, e.g. implemented in a remote control or as an APP of a smartphone or similar device (e.g. a tablet computer). The number of spatial segments in an angular direction around the user (each segment being defined by a specific value of θ.sub.i) is larger than or equal to two, e.g. larger than or equal to three, larger than or equal to four. The number of spatial segments in a radial direction around the user (each segment being defined by a specific value of r.sub.j) is larger than or equal to one, e.g. larger than or equal to two, e.g. larger than or equal to three.

(38) C. Speech Presence Probability (SPP) Estimation

(39) For each spatial cell and for a particular time l and frequency k, we consider the following hypotheses:
H.sub.0(k,l,θ,r); S(k,l,θ,r)=0 (Speech is absent) (8)
H.sub.1(k,l,θ,r); S(k,l,θ,r)≠0 (Speech is present) (9)

(40) The SPP is defined as the probability that speech is present, i.e., P(H.sub.1).

(41) In order to estimate P(H.sub.1), we define the following indicator function:

(42) $\begin{matrix} I (k, l, θ, r) = {\begin{matrix} 1 if S (k, l, θ, r) \neq 0 \\ 0 if S (k, l, θ, r) = 0 \end{matrix} & (10) \end{matrix}$

(43) To estimate P(H.sub.1), we will be interested in finding an estimate Î(k,l,θ,r) of I(k,l,θ,r) based on the (generally) noisy microphone signals. In principle, the estimate could be based on the entire observable noisy signal. In practice, however, it is mainly the noisy signal in the spectral, temporal, and spatial “neighbourhood” of (k,l,θ,r) that carries information about the speech presence in frequency, time, space segment (e.g. cell) (k,l,θ,r). The term ‘spectral neighbourhood’ may e.g. include frequencies within +/−100 Hz of the frequency in question. The term ‘temporal neighbourhood’ may e.g, include time instances within +/−50 ms from the current time In an embodiment, the term ‘spatial neighbourhood’ may include space cells located within a radius of 0.4 in, such as within 0.25 m of (e.g. the centre of) the spatial cell in question. Hence, let Z(k,l,θ,r) denote the noisy information upon which estimate Î(k,l,θ,r) is based.

(44) Consider next the minimum mean-square estimator Î*(k,l,θ,r) of I(k,l,θ,r):

(45) $\begin{matrix} {\hat{I}}^{⋆} = \underset{\hat{I}}{\arg \min} E {{(I - \hat{I})}^{2} | ℤ} & (11) \end{matrix}$
where we dropped the parameter dependencies for notational convenience. Then it can be shown (details omitted) that the SPP is simply equal to Î*:
P(H.sub.1(k,l,θ,r))=Î*(k,l,θ,r) (12)

(46) Hence, in order to find the SPP, we need to find the minimum mean-square error (MMSE) estimator Î*(k,l,θ,r) of I(k,l,θ,r). In the following, we describe a procedure to find this estimate, using supervised learning—in our example, we use deep neural networks (DNN), but other algorithmic structures could be used (e.g., estimators based on Gaussian Mixture Models, Hidden Markov Models, Support Vector Machines, etc.).

(47) Training: Finding the Parameters of a DNN MMSE Estimator

(48) For a given noisy microphone signal, X(k,l), we wish to compute the speech presence probability P(H.sub.1(k,l,θ.sub.i,r.sub.j)), i=1, . . . , T, j=1, . . . , R. From Eq. (12) it follows that this is equivalent to computing the MMSE estimates Î*(k,l,θ,r) i=1, . . . , T, j=1, . . . , R. We propose to find these MMSE estimates using deep neural networks (DNN) whose parameters are found in an offline supervised learning procedure. The procedure requires access to a (large) set of training signals, i.e., examples of noisy microphone signals X(k,l) and corresponding binary speech presence variables I(k,l,θ,r). In the following, an example of how this training data is constructed will be illustrated.

(49) A. Generating Clean and Noisy Microphone Signals for Training

(50) Clean and noisy microphone signals are generated (or recorded) which vary in 1) the target speech source (different talkers, different speech signals for each talker), 2) target spatial position (θ.sub.i,r.sub.j), e.g. by generating clean microphone signals by convolving the speech signals from the point above with impulse responses from various spatial positions to microphones located on/at the ears of various persons. 3) the additive noise type (e.g., cocktail party noise, car cabin noise, competing speakers, other environmental noise, etc.), 4) the signal-to-noise ratio (SNR) at which the target signal is typically observed in practice, in the application at hand (e.g., −15 dB≤SNR≤25 dB, or −10 dB≤SNR≤30 dB), 5) head size, 6) microphone variation.

(51) A large corpus of microphone signals is generated by combining the factors described above: common to the used combinations is that they represent noisy signals, which could be typically be experienced in a real-life situation. Hence, if prior knowledge of any of these factors is available, then the noisy signals used for training should reflect this knowledge. If, for example, the identity of the target talker is known, then only speech signal from this particular individual should be used in point 1). Similarly, if it is known that a particular noise type is to be expected (e.g. car cabin noise in a car application), then the noise used to generate the noisy microphone signals (point 3 above) should be dominated by car noise, Advantageously, the data (microphone signals) are recorded with a hearing device or a pair of hearing devices as in the intended use case (e.g. same style, same number and location of microphones relative to the user, etc.). In an embodiment, at least some of the data are gathered by the user himself while wearing a hearing device or a pair of hearing devices fitted to him and identical or similar to the one where the date is intended to be used.

(52) B. Finding Training Pairs I(k,l,θ,r) and X(k,l)

(53) From the clean target-signals generated above (i.e., Points 1 and 2), the binary speech presence indicator function I(k,l,θ,r) is computed. The procedure is illustrated in FIG. 3: a particular clean training signal (Point 1 above) from a particular target location (θ′,r′) (Point 2 above) is passed through analysis filter banks, leading to signals S.sub.m(k,l), m=1, . . . , M. The filterbank signals are then passed through beamformers (e.g. Eq. (5)) steered towards locations {θ.sub.i,r.sub.j} i=1, . . . , T; j=1, . . . , R, resulting in signals (as functions of k and l, i.e. “spectrograms”) S(k,l,θ.sub.i,r.sub.j) for each i=1, . . . , T; j=1, . . . , R. The ground-truth indicator function I(k,l,θ.sub.i,r.sub.j) is computed by deciding if the resulting S(k,l,θ.sub.i,r.sub.j) is significantly different from 0. In practice, this may be done by comparing the signal-energy in cell S(k,l,θ.sub.i,r.sub.j) with a small threshold ϵ>0:

(54) $\begin{matrix} I (k, l, θ, r) = {\begin{matrix} 1 if {.Math. S (k, l, θ, r) .Math.}^{2} > ϵ \\ 0 otherwise \end{matrix} & (13) \end{matrix}$

(55) FIG. 3 shows an exemplary block diagram for determining ‘ground truth’ binary speech presence indicator functions I(k,l,θ,r) from clean microphone target signals s.sub.1(n), . . . . , s.sub.M(n), here M=2.

(56) In order to train DNNs, the ground-truth binary speech presence indicator functions (Eq. (13)) are stored together with noisy versions (Points 3 and 4, above) of the particular underlying clean training signal (Points 1 and 2, above) that gave rise to the speech presence indicator function in question.

(57) The result of this procedure is a (large) collection of pairs of indicator functions I(k,l,θ,r) and noisy signals X(k,l), for which the underlying clean signal gave rise to exactly that indicator function.

(58) C. Training DNN MMSE Estimators

(59) FIG. 4 shows an exemplary block diagram for training of an algorithm, e.g. a neural network, such as a deep neural network DNN Ψ.sub.θi,rj for estimating the speech presence probability for a particular spatial cell (θ.sub.i,r.sub.j). The trained DNN is represented by the parameter set Ψ*.sub.θi,rj (see FIG. 5). The process is repeated to train independent DNNs Ψ*.sub.θi,rj for each spatial cell (θ.sub.i,r.sub.j). The circuit for training the neural network DNN Ψ.sub.θi,rj comprises a multitude M of microphones M1, . . . , MM (M≥2) for capturing environment sound signals x.sub.1(n), . . . , x.sub.M(n), n denoting time, and providing respective (e.g. analogue or digitized) electric input signals IN1, . . . , INM. Each of the microphone paths comprises an analysis filter bank FB-A1, . . . , FB-AM, respectively, for (possibly digitizing and) converting respective time domain electric input signals IN1, . . . , INM to corresponding electric input signals X.sub.1(k,l′), . . . , X.sub.M(k,l′) in a time frequency representation, where k and l′ are frequency and time (frame) indices, respectively. The electric input signals X.sub.1(k,l′), . . . , X.sub.M(k,l′) are fed to beamformer W.sub.X(k,l′,θ.sub.i,r.sub.j), and processed as described in the following.

(60) The set of pairs of indicator functions I(k,l,θ,r) and corresponding noisy signals X(k,l) are used to train DNN-based MMSE estimators of I(k,l,θ,r). The training procedure is illustrated in FIG. 4. A noisy training signal (M microphone signals) is passed through analysis filter banks, resulting in signals X.sub.1(k,l), . . . , X.sub.M(k,l). For a particular time instant l′, the noisy signals are passed through beamformers W.sub.X(k,l′,θ.sub.i,r.sub.j) steered towards a particular spatial cell (θ.sub.i,r.sub.j) (cf. Eq. (7) and FIG. 2), for each frequency index k=1, . . . , K. The resulting signal is X(k,l′,θ.sub.i,r.sub.j), which represents the part of the noisy signal originating from spatial cell (θ.sub.i,r.sub.j). Next, values of X(k,l,θ.sub.i,r.sub.j) are chosen, which are used to estimate I(k,l,θ.sub.i,r.sub.j). In particular, for a given time instant l=l′, the values I(k,l′,θ.sub.i,r.sub.j), k=1, . . . , K could be estimated using present and past noisy signal values, X(k,l″,θ.sub.i,r.sub.j), k=1, . . . , K; l″=l′−L+1, . . . , l′, where L denotes the number of past frames used to estimate I(k,l,θ.sub.i,r.sub.j). The number L of frames represents the ‘history’ of the signal that is included in the estimation of speech presence probability. With a view to the general nature of speech, the ‘history’ (L) may include up to 50 ms of the input signal, or up to 100 ms, or more, of the input signal.

(61) This set of past and present values of X(k,l,θ.sub.i,r.sub.j) (denoted custom character (k,l′,θ.sub.i,r.sub.j), and provided by unit ‘Select noisy signal context (k,l′,θ.sub.i,r.sub.j)’ in FIG. 4, 5) serve as input to a (e.g. deep) neural network. In particular, the input of the DNN has a dimension corresponding to the cardinality of this set. The input to the DNN may be the (generally complex-valued) spectral values X(k,l,θ.sub.i,r.sub.j), the magnitude spectral values |X(k,l,θ.sub.i,r.sub.j)| (as exemplified in FIG. 4, 5), the log-magnitude values log |X(k,l,θ.sub.i,r.sub.j)|, the (generally complex-valued) cepstra computed by Fourier-transforming the log-magnitude-values (cf. e.g. [3]), or the magnitude value of the complex-valued cepstra Other functions applied to the input set are obviously possible. In the time-frequency map insert in the top part of FIGS. 4 (and 5), the frequency range represented by indices k=1, . . . , K may be the full operational range of the hearing device in question (e.g. representing a frequency range between 0 and 12 kHz (or more)), or it may represent a more limited sub-band range (e.g. where speech elements are expected to be located, denoted ‘speech frequencies’, e.g. between 0.5 kHz and 8 kHz, or between 1 kHz and 4 kHz). A limited ‘noisy signal context custom character (k,l′,θ.sub.i,r.sub.j)’ comprising a subset of frequency bands may be represented by k.sub.min and k.sub.max, if indices k=1, . . . , K represent the full frequency range of the device, The ‘noisy signal context’ may contain a continuous range or selected sub-ranges between k.sub.min and k.sub.max.

(62) Noisy input sets custom character (k,l′,θ.sub.i,r.sub.j), e.g., comprising |X(k,l″,θ.sub.i,r.sub.j)|, k=1, . . . , K; l″=l′−L+1, . . . , l′, |X| representing magnitude of X, and corresponding groundtruth binary speech presence functions I(k,l′,θ.sub.i,r.sub.j), k=1, . . . , K (e.g. evaluated for all l′ (i.e, slided through time, while for each value of l′ considering a ‘history’ of L time frames of noisy input signals X or |X|)) are used to train a (deep) neural network. Using the neural network, we wish to estimate I(k,l′,θ.sub.i,r.sub.j), k−1, . . . , K for time ‘now’ (=l′), based on L observations up to (and including) time ‘now’ (see e.g. time-frequency map insert in FIG. 4, 5). The network parameters are collected in a set denoted by Ψ.sub.θi,rj; typically, this parameter set encompasses weight and bias values associated with each network layer. The network may be a feedforward multi-layer perceptron. a convolutional network, a recurrent network, e.g., a long short-term memory (LSTM) network, or combinations of these networks. Other network structures are possible. The output layer of the network may hat e a logistic (e.g. sigmoid) output activation function to ensure that outputs are constrained to the range 0 to 1. The network parameters may be found using standard, iterative, steepest-descent methods, e.g., implemented using back-propagation (cf. e.g. [4]), minimizing the mean-squared error (cf. signal err(k,l′,θ.sub.i,r.sub.j) between the network output Î(k,l′,θ.sub.i,r.sub.j) and the ground truth I(k,l′,θ.sub.i,r.sub.j). The mean-squared error is computed across many training pairs of the ground truth indicator functions I(k,l,θ.sub.i,r.sub.j) (for fixed i,j) and noisy signals X(k,l).

(63) The resulting network for signals captured from spatial cell (θ.sub.i,r.sub.j) is denoted Ψ*.sub.θi,rj (cf. FIG. 5). Networks are trained for each spatial cell, (θ.sub.i,r.sub.j), i=1, . . . , T, j=1, . . . , R.

(64) Application Of Trained DNNS For Speech Presence Probability Estimation

(65) Once trained, the DNNs Ψ*.sub.θi,rj are stored in memory (We use the superscript * to indicate that the networks are “optimal”, i.e., have been trained). They are then applied to noisy microphone signals as outlined in FIG. 5.

(66) FIG. 5 shows an application of trained DNNs Ψ.sub.θi,rj to noisy microphone signals to produce speech presence probability estimates I(k,l,θ.sub.i,r.sub.j). A number of T×R DNNs are evaluated for i=l, . . . , T,j=I, . . . , R to produce speech presence probabilities P(H.sub.1(k,l,θ.sub.i,r.sub.j))=I*(k,l′,θ.sub.i,r.sub.j). The circuit for providing speech presence probability estimates I(k,l,θ.sub.i,r.sub.j) comprises (as FIG. 4) a multitude M of microphones M1, . . . , MM (M≥2) for capturing environment sound signals x.sub.1(n), . . . , x.sub.M(n), n denoting time, and providing respective (e.g. analogue or digitized) electric input signals IN1, . . . , INM. Each of the microphone paths comprises an analysis filter bank FB-A1, . . . , FB-AM, respectively, for (possibly digitizing and) converting respective time domain electric input signals IN1, . . . , INM to corresponding electric input signals X.sub.1(k,l′), . . . , X.sub.M(k,l′) in a time frequency representation, where k and l′ are frequency and time (frame) indices, respectively. The electric input signals X.sub.1(k,l′), . . . , X.sub.M(k,l′) are fed to beamformer W.sub.X(k,l′,θ.sub.i,r.sub.j) (cf, block Apply beamformers W.sub.X(k,l′,θ.sub.i,r.sub.j) in FIG. 5) providing a beamformed signal X(k,l′,θ.sub.i,r.sub.j) for each spatial segment (θ.sub.i,r.sub.j). The beamformed signal X(k,l′,θ.sub.i,r.sub.j) for a given spatial segment (θ.sub.i,r.sub.j) is fed to context unit custom character (k,l′,θ.sub.i,r.sub.j) (cf. block Select noisy signal context (k,l′,θ.sub.i,r.sub.j) in FIG. 5) providing a current frame and a number of previous frames of the beamformed signal X(k,l′,θ.sub.i,r.sub.j) for a given spatial segment (θ.sub.i,r.sub.j) as signal (k,l′,θ.sub.i,r.sub.j) to the optimized neural network DNN Ψ*.sub.θi,rj (cf e.g. FIG. 7) providing the estimated speech presence probability estimates I*(k,l,θ.sub.i,r.sub.j) for each spatial segment (θ.sub.i,r.sub.j) at the frequency k and time l′.

(67) The use cases for the resulting speech presence probabilities I*(k,l′,θ.sub.i,r.sub.j) are numerous. For example, they may be used for voice activity detection, i.e. to decide that speech is present if I*(k,l′,θ.sub.i,r.sub.j)>δ.sub.1, and decide that speech is absent if I*(k,l′,θ.sub.i,r.sub.j)<δ.sub.2, where 0≤δ.sub.2≤δ.sub.1≤1 are pre-determined parameters. In contrast to existing methods (cf. e.g. [1]), which make such decisions on a per-time-frequency-tile basis, the proposed method includes the spatial dimension in the decision.

(68) Furthermore, if speech has been determined to be present at a particular time instant l and frequency k, the physical location of the speech source may be determined, e.g., by identifying the spatial cell i=1, . . . , T, j=1, . . . , R with the highest speech presence probability (other ways of making this decision exist). This information is useful because beamformers may then be constructed (e.g., MVDR beamformers as outlined in Eq. (7)), which extract the signal originating from this particular spatial location, while suppressing maximally signals originating from other locations. Alternatively, beamformers may be constructed, which are a linear combination of beamformers directed at each spatial cell (θ.sub.i,r.sub.j), where the coefficients of the linear combination are derived from the speech presence probabilities [5], cf. e.g. FIG. 9. Further, other beamformers may be constructed, based on non-linear combinations.

(69) The exposition above has focused on a 2-dimensional spatial decomposition (i.e., in spatial cells, (θ.sub.i,r.sub.j)) involving acoustic transfer functions d(k,θ.sub.i,r.sub.j) (cf. Eq. (7)). It is often advantageous to use relative acoustic transfer functions
d′(k,θ.sub.i)=d(k,θ.sub.i,r.sub.j)/d.sub.0(k,θ.sub.i,r.sub.j)
where d.sub.0(k,θ.sub.i,r.sub.j) ∈C is the acoustic transfer function from spatial position (θ.sub.i,r.sub.j) to a pre-chosen reference microphone. Relative transfer functions are essentially independent of source distance (hence, the dependence on distance r.sub.j has been suppressed in the notation). Substituting relative acoustic transfer functions d′ for absolute acoustic transfer functions d everywhere in the exposition, allows us to decompose space in “pie slices” (FIG. 6), and to evaluate speech presence probabilities for each pie slice (i.e., for each direction).

(70) We would then train DNNs, Ψ*.sub.θi, i−l, . . . , T, which are dedicated to spatial directions (pie slices), rather than spatial cells. The usage of the resulting speech presence probabilities is completely analogous to the situation described above, where speech presence probabilities were estimated for spatial cells. The advantage of this solution is that fewer DNNs need to be trained, stored, and executed, because they are no longer dependent on hypothesized source distance

(71) FIG. 6 shows an exemplary spatial decomposition using relative acoustic transfer functions rather than acoustic transfer functions results in a “pie slice” de-composition of space around a user (compared to the cell based de-composition in cf. FIG. 2). The spatial segmentation in FIG. 6 is equivalent to the spatial segmentation in FIG. 2, apart from the lack of radial partition in FIG. 6. As in FIG. 2, the user (U) wears an exemplary binaural hearing system comprising first and second hearing devices located at left and right ears of the user, as e.g. illustrated in FIG. 1C. Values (S(k,l,θ.sub.i) and S(k,l,θ.sub.i′)) of a signal S in a specific frequency band (k) at a specific time (l) are indicated for two different spatial segments corresponding to angular parameters θ.sub.i and θ.sub.i′, respectively . In an embodiment, specific values of the signal S is determined for a multitude of, such as all, segments of the space around the user. The number of segments are preferably larger than or equal to three, such as larger than or equal to four. The segments may represent a uniform angular division of space around the user, but may alternatively represent different angular ranges, e.g. a predetermined configuration, e.g. comprising a left and a right quarter-plane in front of the user and a half-plane to the rear of the user. The segments (or cells of FIG. 2) may be dynamically determined, e.g. in dependence of a current distribution of sound sources (target and/or noise sound sources).

(72) FIG. 7 shows schematically illustrates a neural network for determining speech presence probability estimator (SPPE) Î*(k,l,θi,rj) from a noisy input signal in a time-frequency representation.

(73) FIG. 7 schematically illustrates a neural network for determining an output signal (for a given spatial segment (θ.sub.i,r.sub.j)) in the form of a speech presence probability estimator Î*(k,l′) from a number (L) of time frames of the noisy input signal X(k,l′) in a time-frequency representation. A present time frame (l′) and a number L−1 of preceding time frames are stacked to a vector and used as input layer in a neural network (together denoted custom character (k,l′), cf. also insert denoted ‘Context’ in the upper part of FIG. 4 (and FIG. 5)). Each frame comprises K (e.g. K=64 or K=128) values of a (noisy) electric input signal, e.g. X(k,l′), k=1, . . . , K in FIGS. 4, 5. The signal may be represented by its magnitude |X(k,l′)| (e.g, by ignoring its phase φ). An appropriate number of time frames is related to the correlation inherent in speech. In an embodiment, the number L−1 of previous time frames which are considered together with the present one may e.g. correspond to a time segment of duration of more than 20 ms, e.g. more than 50 ms, such as more than 100 ms. In an embodiment, the number of time frames considered (=L) are larger than or equal to 4, e.g. larger than or equal to 10, such as larger than or equal to 24. The width of the neural network is in the present application equal to K.Math.L, which for K=64 and L−1=9 amounts to N.sub.L1=640 nodes of the input layer L1 (representing a time segment of the audio input signal of 32 ms (for a sampling frequency of 20 kHz and a number of samples per frame of 64 and assuming non-overlapping time frames)). The number of nodes (N.sub.L2, . . . , N.sub.LN) in subsequent layers (L2, . . . , LN) may be larger or smaller than the number of nodes N.sub.L1 of the input layer L1, and in general adapted to the application (in view of the available number of input data sets and the number of parameters to be estimated by the neural network). In the present case the number of nodes N.sub.LN in the output layer LN is K (e.g. 64) in that it comprises K time-frequency tiles of a frame of the probability estimator Î*(k,l′).

(74) FIG. 7 is intended to illustrate a general multi-layer neural network of any type, e.g. deep neural network, here embodied in a standard feed forward neural network. The depth of the neural network (the number of layers), denoted N in FIG. 7, may be any number and typically adapted to the application in question (e.g. limited by a size and/or power supply capacity of the device in question, e.g. a portable device, such as a hearing aid). In an embodiment, the number of layers in the neural network is larger than or equal to two or three. In an embodiment, the number of layers in the neural network is smaller than or equal to four or five.

(75) The nodes of the neural network illustrated in FIG. 7 is intended to implement standard functions of neural network to multiply the values of branches from preceding nodes to the node in question with weights associated with the respective branches and to add the contributions together to a summed value Y′.sub.v,u for node v in layer u. The summed value Y′.sub.v,u is subsequently subject to a non-liner function f, providing a resulting value Z.sub.uv=f(Y′.sub.v,u) for node v in layer u. This value is fed to the next layer (u+1) via the branches connecting node v in layer u with the nodes of layer u+1. In FIG. 7 the summed value Y′.sub.v,u for node v in layer u (i.e. before the application of the non-linear (activation) function to provide the resulting value for node v of layer u) is expressed as:
Y′.sub.v,u=Σ.sub.p=1.sup.N.sup.L(u−1)w.sub.p,v(u−1, u)Z.sub.p(u−1)
where w.sub.p,v(u−1,u) denotes the weight for node p in layer L(u−1) to be applied to the branch from node p in layer u−1 to node v in layer u, and Z.sub.p(u−1) is the signal value of the p.sup.th node in layer u−1. In an embodiment, the same activation function ƒ is used for all nodes (this may not necessarily be the case, though). An exemplary non-linear activation function Z=f(Y) is schematically illustrated in the insert in FIG. 7. Typical functions used in neural networks are the sigmoid function and the hyperbolic tangent function (tanh). Other functions may be used, though, as the case may be. Further, the activation function may be parametrized.

(76) Together, the (possibly parameterized) activity function and the weights w of the different layers of the neural network constitute the parameters of the neural network. They represent the parameters that (together) are optimized in respective iterative procedures for the neural networks of the present disclosure. In an embodiment, the same activation function ƒ is used for all nodes (so in that case, the ‘parameters of the neural network’ are constituted by the weights of the layers).

(77) The neural network of FIG. 7 may e.g. represent a neural network according to the present disclosure (cf. e.g. DNN, Ψ*.sub.θirj in FIG. 5).

(78) Typically, the neural network according to the present disclosure is optimized (trained) in an offline procedure (e.g. as indicated in FIG. 4), e.g, using a model of the head and torso of a human being (e.g. Head and Torso Simulator (HATS) 4128C from Brüel & Kjær Sound & Vibration Measurement A/S). In an embodiment, data for training the neural network (possibly in an offline procedure) may be picked up and stored while the user wears the hearing device or hearing system, e.g. over a longer period of time, e.g. days, weeks or even months. Such data may e.g. be stored in an auxiliary device (e.g. a dedicated, e.g. portable storage device, or in a smartphone). This has the advantage that the training data are relevant for the user's normal behaviour and experience of acoustic environments.

(79) FIG. 8 schematically shows an embodiment of a hearing device according to the present. disclosure. The hearing device (HD), e.g. a hearing aid, is of a particular style (sometimes termed receiver-in-the ear, or RITE, style) comprising a BTE-part (BTE) adapted for being located at or behind an ear of a user, and an ITE-part (ITE) adapted for being located in or at an ear canal of the user's ear and comprising a receiver (loudspeaker). The BTE-part and the ITE-part are connected (e.g. electrically connected) by a connecting element (IC) and internal wiring in the ITE- and BTE-parts (cf e.g. wiring W.sub.X in the BTE-part).

(80) In the embodiment of a hearing device in FIG. 8, the BTE part comprises two input units comprising respective input transducers (e.g. microphones) (M.sub.BTE1, M.sub.BTE2), each for providing an electric input audio signal representative of an input sound signal (S.sub.BTE) (originating from a sound field S around the hearing device). The input unit further comprises two wireless receivers (WLR.sub.1, WLR.sub.2) (or transceivers) for providing respective directly received auxiliary audio and/or control input signals (and/or allowing transmission of audio and/or control signals to other devices). The hearing device (HD) comprises a substrate (SUB) whereon a number of electronic components are mounted, including a memory (MEM) e.g. storing different hearing aid programs (e.g. parameter settings defining such programs, or parameters of algorithms, e.g. optimized parameters of a neural network) and/or hearing aid configurations, e.g. input source combinations (M.sub.BTE1, M.sub.BTE2, WLR.sub.1, WLR.sub.2), e.g. optimized for a number of different listening situations. The substrate further comprises a configurable signal processor (DSP, e.g. a digital signal processor, e.g. including a processor (e.g. PRO in FIG. 9) for applying a frequency and level dependent gain, providing feedback suppression and beamforming, filter bank functionality, and other digital functionality of a hearing device according to the present disclosure). The configurable signal processing unit (DSP) is adapted to access the memory (MEM) and for selecting and processing one or more of the electric input audio signals and/or one or more of the directly received auxiliary audio input signals, based on a currently selected (activated) hearing aid program/parameter setting (e.g. either automatically selected, e.g. based on one or more sensors and/or on inputs from a user interface). The mentioned functional units (as well as other components) may be partitioned in circuits and components according to the application in question (e.g. with a view to size, power consumption, analogue vs. digital processing, etc.), e.g. integrated in one or more integrated circuits, or as a combination of one or more integrated circuits and one or more separate electronic components (e.g. inductor, capacitor, etc.). The configurable signal processor (DSP) provides a processed audio signal, which is intended to be presented to a user. The substrate further comprises a front end IC (FE) for interfacing the configurable signal processor (DSP) to the input and output transducers, etc., and typically comprising interfaces between analogue and digital signals. The input and output transducers may be individual separate components, or integrated (e.g. MEMS-based) with other electronic circuitry.

(81) The hearing device (HD) further comprises an output unit e.g. an output transducer) providing stimuli perceivable by the user as sound based on a processed audio signal from the processor or a signal derived therefrom. In the embodiment of a hearing device in FIG. 8, the ITE part comprises the output unit in the form of a loudspeaker (also termed a ‘receiver’) (SPK for converting an electric signal to an acoustic (air borne) signal, which (when the hearing device is mounted at an ear of the user) is directed towards the ear drum (Ear drum), where sound signal (S.sub.ED) is provided. The ITE-part further comprises a guiding element, e.g. a dome, (DO) for guiding and positioning the ITE-part in the ear canal (Ear canal) of the user. The ITE-part further comprises a further input transducer, e.g. a microphone (M.sub.ITE), for providing an electric input audio signal representative of an input sound signal (S.sub.ITE).

(82) The electric input signals (from input transducers M.sub.BTE1, M.sub.BTE2, M.sub.ITE) may be processed according to the present disclosure in the time domain or in the (time-) frequency domain (or partly in the time domain and partly in the frequency domain as considered advantageous for the application in question).

(83) The hearing device (HD) exemplified in FIG. 8 is a portable device and further comprises a battery (BAT), e.g. a rechargeable battery, e.g. based on Li-Ion battery technology, e.g. for energizing electronic components of the BTE- and possibly ITE-parts. In an embodiment, the hearing device, e.g. a hearing aid, is adapted to provide a frequency dependent gain and/or a level dependent compression and/or a transposition (with or without frequency compression) of one or more frequency ranges to one or more other frequency ranges, e.g. to compensate for a hearing impairment of a user.

(84) FIG. 9 shows a hearing device (HD) according to a second embodiment of the present disclosure, The lower part of FIG. 9 comprises the same elements as the block diagram described in connection with FIG. 5. The microphones M1, . . . , MM, and the associated analysis filter banks FB-A1, . . . , FB-AM, together with the blocks of the upper part of FIG. 9 represent a forward path of the hearing device. The (noisy) electric input signals X.sub.1(k,l′), . . . , X.sub.M(k,l′) in a time frequency representation are fed to resulting beamformer W.sub.res(k,l′) (cf. block ‘Apply resulting beamformer W.sub.res(k,l′)’ in FIG. 9). The resulting beamformer W.sub.res(k,l′) provides a resulting beamformed signal Y.sub.res(k,l′), which is fed to processor (PRO) for further signal processing, e.g. for applying processing algorithms for compensation for a hearing impairment of the user (and/or for compensation of a difficult listening condition). The processor provides processed signal Y.sub.G(k,l′), which is fed to synthesis filter bank FB-S for conversion to time-domain signal Y.sub.G. The time-domain signal Y.sub.G is fed to output transducer (SPK) for conversion to an audible signal to be represented to the user.

(85) The resulting beamformer W.sub.res(k,l′) receives the electric input signals X.sub.l(k,l′), . . . , X.sub.M(k,l′) in a time frequency representation. The resulting beamformer W.sub.res(k,l′) further receives the estimated speech presence probabilities Î*(k,l′,θ.sub.i,r.sub.j) for each spatial segment (θ.sub.i,r.sub.j) from the optimized neural networks (DNN Ψ*.sub.θi,rj). The resulting beamformer W.sub.res(k,l′) receives in addition the beamformer weights w.sub.ij(k,l′) for the beamformers providing beamformed signals X(k,l′,θ.sub.i,r.sub.j) for the respective spatial segments (θ.sub.i,r.sub.j) from the beamformer filtering unit W.sub.X(k,l′,θ.sub.i,r.sub.j). The resulting beamformed signal Y.sub.res is given by the expression:
Y.sub.res(k,l)=X(k,l).Math.w.sub.res(k,l).sup.T
where superscript .sup.T denotes transposition. The beamformed signal Y.sub.res is here determined as the linear combination
Y.sub.res=X.sub.1.Math.w.sub.1,res+X.sub.2.Math.w.sub.2,res+X.sub.Mw.sub.M,res,
where each of the M noisy electric input signals [X.sub.1, X.sub.2, . . . X.sub.M] and the coefficients [w.sub.1,res, w.sub.2,res, . . . , w.sub.M,res] (and hence the beamformed signal Y.sub.res) are defined in a time frequency representation (k,l). The coefficients w.sub.res(k,l) of the linear combination are given by the following expression:
w.sub.wes(k,l)=Σ.sub.i=1.sup.TΣ.sub.j=1.sup.RP.sub.ij(k,l).Math.w.sub.ij(k,l),
where k and l are frequency and time indices, respectively, T×R is the number of spatial segments (cf. e.g. FIG. 2), and P.sub.ij(k,l) are equal to the estimated speech presence probabilities Î*(k,l) for the (i,j).sup.th spatial segment, and w.sub.ij(k,l) are the beamformer weights for the (i,j).sup.th beamformer directed at the (i,j).sup.th spatial segment. The coefficients w.sub.res(k,l) of the linear combination and the beamformer weights for the individual beamformers are here each represented by a M×1 vector (M rows, 1 column), where M is the number of input units, e.g. microphones.
Own Voice:

(86) FIG. 10 schematically illustrates an exemplary ‘low dimensional’ spatial decomposition focusing on estimation of own voice presence probability. The spatial distribution for own voice presence probability estimation comprises at least two cells, e.g. two, three or more cells. As illustrated in FIG. 10, the spatial distribution of cells comprises three spatial volumes, denoted z1, z2, z3 (with associated beamformed signals S.sub.1, S.sub.2, S.sub.3), respectively. One of the spatial cells (z1) is located around the mouth of the user (U) of the hearing device or devices (HD1, HD2). The reference to a given spatial cell (z1, z2, z3) is intended also to refer to the signal (S.sub.1, S.sub.2, S.sub.3) estimated by the beamformer for that cell. The configuration of cells is intended to utilize the concept of the present disclosure to create beamformers that each cover a specific cell and together cover the space around the user (or a selected part thereof), and to provide respective SPP-estimators for the individual spatial cells (segments). In the illustrated exemplary embodiment, the cell denoted z2 picks up sounds from behind, but close to, the user (which might be—mistakenly—taken as own voice). The cell denoted z3 picks up sounds from the environment around the user (exclusive of the near-field environment covered by cells z1 and z2). The cell z3 may cover the whole (remaining) space around the user, or be limited to a certain spatial angle or radius. In an embodiment of the segmentation of the space around the user, a single own voice cell (as e.g. indicated by z1 in FIG. 10) is provided. In another embodiment, or a number of (e.g. smaller) cells around the user's mouth, which together constitute an own voice cell, are provided. The group of own voice cells may form part of a larger segmentation of space as e.g. exemplified in FIG. 2 or FIG. 6. The latter is illustrated in FIG. 11. The spatial segmentation of FIG. 11 is equal to the spatial segmentation of FIG. 2. A difference is that in FIG. 11, the spatial segments around the user's mouth (segments denoted S.sub.11, S.sub.12, S.sub.13, S.sub.14, S.sub.15, indicated in dotted filling) are predefined to possibly contain own voice. If the training data used to train the neural networks of a speech presence probability estimator of a given spatial cell includes own voice data of various SNR's, etc., the network will be able to discriminate between own voice and other voices. In the case that the training data do not include an own voice sound source, a qualification of the speech presence probability estimate regarding its origin from own voice or other voices may be included (e.g. using a criterion related to the signal level or sound pressure level (SPL); e.g. to decide that a given SPP of an own voice cell, e.g. z1 (S.sub.1) in FIG. 10 or S.sub.11-S.sub.15 of FIG. 11) is assumed to relate to own voice, if the level or SPL is above a certain ‘own voice threshold value’, and otherwise that the SPP relates to another voice).

(87) Personalization:

(88) FIG. 12A illustrates a scheme for generating a test database of sound data for selecting a specific set of optimized parameters of a neural network among a number of pre-determined optimized parameters for different head models. FIG. 12B illustrates a scheme for selecting a specific set of optimized parameters of a neural network among a number of pre-determined optimized parameters for different head models using the test database of sound data determined in FIG. 12A.

(89) As illustrated in FIG. 12A, 12B, from left to right, the method comprises

(90) In FIG. 12A: S1. Providing M input transducer (essentially noise free (clean)) test signals in a time frequency representation (k,l). S2. Apply respective beamformers covering individual spatial segments (z1, . . . , zN) around the user to provide (clean) beamformed test signals S(z) for the individual spatial segments. S3. Add various (known) amounts of noise to the clean beam formed signals to provide a test database of noisy beamformed time segments for (e.g. each of) the individual spatial segments. S4. Determine true signal to noise ratios (SNR) of the individual time-frequency tiles of each noisy beamformed test signal. S5. Determine true speech presence measures (TSPM) of the individual time-frequency tiles of each noisy beamformed test signal.

(91) In FIG. 12B:

(92) Steps S1, S2, S3 of FIG. 12A (or select noisy beamformed time segments for (e.g. each of) the individual spatial segments from a test database of sound signals).

(93) S6. Apply noisy beamformed time segments for (e.g. each of) the individual spatial segments from a test database of sound signals to optimized algorithms for different head models to provide corresponding speech presence probabilities (SPP) for each model and time segment for a given spatial segment (or all spatial segments). S7. Convert speech presence probabilities (SPP) to (test) speech presence measures (SPM). S8. Compare true (TSPM) and test (SPM) speech presence measures and provide a comparison speech presence measure for each spatial segment (or for all spatial segments). S9. Select an optimal head model, HM*, in dependence of the comparison speech presence measure and a cost function.

(94) It is intended that the structural features of the devices described above, either in the detailed description and/or in the claims, may be combined with steps of the method, when appropriately substituted by a corresponding process.

(95) As used, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well (i.e. to have the meaning “at least one”), unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element but an intervening element may also be present, unless expressly stated otherwise. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. The steps of any disclosed method is not limited to the exact order stated herein, unless expressly stated otherwise.

(96) It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” or “an aspect” or features included as “may” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the disclosure. The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects.

(97) The claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more.

(98) Accordingly, the scope should be judged in terms of the claims that follow.

REFERENCES

(99) [1] R. C. Hendriks, T. Gerkmann, and J. Jensen, DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement. Morgan and Claypool, 2013. [2] J. Heymann, L. Drufe, and R. Haeb-Umbach, “A Generic Acoustic Beamforming Architecture for Robust Multi-Channel speech Processing,” Computer, Speech and Language, Volume 46, November 2017, Pages 374-385. [3] J. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-Time Processing of Speech Signals. Wiley-IEEE Press, 1999. [4] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org. [5] K. L. Bell, Y. Ephraim, and J. L. V. Trees, “A Bayesian Approach to Robust Adaptive Bean forming,” IEEE Trans. Signal Processing, vol. 48, no. 2, pp. 386-398, February 2000.

Hearing device comprising a speech presence probability estimator

Assignee

Inventors

Cpc classification

Classification Explorer

H04R25/405

ELECTRICITY

Classification Explorer

G10L15/08

PHYSICS

Classification Explorer

H04R2225/51

ELECTRICITY

Classification Explorer

H04R25/552

ELECTRICITY

Classification Explorer

H04R25/407

ELECTRICITY

Classification Explorer

H04R25/554

ELECTRICITY

Classification Explorer

H04R25/505

ELECTRICITY

Classification Explorer

H04R25/507

ELECTRICITY

Classification Explorer

H04R2225/43

ELECTRICITY

International classification

Classification Explorer

H04R25/00

ELECTRICITY

Classification Explorer

G10L15/08

PHYSICS

Abstract

Claims

Description