Hearing device comprising a noise reduction system
11533554 · 2022-12-20
Assignee
Inventors
- Poul Hoang (Smørum, DK)
- Jan M. DE HAAN (Smørum, DK)
- Jesper Jensen (Smørum, DK)
- Michael Syskind Pedersen (Smørum, DK)
Cpc classification
G10K11/17837
PHYSICS
H04R2201/107
ELECTRICITY
G10K2210/1081
PHYSICS
International classification
H04R1/10
ELECTRICITY
G10K11/178
PHYSICS
Abstract
A hearing device adapted for being located at or in an ear of a user, or for being fully or partially implanted in the head of a user comprises a) an input unit for providing at least one electric input signal representing sound in an environment of the user, said electric input signal comprising a target speech signal from a target sound source and additional signal components, termed noise signal components, from one or more other sound sources, b) a noise reduction system for providing an estimate of said target speech signal, wherein said noise signal components are at least partially attenuated, and c) an own voice detector for repeatedly estimating whether or not, or with what probability, said at least one electric input signal, or a signal derived therefrom, comprises speech originating from the voice of the user. The noise signal components are identified during time segments wherein the own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user, or originates from the voice of the user with a probability above an own voice presence probability (OVPP) threshold value. A method of operating a hearing device is further disclosed.
Claims
1. A hearing aid adapted for being located at or in an ear of a user, or for being fully or partially implanted in the head of a user, the hearing device comprising an input unit for providing at least one electric input signal representing sound in an environment of the user, said electric input signal comprising a target speech signal from a target sound source and additional signal components, termed noise signal components, from one or more other sound sources, a noise reduction system for providing an estimate of said target speech signal, wherein said noise signal components are at least partially attenuated, and an own voice detector for repeatedly estimating whether or not, or with what probability, said at least one electric input signal, or a signal derived therefrom, comprises speech originating from the voice of the user, wherein said hearing aid is configured to provide that said noise signal components are identified during time segments wherein said own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user, or originates from the voice of the user with a probability above an own voice presence probability (OVPP) threshold value, and the target sound source is an external speaker in the environment of the hearing aid user.
2. The hearing aid according to claim 1, wherein the input unit comprises at least one microphone, each of the at least one microphone providing an electric input signal comprising said target speech signal and said noise signal components.
3. The hearing aid according to claim 2 comprising a voice activity detector for repeatedly estimating whether or not, or with what probability, said at least one electric input signal, or a signal derived therefrom, comprises speech.
4. The hearing aid according to claim 2 comprising one or more beamformers, and wherein the input unit is configured to provide at least two electric input signals connected to the one or more beamformers, and wherein the one or more beamformers are configured to provide at least one beamformed signal.
5. The hearing aid according to claim 2, wherein said noise signal components are additionally identified during time segments wherein said voice activity detector indicates an absence of speech in the at least one electric input signal, or a signal derived therefrom, or a presence of speech with a probability below a speech presence probability (SPP) threshold value.
6. The hearing aid according to claim 1 comprising a voice activity detector for repeatedly estimating whether or not, or with what probability, said at least one electric input signal, or a signal derived therefrom, comprises speech.
7. The hearing aid according to claim 6 comprising one or more beamformers, and wherein the input unit is configured to provide at least two electric input signals connected to the one or more beamformers, and wherein the one or more beamformers are configured to provide at least one beamformed signal.
8. The hearing aid according to claim 6, wherein said noise signal components are additionally identified during time segments wherein said voice activity detector indicates an absence of speech in the at least one electric input signal, or a signal derived therefrom, or a presence of speech with a probability below a speech presence probability (SPP) threshold value.
9. The hearing aid according to claim 1 comprising one or more beamformers, and wherein the input unit is configured to provide at least two electric input signals connected to the one or more beamformers, and wherein the one or more beamformers are configured to provide at least one beamformed signal.
10. The hearing aid according to claim 9, wherein the one or more beamformers comprises one or more own voice cancelling beamformers configured to attenuate signal components originating from the user's mouth, while signal components from all other directions are left unchanged or attenuated less.
11. The hearing aid according to claim 1, wherein said noise signal components are additionally identified during time segments wherein said voice activity detector indicates an absence of speech in the at least one electric input signal, or a signal derived therefrom, or a presence of speech with a probability below a speech presence probability (SPP) threshold value.
12. The hearing aid according to claim 1 comprising a voice interface for voice-control of the hearing aid or other devices or systems.
13. The hearing aid according to claim 1, wherein the target speech signal from the target sound source comprises an own voice speech signal from the hearing aid user.
14. The hearing aid according to claim 1, wherein the hearing aid further comprises a timer configured to determine a time segment of overlap between the own voice speech signal and a further speech signal.
15. The hearing aid according to claim 14, wherein the hearing aid is configured to determine whether said time segment exceeds a time limit, and if so to label the further speech signal as part of the noise signal component.
16. A binaural hearing system comprising a first and a second hearing aid as claimed in claim 1, the binaural hearing system being configured to allow an exchange of data between the first and the second hearing aids.
17. A method of operating a hearing aid adapted for being located at or in an ear of a user, or for being fully or partially implanted in the head of a user, the method comprising providing at least one electric input signal representing sound in an environment of the user, said electric input signal comprising a target speech signal from a target sound source and additional signal components, termed noise signal components, from one or more other sound sources, providing an estimate of said target speech signal, wherein said noise signal components are at least partially attenuated, repeatedly estimating whether or not, or with what probability, said at least one electric input signal, or a signal derived therefrom, comprises speech originating from the voice of the user, and identifying, by operation of said hearing aid, said noise signal components during time segments wherein said own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user, or originates from the voice of the user with a probability above an own voice presence probability (OVPP) threshold value, wherein the target sound source is an external speaker in the environment of the hearing aid user.
18. A non-transitory computer readable medium on which is stored a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of claim 17.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1) The aspects of the disclosure may be best understood from the following detailed description taken in conjunction with the accompanying figures. The figures are schematic and simplified for clarity, and they just show details to improve the understanding of the claims, while other details are left out. Throughout, the same reference numerals are used for identical or corresponding parts. The individual features of each aspect may each be combined with any or all features of the other aspects. These and other aspects, features and/or technical effect will be apparent from and elucidated with reference to the illustrations described hereinafter in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14) The figures are schematic and simplified for clarity, and they just show details which are essential to the understanding of the disclosure, while other details are left out. Throughout, the same reference signs are used for identical or corresponding parts.
(15) Further scope of applicability of the present disclosure will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the disclosure, are given by way of illustration only. Other embodiments may become apparent to those skilled in the art from the following detailed description.
DETAILED DESCRIPTION OF EMBODIMENTS
(16) The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practiced without these specific details. Several aspects of the apparatus and methods are described by various blocks, functional units, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as “elements”). Depending upon particular application, design constraints or other reasons, these elements may be implemented using electronic hardware, computer program, or any combination thereof.
(17) The electronic hardware may include micro-electronic-mechanical systems (MEMS), integrated circuits (e.g. application specific), microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), gated logic, discrete hardware circuits, printed circuit boards (PCB) (e.g. flexible PCBs), and other suitable hardware configured to perform the various functionality described throughout this disclosure, e.g. sensors, e.g. for sensing and/or registering physical properties of the environment, the device, the user, etc. Computer program shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
(18) The present application relates to the field of hearing devices, e.g. hearing aids.
(19) Speech enhancement and noise reduction are often needed in real-world audio applications where noise from the acoustic environment masks a desired speech signal often resulting in reduced speech intelligibility. Examples of audio applications where noise reduction can be beneficial are hands-free wireless communication devices e.g. headsets, automatic speech recognition systems, and hearing aids (HA). In particular, applications such as headset communication devices where a (‘far end’) human listener needs to understand the noisy own voice picked-up by the headset microphones, noise can greatly reduce sound quality and speech intelligibility making conversations more difficult.
(20) ‘Headset applications’ may in the present context include normal headset applications for use in communication with a ‘far end speaker’ e.g. via a network (such as office or call-centre applications) but also hearing aid applications where the hearing aid is in a specific ‘communication or telephone mode’ adapted to pick up a user's voice and transmit it to another device (e.g. a far-end-communication partner), while possibly receiving audio from the other device (e.g. from the far-end-communication partner).
(21) Noise reduction algorithms implemented in multi microphone devices may comprise a set of linear filters, e.g. spatial filters and temporal filters that are used to shape the sound picked-up by the microphones. Spatial filters are able to alter the sound by enhancing or attenuating sound as a function of direction, while temporal filters alter the frequency response of the noisy signal to enhance or attenuate specific frequencies. To find the optimal filter coefficients, it is usually necessary to know the noise characteristics of the acoustic environment. Unfortunately, these noise characteristics are often unknown and need to be estimated online.
(22) Characteristics that are often necessary as inputs to multichannel noise reduction algorithms are e.g. the cross power spectral densities (CPSDs) of the noise. The noise CPSDs are for example needed for the minimum variance distortionless response (MVDR) and multichannel Wiener filter (MWF) beamformers which are common beamformers implemented in multi-microphone noise reduction systems.
(23) To estimate the noise statistics, researchers have developed a wide variety of estimators of the noise statistics e.g. [1-5]. In [1,4] they propose a maximum likelihood (ML) estimator of the noise CPSD matrix during speech presence by assuming that the noise CPSD matrix remains identical up to a scalar multiplier. This estimator performs well, when the underlying structure of the noise CPSD matrix does not change over time, e.g. for car cabin noise and isotropic noise fields, but may fail otherwise. In many realistic acoustic environments, the underlying structure of the noise CPSD matrix cannot be assumed fixed, for example when a prominent non-stationary interference noise source is present in the acoustic scene. In particular, when the interference is a competing speaker, then many noise reduction systems fail at efficiently suppressing the competing speaker as it is harder to determine whether the own voice or the competing speaker is the desired speech.
(24) In
(25) The hearing device user 1 may wear a hearing device comprising a first microphone 4 and a second microphone 5 on a left ear of the user 1, and a third microphone 6 and a fourth microphone 7 on the right ear of the user 1.
(26) The target sound source 2 may be located near the hearing device user 1 and may be configured to generate and emit a target speech signal into the environment of the user 1. The target source 2 may as such be a person, a radio, a television, etc. configured to generate a target speech signal. The target speech signal may be directed towards the user 1 or may be directed away from the user 1.
(27) The noise signal components 3 are shown to surround both the hearing device user 1 and the target sound source 2 and therefore effect the target source signal received at the hearing device user 1. The noise signal components may comprise localized noise sources (e.g. a machine, a fan, etc.), and/or distributed (diffuse, isotropic) noise sound sources.
(28) The first microphone 4, the second microphone 5, the third microphone 6 and the fourth microphone 7 may (each) provide an electric input signal comprising the target speech signal and the noise signal components 3.
(29) In
(30)
(31) The own voice VAD may detect that the user 1 is speaking in the time segment between t1 and t2 and in the time segment between t5 and t6. The VAD on the other hand will detect that speech (from both the user 1 and the target source 2) is being generated in the entire time segment from t1 to t8. However, depending on the resolution of the VAD used there may be a small break in detected voice activity in the segments t2 to t3, t4 to t5, and t6 to t7.
(32)
(33) In a classical approach (upper part of
(34) With use of an own voice VAD (lower part of
(35) Accordingly, noise signal components may be identified during time segments (time intervals) where said own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user 1, or originates from the voice of the user 1 with a probability above an own voice presence probability (OVPP) threshold value, e.g. 60%, or 70%.
(36) Combining the own voice VAD and the VAD in the hearing device, the noise reduction system may be configured to both detect when the user 1 is speaking and when the target source 2 is speaking. Thereby, the noise reduction system may be updated during time segments where no speech signal is generated and where the user 1 is speaking, but may be prevented from updating at time segments where only the target sound source 2 is generating a target speech signal (speaking).
(37) In
(38) As was the case in
(39) The competing speaker 8 may be located near the hearing device user 1 and may be configured to generate and emit a competing speech signal (i.e. an unwanted speech signal) into the environment of the user 1. The competing speaker 8 may as such be a person, a radio, a television, etc. configured to generate a competing speech signal. The competing speech signal may be directed towards the user 1 or may be directed away from the user 1.
(40) The noise signal components 3 are shown to surround both the hearing device user 1 and the competing speaker 8 and therefore effect the estimation of the own voice of the user 1, i.e. the wanted speech signal (e.g. in case the hearing device comprises or implements a headset), received at the hearing device microphones 4,5,6,7.
(41) In
(42)
(43) The own voice VAD (lower part of
(44)
(45) In a classical approach (upper part of
(46) With use of an own voice VAD (lower part of
(47) Accordingly, noise signal components (including from the competing speaker 8) may be identified during time segments where said own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user 1, or originates from the voice of the user 1 with a probability above an own voice presence probability (OVPP) threshold value.
(48) Combining the own voice VAD and the VAD in the hearing device, the noise reduction system may be configured to both detect when the user 1 is speaking and when the competing speaker 8 is speaking alone. Thereby, the noise reduction system may be updated during time intervals where no speech signal is generated and where the user 1 is speaking, but may be prevented from updating at time intervals where the competing speaker 8 is generating a speech signal.
(49) In
(50) As was the case in
(51) The target sound source 2 and the competing speaker 8 may be located near the hearing device user 1 and may be configured to generate and emit a speech signals into the environment of the user 1. The target speech signal and/or the competing speaker speech signal may be directed towards the user 1 or may be directed away from the user 1.
(52) The noise signal components 3 are shown to surround both the hearing device user 1, the competing speaker 8, and the target sound source 2 and may therefore affect the target source signal received at the hearing device user 1.
(53) The first microphone 4, the second microphone 5, the third microphone 6 and the fourth microphone 7 may provide an electric input signal comprising the target speech signal, the competing speaker signal, and the noise signal components 3.
(54) In
(55)
(56) The own voice VAD will detect that the user 1 is speaking in the time interval between t1 and t2 and in the time interval between t5 and t6. The VAD on the other hand will detect that speech (from both the user 1, the competing speaker 8, and the target source 2) is being generated in the entire time interval from t1 to t8.
(57)
(58) In a classical approach in which the VAD may be used to detect the presence of speech, the noise reduction system of the hearing device would only be updated at times where no speech is generated (both from the user 1, the competing speaker 8, and from the target source 2), as the VAD is not able to distinguish between speech from the user 1, the competing speaker 8, and from the target source 2. Accordingly, only at times where the VAD does not detect speech, i.e. from t0 to t1 and from t8 ongoing, the noise reduction system will be updated.
(59) With use of an own voice VAD, the noise reduction system of the hearing device may be configured to be updated not only when no speech is detected, but also when speech from the user 1 is detected by the own voice VAD, i.e. from t0 to t2, from t5 to t6, and from t8 ongoing.
(60) Accordingly, noise signal components may be identified during time segments where said own voice detector indicates that the at least one electric input signal, or a signal derived therefrom, originates from the voice of the user 1, or originates from the voice of the user 1 with a probability above an own voice presence probability (OVPP) threshold value.
(61) Combining the own voice VAD and the VAD in the hearing device, the noise reduction system may be configured to both detect when the user 1 is speaking and when the target source 2 and the competing speaker 8 are speaking. Thereby, the noise reduction system may be updated during time intervals where no speech signal is generated and where the user 1 is speaking, but may be prevented from updating at time intervals where the target sound source 2 is generating a target speech signal.
(62) In
(63)
(64) Each of the M input transducers receive (at their respective, different locations) sound signals (s.sub.1, . . . , s.sub.M) from an input sound filed (comprising environment sound). The input unit (IU) comprises M input sub-units (IU.sub.1, . . . , I.sub.UM). Each input unit comprises an input transducer (IT.sub.1, . . . , IT.sub.M), e.g. a microphone, for converting an input sound signal to an electric input signal (s′.sub.1, . . . , s′.sub.M). Each input transducer may comprise an analogue-to-digital converter for converting an analogue input signal to a digital signal (with a certain sampling rate, e.g. 20 kHz, or more). Each input unit further comprises an analysis filter bank for converting a time-domain (digital) signal to a number (K, e.g. >16, or >24 or >64) of frequency sub-band signals (S.sub.1(k,n), . . . , S.sub.M(k,n), where k and n are frequency and time indices, respectively, and where k=1, . . . , K). The respective electric input signals (S.sub.1(k,n), . . . , S.sub.M(k,n)) in a time-frequency representation (k,n) are fed to the noise reduction system (NRS).
(65) The noise reduction system (NRS) is configured to provide an estimate S(k,n) of a target speech signal (e.g. the hearing aid user's own voice, and/or the voice of a target speaker in the environment of the user), wherein noise signal components are at least partially attenuated. The noise reduction system (NRS) comprises a a number of beamformers. The noise reduction system (NRS) comprises a beamformer (BF), e.g. an MVDR beamformer or a MVF beamformer, connected to the input unit (IU) and configured to receive the electric input signals (S.sub.1(k,n), . . . , S.sub.M(k,n)) in a time-frequency representation. The beamformer (BF) is configured to provide at least one beamformed (spatially filtered) signal, e.g. the estimate Ŝ(k,n) of a target speech signal.
(66) Directionality by beamforming is an efficient way to attenuate unwanted noise as a direction-dependent gain can cancel noise from one direction while preserving the sound of interest impinging from another direction hereby potentially improving the intelligibility of a target speech signal (thereby providing spatial filtering). Typically, beamformers in hearing devices, e.g. hearing aids, have beampatterns, which are continuously adapted in order to minimize noise components while sound impinging from a target direction is unaltered. Typically, the acoustic properties of the noise signal changes over time. Hence, the noise reduction system is implemented as an adaptive system, which adapts the directional beampattern in order to minimize the noise while the target sound (direction) is unaltered.
(67) The noise reduction system (NRS) of
(68)
(69)
(70)
(71) The first noise reduction system (NRS1) is configured to provide an estimate of the user's own voice Ŝ.sub.OV. The first noise reduction system (NRS1) may comprise an own voice maintaining beamformer and an own voice cancelling beamformer. The own voice cancelling beamformer comprises the noise sources when the user speaks.
(72) The second noise reduction system (NRS2) is configured to provide an estimate of a target sound source (e.g. a voice SENV of a speaker in the environment of the user). The second noise reduction system (NRS2) may comprise an environment target source maintaining beamformer and an environment target source cancelling beamformer, and/or an own voice cancelling beamformer. The target cancelling beamformer comprises the noise sources when the target speaker speaks. The own voice cancelling beamformer comprises the noise sources when the user speaks.
(73)
(74)
Example 1
(75) In the present application, a maximum likelihood estimator of the noise CPSD matrix that overcomes the limitation of the method presented [1,4] (e.g. when a prominent interference is present in the acoustic environment) is disclosed. It is proposed to extend the noise CPSD matrix model. In the following, the signal model of the noisy observations in the acoustic scene is presented. Based on the signal model, the proposed ML estimator of the interference-plus-noise CPSD matrix is derived, and the proposed method is exemplified by application to own voice retrieval.
(76) The acoustic scene consists of a user equipped with hearing aids or a headset with access to at least M>2 microphones. The microphones pick up the sound from the environment and the noisy signal is sampled into a discrete sequence x.sub.m(t)∈; t∈
.sub.0 for all m=1, . . . , M microphones. As illustrated in
x.sub.m(t)=s.sub.o(t)*d.sub.o,m(t)+v.sub.c(t)*d.sub.m(t,θ.sub.c)θv.sub.c,m(t), (1)
where * denotes the convolution, d.sub.o,m(t) is the relative impulse response between the m'th microphone and the own-voice source, d.sub.m(t, θ.sub.c) is the relative impulse response between m'th microphone and the interference arriving from direction θ.sub.c∈Θ, where we without loss of generality assume that Θ is a discrete set of directions Θ={−180°, . . . , 180} with I elements. An illustration of the acoustic scene is shown in
(77) We apply the short-time Fourier transform (STFT) on x.sub.m(t) to transform the noisy signal into the time-frequency (TF) domain with frame length T, decimation factor D, and analysis window w.sub.A(t) such that
(78)
is the TF domain representation of the noisy signal where j=√{square root over (−1)}, k is the frequency bin index, and n is the frame index. The signal model for the noisy observation in the TF domain then becomes
(79)
(80) and for convenience, we vectorize the noisy observation such that x(k, n)=[x.sub.1(k, n), . . . , x.sub.M(k, n)].sup.T and
(81)
(82) We further assume that the relative transfer function (RTF) vectors (i.e. d.sub.o(k, n) and d(k, n, θ.sub.c)) remain identical over time so we may define d.sub.o(k)d.sub.o(k, n) and d(k,θ.sub.c)
d(k, n, θ.sub.c). In practice, it is often the case that s.sub.o(k, n), v.sub.c(k, n), and v.sub.e(k, n) are uncorrelated random
(83)
(84) processes meaning that the CPSD matrix of the noisy observations, i.e. C.sub.x(k, n)={x(k, n)x.sup.H(k, n)}, is given as
(85) where λ.sub.s(k, n), λ.sub.c (k, n), and λ.sub.c (k, n) are power spectral densities (PSDs) of the own-voice, interference, and noise respectively. Γ.sub.e(k, n) is the normalized noise CPSD matrix with 1 at the reference microphone index and we assume that Γ.sub.e (k, n) is a known matrix, but can for approximately isotropic noise fields be modelled as
(86)
(87) We assume that the own voice RTF vector d.sub.o(k) is known, as it can be measured in advance before deployment. The parameters that remain to be estimated are λ.sub.c(k, n), λ.sub.e(k, n), and θ.sub.c and the proposed ML estimators of these parameters will in the following section be presented.
(88) To estimate the interference-plus-noise PSDs λ.sub.c(k, n) and λ.sub.e(k, n) and the interference direction θ.sub.e, we first apply an own voice cancelling beamformer to obtain an interference-plus-noise-only signal (e.g. the signals from the own voice and from a competing speaker). The own voice cancelling beamformer is implemented using an own voice blocking matrix B.sub.o(k). A common approach to find the own voice blocking matrix, is to first find the orthogonal projection matrix of d.sub.o(k) and then select the first M−1 column vectors of the projection matrix. More explicitly, let I.sub.M×M be an M×M identity matrix then I.sub.M×M−1 is the first M−1 column vectors of I.sub.M×M. The own voice blocking matrix is then given as
(89)
(90) where B.sub.o(k)∈C.sup.M×M−1. The own voice blocked signal, z(k, n), can be expressed as
(91)
(92) and the own voice blocked CPSD matrix is
(93)
(94) Before presenting the ML estimators of λ.sub.c(k, n), λ.sub.e(k, n), and θ.sub.c, we introduce the own voice-plus-interference blocking matrix {tilde over (B)}(θ.sub.i).
(95) This step is necessary as the ML estimator of the noise PSD, λ.sub.e(k, n), further requires that the interference is removed from the own voice blocked signal z(k, n). Forming the own-voice-plus-interference blocking matrix follows similar procedure as forming the own voice blocking matrix. The own voice-plus-interference blocking matrix can be found as
(96)
(97) where {tilde over (B)}(θ.sub.i)∈ The own voice-plus-interference blocking matrix {tilde over (B)}(θ.sub.i) is a function of direction, as the direction of the interference is generally unknown. The own voice-plus-interference blocked signal is then
(98)
(99) and the blocked own voice-plus-interference CPSD matrix is
(100)
(101) only if θ.sub.i=θ.sub.c.
(102) It is common to assume that the own-voice, interference, and noise are temporally uncorrelated [6]. Under this assumption, the blocked own voice-plus-interference signal is distributed according to a circular symmetric complex Gaussian distribution i.e. z(k, n)˜.sub.C (0, C.sub.z(k, n)), meaning that the likelihood function for N observations of z(k, n) with Z(k, n)=[z(k, n−N+1), . . . , z(k, n)]∈
is given as
(103)
where tr(⋅) denotes the trace operator and
(104)
is the sample estimate of the own voice blocked CPSD matrix. ML estimators of the interference-plus-noise PSDs λ.sub.c(k, n) and λ.sub.e(k, n) have been derived in [1,4]. The ML estimator of λ.sub.e(k, n) is given as
(105)
with
(106)
being the sample covariance of the own voice-plus-interference blocked signal and the ML estimator of the interference PSD is then given as [7]
λ.sub.c(k,n,θ.sub.i)={tilde over (w)}.sup.H(θ.sub.i)(Ĉ.sub.z(k,n)−{circumflex over (λ)}.sub.e(k,n,θ.sub.i){tilde over (Γ)}.sub.e(k,n)){tilde over (w)}(θ.sub.i), (15)
where {tilde over (w)}(θ.sub.i) is the MVDR beamformer constructed from the blocked own voice CPSD matrix i.e.
(107)
(108) Inserting the ML estimates {circumflex over (λ)}.sub.e (k, n, θ.sub.i) and {circumflex over (λ)}.sub.c(k, n, θ.sub.i) into the likelihood function, we obtain the concentrated likelihood function
ln
(109) Under the assumptions that only one single interference is present in the acoustic environment and that the noisy observations across frequency bins are uncorrelated, then a wideband concentrated log-likelihood function can be derived as
(110)
(111) where K is the total number of frequency bins of the one-sided spectrum. To obtain the ML estimate of the interference direction, we maximize the function
(112)
(113) As θ.sub.i belongs to a discrete set of directions, the ML estimate of θ.sub.c is obtained through an exhaustive search over θ.sub.i. Finally, to obtain an estimate of the interference-plus-noise CPSD matrix we insert the ML estimates into the interference-plus-noise CPSD model i.e.
(114)
(115) For own voice retrieval, we implement the MWF beamformer. It is well-known that the MWF can be decomposed into an MVDR beamformer and a single channel post Wiener filter [10]. The MVDR beamformer is given as
(116)
(117) and the single-channel post Wiener filter is
(118)
(119) The MWF beamformer coefficients are then found as
w.sub.MWF(k,n)=w.sub.MVDR(k,n).Math.g(k,n). (23)
(120) Finally, the own voice signal can be estimated as a linear combination of the noisy observations using the beamformer weights i.e.
y(k,n)=w.sub.MWF.sup.H(k,n)×(k,n). (24)
(121) The enhanced TF domain signal, y(k, n) is then transformed back into the time domain using the inversion STFT, such that y(t) is the retrieved own voice time domain signal.
(122) It is intended that the structural features of the devices described above, either in the detailed description and/or in the claims, may be combined with steps of the method, when appropriately substituted by a corresponding process.
(123) As used, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well (i.e. to have the meaning “at least one”), unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element but an intervening element may also be present, unless expressly stated otherwise. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. The steps of any disclosed method is not limited to the exact order stated herein, unless expressly stated otherwise.
(124) It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” or “an aspect” or features included as “may” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the disclosure. The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects.
(125) The claims are not intended to be limited to the aspects shown herein but are to be accorded the full scope consistent with the language of the claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more.
(126) Accordingly, the scope should be judged in terms of the claims that follow.
REFERENCES
(127) [1] U. Kjems and J. Jensen, “Maximum likelihood based noise covariance matrix estimation for multimicrophone speech enhancement,” in 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), August 2012, pp. 295-299. [2] Yujie Gu and A. Leshem, “Robust Adaptive Beamforming Based on Interference Covariance Matrix Reconstruction and Steering Vector Estimation,” IEEE Transactions on Signal Processing, vol. 60, no. 7, pp. 3881-3885, July 2012. [3] Richard C. Hendriks and Timo Gerkmann, “Estimation of the noise correlation matrix,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, May 2011, pp. 4740-4743, IEEE. [4] Jesper Jensen and Michael Syskind Pedersen, “Analysis of beamformer directed single-channel noise reduction system for hearing aid applications,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Queensland, Australia, April 2015, pp. 5728-5732, IEEE. [5] Mehrez Souden, Jingdong Chen, Jacob Benesty, and Sofi′ ene Affes, “An Integrated Solution for Online Multichannel Noise Tracking and Reduction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2159-2169, September 2011. [6] K. L. Bell, Y. Ephraim, and H. L. Van Trees, “A Bayesian approach to robust adaptive beamforming,” IEEE Transactions on Signal Processing, vol. 48, no. 2, pp. 386-398, February 2000. [7] Adam Kuklasinski, Simon Doclo, Timo Gerkmann, Soren Holdt Jensen, and Jesper Jensen, “Multi-channel PSD estimators for speech dereverberation—A theoretical and experimental comparison,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Queensland, Australia, April 2015, pp. 91-95, IEEE. [8] Mehdi Zohourian, Gerald Enzner, and Rainer Martin, “Binaural Speaker Localization Integrated Into an Adaptive Beamformer for Hearing Aids,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 3, pp. 515-528, March 2018. [9] Hao Ye and D. DeGroat, “Maximum likelihood DOA estimation and asymptotic Cramer-Rao bounds for additive unknown colored noise,” IEEE Transactions on Signal Processing, vol. 43, no. 4, pp. 938-949, April 1995. [10] Michael Brandstein and Darren Ward, Microphone Arrays: Signal Processing Techniques and Applications, 2001. [11] EP2701145A1 (Retune, Oticon) 26 Feb. 2014