Assistive listening device and human-computer interface using short-time target cancellation for improved speech intelligibility
10796692 ยท 2020-10-06
Inventors
Cpc classification
H04S2400/15
ELECTRICITY
H04R2499/11
ELECTRICITY
H04S7/30
ELECTRICITY
G10K11/17885
PHYSICS
H04R5/027
ELECTRICITY
G10L15/1815
PHYSICS
G10K11/17873
PHYSICS
G10L15/22
PHYSICS
H04R1/04
ELECTRICITY
H04R2225/00
ELECTRICITY
G10L15/20
PHYSICS
International classification
G10L15/20
PHYSICS
H04R1/04
ELECTRICITY
H04R5/027
ELECTRICITY
H04S7/00
ELECTRICITY
G10K11/178
PHYSICS
Abstract
An assistive listening device includes a set of microphones including an array arranged into pairs about a nominal listening axis with respective distinct intra-pair microphone spacings, and a pair of ear-worn loudspeakers. Audio circuitry performs arrayed-microphone short-time target cancellation processing including (1) applying short-time frequency transforms to convert time-domain audio input signals into frequency-domain signals for every short-time analysis frame, (2) calculating ratio masks from the frequency-domain signals of respective microphone pairs, wherein the calculation of a ratio mask includes a frequency domain subtraction of signal values of a microphone pair, (3) calculating a global ratio mask from the plurality of ratio masks, and (4) applying the global ratio mask, and inverse short-time frequency transforms, to selected ones of the frequency-domain signals, thereby generating audio output signals for driving the loudspeakers. The circuitry and processing may also be realized in a machine hearing device executing a human-computer interface application.
Claims
1. An assistive listening device for use in the presence of stationary interfering sound sources and/or non-stationary interfering sound sources, comprising an array of microphones arranged into a set of microphone pairs positioned about an axis with respective distinct intra-pair microphone spacings, each microphone of the array of microphones generating a respective audio input signal; a pair of ear-worn loudspeakers; and audio circuitry configured to compute a set of time-varying filters, for real-time speech intelligibility enhancement, using causal and memoryless frame-by-frame processing, comprising (1) applying a short-time frequency transform to each of the respective audio input signals, thereby converting the respective time domain signals into respective frequency-domain signals for every short-time analysis frame, (2) calculating a pairwise noise estimate by first subtracting the respective frequency-domain signals from a microphone pair and thereafter taking the magnitude of the difference, (3) calculating a pairwise mixture estimate by first taking the magnitudes of the respective frequency domain signals from a microphone pair, and thereafter adding the respective magnitudes and (4) calculating a pairwise ratio mask from the pairwise noise estimate and the pairwise mixture estimate for each of the respective microphone pairs, wherein the calculation of the pairwise ratio mask includes the aforementioned frequency-domain subtraction of signals, (5) calculating a global ratio mask, which is an effective time-varying filter with a vector of frequency channel weights for every short-time analysis frame, from the set of pairwise ratio masks, with the frequency channels from each pairwise ratio mask chosen according to the frequency range(s) for which the distinct intra-pair microphone spacing provides a positive absolute phase difference; wherein when using only one pair of microphones, the singular pairwise ratio mask and the global ratio mask are equivalent, and (6) applying the global ratio mask, or a post-processed variant thereof, and inverse short-time frequency transforms, to selected ones of the frequency-domain signals, or to the frequency-domain output of a beamformer, thereby suppressing both the stationary and the non-stationary interfering sound sources in real-time and generating an audio output signal for driving the loudspeakers.
2. The assistive listening device of claim 1, wherein the array of microphones includes a set of one or more pairs of microphones with predetermined intra-pair microphone spacings.
3. The assistive listening device of claim 1, wherein the array of microphones are arranged on a head-worn frame worn by a user.
4. The assistive listening device of claim 3, wherein the head-worn frame is an eyeglass frame.
5. The assistive listening device of claim 4, wherein the microphones are arranged across a front of the eyeglass frame.
6. The assistive listening device of claim 4, wherein the array of microphones includes microphones disposed on temple pieces of the eyeglass frames.
7. The assistive listening device of claim 1, wherein the array of microphones includes in-ear or near-ear microphones whose corresponding frequency-domain signals are the selected frequency-domain signals to which the global ratio mask and inverse short-time frequency transforms are applied.
8. The assistive listening device of claim 1, wherein the short-time target cancellation processing comprises calculating pairwise binary masks from the frequency-domain signals of respective microphone pairs, and wherein a global binary mask is calculated from the pair-wise binary masks.
9. The assistive listening device of claim 8, wherein the pairwise binary masks are calculated by applying a threshold value to respective ratio masks.
10. The assistive listening device of claim 9, wherein for frequency channels below a predetermined frequency, a ramped threshold value is used for a most widely spaced microphone pair to account for reduced cancellation at frequencies below the predetermined frequency.
11. The assistive listening device of claim 10, wherein the global ratio mask and global binary mask are used to calculate a thresholded ratio mask, which has the global binary mask's zero values below a specified threshold, and the global ratio mask's continuous values above said threshold. The thresholded ratio mask, and inverse short-time frequency transforms, are applied to selected ones of the frequency-domain signals, thereby suppressing both the stationary and the non-stationary interfering sound sources and generating an audio output signal for driving the loudspeakers.
12. The assistive listening device of claim 11, wherein the array of microphones includes in-ear or near-ear microphones whose corresponding frequency-domain signals are the selected frequency-domain signals to which the thresholded ratio mask and inverse short-time frequency transforms are applied.
13. A machine hearing device for generating speech signals to be used in identifying semantic content in the presence of stationary interfering sound sources and/or non-stationary interfering sound sources, and thereby allowing the performance of automated actions by related systems in response to the identified semantic content, the hearing device comprising: a set of microphones generating respective audio input signals arranged in an array having a set of microphone pairs arranged about an axis with pre-determined intra-pair microphone spacings; and audio circuitry configured to compute a set of time-varying filters, for real-time speech intelligibility enhancement, using causal and memoryless frame-by-frame processing, comprising (1) applying a short-time frequency transform to each of the respective audio input signals, thereby converting the respective time domain signals into respective frequency-domain signals for every short-time analysis frame, (2) calculating a pairwise noise estimate by first subtracting the respective frequency-domain signals from a microphone pair and thereafter taking the magnitude of the difference, (3) calculating a pairwise mixture estimate by first taking the magnitudes of the respective frequency domain signals from a microphone pair, and thereafter adding the respective magnitudes and (4) calculating a pairwise ratio masks from the pairwise noise estimate and the pairwise mixture estimate for each of the respective microphone pairs, wherein the calculation of the pairwise ratio mask includes the aforementioned frequency-domain subtraction of signals, (5) calculating a global ratio mask, which is an effective time-varying filter with a vector of frequency channel weights for every short-time analysis frame, from the set of pairwise ratio masks, with the frequency channels from each pairwise ratio mask chosen according to the frequency range(s) for which the distinct intra-pair microphone spacing provides a positive absolute phase difference; wherein when using only one pair of microphones, the singular pairwise ratio mask and the global ratio mask are equivalent, and (6) applying the global ratio mask, or a post-processed variant thereof, and inverse short-time frequency transforms, to selected ones of the frequency-domain signals, or to the frequency-domain output of a beamformer, thereby suppressing both the stationary and the non-stationary interfering sound sources in real-time and allowing for identification of the target speech signal.
14. The machine hearing device of claim 13, wherein the array of microphones includes a set of one or more pairs of microphones with predetermined intra-pair microphone spacings.
15. The machine hearing device of claim 13, wherein the short-time target cancellation processing comprises calculating pairwise binary masks from the frequency-domain signals of respective microphone pairs, and wherein a global binary mask is calculated from the pair-wise binary masks.
16. The machine hearing device of claim 15, wherein the pairwise binary masks are calculated by applying a threshold value to respective ratio masks.
17. The machine hearing device of claim 16, wherein for frequency channels below a predetermined frequency, a ramped threshold value is used for a most widely spaced microphone pair to account for reduced cancellation at frequencies below the predetermined frequency.
18. The machine hearing device of claim 17, wherein the global ratio mask and global binary mask are used to calculate a thresholded ratio mask, which has the global binary mask's zero values below a specified threshold, and the global ratio mask's continuous values above said threshold. The thresholded ratio mask, and inverse short-time frequency transforms, are applied to selected ones of the frequency-domain signals, thereby suppressing both the stationary and the non-stationary interfering sound sources and generating an audio output signal for driving the loudspeakers.
19. An assistive listening device for use in the presence of stationary interfering sound sources and/or non-stationary interfering sound sources, comprising one or more pairs of in-ear or near-ear microphones, each microphone generating a respective audio input signal; a pair of ear-worn loudspeakers; and audio circuitry configured to compute a time-varying filter, for real-time speech intelligibility enhancement, using causal and memoryless frame-by-frame processing, comprising (1) applying a short-time frequency transform to each of the respective audio input signals, thereby converting the respective time domain signals into respective frequency-domain signals for every short-time analysis frame, (2) calculating a pairwise noise estimate by first subtracting the respective frequency-domain signals from a microphone pair and thereafter taking the magnitude of the difference, (3) calculating a pairwise mixture estimate by first taking the magnitudes of the respective frequency-domain signals from a microphone pair, and thereafter adding the respective magnitudes and (4) calculating a pairwise ratio mask from the pairwise noise estimate and the pairwise mixture estimates for a microphone pair, wherein the calculation of a pairwise ratio mask includes the aforementioned frequency-domain subtraction of signals, (5) calculating a global ratio mask, which is an effective time-varying filter with a vector of frequency channel weights for every short-time analysis frame, from the set of pairwise ratio masks, with the frequency channels from each pairwise ratio mask chosen according to the frequency range(s) for which the distinct intra-pair microphone spacing provides a positive absolute phase difference; wherein when using only one pair of microphones, the singular pairwise ratio mask and the global ratio mask are equivalent, and (6) applying the global ratio mask, or a post-processed variant thereof, and inverse short-time frequency transforms, to the frequency-domain signals from the in-ear or near-ear microphones, or to the frequency-domain output of a beamformer, thereby suppressing both the stationary and the non-stationary interfering sound sources in real-time and generating an audio output signal for driving the loudspeakers.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The foregoing and other objects, features and advantages will be apparent from the following description of particular embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
DESCRIPTION OF EMBODIMENTS
(16)
(17) The general arrangement of
(18)
(19) Briefly, the selection/combination [36] may or may not include frequency domain signals X that are also used in the pair-wise mask calculations [26]. In an ALD implementation as described more below, it may be beneficial to apply the mask-controlled scaling [34] to signals from near-ear microphones that are separate from the microphones whose outputs are used in the pair-wise mask calculations [26]. Use of such separate near-ear microphones can help maintain important binaural cues for a user. In a computer-based implementation also described below, the mask-controlled scaling [34] may be applied to a sum of the outputs of the same microphones whose signals are used to calculate the masks.
(20) I. System Description of 6-Microphone Short-Time Target Cancellation (STTC) Assistive Listening Device (ALD).
(21)
(22)
(23) Generally, the inputs from the six forward-facing microphones [42] are used to compute a Time-Frequency (T-F) mask (i.e. time-varying filter), which is used to attenuate non-target sound sources in the Left and Right near-ear microphones [44-L], [44-R]. The device boosts speech intelligibility for a target talker [13-T] from a designated look direction while preserving binaural cues that are important for spatial hearing.
(24) The approach described herein avoids Interaural level Difference (ILD) compensation by integrating the microphone pairs [42] into the frame [40] of a pair of eyeglasses and giving them a forward facing half-omni directionality pattern; with this microphone placement, there is effectively no ILD and thus no ILD processing is required. One downside to this arrangement, if one were to use only these forward facing microphones, is the potential loss of access to both head shadow ILD cues and the spectral cues provided by the pinnae (external part of ears). However, such cues can be provided to the user by including near-ear microphones [44]. The forward-facing microphone pairs [42] are used to calculate a vector of frequency channel weights for each short-time analysis frame (i.e., a time-frequency mask); this vector of frequency channel weights is then used to filter the output of the near-ear microphones [44]. Notably, the frequency channel weights for each time slice may be applied independently to both the left and right near-ear microphones [44-L], [44-R], thereby preserving Interaural Time Difference (ITD) cues, spectral cues, and the aforementioned ILD cues. Hence, the assistive listening device described herein can enhance speech intelligibility for a target talker, while still preserving the user's natural binaural cues, which are important for spatial hearing and spatial awareness.
(25) It is noted that the ALD as described herein may be used in connection with separate Visually Guided Hearing Aid (VGHA) technology, in which a VGHA eyetracker can be used to specify a steerable look direction. Steering may be accomplished using shifts, implemented in either the time domain or frequency domain, of the Left and Right signals. The STTC processing [20-1] boosts intelligibility for a target talker [13-T] in the designated look direction and suppresses the intelligibility of non-target talkers (or distractors) [13-NT], all while preserving binaural cues for spatial hearing.
(26) STTC processing consists of a computationally efficient implementation of the target cancellation approach to sound source segregation, which involves removing target talker sound energy and computing gain functions for T-F tiles according to the degree to which each T-F tile is dominated by energy from the target or interfering sound sources. The STTC processing uses subtraction in the frequency domain to implement target cancellation, using only the Short-Time Fourier Transforms (STFTs) of signals from microphones.
(27) The STTC processing computes an estimate of the Ideal Ratio Mask (IRM), which has a transfer function equivalent to that of a time-varying Wiener filter; the IRM uses the ratio of signal (i.e., target speech) energy to mixture energy within each T-F unit:
(28)
where S.sup.2(t, f) and N.sup.2(t, f), are the signal (i.e., target speech) energy and noise energy, respectively. The mixture energy is the sum of the signal energy and noise energy.
(29) The time-domain mixture x.sub.i[m] of sound at the ith microphone is composed of both signal (s.sub.i) and noise (.sub.i) components:
x.sub.i[m]=s.sub.i[m]+.sub.i[m](2)
Effecting sound source segregation amounts to an unmixing process that removes the noise () from the mixture (x) and computes an estimate () of the signal (s). Whereas the IRM is computed using oracle knowledge access to both the ground truth signal (s.sub.i) and the noise (.sub.i) components, the STTC processing has access to only the mixture (x.sub.i) at each microphone. For every pair of microphones, the STTC processing computes both a Ratio Mask (RM) and a Binary Mask (BM) using only the STFTs of the sound mixtures at each microphone. The STFT X.sub.i[n, k] of the sound mixture x.sub.i[m] at the ith microphone is as follows:
(30)
where w[n] is a finite-duration Hamming window; n and k are discrete indices for time and frequency, respectively; H is a temporal sampling factor (i.e., the Hop size between FFTs) and F is a frequency sampling factor (i.e., the FFT length).
(31) The logic underlying the STTC processing involves computing an estimate of the noise (), so as to subtract it from the mixture (x) and compute an estimate () of the signal (s). This filtering (i.e. subtraction of the noise) is effected through a T-F mask, which is computed via target cancellation in the frequency domain using only the STFTs. The STTC processing consists of Short-Time Fourier Transform Magnitude (STFTM) computations, computed in parallel, that yield Mixture ({circumflex over (M)}) and Noise ({circumflex over (N)}) estimates that can be used to approximate the IRM, and thereby compute a time-varying filter. The Mixture ({circumflex over (M)}), Noise ({circumflex over (N)}) and Signal () estimates for each T-F tile are computed as follows using the frequency-domain signals (X.sub.i) from a pair (i=[1, 2]) of microphones:
{circumflex over (M)}[n,k]=0.5(|X.sub.1[n,k]|+|X.sub.2[n,k]|),(4)
{circumflex over (N)}[n,k]=0.5(|X.sub.1[n,k]X.sub.2[n,k]|),(5)
[n,k]={circumflex over (M)}[n,k]{circumflex over (N)}[n,k](6)
(32) The processing described here assumes a target talker straight ahead at V With the target-talker waveforms at the two microphones in phase (i.e., time-aligned) with each other, the cancellation process can be effected via subtraction in either the time domain (e.g., x.sub.1[m]-x.sub.2[m]) or the frequency domain, as in the Noise (N) estimate shown above.
(33) The Noise estimate ({circumflex over (N)}) is computed by subtracting the STFTs before taking their magnitude, thereby allowing phase interactions that cancel the target spectra. The Mixture ({circumflex over (M)}) estimate takes the respective STFT magnitudes before addition, thereby preventing phase interactions that would otherwise cancel the target spectra. A Signal () estimate can be computed by subtracting the Noise ({circumflex over (N)}) estimate from the Mixture ({circumflex over (M)}) estimate. The processing described in this section assumes a target talker straight ahead at 0. However, the look direction can be steered via sample shifts implemented in the time domain prior to T-F analysis. Alternatively, these look direction shifts could be implemented in the frequency domain.
(34) Assuming a perfect cancellation of only the target (i.e., Signal) spectra, the {circumflex over (N)} term contains the spectra of all non-target sound sources (i.e., Noise) in each T-F tile. The STTC processing uses the Mixture ({circumflex over (M)}) and Noise ({circumflex over (N)}) STFTM computations to estimate the ratio of Signal () (i.e., target) energy to mixture energy in every T-F tile:
(35)
(36) The Mixture ({circumflex over (M)}) and Noise ({circumflex over (N)}) terms are short-time spectral magnitudes used to estimate the IRM for multiple frequency channels [k] in each analysis frame [n]. The resulting Ratio Mask RM[n, k] is a vector of frequency channel weights for each analysis frame. RM[n, k] can be computed directly using the STFTs of the signals from the microphone pair:
(37)
(38) A Binary Mask BM[n, k] may also be computed using a thresholding function, with threshold value , which may be set to a fixed value of =0.2 for example:
(39)
(40)
(41) In the illustrated example, three microphone pairs having respective distinct spacings (e.g. 140, 80 and 40 mm) are used, and their outputs are combined via piecewise construction, as illustrated in the bottom panel of
(42)
(43) 1. Short-Time Fourier Transform (STFT) processing [50], converts each microphone signal into frequency domain signal
(44) 2. Ratio Mask (RM) and Binary Mask (BM) processing [52], applied to frequency domain signals of microphone pairs
(45) 3. Global Ratio Mask (RM.sub.G) and Thresholded Ratio Mask (RM.sub.T) processing [54], uses ratio masks of all microphone pairs
(46) 4. Output signal processing [56], uses the Thresholded Ratio Mask (RM.sub.T) to scale/modify selected microphone signals to serve as output signal(s) [16]
(47) The above stages of processing are described in further detail below.
(48) 1. STFT Processing [50]
(49) Short-Time Fourier Transforms (STFTs) are continually calculated from frames of each input signal x[m] according to the following calculation:
(50)
where i is the index of the microphone, w[n] is a finite-duration Hamming window; n and k are discrete indices for time and frequency, respectively; H is a temporal sampling factor (i.e., the Hop size between FFTs) and F is a frequency sampling factor (i.e., the FFT length).
(51) 2. STTC Processing [52]
(52) Pairwise ratio masks RM, one for each microphone spacing (140, 80 and 40 mm) are calculated as follows; i.e., there is a unique RM for each pair of microphones ([1,2], [3,4], [5,6]):
(53)
(54) Pairwise Binary Masks BM are calculated as follows, using a thresholding function IP, which in one example is a constant set to a relatively low value (0.2 on a scale of 0 to 1):
(55)
(56) In the low frequency channels, a ramped binary mask threshold may be used for the most widely spaced microphone pair (BM.sub.1,2) to address the issue of poor cancellation at these low frequencies. Thus at the lowest frequencies, where cancellation is least effective, a higher threshold is used. An example of such a ramped threshold is described below.
(57) 3. Global Ratio Mask (RM.sub.G) and Thresholded Ratio Mask (RM.sub.T) Processing [54]
(58) As mentioned above, a piecewise approach to creating a chimeric Global Ratio Mask RM.sub.G from the individual Ratio Masks for the three microphone pairs ([1,2], [3,4], [5,6]) is used. In one example, the RMG is constructed, in a piece-wise manner, thusly (see bottom panel of
RM.sub.G[n,1:32]=RM.sub.1,2[n,1:32](0.fwdarw.1500 Hz)
RM.sub.G[n,33:61]=RM.sub.3,4[n,33:61](1500.fwdarw.3000 Hz)
RM.sub.G[n,62:F/2]=RM.sub.5,6[n,62:F/2](3000.fwdarw.F.sub.s/2 Hz)
The illustration of piecewise selection of discrete frequency channels (k) shown above is for a sampling frequency (F.sub.s) of 50 kHz and an FFT size (F) of 1024 samples; the discrete frequency channels used will vary according to the specified values of F.sub.s and F. The piecewise-constructed Global Ratio Mask RM.sub.G is also given conjugate symmetry (i.e. negative frequencies are the mirror image of positive frequencies) to ensure that the STTC processing yields a real (rather than complex) output. Additional detail is given below.
(59) A singular Global Binary Mask BM.sub.G is computed from the three Binary Masks (BM.sub.1,2, BM.sub.3,4, BM.sub.5,6), where x specifies element-wise multiplication:
BM.sub.G[n,k]=BM.sub.1,2[n,k]BM.sub.3,4[n,k]BM.sub.5,6[n,k](13)
(60) Multiplication of the Global Ratio Mask RM.sub.G with the Global Binary Mask BM.sub.G yields a Thresholded Ratio Mask RM.sub.T [n, k] that is used for reconstruction of the target signal in the output signal processing [56], as described below. Note that RM.sub.T[n, k] has weights of 0 below the threshold and continuous soft weights at and above .
(61) The Global Ratio Mask (RM.sub.G), the Global Binary Mask (BM.sub.G) and the Thresholded Ratio Mask (RM.sub.T) are all effective time-varying filters, with a vector of frequency channel weights for every analysis frame. Any one of the three (i.e., RM.sub.G, BM.sub.G or RM.sub.T) can provide an intelligibility benefit for a target talker, and supress both stationary and non-stationary interfering sound sources. RM.sub.T is seen as the most desirable, effective and useful of the three; hence it is used for producing the output in the block diagram shown in
(62) 4. Output Signal Processing [56]
(63) The output signal(s) may be either stereo or monaural (mono), and these are created in correspondingly different ways as explained below.
(64) Reconstruction of Target Signal with STEREO Output
(65) Stereo output may be used, for example in applications such as ALD where it is important to preserve binaural cues such as ILD, ITD. The output of the STTC processing is an estimate of the target speech signal from the specified look direction. The Left and Right (i.e. stereo pair) Time-Frequency domain estimate (Y.sub.L[n, k] and Y.sub.R [n, k]) of the target speech signal (y.sub.L [m] and y.sub.R[m]) can be described thusly, where X.sub.L and X.sub.R are the Short Time Fourier Transforms (STFTs) of the signals x.sub.L and x.sub.R, from the designated Left and Right in-ear or near-ear microphones [44] (
Y.sub.L[n,k]=RM.sub.T[n,k]X.sub.L[n,k]Y.sub.R[n,k]=RM.sub.T[n,k]X.sub.R[n,k](14)
Alternatively, the Global Ratio Mask (RM.sub.G) could be used to produce the stereo output:
Y.sub.L[n,k]=RM.sub.G[n,k]X.sub.L[n,k]Y.sub.R[n,k]=RM.sub.G[n,k]X.sub.R[n,k](15)
Synthesis of a stereo output (y.sub.L[m] and y.sub.R[m]) estimate of the target speech signal consists of taking the Inverse Short Time Fourier Transforms (ISTFTs) of Y.sub.L[n, k] and Y.sub.R[n, k] and using the overlap-add method of reconstruction.
(66) While the Global Binary Mask BM.sub.G could also be used to produce the stereo output, the continuously valued frequency channel weights of the RM.sub.G and RM.sub.T are more desirable, yielding superior performance in speech intelligibility and speech quality performance than the BM.sub.G. RM.sub.T is seen as the most desirable, effective and useful of the three; hence it is used for producing the output in the block diagram shown in
(67) Reconstruction of Target Signal with MONO Output
(68) A mono output (denoted below with the subscript M) may be used in other applications in which the preservation of binaural cues is absent or less important. In one example, a mono output can be computed via an average of the STFTs across multiple microphones, where I is the total number of microphones:
(69)
Alternatively, the Global Ratio Mask (RM.sub.G) could be used to produce the mono output:
Y.sub.M[n,k]=RM.sub.G[n,k]X.sub.M[n,k](18)
The Mono output y.sub.M[m] is produced by taking Inverse Short Time Fourier Transforms (ISTFT) of Y.sub.M[n, k] and using the overlap-add method of reconstruction.
(70) Steering the Nonlinear Beamformer's Look Direction
(71) The default target sound source look direction is straight ahead at 0. However, if deemed necessary or useful, an eyetracker could be used to specify the look direction, which could be steered via time shifts, implemented in either the time or frequency domains, of the Left and Right signals. The STTC processing could boost intelligibility for the target talker from the designated look direction and suppress the intelligibility of the distractors, all while preserving binaural cues for spatial hearing.
(72) The sample shifts are computed independently for each pair of microphones, where F.sub.s is the sampling rate, d is the inter-microphone spacing in meters, is the speed of sound in meters per second and is the specified angular look direction in radians:
(73)
(74) These time shifts are used both for the computation of the Ratio Masks (RMs) as well as for steering the beamformer used for the Mono version of the STTC processing.
(75)
(76)
(77) An STTC ALD as described herein can improve speech intelligibility for a target talker while preserving Interaural Time Difference (ITD) and Interaural Level Difference (ILD) binaural cues that are important for spatial hearing. These binaural cues are not only important for effecting sound source localization and segregation, they are important for a sense of Spatial Awareness. While the processing described herein aims to eliminate the interfering sound sources altogether, the user of the STTC ALD device could choose whether to listen to the unprocessed waveforms at the Left and Right near-ear microphones, the processed waveforms, or some combination of both. The binaural cues that remain after filtering with the Time-Frequency (T-F) mask are consistent with the user's natural binaural cues, which allows for continued Spatial Awareness with a mixture of the processed and unprocessed waveforms. The ALD user might still want to hear what is going on in the surroundings, but will be able to turn the surrounding interferring sound sources down to a comfortable and ignorable, rather than distracting, intrusive and overwhelming, sound level. For example, in some situations, it would be helpful to be able to make out the speech of surrounding talkers, even though the ALD user is primarily focused on listening to the person directly in front of them.
(78) II. System Description of 8-Microphone Short-Time Target Cancellation (STTC) Human-Computer Interface (HCI)
(79)
(80)
(81)
(82)
(83)
(84) 1. Short-Time Fourier Transform (STFT) processing [90], converts each microphone signal into frequency domain signal. 2. Ratio Mask (RM) and Binary Mask (BM) processing [92], applied to frequency domain signals of microphone pairs. 3. Global Ratio Mask (RM.sub.G) and Thresholded Ratio Mask (RM.sub.T) processing [94], uses ratio masks of all microphone pairs. 4. Output signal processing [96], uses the Thresholded Ratio Mask (RM.sub.T) to scale/modify selected microphone signals to serve as output signal(s) [16].
(85) In the STFT processing [90], individual STFT calculations [90] are the same as above. Two additional STFTs are calculated for the 4th microphone pair (7,8). In the RM processing [92], a fourth RM.sub.7,8 is calculated for the fourth microphone pair:
(86)
Also, as shown in the bottom panel of
RM.sub.G[n,1:16]=RM.sub.1,2[n,1:16](0.fwdarw.750 Hz)
RM.sub.G[n,17:32]=RM.sub.3,4[n,17:32](750.fwdarw.1500 Hz)
RM.sub.G[n,33:61]=RM.sub.5,6[n,33:61](1500.fwdarw.3000 Hz)
(87)
(88) Similarly, the pairwise BM calculations include calculation of a fourth Binary Mask, BM.sub.7,8, for the fourth microphone pair [7, 8]:
(89)
(90) And the Global Binary Mask BM.sub.G uses all four BMs:
BM.sub.G[n,k]=BM.sub.1,2[n,k]BM.sub.3,4[n,k]BM.sub.5,6[n,k]BM.sub.7,8[n,k](22)
(91)
(92) For the Output Signal Reconstruction [96], both stereo and mono alternatives are possible. These are generally similar to those of
(93)
(94) Alternative Computation of the Global Ratio Mask
(95) Two alternatives are described: (1) taking the minimum (min) of pairwise masks, and (2) combining the min and piecewise approaches.
(96) 1. Taking the Minimum of Pairwise Masks
(97) As an alternative to calculating the Global Ratio Mask (RM.sub.G) from the pair-wise ratio masks through piecewise construction, it can be calculated by taking the minimum value, at each Time-Frequency (T-F) tile position [n, k], across the pair-wise ratio masks, as described in the equations below for the six and eight microphone arrays, respectively:
(98) For six microphones:
RM.sub.G[n,k]=min(RM.sub.1,2[n,k],RM.sub.3,4[n,k],RM.sub.5,6[n,k])(25)
(99) For eight microphones:
RM.sub.G[n,k]=min(RM.sub.1,2[n,k],RM.sub.3,4[n,k],RM.sub.5,6[n,k],RM.sub.7,8[n,k])(26)
(100) There may be advantages to this alternative approach to computing the Global Ratio Mask, such as better sound source segregation and speech quality when the interfering sources are in close proximity to the target talker (i.e. 30 as opposed to 90 angular separation).
(101) 2. Combining the Min and Piecewise Approaches
(102) The min approach described above may give improved performance when the microphone array is lined up perfectly with the target, but may set high-frequency T-F tiles to values of zero when the array is not looking directly at the target (i.e. the target is off to the left or right by a few degrees). For this situation an effective approach might involve combining the min and piecewise approaches to ensure that the most widely spaced microphone pair is not included in the computation of the high-frequency T-F tiles. An example of combining the min and piecewise approaches to compute low (0 to 1025 Hz), mid (1074 to 2490 Hz) and high frequency segments of the Global Ratio Mask (RM.sub.G) is shown below:
(103) For Low Frequency Channels (0 to 1025 Hz):
RM.sub.G[n,1: 22]=min(RM.sub.1,2[n,1: 22],RM.sub.3,4[n,1: 22])(27a)
For Mid Frequency Channels (1074 to 2490 Hz):
RM.sub.G[n,23: 52]=min(RM.sub.1,2[n,23: 52],RM.sub.3,4[n,23: 52],RM.sub.5,6[n,23: 52])(27b)
For High Frequency Channels (2539 to 25000 Hz):
(104)
(105) Note that the most widely spaced microphone pair, RM.sub.1,2 is not included in the computation of the high frequency channel (2539 to 25000 Hz) T-F tiles. This processing should alleviate deletion of high-frequency target T-F tiles when the target is slightly off axis (i.e. a few degrees to the left or right of the array's designated look direction.
(106) Piecewise-Construction of Global Binary Mask
(107) For a scenario in which the ALD's 6-microphone array has a straight-ahead look direction of 0, but the target talker is at +5, the Time-Frequency (T-F) tiles above 3.5 kHz may be set to zero, a result of the target waveform becoming part of the Noise estimate in the most widely spaced pair of microphones. Having the T-F tiles above 3.5 kHz set to zero does not affect the speech intelligibility metrics because the frequencies most important for speech are below that cutoff; e.g., telephones transmit speech bandpass filtered between 300 Hz and 3.5 kHz. However, if it is desired to preserve this high-frequency speech content, a piecewise construction approach could be used for computing the Global Binary Mask (BM.sub.G). Rather than using the above element-by-element multiplication, a piecewise construction such as the following can be used:
(108) For Low to Mid Frequency Channels 0 to 2500 Hz):
BM.sub.G[n,1: 52]=BM.sub.1,2[n,1: 52]BM.sub.3,4[n,1: 52](28a)
For Mid to High Frequency Channels 2500 to 25000 Hz):
(109)
(110) Embodiment in a 2-Microphone Binaural Hearing Aid.
(111) Although the devices described thus far have leveraged multiple microphone pairs to compute an effective time-varying filter that can suppress non-stationary sound sources, the approach could also be used in binaural hearing aids using only two near-ear microphones [44], as shown in
(112)
(113) The STTC processing [98] would use only the signals from the binaural microphones, the Left and Right STFTs X.sub.L[n, k] and X.sub.R[n,k] [24], to compute a Ratio Mask (RM):
(114)
If there is only one pair of microphones, and therefore only one Ratio Mask (RM) is computed, then the Global Ratio Mask (RM.sub.G) and the single Ratio Mask (RM) are equivalent; i.e., RM.sub.G[n, k]=RM[n, k].
(115) For the output signal reconstruction [99], the RM.sub.G[n, k] T-F mask (i.e., time-varying filter) can be used to filter the signals from the Left and Right near-ear microphones [44]:
Y.sub.L[n,k]=RM.sub.G[n,k]X.sub.L[n,k]Y.sub.R[n,k]=RM.sub.G[n,k]X.sub.R[n,k](30)
Synthesis of a stereo output (y.sub.L[m] and y.sub.R[m]) estimate of the target speech signal consists of taking the Inverse Short Time Fourier Transforms (ISTFTs) of Y.sub.L [n, k] and Y.sub.R [n, k] and using the overlap-add method of reconstruction. The minimalist processing described here would provide a speech intelligibility benefit, for a targer talker straight ahead at 0, while still preserving binaural cues. Alternative processing might include using a Thresholded Ratio Mask (RM.sub.T), as described in the previous sections, for computing the outputs Y.sub.L and Y.sub.R.
(116) A Binary Mask BM[n, k] may also be computed using a thresholding function, with threshold value , which may be set to a fixed value of =0.2 for example:
(117)
When using only one pair of microphones, the Thresholded Ratio Mask (RM.sub.T) is the product of the Ratio Mask and Binary Mask:
RM.sub.T[n,k]=RM[n,k]BM[n,k](32)
(118) For this alternative processing for the output signal reconstruction [99], when using only one pair of microphones, the RM.sub.T[n, k] T-F mask (i.e., time-varying filter) can be used to filter the signals from the Left and Right near-ear microphones [44]:
Y.sub.L[n,k]=RM.sub.T[n,k]X.sub.L[n,k]Y.sub.R[n,k]=RM.sub.T[n,k]X.sub.R[n,k](33)
(119) While various embodiments of the invention have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.