AUDIO DEVICE WITH DISTRACTOR SUPPRESSION
20250054479 ยท 2025-02-13
Inventors
- Ting-Yao CHEN (Zhubei City, TW)
- Chen-Chu HSU (Zhubei City, TW)
- YAO-CHUN LIU (Zhubei City, TW)
- Tsung-Liang CHEN (Zhubei City, TW)
Cpc classification
International classification
Abstract
An audio device is disclosed, comprising multiple microphones and an audio module. The multiple microphones generate multiple audio signals. The audio module coupled to the multiple microphones comprises a processor, a storage media and a post-processing circuit. The storage media includes instructions operable to be executed by the processor to perform operations comprising: producing multiple instantaneous relative transfer functions (IRTFs) using a known adaptive algorithm according to multiple spectral representations for multiple first sample values in current frames of the multiple audio signals; and, performing distractor suppression over the multiple spectral representations and the multiple IRTFs using an end-to-end neural network to generate a compensation mask. The post-processing circuit generates an audio output signal according to the compensation mask. Each IRTF represents a difference in sound propagation between each predefined microphone and a reference microphone of the microphones relative to sound sources.
Claims
1. An audio device, comprising: multiple microphones that generate multiple audio signals; and an audio module coupled to the multiple microphones, comprising: at least one processor; at least one storage media including instructions operable to be executed by the at least one processor to perform a set of operations comprising: producing multiple instantaneous relative transfer functions (IRTFs) using a first known adaptive algorithm according to multiple mic spectral representations for multiple first sample values in current frames of the multiple audio signals; and performing distractor suppression over the multiple mic spectral representations and the multiple IRTFs using an end-to-end neural network to generate a compensation mask; and a post-processing circuit that generates an audio output signal according to the compensation mask; wherein each IRTF represents a difference in sound propagation between each predefined microphone and a reference microphone of the multiple microphones relative to at least one sound source; and wherein each predefined microphone is different from the reference microphone.
2. The audio device according to claim 1, further comprising: a loudspeaker that converts a playback audio signal into a sound pressure signal; wherein the set of operations further comprises: producing multiple playback transfer functions (PTFs) using a second known adaptive algorithm according to the multiple mic spectral representations and a playback spectral representation for multiple second sample values in a current frame of the playback audio signal; and performing acoustic echo cancellation (AEC) over the multiple mic spectral representations, the multiple IRTFs, the playback spectral representation and the multiple PTFs using the end-to-end neural network to generate the compensation mask; wherein each PTF indicates a degree of a sound leakage from the loudspeaker to a target microphone of the multiple microphones.
3. The audio device according to claim 2, wherein the first and the second known adaptive algorithms are least mean square (LMS) algorithm.
4. The audio device according to claim 2, wherein each PTF includes multiple PTF elements corresponding to multiple frequency bands, and wherein the operation of producing the PTFs comprises: for a target frequency band of one PTF, producing a current PTF element for the one PTF using the second known adaptive algorithm according to a first corresponding sample in the playback spectral representation and a difference between an estimated sample and a second corresponding sample in a corresponding mic spectral representation for the target microphone; wherein the estimated sample is related to a product of a previous PTF element for the one PTF and the first corresponding sample.
5. The audio device according to claim 2, wherein the set of operations further comprises: performing active noise cancellation (ANC) operations over the multiple first sample values using the end-to-end neural network to generate multiple third sample values.
6. The audio device according to claim 5, wherein the end-to-end neural network comprises: a time delay neural network (TDNN); a first long short-term memory (LSTM) network coupled to the output of the TDNN; and a second LSTM network coupled to the output of the TDNN; wherein the TDNN and the first LSTM network are jointly trained to perform the ANC operations over the first sample values to generate the third sample values; wherein the TDNN and the second LSTM network are jointly trained to perform the distraction suppression over the multiple mic spectral representations and the multiple IRTFs to generate the compensation mask; and wherein the TDNN and the second LSTM network are jointly trained to perform the AEC over the multiple mic spectral representations, the multiple IRTFs, the playback spectral representation and the multiple PTFs to generate the compensation mask.
7. The audio device according to claim 5, wherein the post-processing circuit modifies a main spectral representation of the multiple mic spectral representations with the compensation mask to generate a compensated spectral representation, and generates the audio output signal according to the multiple third sample values and the compensated spectral representation.
8. The audio device according to claim 1, wherein the compensation mask comprises multiple frequency band gains, each indicating its corresponding frequency band is either speech-dominant or noise-dominant.
9. The audio device according to claim 1, wherein the end-to-end neural network is a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a time delay neural network (TDNN) or a combination thereof.
10. The audio device according to claim 1, wherein one of the multiple microphones is selected as the reference microphone according to signal-to-noise ratios or receiving spectrum ranges of the multiple microphones.
11. The audio device according to claim 1, wherein the audio output signal is sent out over a connection link.
12. The audio device according to claim 1, further comprising: an audio output circuit that is coupled to an output terminal of the post-processing circuit over a connection link and converts the audio output signal into a sound pressure signal.
13. The audio device according to claim 1, wherein the audio module and at least one of the multiple microphones are arranged at a first source device, wherein the other microphones are arranged at a second source device, and wherein the audio module connected to the at least one microphone is coupled to the other microphones over a first connection link and to a sink device over a second connection link.
14. The audio device according to claim 1, wherein the multiple microphones are respectively arranged at two different source devices, and wherein the audio module is arranged at a sink device and coupled to the multiple microphones over two different connection links.
15. The audio device according to claim 1, wherein the multiple microphones are respectively arranged at a first source device, a second source device and a sink device, and the audio module is arranged at the sink device, and wherein the audio module is coupled to the multiple microphones over three different connection links.
16. The audio device according to claim 1, wherein each IRTF includes multiple IRTF elements corresponding to multiple frequency bands, and wherein the operation of producing the IRTFs comprises: for a target frequency band of one IRTF, producing a current IRTF element for the one IRTF using the first known adaptive algorithm according to a first corresponding sample in a first mic spectral representation for the predefined microphone and a difference between an estimated sample and a second corresponding sample in a second mic spectral representation for the reference microphone; wherein the estimated sample is related to a product of a previous IRTF element for the one IRTF and the first corresponding sample.
17. An audio apparatus, comprising: two audio devices of claim 1 that are arranged at two different source devices; wherein the two audio output signals from the two audio devices are respectively sent to a sink device over a first connection link and a second connection link.
18. The apparatus according to claim 17, further comprising: an audio output circuit that is arranged at the sink device and converts the two audio output signals into two sound pressure signals.
19. The audio apparatus according to claim 17, wherein the sink device receives the two audio output signals over the first and the second connection links and delivers them over a third connection link.
20. An audio processing method, comprising: obtaining multiple instantaneous relative transfer functions (IRTFs) using a first known adaptive algorithm according to multiple mic spectral representations for multiple first sample values in current frames of multiple audio signals from multiple microphones; performing distractor suppression over the multiple mic spectral representations and the multiple IRTFs using an end-to-end neural network to generate a compensation mask; and obtaining an audio output signal according to the compensation mask; wherein each IRTF represents a difference in sound propagation between each predefined microphone and a reference microphone of the multiple microphones relative to at least one sound source; and wherein each predefined microphone is different from the reference microphone.
21. The method according to claim 20, further comprising: producing multiple playback transfer functions (PTFs) using a second known adaptive algorithm according to the multiple mic spectral representations and a playback spectral representation for multiple second sample values in a current frame of a playback audio signal for a loudspeaker; and performing acoustic echo cancellation (AEC) over the multiple mic spectral representations, the multiple IRTFs, the playback spectral representation and the multiple PTFs using the end-to-end neural network to generate the compensation mask; wherein each PTF indicates a degree of a sound leakage from the loudspeaker to a target microphone of the multiple microphones.
22. The method according to claim 21, wherein the first and the second known adaptive algorithms are least mean square (LMS) algorithm.
23. The method according to claim 21, wherein each PTF includes multiple PTF elements corresponding to multiple frequency bands, wherein the step of obtaining the PTFs comprises: for a target frequency band of one PTF, producing a current PTF element for the one PTF using the second known adaptive algorithm according to a first corresponding sample in the playback spectral representation and a difference between an estimated sample and a second corresponding sample in a corresponding mic spectral representation for the target microphone; wherein the estimated sample is related to a product of a previous PTF element for the one PTF and the first corresponding sample.
24. The method according to claim 21, further comprising: performing active noise cancellation (ANC) operations over the multiple first sample values using the end-to-end neural network to generate multiple third sample values.
25. The method according to claim 24, wherein the end-to-end neural network comprises a time delay neural network (TDNN), a first long short-term memory (LSTM) network and a second LSTM network, wherein the TDNN and the first LSTM network are jointly trained to perform the ANC operations over the first sample values to generate the third sample values, wherein the TDNN and the second LSTM network are jointly trained to perform the distraction suppression over the multiple mic spectral representations and the multiple IRTFs to generate the compensation mask, and wherein the TDNN and the second LSTM network are jointly trained to perform the AEC over the multiple mic spectral representations, the multiple IRTFs, the playback spectral representation and the multiple PTFs to generate the compensation mask.
26. The method according to claim 24, wherein the step of obtaining the audio output signal comprises: modifying a main spectral representation of the multiple mic spectral representations with the compensation mask to obtain a compensated spectral representation; and obtaining the audio output signal according to the third sample values and the compensated spectral representation.
27. The method according to claim 20, wherein the compensation mask comprises multiple frequency band gains, each indicating its corresponding frequency band is either speech-dominant or noise-dominant.
28. The method according to claim 20, further comprising: selecting one of the multiple microphones as the reference microphone according to either signal-to-noise ratios or receiving spectrum ranges of the multiple microphones.
29. The method according to claim 20, further comprising: sending out the audio output signal over a connection link.
30. The method according to claim 20, further comprising: converting the audio output signal into a sound pressure signal.
31. The method according to claim 20, further comprising: at a first source device with M microphones of the multiple microphones, carrying out the above steps of obtaining the multiple IRTFs, performing and obtaining the audio output signal based on M audio signals of the M microphones; defining the audio output signal in the first source device as a first signal; and delivering the first signal to a sink device over a first connection link; at a second source device with N microphones of the multiple microphones, carrying out the above steps of obtaining the multiple IRTFs, performing and obtaining the audio output signal based on N audio signals of the N microphones; defining the audio output signal in the second source device as a second signal; and delivering the second signal to the sink device over a second connection link; and at the sink device, receiving the first and the second signals over the first and the second connection links, where M, N>1.
32. The method according to claim 20, further comprising: at a first source device with at least one of the multiple microphones, delivering at least one audio signal from the at least one of the multiple microphones to a second source device with the other microphones over a first connection link; at the second source device, carrying out the above steps of obtaining the multiple IRTFs, performing and obtaining the audio output signal according to the multiple audio signals; and delivering the audio output signal to a sink device over a second connection link; and at the sink device, receiving the audio output signal over the second connection link; and transmitting the audio output signal over a third connection link.
33. The method according to claim 20, further comprising: at a first source device with at least one of the multiple microphones, delivering at least one audio signal from the at least one of the multiple microphones to a sink device over a first connection link; at a second source device with the other microphones, delivering the other audio signals from the other microphones to the sink device over a second connection link; and at the sink device, carrying out the above steps of obtaining the multiple IRTFs, performing and obtaining the audio output signal according to the multiple audio signals; and delivering the audio output signal over a third connection link.
34. The method according to claim 20, further comprising: at a first source device with a first portion of the multiple microphones, delivering at least one audio signal from the first portion of the multiple microphones to a sink device over a first connection link; at a second source device with a second portion of the multiple microphones, delivering at least one audio signal from the second portion of the multiple microphones to the sink device over a second connection link; at the sink device with a third portion of the multiple microphones, performing the above steps of obtaining the multiple IRTFs, performing and obtaining the audio output signal according to the multiple audio signals of the multiple microphones; and delivering the audio output signal over a third connection link; wherein the multiple microphones are divided into the first, the second and the third portions.
35. The method according to claim 20, wherein each IRTF includes multiple IRTF elements corresponding to multiple frequency bands, wherein the step of obtaining the IRTFs comprises: for a target frequency band of one IRTF, obtaining a current IRTF element for the one IRTF using the first known adaptive algorithm according to a first corresponding sample in a first mic spectral representation for the predefined microphone and a difference between an estimated sample and a second corresponding sample in a second mic spectral representation for the reference microphone; wherein the estimated sample is related to a product of a previous IRTF element for the one IRTF and the first corresponding sample.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, and wherein:
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
DETAILED DESCRIPTION OF THE INVENTION
[0030] As used herein and in the claims, the term and/or includes any and all combinations of one or more of the associated listed items. The use of the terms a and an and the and similar referents in the context of describing the invention are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Throughout the specification, the same components with the same function are designated with the same reference numerals.
[0031] As used herein and in the claims, the term sink device refers to a device implemented to establish a first connection link with one or two source devices so as to receive audio data from the one or two source devices, and implemented to establish a second connection link with another sink device so as to transmit audio data to the another sink device. Examples of the sink device include, but are not limited to, a personal computer, a laptop computer, a mobile device, a wearable device, an Internet of Things (IoT) device/hub and an Internet of Everything (IoE) device/hub. The term source device refers to a device having an embedded microphone and implemented to originate, transmit and/or receive audio data over connection links with the other source device or the sink device. Examples of the source device include, but are not limited to, a headphone, an earbud and one side of a headset. The type of headphones and the headset includes, but not limited, over-ear, on-ear, clip-on and in-ear-monitor. The source device, the sink device and the connection links can be either wired or wireless. A wired connection link is made using a transmission line or cable. A wireless connection link can occur over any suitable communication link/network that enables the source devices and the sink device to communicate with each other over a communication medium. Examples of protocols that can be used to form communication links/networks can include, but are not limited to, near-field communication (NFC) technology, radio-frequency identification (RFID) technology, Bluetooth, Bluetooth Low Energy (BLE), Wi-Fi technology, the Internet Protocol (IP) and Transmission Control Protocol (TCP).
[0032] A feature of the invention is to use an end-to-end neural network to simultaneously perform ANC functions, and advanced audio signal processing, e.g., noise suppression, acoustic feedback cancellation (AFC), sound amplification, distractor suppression and acoustic echo cancellation (AEC) and so on. Another feature of the invention is that the end-to-end neural network receives a time-domain audio signal and a frequency-domain audio signal for each microphone so as to gain the benefits of both time-domain signal processing (e.g., extremely low system latency) and frequency-domain signal processing (e.g., better frequency analysis). In comparison with the conventional ANC technology that is most effective on lower frequencies of sound, e.g., between 50 to 1000 Hz, the end-to-end neural network of the invention can reduce both the high-frequency noise and low-frequency noise. Another feature of the invention is to use multiple microphone signals from one or two source devices or/and a sink device and multiple IRTFs (will be described below) to suppress the distractor speech 230 in
[0033]
[0034] In an embodiment, the audio device 10/60/70 may be a hearing aid, e.g. of the behind-the-ear (BTE) type, in-the-ear (ITE) type, in-the-canal (ITC) type, or completely-in-the-canal (CIC) type. The microphones 111Q are used to collect ambient sound to generate Q audio signals au-1au-Q. The pre-processing unit 120 is configured to receive the Q audio signals au-1au-Q and generate audio data of current frames i of Q time-domain digital audio signals s.sub.1[n]s.sub.Q[n] and Q current spectral representations F1(i)FQ(i) corresponding to the audio data of the current frames i of time-domain digital audio signals s.sub.1[n]s.sub.Q[n], where n denotes the discrete time index and i denotes the frame index of the time-domain digital audio signals s.sub.1[n]s.sub.Q[n]. The end-to-end neural network 130 receives input parameters, the Q current spectral representations F1(i)FQ(i), and audio data for current frames i of the Q time-domain signals s.sub.1[n]s.sub.Q[n], performs ANC and AFC functions, noise suppression and sound amplification to generate a frequency-domain compensation mask stream G.sub.1(i)G.sub.N(i) and audio data of the current frame i of a time-domain digital data stream u[n]. The post-processing unit 150 receives the frequency-domain compensation mask 20) stream G.sub.1(i)G.sub.N(i) and audio data of the current frame i of the time-domain data stream u[n] to generate audio data for the current frame i of a time-domain digital audio signal y[n], where N denotes the Fast Fourier transform (FFT) size. The output terminal of the post-processing unit 150 is coupled to the audio output circuit 160 via a connection link 172, such as a transmission line or a Bluetooth/WiFi communication link. Finally, the audio output circuit 160 placed at a sink device or a source device converts the digital audio signal y[n] from the second connection link 172 into a sound pressure signal. Please note that the first connection links 171 and the second connection link 172 are not necessarily the same, and the audio output circuit 160 is optional.
[0035]
[0036] The end-to-end neural network 130/630/730 may be implemented by a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a time delay neural network (TDNN) or any combination thereof. Various machine learning techniques associated with supervised learning may be used to train a model of the end-to-end neural network 130/630/730 (hereinafter called model 130/630/730 for short). Example supervised learning techniques to train the end-to-end neural network 130/630/730 include, without limitation, stochastic gradient descent (SGD). In supervised learning, a function (i.e., the model 130) is created by using four sets of labeled training examples (will be described below), each of which consists of an input feature vector and a labeled output. The end-to-end neural network 130 is configured to use the four sets of labeled training examples to learn or estimate the function (i.e., the model 130), and then to update model weights using the backpropagation algorithm in combination with cost function. Backpropagation iteratively computes the gradient of cost function relative to each weight and bias, then updates the weights and biases in the opposite direction of the gradient, to find a local minimum. The goal of a learning in the end-to-end neural network 130 is to minimize the cost function given the four sets of labeled training examples.
[0037]
[0038] According to the input parameters, the end-to-end neural network 130 receives the Q current spectral representations F1(i)FQ(i) and audio data of the current frames i of Q time-domain input streams s.sub.1[n]s.sub.Q[n] in parallel, performs ANC function and advanced audio signal processing and generates one frequency-domain compensation mask stream (including N mask values G.sub.1(i)G.sub.N(i)) corresponding to N frequency bands and audio data of the current frame i of one time-domain output sample stream u[n]. Here, the advanced audio signal processing includes, without limitations, noise suppression, AFC, sound amplification, alarm-preserving, environmental classification, direction of arrival (DOA) and beamforming, speech separation and wearing detection. For purpose of clarity and ease of description, the following embodiments are described with the advanced audio signal processing only including noise suppression, AFC, and sound amplification. However, it should be understood that the embodiments of the end-to-end neural network 130 are not so limited, but are generally applicable to other types of audio signal processing, such as environmental classification, direction of arrival (DOA) and beamforming, speech separation and wearing detection.
[0039] For the sound amplification function, the input parameters for the end-to-end neural network 130 include, with limitations, magnitude gains, a maximum output power value of the signal z[n] (i.e., the output of inverse STFT 154) and a set of N modification gains g.sub.1g.sub.N corresponding to N mask values G.sub.1(i)G.sub.N(i), where the N modification gains g.sub.1g.sub.N are used to modify the waveform of the N mask values G.sub.1(i)G.sub.N(i). For the noise suppression, AFC and ANC functions, the input parameters for the end-to-end neural network 130 include, with limitations, level or strength of suppression. For the noise suppression function, the input data for a first set of labeled training examples are constructed artificially by adding various noise to clean speech data, and the ground truth (or labeled output) for each example in the first set of labeled training examples requires a frequency-domain compensation mask stream (including N mask values G.sub.1(i)G.sub.N(i)) for corresponding clean speech data. For the sound amplification function, the input data for a second set of labeled training examples are weak speech data, and the ground truth for each example in the second set of labeled training examples requires a frequency-domain compensation mask stream (including N mask values G.sub.1(i)G.sub.N(i)) for corresponding amplified speech data based on corresponding input parameters (e.g., including a corresponding magnitude gain, a corresponding maximum output power value of the signal z[n] and a corresponding set of N modification gains g.sub.1g.sub.N). For the AFC function, the input data for a third set of labeled training examples are constructed artificially by adding various feedback interference data to clean speech data, and the ground truth for each example in the third set of labeled training examples requires a frequency-domain compensation mask stream (including N mask values G.sub.1(i)G.sub.N(i)) for corresponding clean speech data. For the ANC function, the input data for a fourth set of labeled training examples are constructed artificially by adding the direct sound data to clean speech data, the ground truth for each example in the fourth set of labeled training examples requires N sample values of the time-domain denoised audio data u[n] for corresponding clean speech data. For speech data, a wide range of people's speech is collected, such as people of different genders, different ages, different races and different language families. For noise data, various sources of noise are used, including markets, computer fans, crowd, car, airplane, construction, etc. For the feedback interference data, interference data at various coupling levels between the loudspeaker 163 and the microphones 111Q are collected. For the direct sound data, the sound from the inputs of the audio devices to the user eardrums among a wide range of users are collected. During the process of artificially constructing the input data, each of the noise data, the feedback interference data and the direct sound data is mixed at different levels with the clean speech data to produce a wide range of SNRs for the four sets of labeled training examples.
[0040] Regarding the end-to-end neural network 130, in a training phase, the TDNN 131 and the FD-LSTM network 132 are jointly trained with the first, the second and the third sets of labeled training examples, each labeled as a corresponding frequency-domain compensation mask stream (including N mask values G.sub.1(i)G.sub.N(i)); the TDNN 131 and the TD-LSTM network 133 are jointly trained with the fourth set of labeled training examples, each labeled as N corresponding time-domain audio sample values. When trained, the TDNN 131 and the FD-LSTM network 132 can process new unlabeled audio data, for example audio feature vectors, to generate N corresponding frequency-domain mask values G.sub.1(i)G.sub.N(i) for the N frequency bands while the TDNN 131 and the TD-LSTM network 133 can process new unlabeled audio data, for example audio feature vectors, to generate N corresponding time-domain audio sample values for the current frame i of the signal u[n]. In one embodiment, the N mask values G.sub.1(i)G.sub.N(i) are N band gains (being bounded between Th1 and Th2; Th1<Th2) corresponding to the N frequency bands in the current spectral representations F1(i)FQ(i). Thus, if any band gain value G.sub.k(i) gets close to Th1, it indicates the signal on the corresponding frequency band k is noise-dominant; if any band gain value G.sub.k(i) gets close to Th2, it indicates the signal on the corresponding frequency band k is speech-dominant. When the end-to-end neural network 130 is trained, the higher the SNR value in a frequency band k is, the higher the band gain value G.sub.k(i) in the frequency-domain compensation mask stream becomes.
[0041] In brief, the low latency of the end-to-end neural network 130 between the time-domain input signals s.sub.1[n]s.sub.Q[n] and the responsive time-domain output signal u[n] fully satisfies the ANC requirements (i.e., less than 50 s). In addition, the end-to-end neural network 130 manipulates the input current spectral representations F1(i)FQ(i) in frequency domain to achieve the goals of noise suppression, AFC and sound amplification, thus greatly improving the audio quality. Thus, the framework of the end-to-end neural network 130 integrates and exploits cross domain audio features by leveraging audio signals in both time domain and frequency domain to improve hearing aid performance.
[0042]
[0043] In the embodiment of
[0044]
[0045] A RTF represents correlation (or differences in magnitude and in phase) between any two microphones in response to the same sound source. Multiple sound sources can be distinguished by utilizing their RTFs, which describe differences in sound propagation between sound sources and microphones and are generally different for sound sources in different locations. Different sound sources, such as user speech, distractor speech and background noise, bring about different RTFs. Generally, the RTFs are used in sound source location, speech enhancement and beamforming, such as direction of arrival (DOA) and generalized sidelobe canceller (GSC) algorithm.
[0046] Each RTF is defined/computed for each predefined microphone 1u relative to a reference microphone 1v, where 1<=u, v<=Q and uv. Properly selecting the reference microphone is important as all RTFs are relative to this reference microphone. In a preferred embodiment, a microphone with a higher signal to noise ratio (SNR), such as a feedback microphone 12 of a TWS earbud 620 in
[0047] H.sub.u,v(i) denotes an IRTF from the predefined microphone 1u to the reference microphone 1v and is obtained based on audio data in the current frames i of the audio signals s.sub.u[n] and s.sub.v[n]. Each IRTF (H.sub.u,v(i)) represents a difference in sound propagation between the predefined microphone 1u and the reference microphone 1v relative to at least one sound source. Each IRTF (H.sub.u,v(i)) represents a vector including an array of N complex-valued elements: [H.sub.1,u,v(i), H.sub.2,u,v(i), . . . , H.sub.N,u,v(i)], respectively corresponding to N frequency bands for the audio data of the current frames i of the audio signals s.sub.u[n] and s.sub.v[n]. Each IRTF element (H.sub.k,u,v(i)) is a complex number that can be expressed in terms of a magnitude and a phase/angle, where 1<=k<=N. Assuming that a microphone 12 is selected as the reference microphone in
[0048] (i) for k.sup.th frequency band based on a previous estimated IRTF (
(i)=H.sub.k,u,v(i)) from the adaptive algorithm block 615, where
(i)=H.sub.k,u,v(i)F.sub.k,u(i). Then, the known adaptive algorithm block 615 updates the complex value of the current estimated IRTF (
(i)=H.sub.k,u,v(i)) for k.sup.th frequency band according to the input sample F.sub.k,u(i) and the error signal e(i) so as to minimize the error signal e(i) between the input sample F.sub.k,v(i) and an estimated sample
(i) for a given environment. In one embodiment, the known adaptive algorithm block 615 is implemented by a least mean square (LMS) algorithm to produce the current complex value of the current estimated IRTF (
(i)=H.sub.k,u,v(i)). However, the LMS algorithm is provided by example and not limitation of the invention.
[0049] In comparison with the neural network 130, the end to end neural network 630 (or the TDNN 631) additionally receives (Q1) estimated IRTFs (H.sub.u,v(i)) and one more input parameter as shown in
[0050] For the distractor suppression function, the input data for a fifth set of labeled training examples are constructed artificially by adding various distractor speech data to clean speech data, and the ground truth (or labeled output) for each example in the fifth set of labeled training examples requires a frequency-domain compensation mask stream (including N mask values G.sub.1(i)G.sub.N(i)) for corresponding clean speech data. For the distractor speech data, various distractor speech data from various directions, different distances and different numbers of people are collected. During the process of artificially constructing the input data, the distractor speech data is mixed at different levels with the clean speech data to produce a wide range of SNRs for the fifth sets of labeled training examples. The end-to-end neural network 630 is configured to use the above-mentioned five sets (i.e., from the first to the fifth sets) of labeled training examples to learn or estimate the function (i.e., the model 630), and then to update model weights using the backpropagation algorithm in combination with cost function. Besides, in the training phase, the TDNN 631 and the FD-LSTM network 132 are jointly trained with the first, the second, the third and the fifth sets of labeled training examples, each labeled as a corresponding frequency-domain compensation mask stream (including N mask values G.sub.1(i)G.sub.N(i)); the TDNN 631 and the TD-LSTM network 133 are jointly trained with the fourth set of labeled training examples, each labeled as N corresponding time-domain audio sample values. When trained, the TDNN 631 and the FD-LSTM network 132 can process new unlabeled audio data, for example audio feature vectors, to generate N corresponding frequency-domain mask values G.sub.1(i)G.sub.N(i) for the N frequency bands while the TDNN 631 and the TD-LSTM network 133 can process new unlabeled audio data, for example audio feature vectors, to generate N corresponding time-domain audio sample values for the current frame i of the signal u[n].
[0051]
[0052] The playback audio signal r[n] played by a loudspeaker 66 can be modeled by PTFs relative to each of the microphones 111Q at the source device, i.e., at the TWS earbud 620 in
[0053] Assuming that a microphone 12 is selected as the reference microphone in
[0054] (i)=P.sub.k,j(i)) from the adaptive algorithm block 715, so that {circumflex over (F)}.sub.k,j(i)=P.sub.k,j(i)R.sub.k(i). Then, the adaptive algorithm block 715 updates the complex value of the current estimated PTF (i.e.,
(i)=P.sub.k,j(i)) for k.sup.th frequency band according to the input sample R.sub.k(i) and the error signal e(i) so as to minimize the error signal e(i) between the sample F.sub.k,j(i) and an estimated sample {circumflex over (F)}.sub.k,j(i) for a given environment. In one embodiment, the known adaptive algorithm block 715 is implemented by the LMS algorithm to produce the complex value of the current estimated PTF block 711. However, the LMS algorithm is provided by example and not limitation of the invention.
[0055] In comparison with the neural network 630, the end to end neural network 730 (or the TDNN 731) additionally receives a number Q of PTFs (P.sub.1(i)P.sub.Q(i)) and one more input parameter as shown in
[0056] For the AEC function, the input data for a sixth set of labeled training examples are constructed artificially by adding various playback audio data to clean speech data, and the ground truth (or labeled output) for each example in the sixth set of labeled training examples requires a frequency-domain compensation mask stream (including N mask values G.sub.1(i)G.sub.N(i)) for corresponding clean speech data. For the playback audio data, various playback audio data played by different loudspeakers at the source devices or the sink device at different locations are collected. During the process of artificially constructing the input data, the playback audio data is mixed at different levels with the clean speech data to produce a wide range of SNRs for the sixth sets of labeled training examples. The end-to-end neural network 730 is configured to use the above-mentioned six sets (from the first to the sixth sets) of labeled training examples to learn or estimate the function (i.e., the model 730), and then to update model weights using the backpropagation algorithm in combination with cost function. Besides, in the training phase, the TDNN 731 and the FD-LSTM network 132 are jointly trained with the first, the second, the third, the fifth and the sixth sets of labeled training examples, each labeled as a corresponding frequency-domain compensation mask stream (including N mask values G.sub.1(i)G.sub.N(i)); the TDNN 731 and the TD-LSTM network 133 are jointly trained with the fourth set of labeled training examples, each labeled as N corresponding time-domain audio sample values. When trained, the TDNN 731 and the FD-LSTM network 132 can process new unlabeled audio data, for example audio feature vectors, to generate N corresponding frequency-domain mask values G.sub.1(i)G.sub.N(i) for the N frequency bands while the TDNN 731 and the TD-LSTM network 133 can process new unlabeled audio data, for example audio feature vectors, to generate N corresponding time-domain audio sample values for the current frame i of the signal u[n].
[0057] Each of the pre-processing unit 120, the IRTF estimator 610, the PTF estimator 710, the STFT 720, the end-to-end neural network 130/630/730 and the post-processing unit 150 may be implemented by software, hardware, firmware, or a combination thereof. In one embodiment, the pre-processing unit 120, the IRTF estimator 610, the PTF estimator 710, the STFT 720, the end-to-end neural network 130/630/730 and the post-processing unit 150 are implemented by at least one first processor and at least one first storage media (not shown). The at least one first storage media stores instructions/program codes operable to be executed by the at least one first processor to cause the at least one first processor to function as: the pre-processing unit 120, the IRTF estimator 610, the PTF estimator 710, the STFT 720, the end-to-end neural network 130/630/730 and the post-processing unit 150. In an alternative embodiment, the IRTF estimator 610, the PTF estimator 710, and the end-to-end neural network 20) 130/630/730 are implemented by at least one second processor and at least one second storage media (not shown). The at least one second storage media stores instructions/program codes operable to be executed by the at least one second processor to cause the at least one second processor to function as: the IRTF estimator 610, the PTF estimator 710 and the end-to-end neural network 130/630/730.
[0058]
[0059]
[0060] Each of the audio modules 81/82 receives three audio signals from three microphones and a playback audio signal for one loudspeaker at the same TWS earbud, performs ANC function and the advanced audio signal processing including distractor suppression and AEC, and generates a time-domain digital audio signal y.sub.R[n]/y.sub.L[n]. Next, the TWS earbuds 810 and 820 respectively deliver their outputs (y.sub.R[n] and y.sub.L[n]) to the mobile phone 880 over two separate Bluetooth communication links. Finally, after receiving the two digital audio signals y.sub.R[n] and y.sub.L[n], the mobile phone 880 may deliver them to the stereo output circuit 160 for audio play, store them in a storage media, or deliver them to another sink device for audio communication via another communication link, such as WiFi.
[0061]
[0062] At first, the TWS right earbud 840 delivers three audio signals s.sub.1[n]s.sub.3[n] from three microphones 1113 to the TWS left earbud 830 over a Bluetooth communication link. Then, the TWS left earbud 830 feeds the playback audio signal r[n], three audio signals s.sub.4[n]s.sub.6[n] from three microphones 1416 and the three audio signals s.sub.1[n]s.sub.3[n] to the audio module 83. The audio module 83 receives the six audio signals s.sub.1[n]s.sub.6[n] and the playback audio signal r[n], performs ANC function and the advanced audio signal processing including distractor suppression and AEC, and generates a time-domain digital audio signal y[n]. Next, the TWS left earbud 830 delivers the digital audio signal y[n] to the mobile phone 880 over another Bluetooth communication link. Finally, after receiving the digital audio signal y[n], the mobile phone 880 may deliver them to the stereo output circuit 160 for audio play, store them in a storage media, or deliver them to another sink device for audio communication via another communication link, such as WiFi.
[0063]
[0064] At first, the TWS earbuds 840 and 850 respectively delivers six audio signals s.sub.1[n]s.sub.6[n] from six microphones 1116 to the mobile phone 890 over two separate Bluetooth communication links. Then, the mobile phone 890 feeds the six audio signals s.sub.1[n]s.sub.6[n] to the audio module 84. The audio module 84 receives the six audio signals s.sub.1[n]s.sub.6[n] and the playback audio signal r[n], performs ANC function and the advanced audio signal processing including distractor suppression and AEC, and generates a time-domain digital audio signal y[n]. Finally, the audio module 84 may deliver the signal y[n] to the stereo output circuit 160 for audio play. If not, the mobile phone 890 may store it in a storage media or deliver it to another sink device for audio communication via another communication link, such as WiFi.
[0065]
[0066] In brief, the audio devices 800AD including one of the audio modules 600 and 700 of the invention can suppress the distractor speech 230 as shown in
[0067]
TABLE-US-00001 TABLE 1 Distractor Attenuation Data for Open Average of Minimum of source office headsets all angles all angles Single MS Teams Spec. Speech to distractor speech attenuation SDR (dB) distractor Open Office >=17 >=14 Designation Premium >=23 >=20 Baseline Result 18 17 The headset 900 of 24 20 the invention
[0068] As clearly shown in Table 1, the headset 900 passes the test because the speech to distractor ratios (SDRs) of the headset 900 are higher than the attenuation requirements, where the SDR describes the level ratio of the near end speech compared to the nearby distractor speech. The above test results prove the audio module 600/700 of the invention is capable of suppressing audio signals other than the (headset) user's speech.
[0069] The above embodiments and functional operations can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. The operations and logic flows described in
[0070] While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention should not be limited to the specific construction and arrangement shown and described, since various other modifications may occur to those ordinarily skilled in the art.