Methods and Systems for Providing Consistency in Noise Reduction during Speech and Non-Speech Periods
20170221501 · 2017-08-03
Inventors
Cpc classification
G10L2021/02168
PHYSICS
G10L21/02
PHYSICS
International classification
G10L25/93
PHYSICS
Abstract
Methods and systems for providing consistency in noise reduction during speech and non-speech periods are provided. First and second signals are received. The first signal includes at least a voice component. The second signal includes at least the voice component modified by human tissue of a user. First and second weights may be assigned per subband to the first and second signals, respectively. The first and second signals are processed to obtain respective first and second full-band power estimates. During periods when the user's speech is not present, the first weight and the second weight are adjusted based at least partially on the first full-band power estimate and the second full-band power estimate. The first and second signals are blended based on the adjusted weights to generate an enhanced voice signal. The second signal may be aligned with the first signal prior to the blending.
Claims
1. A method for audio processing, the method comprising: receiving a first signal including at least a voice component and a second signal including at least the voice component modified by at least a human tissue of a user, the voice component being speech of the user, the first and second signals including periods when the speech of the user is not present; assigning a first weight to the first signal and a second weight to the second signal; processing the first signal to obtain a first power estimate; processing the second signal to obtain a second power estimate; utilizing the first and second power estimates to identify the periods when the speech of the user is not present; for the periods that have been identified to be when the speech of the user is not present, performing one or both of decreasing the first weight and increasing the second weight so as to enhance the level of the second signal relative to the first signal; and blending, based on the first weight and the second weight, the first signal and the second signal to generate an enhanced voice signal.
2. The method of claim 1, further comprising: further processing the first signal to obtain a first full-band power estimate; further processing the second signal to obtain a second full-band power estimate; determining a minimum value between the first full-band power estimate and the second full-band power estimate; and based on the determination: increasing the first weight and decreasing the second weight when the minimum value corresponds to the first full-band power estimate; and increasing the second weight and decreasing the first weight when the minimum value corresponds to the second full-band power estimate.
3. The method of claim 2, wherein the increasing and decreasing is carried out by applying a shift.
4. The method of claim 3, wherein the shift is calculated based on a difference between the first full-band power estimate and the second full-band power estimate, the shift receiving a larger value for a larger difference value.
5. The method of claim 4, further comprising: prior to the increasing and decreasing, determining that the difference exceeds a pre-determined threshold; and based on the determination, applying the shift if the difference exceeds the pre-determined threshold.
6. The method of claim 1, wherein the first signal and the second signal are transformed into subband signals.
7. The method of claim 6, wherein, for the periods when the speech of the user is present, the assigning the first weight and the second weight is carried out per subband by performing the following: processing the first signal to obtain a first signal-to-noise ratio (SNR) for the subband; processing the second signal to obtain a second SNR for the subband; comparing the first SNR and the second SNR; and based on the comparison, assigning a first value to the first weight for the subband and a second value to the second weight for the subband, and wherein: the first value is larger than the second value if the first SNR is larger than the second SNR; the second value is larger than the first value if the second SNR is larger than the first SNR; and a difference between the first value and the second value depends on a difference between the first SNR and the second SNR.
8. The method of claim 1, wherein the second signal represents at least one sound captured by an internal microphone located inside an ear canal.
9. The method of claim 8, wherein the internal microphone is at least partially sealed for isolation from acoustic signals external to the ear canal.
10. The method of claim 1, wherein the first signal represents at least one sound captured by an external microphone located outside an ear canal.
11. The method of claim 1, further comprising, prior to the assigning, aligning the second signal with the first signal, the aligning including applying a spectral alignment filter to the second signal.
12. The method of claim 1, wherein the assigning of the first weight and the second weight includes: determining, based on the first signal, a first noise estimate; determining, based on the second signal, a second noise estimate; and calculating, based on the first noise estimate and the second noise estimate, the first weight and the second weight.
13. The method of claim 1, wherein the blending includes mixing the first signal and the second signal according to the first weight and the second weight.
14. A system for audio processing, the system comprising: a processor; and a memory communicatively coupled with the processor, the memory storing instructions, which, when executed by the processor, perform a method comprising: receiving a first signal including at least a voice component and a second signal including at least the voice component modified by at least a human tissue of a user, the voice component being speech of the user, the first and second signals including periods when the speech of the user is not present; assigning a first weight to the first signal and a second weight to the second signal; processing the first signal to obtain a first power estimate; processing the second signal to obtain a second power estimate; utilizing the first and second power estimates to identify the periods when the speech of the user is not present; for the periods that have been identified to be when the speech of the user is not present, performing one or both of decreasing the first weight and increasing the second weight so as to enhance the level of the second signal relative to the first signal; and blending, based on the first weight and the second weight, the first signal and the second signal to generate an enhanced voice signal.
15. The system of claim 14, wherein the method further comprises: further processing the first signal to obtain a first full-band power estimate; further processing the second signal to obtain a second full-band power estimate; determining a minimum value between the first full-band power estimate and the second full-band power estimate; and based on the determination: increasing the first weight and decreasing the second weight when the minimum value corresponds to the first full-band power estimate; and increasing the second weight and decreasing the first weight when the minimum value corresponds to the second full-band power estimate.
16. The system of claim 15, wherein the increasing and decreasing is carried out by applying a shift.
17. The system of claim 16, wherein the shift is calculated based on a difference of the first full-band power estimate and the second full-band power estimate, the shift receiving a larger value for a larger value difference.
18. The system of claim 17, further comprising: prior to the increasing and decreasing, determining that the difference exceeds a pre-determined threshold; and based on the determination, applying the shift if the difference exceeds the pre-determined threshold.
19. The system of claim 14, wherein the first signal and the second signal are transformed into subband signals.
20. The system of claim 19, wherein, for the periods when the speech of the user is present, the assigning the first weight and the second weight is carried out per subband by performing the following: processing the first signal to obtain a first signal-to-noise ratio (SNR) for the subband; processing the second signal to obtain a second SNR for the subband; comparing the first SNR and the second SNR; and based on the comparison, assigning a first value to the first weight for the subband and a second value to the second weight for the subband, and wherein: the first value is larger than the second value if the first SNR is larger than the second SNR; the second value is larger than the first value if the second SNR is larger than the first SNR; and a difference between the first value and the second value depends on a difference between the first SNR and the second SNR.
21. The system of claim 14, wherein the second signal represents at least one sound captured by an internal microphone located inside an ear canal.
22. The system of claim 21, wherein the internal microphone is at least partially sealed for isolation from acoustic signals external to the ear canal.
23. The system of claim 14, wherein the first signal represents at least one sound captured by an external microphone located outside an ear canal.
24. The system of claim 14, further comprising, prior to assigning, aligning the second signal with the first signal, the aligning including applying a spectral alignment filter to the second signal.
25. The system of claim 14, wherein the assigning the first weight and the second weight includes: determining, based on the first signal, a first noise estimate; determining, based on the second signal, a second noise estimate; and calculating, based on the first noise estimate and the second noise estimate, the first weight and the second weight.
26. A non-transitory computer-readable storage medium having embodied thereon instructions, which, when executed by at least one processor, perform steps of a method, the method comprising: receiving a first signal including at least a voice component and a second signal including at least the voice component modified by at least a human tissue of a user, the voice component being speech of the user, the first and second signals including periods when the speech of the user is not present; determining, based on the first signal, a first noise estimate; determining, based on the second signal, a second noise estimate; assigning, based on the first noise estimate and second noise estimate, a first weight to the first signal and a second weight to the second signal; processing the first signal to obtain a first power estimate; processing the second signal to obtain a second power estimate; utilizing the first and second power estimates to identify the periods when the speech of the user is not present; for the periods that have been identified to be when the speech of the user is not present, performing one or both of decreasing the first weight and increasing the second weight so as to enhance the level of the second signal relative to the first signal; and blending, based on the first weight and the second weight, the first signal and the second signal to generate an enhanced voice signal.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] Embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements.
[0016]
[0017]
[0018]
[0019]
[0020]
DETAILED DESCRIPTION
[0021] The present technology provides systems and methods for audio processing which can overcome or substantially alleviate problems associated with ineffective noise reduction during speech-absent periods. Embodiments of the present technology can be practiced on any earpiece-based audio device that is configured to receive and/or provide audio such as, but not limited to, cellular phones, MP3 players, phone handsets and headsets. While some embodiments of the present technology are described in reference to operation of a cellular phone, the present technology can be practiced with any audio device.
[0022] According to an example embodiment, the method for audio processing includes receiving a first audio signal and a second audio signal. The first audio signal includes at least a voice component. The second audio signal includes the voice component modified by at least a human tissue of a user, the voice component being speech of the user. The first and second audio signals may include periods when the speech of the user is not present. The first and second audio signals may be transformed into subband signals. The example method includes assigning, per subband, a first weight to the first audio signal and a second weight to the second audio signal. The example method includes processing the first audio signal to obtain a first full-band power estimate. The example method includes processing the second audio signal to obtain a second full-band power estimate. For the periods when the user's speech is not present (speech gaps), the example method includes adjusting, based at least partially on the first full-band power estimate and the second full-band power estimate, the first weight and the second weight. The example method also includes blending, based on the adjusted first weight and the adjusted second weight, the first audio signal and the second audio signal to generate an enhanced voice signal.
[0023] Referring now to
[0024] In various embodiments, the microphones 106 and 108 are either analog or digital. In either case, the outputs from the microphones are converted into synchronized pulse coded modulation (PCM) format at a suitable sampling frequency and connected to the input port of the digital signal processor (DSP) 112. The signals xin and xex denote signals representing sounds captured by internal microphone 106 and external microphone 108, respectively.
[0025] The DSP 112 performs appropriate signal processing tasks to improve the quality of microphone signals x.sub.in and x.sub.ex. The output of DSP 112, referred to as the send-out signal (s.sub.out), is transmitted to the desired destination, for example, to a network or host device 116 (see signal identified as s.sub.out uplink), through a radio or wired interface 114.
[0026] If a two-way voice communication is needed, a signal is received by the network or host device 116 from a suitable source (e.g., via the wireless or wired interface 114). This is referred to as the receive-in signal (r.sub.in) (identified as r.sub.in downlink at the network or host device 116). The receive-in signal can be coupled via the radio or wired interface 114 to the DSP 112 for processing. The resulting signal, referred to as the receive-out signal (r.sub.out), is converted into an analog signal through a digital-to-analog convertor (DAC) 110 and then connected to a loudspeaker 118 in order to be presented to the user. In some embodiments, the loudspeaker 118 is located in the same ear canal 104 as the internal microphone 106. In other embodiments, the loudspeaker 118 is located in the ear canal opposite the ear canal 104. In example of
[0027]
[0028] In various embodiments, each ITE module 202 includes an internal microphone 106 and the loudspeaker 118 (shown in
[0029] In some embodiments, each of the BTE modules 204 and 206 includes at least one external microphone 108 (also shown in
[0030] In some embodiments, the seal of the ITE module(s) 202 is good enough to isolate acoustics waves coming from outside acoustic environment 102. However, when speaking or singing, a user can hear user's own voice reflected by ITE module(s) 202 back into the corresponding ear canal. The sound of voice of the user can be distorted because, while traveling through skull of the user, high frequencies of the sound are substantially attenuated. Thus, the user can hear mostly the low frequencies of the voice. The user's voice cannot be heard by the user outside of the earpieces since the ITE module(s) 202 isolate external sound waves.
[0031]
[0032] In the example in
[0033] By way of example and not limitation, suitable noise reduction methods are described by Ephraim and Malah, “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, December 1984., and U.S. patent application Ser. No. 12/832,901 (now U.S. Pat. No. 8,473,287), entitled “Method for Jointly Optimizing Noise Reduction and Voice Quality in a Mono or Multi-Microphone System,” filed on Jul. 8, 2010, the disclosures of which are incorporated herein by reference for all purposes.
[0034] In various embodiments, the microphone signals x.sub.in and x.sub.ex, with or without NR, and noise estimates (e.g., “external noise and SNR estimates” output from NT/NR module 302 and/or “internal noise and SNR estimates” output from NT/NR module 304) from the NT/NR modules 302 and 304 are sent to a microphone spectral alignment (MSA) module 306, where a spectral alignment filter is adaptively estimated and applied to the internal microphone signal x.sub.in. A primary purpose of MSA module 306, in the example in
[0035] The external microphone signal x.sub.ex, the spectrally-aligned internal microphone signal x.sub.in,align, and the estimated noise levels at both microphones 106 and 108 are then sent to a microphone signal blending (MSB) module 308, where the two microphone signals are intelligently combined based on the current signal and noise conditions to form a single output with optimal voice quality. The functionalities of various embodiments of the NT/NR modules 302 and 304, MSA module, and MSB module 308 are discussed in more detail in U.S. patent application Ser. No. 14/853,947, entitled “Microphone Signal Fusion”, filed Sep. 14, 2015.
[0036] In some embodiments, external microphone signal x.sub.ex and the spectrally-aligned internal microphone signal x.sub.in,align are blended using blending weights. In certain embodiments, the blending weights are determined in MSB module 308 based on the “external noise and SNR estimates” and the “internal noise and SNR estimates”.
[0037] For example, MSB module 308 operates in the frequency-domain and determines the blending weights of the external microphone signal and spectral-aligned internal microphone signal in each frequency bin based on the SNR differential between the two signals in the bin. When a user's speech is present (for example, the user of headset 200 is speaking during a phone call) and the outside acoustic environment 102 becomes noisy, the SNR of the external microphone signal x.sub.ex becomes lower as compared to the SNR of the internal microphone signal x.sub.in. Therefore, the blending weights are shifted toward the internal microphone signals x.sub.in. Because acoustic sealing tends to reduce the noise in the ear canal by 20-30 dB relative to the external environment, the shift can potentially provide 20-30 dB noise reduction relative to the external microphone signal. When the user's speech is absent, the SNRs of both internal and external microphone signals are effectively zero, so the blending weights become evenly distributed between the internal and external microphone signals. Therefore, if the outside acoustic environment is noisy, the resulting blended signal s.sub.out includes the part of the noise. The blending of internal microphone signal x.sub.in and noisy external microphone signal x.sub.ex may result in 3-6 dB noise reduction, which is generally insufficient for extraneous noise conditions.
[0038] In various embodiments, the method includes utilizing differences between the power estimates for the external and the internal microphone signals for locating gaps in the speech of the user of headset 200. In certain embodiments, for the gap intervals, blending weight for the external microphone signal is decreased or set to zero and blending weight for the internal microphone signal is increased or set to one before blending of the internal microphone and external microphone signals. Thus, during the gaps in the user's speech, the blending weights are biased to the internal microphone signal, according to various embodiments. As a result, the resulting blended signal contains a lesser amount of the external microphone signal and, therefore, a lesser amount of noise from the outside external environment. When the user is speaking, the blended weights are determined based on “noise and SNR estimates” of internal and external microphone signals. Blending the signals during user's speech improves the quality of the signal. For example, the blending of the signals can improve a quality of signals delivered to the far-end talker during a phone call or to an automatic speech recognition system by the radio or wired interface 114.
[0039] In various embodiments, DSP 112 includes a microphone power spread (MPS) module 310 as shown in
[0040] In various embodiments, the MPS module 310 generates microphone power spread (MPS) estimates for the internal microphone signal and external microphone signal. The MPS estimates are provided to MSB module 308. In certain embodiments, the MPS estimates are used for a supplemental control of microphone signal blending. In some embodiments, MSB module 308 applies a global bias toward the microphone signal with significantly lower full-band power, for example, by increasing the weights for that microphone signal and decreasing the weights for the other microphone signal (i.e., shifting the weights toward the microphone signal with significantly lower full-band power) before the two microphone signals are blended.
[0041]
[0042] In block 404, method 400 can proceed with assigning a first weight to the first audio signal and a second weight to the second audio signal. In some embodiments, prior to assigning the first weight and the second weight, the first audio signal and the second audio signal are transformed into subband signals and, therefore, assigning of the weights may be performed per each subband. In some embodiments, the first weight and the second weight are determined based on noise estimates in the first audio signal and the second audio signal. In certain embodiments, when the user's speech is present, the first weight and the second weight are assigned based on sub-band SNR estimates in the first audio signal and the second audio signal.
[0043] In block 406, method 400 can proceed with processing the first audio signal to obtain a first full-band power estimate. In block 408, method 400 can proceed with processing the second audio signal to obtain a second full-band power estimate. In block 410, during speech gaps when the user's speech is not present, the first weight and the second weight may be adjusted based, at least partially, on the first full-band power estimate and the second full-band power estimate. In some embodiments, if the first full-band power estimate is less than the second full-band estimate, the first weight and the second weight are shifted towards the first weight. If the second full-band power estimate is less than the first full-band estimate, the first weight and the second weight are shifted towards the second weight.
[0044] In block 412, the first signal and the second signal can be used to generate an enhanced voice signal by being blended together based on the adjusted first weight and the adjusted second weight.
[0045]
[0046] The components shown in
[0047] Mass data storage 530, which can be implemented with a magnetic disk drive, solid state drive, or an optical disk drive, is a non-volatile storage device for storing data and instructions for use by processor unit(s) 510. Mass data storage 530 stores the system software for implementing embodiments of the present disclosure for purposes of loading that software into main memory 520.
[0048] Portable storage device 540 operates in conjunction with a portable non-volatile storage medium, such as a flash drive, floppy disk, compact disk, digital video disc, or Universal Serial Bus (USB) storage device, to input and output data and code to and from the computer system 500 of
[0049] User input devices 560 can provide a portion of a user interface. User input devices 560 may include one or more microphones, an alphanumeric keypad, such as a keyboard, for inputting alphanumeric and other information, or a pointing device, such as a mouse, a trackball, stylus, or cursor direction keys. User input devices 560 can also include a touchscreen. Additionally, the computer system 500 as shown in
[0050] Graphics display system 570 include a liquid crystal display (LCD) or other suitable display device. Graphics display system 570 is configurable to receive textual and graphical information and processes the information for output to the display device.
[0051] Peripheral devices 580 may include any type of computer support device to add additional functionality to the computer system.
[0052] The components provided in the computer system 500 of
[0053] The processing for various embodiments may be implemented in software that is cloud-based. In some embodiments, the computer system 500 is implemented as a cloud-based computing environment, such as a virtual machine operating within a computing cloud. In other embodiments, the computer system 500 may itself include a cloud-based computing environment, where the functionalities of the computer system 500 are executed in a distributed fashion. Thus, the computer system 500, when configured as a computing cloud, may include pluralities of computing devices in various forms, as will be described in greater detail below.
[0054] In general, a cloud-based computing environment is a resource that typically combines the computational power of a large grouping of processors (such as within web servers) and/or that combines the storage capacity of a large grouping of computer memories or storage devices. Systems that provide cloud-based resources may be utilized exclusively by their owners or such systems may be accessible to outside users who deploy applications within the computing infrastructure to obtain the benefit of large computational or storage resources.
[0055] The cloud may be formed, for example, by a network of web servers that comprise a plurality of computing devices, such as the computer system 500, with each server (or at least a plurality thereof) providing processor and/or storage resources. These servers may manage workloads provided by multiple users (e.g., cloud resource customers or other users). Typically, each user places workload demands upon the cloud that vary in real-time, sometimes dramatically. The nature and extent of these variations typically depends on the type of business associated with the user.
[0056] The present technology is described above with reference to example embodiments. Therefore, other variations upon the example embodiments are intended to be covered by the present disclosure.