Adaptive multi-microphone beamforming
10366701 ยท 2019-07-30
Assignee
Inventors
Cpc classification
G10L2021/02165
PHYSICS
G10L21/0264
PHYSICS
International classification
G10L15/20
PHYSICS
Abstract
Provided is a method and computer program product for producing an enhanced audio signal for an output device from audio signals received by 2 or more microphones in close proximity to each other. For example, one embodiment of the present invention comprises the steps of receiving a first input audio signal from the first microphone, digitizing the first input audio signal to produce a first digitized audio input signal, receiving a second input audio input signal from the second microphone, digitizing the second input audio input signal to produce a second digitized audio input signal, using the first digitized audio input signal as a reference signal to an adaptive prediction filter, using the second digitized audio input signal as input to said adaptive prediction filter and finally adding a prediction result signal from the adaptive prediction filter to the first digitized audio input signal to produce the enhanced audio signal. In other embodiments, any number of microphones can be used, and in all embodiments there is no requirement to detect or locate the source or direction of arrival of the input audio signals.
Claims
1. A method for producing an amplified enhanced audio signal for an output device from audio signals received by a first and a second microphone in close proximity to each other, said method comprising the steps of: receiving a first input audio signal from the first microphone; digitizing said first input audio signal to produce a first digitized audio input signal; receiving a second input audio input signal from the second microphone; digitizing said second input audio input signal to produce a second digitized audio input signal; using said first digitized audio input signal as a input to a first adaptive prediction filter and as reference to a second adaptive prediction filter; using said second digitized audio input signal as an input to said second adaptive prediction filter and as reference to said first adaptive prediction filter; adding a prediction result signal from said first adaptive prediction filter to said second digitized audio input signal to produce a second enhanced audio signal; and adding a prediction result signal from said second adaptive prediction filter to said first digitized audio input signal to produce a first enhanced audio signal applying said first enhanced audio signal as input to a third adaptive prediction filter; applying said second enhanced signal as reference to said third adaptive prediction filter; adding a prediction result from said third adaptive prediction filter to said second enhanced signal to form said amplified enhanced audio signal; and outputting said enhanced audio signal to an output device.
2. The method of claim 1, further comprising the steps of: comparing said first enhanced audio signal to said second enhanced auto signal to determine a stronger signal and a weaker signal; and using said stronger signal as said reference signal and said weaker signal as said input signal in said applying steps.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
DETAILED DESCRIPTION
(11) The present invention may be described herein in terms of functional block components and various processing steps. It should be appreciated that such functional blocks may be realized by any number of hardware components or software elements configured to perform the specified functions. For example, the present invention may employ various integrated circuit components, e.g., memory elements, digital signal processing elements, logic elements, look-up tables, and the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices. In addition, those skilled in the art will appreciate that the present invention may be practiced in conjunction with any number of data and voice transmission protocols, and that the system described herein is merely one exemplary application for the invention.
(12) It should be appreciated that the particular implementations shown and described herein are illustrative of the invention and its best mode and are not intended to otherwise limit the scope of the present invention in any way. Indeed, for the sake of brevity, conventional techniques for signal processing, data transmission, signaling, packet-based transmission, network control, and other functional aspects of the systems (and components of the individual operating components of the systems) may not be described in detail herein, but are readily known by skilled practitioners in the relevant arts. Furthermore, the connecting lines shown in the various figures contained herein are intended to represent exemplary functional relationships and/or physical couplings between the various elements. It should be noted that many alternative or additional functional relationships or physical connections may be present in a practical communication system.
(13)
(14) Referring now to
(15) Referring now to
(16) Thus, because the two noise signals 321 and 322 remain uncorrelated, their sum does not create a 2 sample value effect in the output signal 330, as does the voice signal from the talker 301. Therefore, the two uncorrelated noise signals added together is simply a noise energy increase of 2, and a noise level increase of 3 d B.
(17)
(18) In an ideal case, with a speech signal energy level increase of 6 dB, and a noise level increase of 3 dB, the maximum gain of a two-microphone based delay-sum beamforming approach is 3 dB SNR. However, as previously mentioned, this traditional method requires extremely accurate knowledge regarding the location of the talker in order to calculate the exact time delay required to create a perfectly correlated speech signal. As would be appreciated by persons skilled in the art, it is often very difficult to accurately and precisely detect a talker's location. When such location information is not accurate or unavailable, the performance of such traditional beamforming systems and methods are dramatically reduced as is often the case when a talker is not stationary.
(19) Another difficulty with traditional delay-sum beamforming is that, due to design constraints, such as required product size, and other form factor considerations, multiple microphones are not necessarily aligned in a straight line. This makes the estimation of the talker's location even more difficult to calculate and therefore further limits the applicability of traditional methods. These types of problems are illustrated in
(20) As shown by the examples depicted in
(21) The present invention alleviates the problems found in traditional microphone beamforming methods and systems by not requiring any determination of the direction of arrival of the audio sources. Further, because the orientation of the device and the placement of the microphones are irrelevant, the present invention works equally well under all conditions and may be implemented with less complexity than traditional methods.
(22)
(23) In general, as stated by the above-referenced article, an adaptive filter is a filter that self adjusts its transfer function according to an optimizing algorithm. It adapts the performance based on the input signal. Such filters incorporate algorithms that allow the filter coefficients to adapt to the signal statics. Adaptive techniques use algorithms, which enable the adaptive filter to adjust its parameters to produce an output that matches the output of an unknown system. This algorithm employs an individual convergence factor that is updated for each adaptive filter coefficient at each iteration.
(24) As shown in
(25) Referring back now to
(26) The audio signal from the second microphone input 604 is digitized by the A/D converter 608 to become the second input speech signal 682, and is the input to the adaptive prediction module 670. The prediction result 692 is subtracted from the reference signal 691 to obtain the prediction error 693. This prediction error 693 is then used to drive the adaptive prediction module 670, which acts to minimize the prediction error as an objective for the adaptation. The sum of the first input speech signal 681 and the prediction result signal 692 forms the desired output signal 680, which is output to an output device such as a speaker, headphones or the like. Adding such highly correlated signals together results in an output signal 692 with an approximate amplification of 2.
(27) Please note that in the examples used herein, speech signals are used as examples (such as input speech signals 681 and 682) of the desired type of signals that are enhanced by an embodiment of the present invention. However, in other embodiments, any type of audio signal can be enhanced by the improved techniques described herein, such as music signals and the like, without departing from the scope and breadth of the present invention.
(28)
(29) In
(30) Similarly, the digitized second input speech signal 722 is used as reference for the first adaptive prediction module 771, which takes the digitized first input speech signal 721 as input to produce an optimized prediction result signal 731 that minimizes the prediction error between the reference signal 722 and the prediction result signal 731. The sum of 731 and 722 forms the second enhanced signal 741.
(31) The second enhanced signal 741 is used as the reference signal for a second level of prediction according to an example embodiment of the present invention. The first enhanced signal 742 is input to the third adaptive prediction module 773 that produces an optimized prediction result 733 by minimizing the prediction error between second enhanced signal 741 and the prediction result 733. Finally, the sum of 741 and 733 is the desired output signal 798, with is subsequently output to an output audio device.
(32) It should be noted that in this example embodiment, it is assumed that there is a high level of consistency between the first input signal 722 and the second input signal 721. As such, in this example, the second enhanced signal 741 is selected to act as the reference signal to the third adaptive prediction module 773. Indeed, in most cases, were the microphones that comprise the microphone array are closely spaced relative to each other, this consistency is expected. However, in order to minimize any negative effects from inconsistent inputs and to maximize the performance of the present invention, another stage may be added to the embodiment shown in
(33) As shown in
(34) In this example, the better or stronger single is detected in the first step 702, for example, the signal with the highest energy, or other criteria as discussed above is identified in the first step 702. Once this determination is made, the better signal is used as the reference signal and the other signal or weaker signal, is used as the input signal to the third adaptive prediction module 773. In particular, in step 702, if it is determined that signal 742 is better than 741, then as shown in step 704, the signal 742 is used as the reference signal and the signal 741 is used as the input signal to the adaptive prediction module 773. Similarly, if the Signal 741 is better than (or equal to) 742, then as shown in step 703, the signal 741 is the reference signal and the signal 742 is the input signal to the adaptive prediction module 773. In practice, if the signals are equivalent and neither one is better or stronger than the other, than it makes no difference which signal is used as the reference signal and which signal is used as the input signal.
(35) In yet another embodiment of the present invention, this technique of
(36)
(37) The digitized second microphone input is the input speech signal 872 that is the input to the second adaptive prediction module 878. Adaptive prediction module 878 functions to minimize the prediction error signal 894 between the reference signal 851 and the prediction result 882. As shown and indicated by the ellipses in
(38) Finally, the sum of the first input speech signal 831 (also the reference signal), and each of the prediction result signals associated with each of the N1 adaptive prediction filter modules, (such as those shown in 882 and 883), form the desired output signal 898, which is output to an output device.
(39) In yet another embodiment of the present invention, the technique of
(40) The present invention may be implemented using hardware, software or a combination thereof and may be implemented in a computer system or other processing system. Computers and other processing systems come in many forms, including wireless handsets, portable music players, infotainment devices, tablets, laptop computers, desktop computers and the like. In fact, in one embodiment, the invention is directed toward a computer system capable of carrying out the functionality described herein. An example computer system 901 is shown in
(41) Computer system 901 also includes a main memory 906, preferably random access memory (RAM), and can also include a secondary memory 908. The secondary memory 908 can include, for example, a hard disk drive 910 and/or a removable storage drive 912, representing a magnetic disc or tape drive, an optical disk drive, etc. The removable storage drive 912 reads from and/or writes to a removable storage unit 914 in a well-known manner. Removable storage unit 914, represent magnetic or optical media, such as disks or tapes, etc., which is read by and written to by removable storage drive 912. As will be appreciated, the removable storage unit 914 includes a computer usable storage medium having stored therein computer software and/or data.
(42) In alternative embodiments, secondary memory 908 may include other similar means for allowing computer programs or other instructions to be loaded into computer system 901. Such means can include, for example, a removable storage unit 922 and an interface 920. Examples of such can include a USB flash disc and interface, a program cartridge and cartridge interface (such as that found in video game devices), other types of removable memory chips and associated socket, such as SD memory and the like, and other removable storage units 922 and interfaces 920 which allow software and data to be transferred from the removable storage unit 922 to computer system 901.
(43) Computer system 901 can also include a communications interface 924. Communications interface 924 allows software and data to be transferred between computer system 901 and external devices. Examples of communications interface 924 can include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, etc. Software and data transferred via communications interface 924 are in the form of signals which can be electronic, electromagnetic, optical or other signals capable of being received by communications interface 924. These signals 926 are provided to communications interface via a channel 928. This channel 928 carries signals 926 and can be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link, such as WiFi or cellular, and other communications channels.
(44) In this document, the terms computer program medium and computer usable medium are used to generally refer to media such as removable storage device 912, a hard disk installed in hard disk drive 910, and signals 926. These computer program products are means for providing software or code to computer system 901.
(45) Computer programs (also called computer control logic or code) are stored in main memory and/or secondary memory 908. Computer programs can also be received via communications interface 924. Such computer programs, when executed, enable the computer system 901 to perform the features of the present invention as discussed herein. In particular, the computer programs, when executed, enable the processor 904 to perform the features of the present invention. Accordingly, such computer programs represent controllers of the computer system 901.
(46) In an embodiment where the invention is implemented using software, the software may be stored in a computer program product and loaded into computer system 901 using removable storage drive 912, hard drive 910 or communications interface 924. The control logic (software), when executed by the processor 904, causes the processor 904 to perform the functions of the invention as described herein.
(47) In another embodiment, the invention is implemented primarily in hardware using, for example, hardware components such as application specific integrated circuits (ASICs). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons skilled in the relevant art(s).
(48) In yet another embodiment, the invention is implemented using a combination of both hardware and software.
(49) While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.