METHOD, APPARATUS, AND COMPUTER-READABLE MEDIA FOR FOCUSSING SOUND SIGNALS IN A SHARED 3D SPACE
20170347217 · 2017-11-30
Inventors
Cpc classification
H04S2400/15
ELECTRICITY
International classification
Abstract
Focusing sound signals in a shared 3D space uses an array of physical microphones, preferably disposed evenly across a room to provide even sound coverage throughout the room. At least one processor coupled to the physical microphones does not form beams, but instead preferably forms 1000's of virtual microphone bubbles within the room. By determining the processing gains of the sound signals sourced at each of the bubbles, the location(s) of the sound source(s) in the room can be determined. This system provides not only sound improvement by focusing on the sound source(s), but with the advantage that a desired sound source can be focused on more effectively (rather than steered to) while un-focusing undesired sound sources (like reverb and noise) instead of rejecting out of beam signals. This provides a full three dimensional location and a more natural presentation of each sound within the room.
Claims
1. A method of focusing combined sound signals from a plurality of physical microphones in order to determine a processing gain for each of a plurality of virtual microphone locations in a shared 3D space, comprising: defining, by at least one processor, a plurality of virtual microphone bubbles in the shared 3D space, each bubble having location coordinates in the shared 3D space, each bubble corresponding to a virtual microphone; receiving, by the at least one processor, sound signals from the plurality of physical microphones in the shared 3D space; determining, by the at least one processor, a processing gain at each of the plurality of virtual microphone bubble locations, based on a received combination of sound signals sourced from each virtual microphone bubble location in the shared 3D space; identifying, by the at least one processor, a sound source in the shared 3D space, based on the determined processing gains, the sound source having coordinates in the shared 3D space; and focusing, by the at least one processor, combined signals from the plurality of physical microphones to the sound source coordinates by adjusting a weight and a delay for signals received from each of the plurality of physical microphones; and outputting, by the at least one processor, a plurality of streamed signals comprising (i) real-time location coordinates, in the shared 3D space, of the sound source, and (ii) sound source processing gain values associated with each virtual microphone bubble in the shared 3D space.
2. The method according to claim 1, wherein there are at least four bubble locations disposed in a 3D array in the shared 3D space, and wherein the coordinates in the shared 3D space are defined in (x,y,z) coordinates.
3. The method according to claim 1, wherein a largest processing gain among the bubbles corresponds to a location of the sound source.
4. The method according to claim 1, wherein plural sound sources are within the shared 3D space, and wherein the output plurality of streamed signals includes (i) real-time location coordinates, in the shared 3D space, of each of the plurality of sound sources, and (ii) sound source processing gain values associated with the virtual microphone bubbles, for each of the sound sources in the shared 3D space.
5. The method according to claim 1, wherein the output processing gain values increase with increases in magnitude of direct sound from the sound source relative to the reverb and noise in the shared 3D space.
6. The method according to claim 1, wherein the plurality of virtual microphone bubbles includes more than one hundred microphone bubbles.
7. The method according to claim 1, wherein the processing gain comprises a signal strength when the plurality of virtual microphones is focused on the sound source divided by the signal strength when the plurality of virtual microphones is not focused on the sound source.
8. The method according to claim 1, wherein the at least one processor determines an expected propagation delay from each virtual microphone to each physical microphone.
9. The method according to claim 1, wherein the at least one processor (i) samples the signals from the plurality of physical microphones at the same time and at a fixed rate, (ii) conditions and aligns the samples in time and weights the amplitude of each sample, and (iii) combines the conditioned and aligned samples.
10. Apparatus configured to focus combined sound signals from a plurality of physical microphones in order to determine a processing gain for each of a plurality of virtual microphone locations in a shared 3D space, each of the plurality of physical microphones being configured to receive sound signals in a shared 3D space, the apparatus comprising: at least one processor configured to: define a plurality of virtual microphone bubbles in the shared 3D space, each bubble having location coordinates in the shared 3D space, each bubble corresponding to a virtual microphone; receive sound signals from the plurality of physical microphones in the shared 3D space; determine a processing gain at each of the plurality of virtual microphone bubble locations, based on a received combination of sound signals sourced from each virtual microphone bubble location in the shared 3D space; identify a sound source in the shared 3D space, based on the determined processing gains, the sound source having coordinates in the shared 3D space; focus combined signals from the plurality of physical microphones to the sound source coordinates by adjusting a weight and a delay for signals received from each of the plurality of physical microphones; and output a plurality of streamed signals comprising (i) real-time location coordinates, in the shared 3D space, of the sound source, and (ii) sound source processing gain values associated with each virtual microphone bubble in the shared 3D space.
11. The apparatus according to claim 10, wherein the at least one processor defines four bubble locations in a 3D array in the shared 3D space, and wherein the coordinates in the shared 3D space are defined in (x,y,z) coordinates.
12. The apparatus according to claim 10, wherein the at least one processor determines a location of the sound source as corresponding to a largest processing gain among the bubbles.
13. The apparatus according to claim 10, wherein plural sound sources are within the shared 3D space, and wherein the at least one processor provides the output plurality of streamed signals which include (i) real-time location coordinates, in the shared 3D space, of each of the plurality of sound sources, and (ii) sound source processing gain values associated with the virtual microphone bubbles, for each of the sound sources in the shared 3D space.
14. The apparatus according to claim 10, wherein the at least one processor provides output processing gain values which increase with increases in magnitude of direct sound from the sound source relative to the reverb and noise in the shared 3D space.
15. The apparatus according to claim 10, wherein the at least one processor defines more than one hundred microphone bubbles.
16. The apparatus according to claim 10, wherein the processing gain comprises a signal strength when the plurality of virtual microphones is focused on the sound source divided by the signal strength when the plurality of virtual microphones is not focused on the sound source.
17. The apparatus according to claim 10, wherein the at least one processor determines an expected propagation delay from each virtual microphone to each physical microphone.
18. The apparatus according to claim 10, wherein the at least one processor (i) samples the signals from the plurality of physical microphones at the same time and at a fixed rate, (ii) conditions and aligns the samples in time and weights the amplitude of each sample, and (iii) combines the conditioned and aligned samples.
19. The apparatus according to claim 10, wherein the at least one processor comprises a microphone processor and a bubble processor.
20. A program embodied in a non-transitory computer readable medium for focusing combined sound signals from a plurality of physical microphones in order to determine a processing gain for each of a plurality of virtual microphone locations in a shared 3D space, said program comprising instructions causing at least one processor to: define a plurality of virtual microphone bubbles in the shared 3D space, each bubble having location coordinates in the shared 3D space, each bubble corresponding to a virtual microphone; receive sound signals from the plurality of physical microphones in the shared 3D space; determine a processing gain at each of the plurality of virtual microphone bubble locations, based on a received combination of sound signals sourced from each virtual microphone bubble location in the shared 3D space; identify a sound source in the shared 3D space, based on the determined processing gains, the sound source having coordinates in the shared 3D space; focus combined signals from the plurality of physical microphones to the sound source coordinates by adjusting a weight and a delay for signals received from each of the plurality of physical microphones; and output a plurality of streamed signals comprising (i) real-time location coordinates, in the shared 3D space, of the sound source, and (ii) sound source processing gain values associated with each virtual microphone bubble in the shared 3D space.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
DETAILED DESCRIPTION OF THE PRESENTLY PREFERRED EXEMPLARY EMBODIMENTS
[0033] The present invention is directed to systems and methods that enable groups of people, known as participants, to join together over a network such as the Internet, or similar electronic channel, in a remotely distributed real-time fashion employing personal computers, network workstations, or other similarly connected appliances, without face-to-face contact, to engage in effective audio conference meetings that utilize large multi-user rooms (spaces) with distributed participants.
[0034] Advantageously, embodiments of the present invention pertain to utilizing the time domain to provide systems and methods to give remote participants the capability to focus an in-multi-user-room microphone array to the desired speaking participant and/or sound sources. And the present invention may be applied to any one or more shared spaces having multiple microphones for both focusing sound source pickup and simulating a local sound recipient for a remote listening participant.
[0035] Focusing the microphone array preferably comprises the process of optimizing the microphone array to maximize the process gain at the targeted virtual microphone (X,Y,Z) position, to increase the magnitude of the desired sound source while maintaining a constant ambient noise level in the shared space, resulting in a natural audio experience; and is specifically not the process of switching microphones, and/or steering microphone beam former array(s) to provide constant gain within the on-axis beam and rejecting the off axis signals resulting in an unnatural audio experience and inconsistent ambient noise performance.
[0036] A notable challenge to picking up sound clearly in a room, cabin or confined space is the multipath environment where the sound wave reaches the ear both directly and via many reflected paths. If the microphone is in close proximity to the source, then the direct path is very much stronger than the reflected paths and it dominates the signal. This gives a very clean sound. In the present invention, it is desirable to place the microphones unobtrusively and away from the sound source, on the walls or ceiling to get them out of the way of the participants and occupants.
[0037]
[0038]
[0039]
[0040] To derive the processing gains 308, the volume of the room where sound pickup is desired is preferably divided into a large number of virtual microphone positions (
[0041] The flow chart in
[0042] The signal components 320 from the microphone's element processor 301 are summed at node 304 to provide the combined microphone array 205 signal for each of the 8192 bubbles. Each bubble signal is preferably converted into a power signal at node 305 by squaring the signal samples. The power signals are then preferably summed over a given time window by the 8192 accumulators at node 307. The sums represent the signal energy over that time period.
[0043] The processing gain for each bubble is preferably calculated at node 308 by dividing the energy of each bubble by the energy of an ideal unfocused signal 322. The unfocused signal energy is preferably calculated by Summing 319 the energies of the signals from each microphone element 318 over the given time window, weighted by the maximum ratio combining weight squared. This is the energy that we would expect if all of the signals were uncorrelated. The processing gain 308 is then preferably calculated for each bubble by dividing the microphone array signal energy by the unfocused signal energy 322.
[0044] Processing Gain is achieved because signals from a common sound source all experience the same delay before being combined, which results in those signals being added up coherently, meaning that their amplitudes add up. If 12 equal amplitude and time aligned direct signals 101 are combined the resulting signal will have an amplitude 12× higher, or a power level 144× higher. Signals from different sources and signals from the same source with significantly different delays as the signals from reverb 202 and noise 203 do not add up coherently and do not experience the same gain. In the extremes, the signals are completely uncorrelated and will add up orthogonally. If 12 equal amplitude orthogonal signals are added up, the signal will have roughly 12× the power of the original signal or a 3.4× increase in amplitude (measured as rms). The difference between the 12× gain of the direct signal 101 and the 3.4× gain of the reverb (202) and noise signals (203) is the net processing gain (3.4 or 11 dB) of the microphone array 205 when it is focused on the sound source 107. This makes the signal sound as if the microphone 108 has moved 3.4× closer to the sound source. This example used a 12 microphone array 205 but it could be extended to an arbitrary number (N) resulting in a maximum possible processing gain of sqrt(N) or 10 log (N) dB.
[0045] The bubble processor system 300 preferably simultaneously focuses the microphone array 205 on 8192 points 402 in 3-D space using the method described above. The energy level of a short burst of sound signal (50-100 ms) is measured at each of the 8192 virtual microphone bubble 402 points and compared to the energy level that would be expected if the signals combined orthogonally. This gives us the processing gain 308 at each point. The virtual microphone bubble 402 that is closest to the sound source 107 should experience the highest processing gain and be represented as a peak in the output. Once that is determined, the location 403 is known.
[0046] Node 306 preferably searches through the output of the processing gain unit 308 for the bubble with the highest processing gain. The (x,y,z) location 301120 (
[0047]
[0048] The Mic Element Processor 301 and shown in
[0049] It may be expected that reflected signals 202 will be de-correlated from the direct signal 101 due to the fact that they have to travel a further distance and will be time-shifted relative to the desired direct signal 101. This is not true in practice, as signals that are shifted by a small amount of time will have some correlation to each other. A “small amount of time” depends on the frequency of the signal. Low frequency signals tend to de-correlate with delay much less than high frequency signals. Signals at low frequency spread themselves over many sample points and make it hard to find the source of the sound. For this reason, it is preferable to filter off as much of the low frequency signal as possible without losing the signal itself. High frequency signals also pose a problem because they de-correlate too fast. Since there cannot be an infinite number of virtual microphone bubbles (402) in the space, there should be some significant distance between them, say 200 mm. The focus volume of the virtual microphone bubble (402) becomes smaller as the frequency increases because the tiny shift in delays has more of an effect. If the bubbles volumes get too small, then the sound source may fall between two sample points and get lost. By restricting the high frequency components, the virtual microphone bubbles (402) will preferably be big enough that sound sources (309) will not be missed by a sample point in the process algorithm. The signal is preferably filtered and passed to the Microphone Delay line function 3011.
[0050] A delay line 3011 (
[0051] A counter 3015, preferably running at a sample frequency of more than 8192 times that of the microphone sample rate, counts bubble positions from 0 to 8191 and sends this to the index of the two look up tables 3012 and 3014. The output of the bubble delay lookup table 3012 is preferably used to choose that tap of the delay line 3011 with the corresponding delay for that bubble. That sample is then preferably multiplied 3013 by the weight read from the weight lookup table 3014. For each sample input to the microphone element processor 301, 8192 samples are output 3018, each corresponding to the signal component for a particular virtual microphone bubble 402 in relation to that microphone element 108.
[0052] The second method by which the array may be used to improve the direct signal strength is by applying a specific weight to the output of each microphone element 108. Because the microphones 108 are not co-located in the exact same location, the direct sound 101 will not arrive at the microphones 108 with equal amplitude. The amplitude drops as 1/r 110 and the distance (r) is different for each combination of microphone 108 and virtual microphone bubble 402. This creates a problem as mixing weaker signals 310 into the output at the same level as stronger signals 310 can actually introduce more noise 203 and reverb 202 into the system 300 than not. Maximal Ratio Combining is the preferable way of combining signals 304. Simply put, each signal in the combination should be weighted 3014 proportionally by the amplitude of the signal component to result in the highest signal to noise level. Since the distance that each direct path 101 travels from each bubble position 402 to each microphone 108 is known, and since the 1/r law is also known, this can be used to calculate the optimum weighting 3014 for each microphone 108 at each of the 8192 virtual microphone points 402.
[0053]
[0054] The present embodiment is designed with a target time delay, D, 30117 as shown in
[0055] The challenge now is how to compute the 8192 sample points in real-time so that the system can pick up a sound source and focus on it as it happens. The challenge is very computation and memory bandwidth intensive. For each microphone at each virtual microphone bubble 402 point in the room, there are five simple operations: fetch the required delay 3012 to add to this path, fetch the required weight 3014, fetch the signal from a delay line 3011, multiply the signal by the weight 3013, and add the result to the total signal 304. The implementation of this embodiment is for 12 microphones 205, at each of the 8192 virtual microphone 402 sample points, at the base sample frequency of 12 kHz. The total operation count is 12×8192×12000×5 operations=5.9 billion operations per second. The rest of the calculation (filters, power calculation, peak finding, etc.) is still large but insignificant compared to this number. While this operation count is possible with a high-end computer system, it is not economical. Implementation of the process is preferably on a field programmable gate array (FPGA) or, equivalently, it could be implemented on an ASIC. On the FPGA, is a processor core that can preferably do all five of the basic operations in parallel in a single clock cycle. Twelve copies of the processor core are preferably provided, one for each microphone to allow for sufficient processing capability. This system now can compute 60 operations in parallel and operate at a modest clock rate of 100 MHz. A small DSP processor for filtering and final array processing is preferably used.
[0056]
[0057]
[0058]
[0059]
[0060] The individual components shown in outline or designated by blocks in the attached Drawings are all well-known in the electronic processing arts, and their specific construction and operation are not critical to the operation or best mode for carrying out the invention.
[0061] While the present invention has been described with respect to what is presently considered to be the preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments. To the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.