Sound source localization using phase spectrum
09817100 ยท 2017-11-14
Assignee
Inventors
- Shankar Regunathan (Redmond, WA)
- Kazuhito Koishida (Redmond, WA)
- Harshavardhana Narayana Kikkeri (Bellevue, WA)
Cpc classification
G01S3/8006
PHYSICS
G01S3/82
PHYSICS
International classification
G01S3/808
PHYSICS
G01S3/82
PHYSICS
Abstract
An array of microphones placed on a mobile robot provides multiple channels of audio signals. A received set of audio signals is called an audio segment, which is divided into multiple frames. A phase analysis is performed on a frame of the signals from each pair of microphones. If both microphones are in an active state during the frame, a candidate angle is generated for each such pair of microphones. The result is a list of candidate angles for the frame. This list is processed to select a final candidate angle for the frame. The list of candidate angles is tracked over time to assist in the process of selecting the final candidate angle for an audio segment.
Claims
1. A process for sound source localization with a plurality of pairs of microphones with known spatial relationship, comprising: receiving signals from the plurality of pairs of microphones into a memory as a plurality of frames; processing each frame of the signals from the plurality of pairs of microphones to identify when the received signals are active in the frame; computing frequency spectrum data for each frame of the received signals; for each pair of active signals in a frame, determining a candidate angle of sound arrival on the plurality of pairs of microphones using the frequency spectrum data; and selecting, for a current frame, an angle of sound arrival on the plurality of pairs of microphone from among the candidate angles determined for the current frame, by: tracking a history of candidate angles determined for multiple frames; updating the history based on the candidate angles determined for the current frame; and selecting, as the angle for the current frame, an angle from the history having a phase distortion less than or equal to a minimum phase distortion of the candidate angles and similar to a highest ranked candidate angle determined for the current frame.
2. The computer-implemented process of claim 1, wherein processing, computing, determining and selecting are performed on a per frame basis.
3. The computer-implemented process of claim 1, wherein selecting the angle from the history is further based on the selected candidate angle having a presence score greater than or equal to a maximum presence score of candidate angles in the history.
4. The process of claim 1, wherein the history comprises, for each candidate angle, a phase distortion, a presence score and a presence counter.
5. The process of claim 4, wherein updating the history comprises: for candidate angles in the history other than the selected candidate angle for the current frame, decrementing the presence counter for the candidate angles.
6. The process of claim 5, wherein updating the history comprises: for the selected candidate angle for the current frame, incrementing a presence counter for the selected candidate angle.
7. The process of claim 6, further comprising, in response to a determination that a presence counter for a selected candidate angle for the current frame exceeds a threshold, reporting the selected candidate angle for the current frame as a detected angle of sound arrival on the plurality of pairs of microphone.
8. The process of claim 4 wherein updating the history comprises: for a target candidate angle in the history having a lowest phase distortion, updating the presence score for the target candidate angle based on a candidate angle for the current frame having an angle similar to the target candidate angle.
9. The process of claim 4 wherein updating the history comprises: for the target candidate angle in the history having a lowest phase distortion, updating the phase distortion for the target candidate angle based on a candidate angle for the current frame having an angle similar to the target candidate angle.
10. The process of claim 4 wherein updating the history comprises: for the target candidate angle in the history having a lowest phase distortion, updating the angle of the target candidate angle based on a candidate angle for the current frame having an angle similar to the target candidate angle.
11. A computing machine comprising: a memory; an input for receiving signals from a plurality of pairs of microphones into the memory as a plurality of frames; a processing unit configured to process each frame of the received signals from the plurality of pairs of microphones to identify when the received signals are active in the frame and to compute frequency spectrum data for each frame of the received signals; wherein the processing unit is further configured to, for each pair of active signals in a frame, determine a candidate angle of sound arrival on the plurality of pairs of microphones using the frequency spectrum data, and to select, for a current frame, an angle of sound arrival on the plurality of pairs of microphones from among the candidate angles determined for the current frame, by: tracking a history of candidate angles determined for multiple frames; updating the history based on the candidate angles determined for the current frame; and selecting, as the angle for the current frame, an angle from the history having a phase distortion less than or equal to a minimum phase distortion of the candidate angles and similar to a highest ranked candidate angle determined for the current frame.
12. The computing machine of claim 11, wherein the processing unit is configured to process the signals on a per frame basis.
13. The computing machine of claim 11, wherein to select the angle from the history, the processing unit is further configured to select the candidate angle having a presence score greater than or equal to a maximum presence score of candidate angles in the history.
14. The computing machine of claim 11, wherein the history comprises, for each candidate angle, a phase distortion, a presence score and a presence counter.
15. The computing machine of claim 14, wherein updating the history comprises: for candidate angles in the history other than the selected candidate angle for the current frame, decrementing the presence counter for the candidate angles.
16. The computing machine of claim 15, wherein updating the history comprises: for the selected candidate angle for the current frame, incrementing a presence counter for the selected candidate angle.
17. The computing machine of claim 16, further comprising, in response to a determination that a presence counter for a selected candidate angle for the current frame exceeds a threshold, reporting the selected candidate angle for the current frame as a detected angle of sound arrival on the plurality of pairs of microphone.
18. The computing machine of claim 14 wherein updating the history comprises: for a target candidate angle in the history having a lowest phase distortion, updating the presence score for the target candidate angle based on a candidate angle for the current frame having an angle similar to the target candidate angle.
19. The computing machine of claim 14 wherein updating the history comprises: for the target candidate angle in the history having a lowest phase distortion, updating the phase distortion for the target candidate angle based on a candidate angle for the current frame having an angle similar to the target candidate angle.
20. The computing machine of claim 14 wherein updating the history comprises: for the target candidate angle in the history having a lowest phase distortion, updating the angle of the target candidate angle based on a candidate angle for the current frame having an angle similar to the target candidate angle.
Description
DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION
(8) The following section provides an example operating environment in which sound source localization can be implemented.
(9) Referring to
(10) While
(11) While
(12) In one application, the sound source can be a human speaker 104. The human speaker produces sounds 106 when speaking. Other sound sources can be detected, however, in this application the robot can be designed to interact with a human speaker and determining a location for the human speaker can be used as an aid in this interaction.
(13) Given this context, an example implementation of sound source localization will now be described in more detail in connection with
(14) In
(15) Referring now to
(16) A preprocessor 302 receives the input audio signals 300 and performs a variety of operations on the signals to prepare them for analysis.
(17) Such preprocessing can include a DC removal filter. Such a DC removal filter is used to suppress undesirable components at very low frequencies for subsequent processing. An example implementation of such a filter is a first-order finite impulse response (FIR) filter and the input signal is processed channel-by-channel. The output of the filter is computed as
x.sub.c,i(n)=x.sub.c,i(n)0.97 x.sub.c,i(n1)
where n=0,1, . . . , N1, c=0,1, . . . , C1, and x.sub.c,i(1) is the last sample in the previous frame, i.e., x.sub.c,i(1)=x.sub.c,i1(N1).
(18) Another example of preprocessing is applying a Hamming window. The Hamming window h(n) is multiplied across two frames, previous and current, and a C-channel windowed signal is generated:
(19)
The windowed signal, x.sub.c,i(n), contains 2N samples.
(20) By applying the Hamming window, the separability of neighborhood spectrum elements can be improved so that the phase analysis described below performs better.
(21) The output 304 of the preprocessing stage, in this example the output of the Hamming window, is then input to a fast Fourier transform (FFT) processor 306. The output of the FFT processor is frequency domain data 308. In this example implementation. For example, an FFT with size 2N can be applied to x.sub.c,i(n) to obtain the complex values of frequency spectrum X.sub.c,i(k) for each channel. Because of the mirror image property, X.sub.c,i(k) has unique values in the range of the frequency bin k=0,1, . . . , K(=N/2), which corresponds to 0,8000/K, . . . , 8000 Hz, so that the spectrum within that range is processed in the subsequent operations.
(22) The frequency domain data 308 can be subjected to further post processing for a variety of applications, such as speech recognition, as indicated by the post processing module 310. The invention is neither limited by, nor requires, such post processing.
(23) Finally, in this example implementation, the frequency domain data 308 and the input signals 300 are input to a sound source localizer 312, described in more detail below, to produce, for each frame i, the angle of sound arrival .sub.i for the i-th frame.
(24) Referring now to
(25) The input audio signals 400 are input to an activity detector 402, which outputs data indicative of whether the audio signal is active in a given frame. An example implementation of the activity detector is the following:
(26) The log energy of c-th channel at i-th frame is computed by:
(27)
where x.sub.c,i(n) is the corresponding PCM input. In an initial few frames, E.sub.c,i is accumulated and its average is used to set a noise floor E.sub.c,i.sup.Floor. The noise floor of each channel is periodically forced to be replaced with a good candidate from the past few seconds. After this initial stage, the following condition is tested on a per-channel basis to determine channel activity:
E.sub.c,i.sup.Floor=min(E.sub.c,i1.sup.Floor, E.sub.c,i)
E.sub.c,i>E.sub.c,i.sup.Floor+E.sup.Offset, c=0,1, . . . , C1
(28) In other words, the activity detector 402 determines, in this implementation, if the log energy of the c-th channel at the i-th frame is greater than the noise floor of the channel, plus an offset. If this condition is true, then a channel state is set to active, otherwise the channel state is set to pause. Next, a frame state is obtained by combining channel states. The frame state is set to active if a few channels are detected as active, such as at least three; otherwise the frame state is set to pause.
(29) If the activity detector 402 sets a Pause frame state, as determined at 404, then the sound localizer uses the second angle tracker updater 414 (described below) and returns that the angle is undetected for that frame.
(30) If the activity detector 402 sets an active frame state, as determined at 402, then the a phase analyzer 406 processes the frequency domain data (e.g., 308 in
R.sub.<c.sub.
and its phase spectrum is given by:
(31)
(32) The best sample lag between channel c.sub.a and C.sub.b is estimated by searching a lag range for the minimum phase distortion against theoretical values:
(33)
(34) In the foregoing, the lag range of two channels L.sub.<c.sub.
(35)
where .Math. is the ceiling operation, |d.sub.<c.sub.
(36) Two mirror-imaged angles (respect to the virtual line between two microphones as shown in
(37)
(38) Since these angles are relative to the position of channel pair <c.sub.a, c.sub.b>, they are compensated by the global angle of the channel pair .sub.<c.sub.
.sub.<c.sub.
(39) The above phase analysis procedure is repeated for the pre-defined channel pairs to obtain an initial list of candidates with the estimated angle and its phase distortion. Thus, the output of the phase analysis module 406 is a list of candidate angles, and phase distortion.
(40) A candidate selection module 408 processes the list of candidate angles to refine it towards a selected angle. There are a variety of ways in which to refine this list. For example, if two or more angles are similar to, i.e., within some threshold of, each other, those angles are merged into one angle in the list, such as by averaging them, with a discounted phase distortion of .sub.min/M where .sub.min is the minimum phase distortion among the angles and M is the number of similar angles. In this refinement, isolated angles are eliminated and more commonly occurring angles are assigned a smaller distortion. As a result, in later stage processing which selects an angle based on its distortion, the more commonly occurring angle is more likely to be selected.
(41) The refined list of candidate angles, and their phase distortion, is passed on to an first angle tracker update module 410. In this module, a list of candidate angles and phase distortions over time is kept in an angle tracking list. Each entry in this list includes, as shown in
(42) First, an entry in the angle tracking list is identified with the lowest phase distortion from all the entries which have non-zero presence score and have not been updated yet. Next, a target candidate is found in the candidate list with the lowest phase distortion from all the candidates, and for which the angle is similar to the identified entry from the angle tracking list. If such a candidate is found, then the target entry (.sub.i, .sub.i, .sub.i) is updated with the candidate (, ) as follows:
Angle: .sub.i=.sub..sub.i1+(1.sub.), .sub.: constant
Phase distortion: .sub.i=.sub..sub.i1+(1.sub.), .sub.: constant
Presence score: .sub.i=max(.sub.i1+.sub., 1.0), .sub.: constant
Otherwise, a new entry is created from the candidate, as follows:
.sub.i=, .sub.i=, .sub.i=.sup.Init
(43) This process is followed until all entries in the angle tracking list and candidate list have been evaluated. Next, the entries which have not been updated in the above procedure are updated in a way that the distortion increases and the presence score decreases. In particular,
.sub.i=.sub.i1, .sub.i=.sub..sub.i1, .sub.i=.sub.i1.sub., .sub.: constant.
(44) If the presence score becomes below a threshold, such entries are removed from the list. The last step of the update is to scan all the entries again to merge those having similar angles.
(45) Given the angle tracking list, a final selection of an angle for a frame is made by the final selection module 412. For example, an entry is selected from the angle tracking list that meets the following criteria. First, its presence score is greater than or equal to the maximum presence score of the entries in the angle tracking list. Second, its phase distortion is less than or equal to the minimum phase distortion of the entries in the angle tracking list. Third, it has a similar angle to the highest ranked candidate angle obtained from the candidate selection module. The presence counter of this entry is incremented, while that of other entries is decremented. The counter indicates how often the entry is selected in the recent frames. When the counter of the entry exceeds a threshold, its angle .sub.i is reported as a detected angle. If no entry satisfies the above conditions, Undetected is returned.
(46) Finally, the second angle tracker updater 414 processes the angle tracking list to scan the tracking entries and update ones which have not been updated in the previous modules. The update formulas are the same as those used in the first angle tracker update, where the distortion gets larger and the presence score gets smaller.
(47) To summarize, referring to
(48) Having now described an example implementation, a computing environment in which such a system is designed to operate will now be described. The following description is intended to provide a brief, general description of a suitable computing environment in which this system can be implemented. The system can be implemented with numerous general purpose or special purpose computing hardware configurations. Examples of well known computing devices that may be suitable include, but are not limited to, personal computers, server computers, hand-held or laptop devices (for example, media players, notebook computers, cellular phones, personal data assistants, voice recorders), multiprocessor systems, microprocessor-based systems, set top boxes, game consoles, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
(49)
(50) With reference to
(51) Computing machine 700 may also contain communications connection(s) 712 that allow the device to communicate with other devices. Communications connection(s) 712 is an example of communication media. Communication media typically carries computer program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, thereby changing the configuration or state of the receiving device of the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
(52) Computing machine 700 may have various input device(s) 714 such as a display, a keyboard, mouse, pen, camera, touch input device, and so on. Output device(s) 716 such as speakers, a printer, and so on may also be included. All of these devices are well known in the art and need not be discussed at length here.
(53) The system may be implemented in the general context of software, including computer-executable instructions and/or computer-interpreted instructions, such as program modules, being processed by a computing machine. Generally, program modules include routines, programs, objects, components, data structures, and so on, that, when processed by a processing unit, instruct the processing unit to perform particular tasks or implement particular abstract data types. This system may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
(54) The terms article of manufacture, process, machine and composition of matter in the preambles of the appended claims are intended to limit the claims to subject matter deemed to fall within the scope of patentable subject matter defined by the use of these terms in 35 U.S.C. 101.
(55) Any or all of the aforementioned alternate embodiments described herein may be used in any combination desired to form additional hybrid embodiments. It should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific implementations described above. The specific implementations described above are disclosed as examples only.