Directional sound masking
09613610 ยท 2017-04-04
Assignee
Inventors
Cpc classification
H04K3/43
ELECTRICITY
H04K3/45
ELECTRICITY
H04K3/41
ELECTRICITY
H04K3/42
ELECTRICITY
G10K2210/3028
PHYSICS
International classification
Abstract
The invention relates to a system for masking a sound incident on a person. The system comprises a microphone sub-system for capturing the sound. The system further comprises a spectrum-analyzer for determining a power attribute of the sound captured by the multiple microphone sub-system, and a spatial analyzer for determining a directional attribute of the captured sound representative of a direction of incidence on the person. The system further comprises a generator sub-system for generating a masking sound under combined control of the power attribute and the spatial attribute, for masking the incident sound.
Claims
1. A system configured for masking sound signal, comprising: microphones for capturing the sound signals at multiple locations; loudspeakers for generating at least one masking sound signal; a computer configured by computer program instructions to: determine a power attribute of a frequency spectrum of at least one sound signal of the sound signals, wherein the power attribute of the frequency spectrum is representative of a power in a frequency band of the at least one sound signal; determine a directional attribute of the at least one sound signal in the frequency band, wherein the directional attribute of the at least one sound signal is representative of a direction from which the at least one sound signal is captured; and control at least one loudspeaker of the loudspeakers to generate the at least one masking sound signal based on the power attribute and the directional attribute of the at least one sound signal such that a direction of the generated at least one masking sound signal is based on the directional attribute.
2. The system of claim 1, wherein: the microphones supply a first signal representative of the at least one sound signal; the computer supplies a second signal for control of the loudspeakers; the system comprises: an adaptive filter for receiving the second signal and supplying a filtered version of the second signal; and a subtractor for receiving the first signal, receiving the filtered version of the second signal, and supplying a third signal to the computer that is representative of a difference between the first signal and the filtered version of the second signal; and the adaptive filter has a control input for receiving the third signal for control of one or more filter coefficients of the adaptive filter.
3. The system of claim 1, comprising a sound classifier that is operative to selectively remove a pre-determined portion from the at least one sound signal before carrying out the determining of the power attribute and before carrying out the determining of the directional attribute.
4. The system of claim 1, wherein the controlling comprises: selecting the at least one loudspeaker from among the loudspeakers based on the at least one loudspeaker corresponding to at least one microphone of the microphones that captured the at least one sound signal; and controlling the at least one loudspeaker to generate the at least one masking sound signal based on the power attribute and the directional attribute of the at least one sound signal such that the direction of the at least one masking sound signal is the same as the direction from which the at least one sound signal is captured by the at least one microphone.
5. A method for masking sound signals, comprising: capturing the sound signals at multiple locations; determining a power attribute of a frequency spectrum of at least one sound signal of the sound signals, wherein the power attribute of the frequency spectrum is representative of a power in a frequency band of the at least one sound signal; determining a directional attribute of the at least one sound signal in the frequency band, wherein the directional attribute of the at least one sound signal is representative of a direction from which the at least one sound signal is captured; and generating the at least one masking sound signal based on the power attribute and the directional attribute of the at least one sound signal such that a direction of the generated at least one masking sound signal is based on the directional attribute.
6. The method of claim 5, further comprising: receiving a first signal representative of the at least one sound signal; supplying a second signal for generating the at least one masking sound signal; and adaptive filtering for reducing a contribution from the at least one masking sound signal, present in the at least one sound signal, to the second signal, wherein the adaptive filtering comprises: receiving the second signal; using an adaptive filter for supplying a filtered version of the second signal; supplying a third signal that is representative of a difference between the first signal and the filtered version of the second signal; receiving the third signal for control of one or more filter coefficients of the adaptive filter; and using the third signal for the determining of the power attribute and for the determining of the directional attribute.
7. The method of claim 5, further comprising selectively removing a pre-determined portion from the at least one sound signal before carrying out the determining of the power attribute and before carrying out the determining of the directional attribute.
8. A non-transitory computer-readable medium having instructions recorded thereon, wherein the instructions, when executed by a computer, cause the computer to perform following: capturing sound signals at multiple locations; determining a power attribute of a frequency spectrum of at least one sound signal of the sound signals, wherein the power attribute of the frequency spectrum is representative of a power in a frequency band of the at least one sound signal; determining a directional attribute of the at least one sound signal in the frequency band, wherein the directional attribute of the at least one sound signal is representative of a direction from which the at least one sound signal is captured; and generating the at least one masking sound signal based on the power attribute and the directional attribute of the at least one sound signal such that a direction of the generated at least one masking sound signal is based on the directional attribute.
9. The non-transitory computer-readable medium of claim 8, wherein the instructions, when executed by the computer, further cause the computer to perform following: receiving a first signal representative of the at least one sound signal, supplying a second signal for generating the at least one masking sound signal; and adaptive filtering for reducing a contribution from the at least one masking sound signal, present in the at least one sound signal, to the second signal, wherein the adaptive filtering comprises: receiving the second signal; using an adaptive filter for supplying a filtered version of the second signal; supplying a third signal that is representative of a difference between the first signal and the filtered version of the second signal; receiving the third signal for control of one or more filter coefficients of the adaptive filter; and using the third signal for the determining of the power attribute and for the determining of the directional attribute.
10. The non-transitory, computer-readable medium of claim 8, wherein the instructions, when executed by the computer, further cause the computer to selectively remove a pre-determined portion from the at least one sound signal before carrying out the determining of the power attribute and before carrying out the determining of the directional attribute.
Description
BRIEF DESCRIPTION OF THE DRAWING
(1) The invention is explained in further detail, by way of example and with reference to the accompanying drawing, wherein:
(2)
(3)
(4)
(5) Throughout the Figures, similar or corresponding features are indicated by same reference numerals.
DETAILED EMBODIMENTS
(6) The invention relates to a system and method for masking a sound incident on a person. The system comprises a microphone sub-system for capturing the sound. The system further comprises a spectrum-analyzer for determining a power attribute of the sound captured by the multiple microphone sub-system, and a spatial analyzer for determining a directional attribute of the captured sound representative of a direction of incidence on the person. The system further comprises a generator sub-system for generating a masking sound under combined control of the power attribute and the spatial attribute, for masking the incident sound.
(7)
(8) The first embodiment 100 comprises a signal-processing sub-system 103 between, on the one hand, the left microphone 102 and the right microphone 104 and, on the other hand, the left loudspeaker 106 and the right loudspeaker 108. The functionality of the signal-processing sub-system 103 will now be discussed.
(9) The left microphone 102 captures sounds incident on the left microphone 102 and produces a left audio signal for a left audio channel. The left audio signal is converted to the frequency domain in a left converter 110 that produces a left spectrum. Likewise, the right microphone 104 captures sounds incident on the right microphone 104 and produces a right audio signal for a right audio channel. The right audio signal is converted to the frequency domain by a right converter 112 that produces a right spectrum. Operation of the left converter 110 and of the right converter 112 is based on, e.g., the Fast-Fourier Transform (FFT).
(10) The left spectrum is supplied to a set of one or more left band-pass filters 114 that determines one or more frequency bands in the left spectrum. Likewise, the right spectrum is supplied to a set of one or more right band-pass filters 116 that determines one or more frequency bands in the right spectrum. Dividing each respective one of the left spectrum and the right spectrum into respective frequency bands enables to separately process different bands in the same spectrum. For example, the set of left band-pass filters 114 determines one or more frequency bands in the left spectrum, wherein each particular one of the frequency bands is associated with a particular one of the auditory band-pass filters. As mentioned above, the asymmetric filter shape per individual band-pass filter in a psychoacoustic model of auditory perception is approximated in practice by a symmetric frequency-response function, known as the Rounded Exponential (RoEx) shape. Similarly, the set of right band-pass filters 116 determines one or more frequency bands in the right spectrum, wherein each particular one of the frequency bands is associated with a particular one of the auditory band-pass filters.
(11) The first embodiment 100 also comprises a masking sound generator 118 that is configured for generating a signal representative of the masking sound. The masking sound signal is converted to the frequency domain by a further frequency converter 120 to generate a spectrum of the masking sound. The spectrum of the masking sound is supplied to a set of one or more further band-pass filters 122. The set of further band-pass filters 122 determines respective frequency bands in the spectrum of the masking sound that correspond with respective ones of the frequency ranges determined by the set of left band-pass filters 114 and the set of right band-pass filters 116.
(12) A particular part of the left spectrum associated with a particular frequency range, another particular part of the right spectrum associated with this particular frequency range and a further particular part of the spectrum of the masking sound associated with the particular frequency range are supplied to a particular one of a first sub-system 124, a second sub-system 126, a third sub-system 128, etc. In the following, the processing of the particular part of the left spectrum, of the other particular part of the right spectrum and of the further particular part of the spectrum of the masking sound is explained with reference to the processing by the first sub-system 124.
(13) The first sub-system 124 comprises a spectrum analyzer 130, a spatial analyzer 134 and a generator sub-system 135. The generator sub-system 135 comprises a spectrum equalizer 132 and a virtualizer 136. The second sub-system 126, the third sub-system 128, etc., have a configuration similar to that of the first sub-system 124. The generator sub-system 135 is configured to generate a masking sound under combined control of a power attribute, as determined by the spectrum analyzer 130, and a spatial attribute as determined by the spatial analyzer 134, for masking the sound as captured by the left microphone 102 and the right microphone 104.
(14) The spectrum analyzer 130 is configured for estimating, or determining, the power in the relevant one of the frequency ranges that is being handled by the first sub-system 124 for the sound captured by the left microphone 102 and the right microphone combined.
(15) The power in the relevant frequency range as determined by the spectrum analyzer, suitably averaged over time, is used to control the spectrum equalizer 132. The spectrum equalizer 132 is configured to adjust the power in the relevant frequency range of the masking sound under control of the power estimated by the spectrum analyzer 130 as being present in the relevant frequency range of the incident sound captured by the left microphone 102 and the right microphone 104. Optionally, the spectrum equalizer 132 is adjustable so as to set control parameters in advance for adjusting the power in the relevant frequency range of the masking sound in dependence on the power spectrum of the relevant frequency range of the captured sound. For example, the adjustability of the spectrum equalizer enables to limit a ratio between the power in the frequency range of the captured sound and the power in the frequency range of the masking sound to a range between a minimum value and a maximum value. This limiting of the ratio assists in creating a masking sound that will be perceived by the user as more natural rather than artificial.
(16) The spatial analyzer 134 is configured to determine a spatial attribute, e.g., a direction of incidence on the left microphone 102 and on the right microphone 104, of that particular contribution of the sound, which is captured by the left microphone 102 and the right microphone 104 and which is associated with the relevant frequency range.
(17) The spatial analyzer 134 thus performs sound localization of the contribution to the captured sound in the relevant frequency range. The expression sound localization as used in the art refers to a person's ability to identify a location of a detected sound in direction and distance. Sound localization may also refer to methods in acoustical engineering to simulate the placement of an auditory cue in a virtual three-dimensional space. In human sound localization, the concepts interaural time difference (LTD) and interaural level differencne (ILD) refer to physical quantities that enable a person to determine a lateral direction (left, right) from which a sound appears to be coming. The ITD is the difference in arrival times of a sound arriving at the person's left ear and the person's right ear. If a sound signal arrives at the person's head from one side, the sound signal has to travel farther to reach the far ear than the near ear. This difference in path length results in a time difference between the sound's arrivals at the ears, which is detected and aids the process of identifying the direction from which the sound appears to be coming. As to the ILD, sound arriving at the person's near ear has a higher energy level than the sound arriving at the person's far ear, as the far ear is located in the acoustic shadow of the person's head which causes a significant attenuation of the sound signal. The ILD is noticeably frequency-dependent as the characteristic dimension of a person's head is within a range of wavelength in the audible spectrum. The spatial analyzer 134 is configured, e.g., to determine a quantity representative of at least one of the ITD and ILD for the sound captured by the left microphone 102 and the right microphone 104.
(18) The virtualizer 136 is configured for generating, under combined control of the spectrum equalizer 130 and the spatial analyzer 134, a left-channel representation and a right-channel representation of a masking sound in the frequency domain and associated with the relevant frequency range. The left-channel representation is supplied to a left inverse-converter 138 for being converted to the time-domain, e.g., through an inverse FFT. The left-channel representation in the time-domain is then supplied to the left loudspeaker 106. Similarly, the right-channel representation is supplied to a right inverse-converter 140 for being converted to the time-domain, e.g., through an inverse FFT. The right-channel representation in the time-domain is then supplied to the right loudspeaker 108.
(19) Each respective one of the second sub-system 126 and the third sub-system 128, etc., performs similar operations for processing a respective contribution to the captured sound from a respective other frequency range. The eventual masking sound as played out at the left loudspeaker 106 and the right loudspeaker 108 then comprises the respective left-channel representation in the time domain and the respective right-channel representation in the time domain as supplied by a respective one of the first sub-system 124, the second sub-system 126, the third sub-system 128 etc.
(20) For completeness, it is remarked here that more than two microphones and more than two loudspeakers can be exploited so as to be able to determine directionality of the incident sound with higher resolution and so as to be able to play out a masking sound with a higher directional resolution. Note also that the sound, captured by the microphones, here: the left microphone 102 and the right microphone 104, may stem from two or more sources or may be incident on the microphones from multiple directions (e.g., through multiple reflections at acoustically reflecting objects within range of the microphones). The first embodiment 100 determines the power spectrum and direction of incidence per individual one of the frequency ranges and generates an eventual masking sound taking into account the multiple sources and/or multiple directions of incidence.
(21) Also, in the case of generating a binaural masking sound, some reverberation may be added so as to strengthen the impression by the user that the masking sound as perceived stems from one or more sources external to the user's head.
(22) For completeness, it is remarked here that the first embodiment 100 is illustrated as including the left microphone 102 and the right microphone 104. If one or more additional microphones are present in the first embodiment 100, the output signal of each additional microphone is supplied to an additional frequency converter (not shown), and from there to an additional set of band-pass filters (not shown). Each individual one of the band-pass filters of the additional set supplies a particular output signal, indicative of a particular frequency range, to a particular one of the first sub-system 124, the second sub-system 126, the third sub-system 128, etc. Consider the specific output signal of the additional set of band-pass filters that is supplied to the first sub-system 124. The specific output signal is then supplied to the spectrum analyzer 130 and to the spatial analyzer 134, in parallel to the left output signal of the set of left band-pass filters 114 supplied to the first sub-system 124, and in parallel to the right output signal of the set of right band-pass filters 116 as supplied to the first sub-system 124.
(23) Consider now a scenario, wherein one or both of the left microphone 102 and the right microphone 104 is not acoustically well isolated from the left loudspeaker 106 and/or from the right loudspeaker 108. For example, a typical active noise-cancellation headphone has both a loudspeaker unit and a microphone unit positioned inside each of the ear cups. That is, a typical active noise-cancellation headphone has the left microphone 102 and the left loudspeaker 106 positioned inside the left ear cup, and has the right microphone 104 and the right loudspeaker 108 positioned inside the right ear cup. As a result, the masking sound reproduced by the left loudspeaker 106 will be picked up by the left microphone 102, and the masking sound reproduced by the right loudspeaker 108 will be picked up by the right microphone 104. In this case, it is necessary to remove the masking sound reproduced by the left loudspeaker 106 from the sound that is captured by the left microphone 102, and to remove the masking sound reproduced by the right loudspeaker 108 from the sound captured by the right microphone 104, so as to subject the thus modified captured sound to the signal processing carried out by the signal-processing sub-system 103.
(24) Likewise, consider another scenario, wherein the left microphone 102, the right microphone 104, the left loudspeaker 106 and the right loudspeaker 108 are positioned away from the user's ears. As a result, each individual one of the left microphone 102 and the right microphone 104 is acoustically coupled to both the left loudspeaker 106 and the right loudspeaker 108. In this case, it is necessary as well to remove the masking sound reproduced by the left loudspeaker 106 and the masking sound produced by the right loudspeaker 108 from the sound that is captured by each individual one of the left microphone 102 and the right microphone 104, so as to subject the thus modified captured sound to the signal processing carried out by the signal-processing sub-system 103 as discussed above with reference to the diagram of
(25) The removal of the masking sound as captured by each individual one of the left microphone 102 and the right microphone 104 can be implemented through use of adaptive filtering, as is explained with reference to the diagram of
(26)
(27) Each individual one of the microphones of the microphone sub-system 202, e.g., the specific microphone 206, may capture the sound to be masked as well as the masking sound, as reproduced by the loudspeaker sub-system 204 in the manner described above with reference to the first embodiment 100. The sound to be masked is indicated in the diagram of
(28) The specific microphone 206 captures the sound to be masked 208 as well as the masking sound 210 and supplies a first signal. The first signal is supplied to the signal-processing sub-system 103 via a subtractor 212. The subtractor 212 also receives a filter output signal from an adaptive filter 214 and is operative to subtract the filter output signal from the microphone signal. The output signal of the subtractor 212 is supplied to the signal-processing sub-system 103 described with reference to the first embodiment 100. The output signal of the signal-processing sub-system 103 as supplied to the loudspeaker sub-system 204 is supplied to an input of the adaptive filter 214. The adaptive filter 214 is configured for adjusting its filter coefficients under control of the output signal of the subtractor 212. Adaptive filtering techniques are well-known in the art and need not be discussed here in further detail.
(29) The wearing of headphones (or of earphones) may be inconvenient. Instead, the loudspeakers and microphones of a system of the invention are positioned at a distance from the head of the user. In this case, an array of two or more microphones can used to obtain the directions of the disturbing sounds to be masked with respect to a preferably fixed position of the user's head using a beamforming technique. For example, in a hospital environment, the possible positions of the head of a patient lying in a hospital bed, erected at a fixed location in a hospital room, is usually limited to a small volume of space.
(30) A one-dimensional array of microphones can then be used to sweep (in software) a narrow (microphone-) beam pattern along an axis that has a particular orientation with respect to the patient, e.g., the horizontal axis. A two-dimensional array of microphones can then be used to sweep (in software) a narrow (microphone-) beam pattern along two axes that have different particular orientations with respect to the patient, e.g., the horizontal axis and the vertical axis.
(31) Note that, when using only a left microphone and the right microphone as located at or near the user's ears, an implementation of the spatial analyzer 134 may be used for determining the ITD and ILD. If the microphones are positioned remote form the user's head and if beamforming is being used to determine the directions of the sounds to be masked, another implementation of the spatial analyzer 134 may be used that is adapted to the specific beamforming technique.
(32) When the loudspeakers are positioned away from the user's head, an implementation of the virtualizer 136 may be used so that, given the estimated incident directions of the target sounds, the masking sounds may be rendered at the same directions using the loudspeaker sub-system. This can be achieved by filtering the binaural signals with a matrix of filters to synthesize input signals for the loudspeaker array, where the filters are created so that the transmission paths to the user's ear positions may be relatively transparent (e.g., using cross-talk cancellation). Alternatively, beamforming can be used wherein two narrow beams are formed by a filter matrix, each respective one of which being directed to the respective one of the position of the user's left ear and the position of the user's right ear. Cross-talk cancellation is known in the art. The objective of a cross-talk canceller is to reproduce a desired signal at a single target position while cancelling out the sound perfectly at all remaining target positions. The basic principle of cross-talk cancellation using only two loudspeakers and two target positions has been known for a long time. In 1966, Atal and Schroeder used physical reasoning to determine how a cross-talk canceller comprising only two loudspeakers placed symmetrically in front of a single listener could work. In order to reproduce a short pulse at the left ear only, the left loudspeaker first emits a positive pulse. This pulse must be cancelled at the right ear by a slightly weaker negative pulse emitted by the right loudspeaker. This negative pulse must then be cancelled at the left ear by another even weaker positive pulse emitted by the left loudspeaker, and so on. The Atal and Schroeder's model assumes free-field conditions; the influence of the listener's torso, head and outer ears on the incoming sound waves are ignored (copied from a web page Cross-Talk Cancellation of the Fluid Dynamics and Acoustics Group, section Virtual Acoustics and Audio Engineering of the Institute of Sound and Vibration Research at the University of Southampton; URL=http://resource.isvr.soton.ac.uk/FDAG/VAP/html/xtalk.html).
(33) The location(s), where the masking sound is intended to effectively mask the sound to be masked, can be fixed regardless of the direction(s) from the sound(s) to be masked is/are arriving at the user's head. In hospital rooms, the sources of sounds to be masked, e.g., electronic monitoring systems, are mostly located to the side of, or behind, the patient's bed. In this case, masking sounds can be created that have fixed directionality and only to the lateral positions and to the back, reducing the variability of the soundscape, and also reducing the required computational power needed for the adaptive filtering (as some of the adaptive filters can use fixed filter coefficients).
(34)
(35) The first embodiment 100 is shown to accommodate the masking sound generator 118. The third embodiment 300 comprises one or more additional masking sound generators, e.g., a first additional masking sound generator 306 and a second additional masking sound generator 308, etc. Accordingly, instead of using a single type of masking sound for the processing at the signal-processing sub-system 103, a multitude of different masking sounds is used, a particular one of the masking sounds being tuned to a particular one of the sources that together produce the sound to be masked.