METHOD AND APPARATUS FOR PROCESSING AN AUDIO SIGNAL BASED ON EQUALIZATION FILTER
20210250686 · 2021-08-12
Assignee
Inventors
- Liyun PANG (Munich, DE)
- Fons Adriaensen (Munich, DE)
- Roman Schlieper (Hannover, DE)
- Song Li (Hannover, DE)
Cpc classification
G10K11/002
PHYSICS
International classification
G10K11/00
PHYSICS
Abstract
A method for processing an audio signal, the method including: processing the audio signal according to a pair of mouth to ear transfer functions to obtain a processed audio signal; filtering the processed audio signal, using a pair of equalization filters, to obtain a filtered audio signal, where a parameter of the equalization filter is depends on an acoustic impedance of a headphone; and outputting the filtered audio signal to the headphone. Accordingly, this method counteracts the occlusion effect and to provides a natural perceived sound pressure.
Claims
1. A method for processing an audio signal, comprising: processing the audio signal according to a pair of mouth to ear transfer functions to obtain a processed audio signal; filtering the processed audio signal, using a pair of equalization filters, to obtain a filtered audio signal, wherein a parameter of the equalization filter depends on an acoustic impedance of a headphone; and outputting the filtered audio signal to the headphone.
2. The method of claim 1, wherein the mouth to ear transfer function describes a transfer function from the mouth to the eardrums.
3. The method of claim 1, wherein the acoustic impedance of the headphone is measured based on an acoustic impedance tube, the acoustic impedance tube having a measurable frequency range from 20 Hz to 2 kHz.
4. The method of claim 1, wherein the parameter of the equalization filter is a gain factor of the equalization filter, and the gain factor of the equalization filter is proportional to the inverse of the acoustic impedance of the headphone.
5. The method of claim 1, wherein the pair of equalization filters is selected based on a headphone type of the headphone.
6. The method of claim 5, wherein the headphone type of the headphone is obtained based on a Universal Serial Bus (USB) Type-C information.
7. An apparatus for processing a stereo signal , the apparatus comprising processing circuitry configured to: process the audio signal according to a pair of mouth to ear transfer functions to obtain a processed audio signal; filter the processed audio signal, using a pair of equalization filters, to obtain a filtered audio signal, wherein a parameter of the equalization filter depends on an acoustic impedance of a headphone; and output the filtered audio signal to the headphone.
8. The apparatus of claim 7, wherein the mouth to ear transfer function describes a transfer function from the mouth to the eardrums.
9. The apparatus of claim 7, wherein the acoustic impedance of the headphone is measured based on an acoustic impedance tube, the acoustic impedance tube having a measurable frequency range from 20 Hz to 2 kHz.
10. The apparatus of claim 7, wherein the parameter of the equalization filter is a gain factor of the equalization filter, the gain factor of the equalization filter being proportional to the inverse of the acoustic impedance of the headphone.
11. The apparatus of claim 7, wherein the pair of equalization filters is selected based on a headphone type of the headphone.
12. The apparatus of claim 11, wherein the headphone type of the headphone is obtained based on a Universal Serial Bus (USB) Type-C information.
13. A computer-readable storage medium storing program code which, when executed by a computer, causes the computer to carry out a method comprising: processing an audio signal according to a pair of mouth to ear transfer functions to obtain a processed audio signal; filtering the processed audio signal, using a pair of equalization filters, to obtain a filtered audio signal, wherein a parameter of the equalization filter depends on an acoustic impedance of a headphone; and outputting the filtered audio signal to the headphone.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0049] To illustrate the features of embodiments of the embodiments more clearly, the accompanying drawings provided for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are merely some embodiments, but modifications on these embodiments are possible without departing from their scope.
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064] In the figures, identical reference signs are be used for identical or functionally equivalent features.
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0065] In the following description, reference is made to the accompanying drawings, which describe embodiments, and in which are shown, by way of illustration, various aspects in which the embodiments may be placed. It can be appreciated that the embodiments may be placed in other aspects and that structural or logical changes may be made without departing from the scope of the embodiments. The following descriptions, therefore, are non-limiting.
[0066] For instance, it can be appreciated that an embodiment in connection with a described method will generally also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures.
[0067] Moreover, embodiments with functional blocks or processing units are described, which are connected with each other or exchange signals. It can be appreciated that the embodiments also cover embodiments which include additional functional blocks or processing units, such as pre- or post-filtering and/or pre- or post-amplification units, that are arranged between the functional blocks or processing units of the embodiments described below.
[0068] Finally, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.
[0069] A channel is a pathway for passing on information, in this context sound information. Physically, it might, for example, be a tube you speak down, or a wire from a microphone to an earphone, or connections between electronic components inside an amplifier or a computer.
[0070] A track is a physical home for the contents of a channel when recorded on magnetic tape. There can be as many parallel tracks as technology allows, but for everyday purposes there are 1, 2 or 4. Two tracks can be used for two independent mono signals in one or both playing directions, or a stereo signal in one direction. Four tracks (such as a cassette recorder) are organized to work pairwise for a stereo signal in each direction; a mono signal is recorded on one track (same track as the left stereo channel) or on both simultaneously (depending on the tape recorder or on how the mono signal source is connected to the recorder).
[0071] A mono sound signal does not contain any directional information. In an example, there may be several loudspeakers along a railway platform and hundreds around an airport, but the signal remains mono. Directional information cannot be generated simply by sending a mono signal to two “stereo” channels. However, an illusion of direction can be conjured from a mono signal by panning it from channel to channel.
[0072] A stereo sound signal may contain synchronized directional information from the left and right aural fields. Consequently, it uses at least two channels, one for the left field and one for the right field. The left channel is fed by a mono microphone pointing at the left field and the right channel by a second mono microphone pointing at the right field (you can also find stereo microphones that have the two directional mono microphones built into one piece). In an example, Quadraphonic stereo uses four channels, surround stereo has at least additional channels for anterior and posterior directions apart from left and right. Public and home cinema stereo systems can have even more channels, dividing the sound fields into narrower sectors.
[0073] Stereophonic sound or, more commonly, stereo, is a method of sound reproduction that creates an illusion of multi-directional audible perspective. This is usually achieved by using two or more independent audio channels through a configuration of two or more loudspeakers (or stereo headphones) in such a way as to create the impression of sound heard from various directions, as in natural hearing.
[0074] In one embodiment, the object of the audio signal processing method or audio signal processing apparatus is to improve the naturalness and to reduce the occlusion effect when using in-ear headphones, and to counteract the occlusion effect and to provide a sound pressure that can be perceived as natural. In an example, the user's voice is captured by the in-line microphone and convolved 402 with a pair of mouth to ear transfer function (HmeTF) 401 for left/right ear form a recorded or a database, respectively (
[0075] A head-related transfer function (HRTF) is a response that characterizes how an ear receives a sound from a point in space. As sound strikes the listener, the size and shape of the head, ears, ear canal, density of the head, size and shape of nasal and oral cavities, all transform the sound and affect how it is perceived, boosting some frequencies and attenuating others. Generally speaking, the HRTF boosts frequencies from 2-5 kHz with a primary resonance of +17 dB at 2,700 Hz.
[0076] A pair of HRTFs for two ears can be used to synthesize a binaural sound that seems to come from a particular point in space. It is a transfer function, describing how sound from a specific point will arrive at the ear (generally at the outer end of the auditory canal). HRTFs for left and right ear describe the filtering of sound by the sound propagation paths from the source to the left and right ears, respectively. The HRTF can also be described as the modifications to a sound from a direction in free air to the sound as it arrives at the eardrum.
[0077] The mouth to ear transfer function (HmeTF) describes the transfer function from the mouth to the eardrums. HmeTF can be measured non-individually by using a dummy head (head-torso with mouth simulator), or HmeTF can be measured individually by placing a smartphone or microphone close to the mouth of a user and reproducing a measurement signal. The measurement signal is acquired by microphones placed near the entrance of the blocked ear canal (120). The measurement signal can be a noise signal.
[0078] In an example, a HmeTF measurement can be made of a real room environment from the mouth to the ears of the same head. For simulation, a talker's voice is convolved in real-time with the HmeTF, so that the talker can hear the sound of his or her own voice in the simulated room environment. It can be shown by example how HmeTF measurements can be made using human subjects (by measuring the transfer function of speech) or by a head and torso simulator.
[0079] In an example, a HmeTF is measured using a head and torso simulator (HATS). The mouth simulator directivity of the HATS is similar to the mean long term directivity of conversational speech from humans, except in the high frequency range. The HATS' standard mouth microphone position (known as the ‘mouth reference point’) is 25 mm away from the ‘center of lip’ (which in turn is 6 mm in front of the face surface). A microphone is used at the mouth reference point. Rather than using the inbuilt microphones of the HATS (which are at the acoustic equivalent to eardrum position), some microphones that are positioned near the entrance of the ear canals are used. One reason is that a microphone setup similar to the one of the HATS is used on a real person. The microphone setup on the real person includes microphones which may be similar or identical to the microphones of the HATS microphones and which are placed at positions equivalent to those of the HATS. Another reason is that it is desirable to avoid measuring with ear canal resonance, as the strong resonant peaks would need to be inverted in the simulation, which would introduce noise and perhaps latency.
[0080] In another example, the measurement about the HmeTF is made by sending a swept sinusoid test signal to the mouth loudspeaker, the sound of which was recorded at the mouth and ear microphones. The sweep ranged between 50 Hz-15 kHz, with a constant sweep rate on the logarithmic frequency scale over a period of 15 s. A signal suitable for deconvolving the impulse response from the sweep was sent directly to the recording device, along with the three microphone signals. This yielded the impulse response (IR) from the signal generator to a microphone, and the transfer function is obtained from the mouth microphone to ear microphones by dividing the latter by the former in the frequency domain The procedure for this is, first, to take the Fourier transform of the direct sound from the mouth microphone impulse response, zero-padded to be twice the length of the desired impulse response. The direct sound is identified by the maximum absolute value peak of the mouth microphone IR, and data from −2 to +2 ms around this is used, with a Tukey window function applied (50% of the window is fade-in and fade-out using half periods of a raised cosine, and the central 50% has a constant coefficient of 1).
[0081] In another example, a Fourier transform window length is used for the ear microphone impulse responses, with the second half of the window zero-padded. The transfer function is obtained by dividing the cross-spectrum (conjugate of mouth IR multiplied by the ear IR) by the auto-spectrum of the mouth microphone's direct sound. Before returning to the time domain, a band-pass filter is applied to the transfer function to be within 100 Hz-10 kHz to avoid signal-to-noise ratio problems at the extremes of the spectrum (this is done by multiplying the spectrum components outside this range by coefficients approaching zero). After applying an inverse Fourier transform, the impulse response is truncated (discarding the latter half). The resulting IR for each ear is multiplied by the respective ratio of mouth-to-ear rms values of microphone calibration signals (sound pressure level of 94 dB) to compensate for differences in gain between channels of the recording system.
[0082] In another example, HmeTFs can be measured using a real person and using a microphone arrangement similar or identical to the one used in a HATS. The sound source could simply be speech, although other possibilities exist. The transfer function is calculated between a microphone near the mouth to each of the ear microphones. This approach was taken in measuring the transfer function from mouth to ear (without room reflections), and it can be used for measuring room reflections too. Advantages of using such a technique (compared to using the HATS) may include matching the individual long term speech directivity of the person; matching the head related transfer functions of the person's ears; and that the measurement system only requires minimal equipment.
[0083] In an example, the formula of the HmeTF depends on how it is measured, generally it is the ratio between the complex sound signal at the ear and at the mouth, HmeTF=p_ear/p_mouth.
[0084] In another example, the HmeTF is measured using a real person and a smartphone. The microphone setup can be similar to the other examples and the smartphone has to be positioned near to the mouth. The smartphone acts as a sound source and as the reference microphone. The transfer function is calculated between the smartphone microphone (reference microphone) and the ear microphones. The advantages of this method is the increased bandwidth of the sound source compared with the speech of the real person.
[0085] Parameters of the equalization filter are based on the acoustic impedance of the headphone. The acoustic impedance of the headphone in low frequency is highly correlated with the perceived occlusion effect, i.e., high acoustic impedance corresponds to high occlusion effect caused by the headphone. The acoustic impedance of the headphone can be measured using a customized acoustic impedance tube, for example an acoustic impedance tube built in accordance with ISO-10534-2.The measurement tube may be built to fit the geometries of a human ear canal, for example, the inner diameter of the tube should be approx. 8 mm, and a frequency range should be between at least 60 Hz and 2 kHz. As shown in
[0086] In another example, the acoustic impedance of the headphone (Z.sub.HP) may be determined by calculating the difference between the Z.sub.OE.sup.Hp and Z.sub.OE:
Z.sub.HP=Z.sub.OE.sup.Hp−Z.sub.OE.
[0087]
[0088] The curves 110, 111 in
[0089] The gain factor/shape (g) of an equalization filter is proportional to the inverse of Z.sub.HP.
where a is the scaling factor (proportional coefficient), which can either be selected by the user or determined during a lot of measurement of different headphones.
[0090]
[0091] S21: processing the audio signal according to a pair of mouth to ear transfer functions.
[0092] S22: filtering the processed audio signal, using a pair of equalization filters.
[0093] S23: outputting the filtered audio signal to the headphone.
[0094] Embodiment 1, telephone with headset (in-ear headphone or earbuds with in-line microphone) in a quiet environment.
[0095]
[0096] The anti-occlusion hear-through equalization filter 12 is pre-designed based on the acoustic impedance of the headphone. Therefore, information of the headphone used is required. It can be done either manually or automatically. For example, the headphone can be selected 11 by the user manually based on the headphone categories (for example, over-ear headphone, on-ear headphone) or the headphone model (for example, HUAWEI Earbud). It can also be automatically detected by the information provided by the USB type-C. For each headphone, the anti-occlusion hear-through equalization filter is then chosen based on its acoustic impedance, as mentioned above. For each category, a filter can be designed based on an averaged acoustic impedance or use a representative equalization filter for each category.
[0097] The shape of the filter should be proportional to the inverse of the acoustic impedance (0−Z.sub.HP in dB). For the design of the anti-occlusion hear-through equalization filter, almost every low order infinite impulse response (IIR) filter or finite impulse response (FIR) filter is suitable (low latency is needed).
[0098]
[0099] The filter can be designed in two steps: [0100] 1) The stopband attenuation can be determined by averaged acoustic impedance from low (60 Hz) to the cut-off frequency as a starting point. Then the cut-off frequency can be determined by the first zero crossing of the frequency dependent acoustic impedance, seen from the low to the high frequency. [0101] 2) Iterating the stopband attenuation and the cut-off frequency by minimizing the error between the inverse of the acoustic impedance curve (target) and the designed frequency response (such as, using machine learning).
[0102] For example, the cut-off frequency is 3.5 kHz of an in-box earbuds, and the stopband attenuation is 16 dB. The pre-designed filters can be stored in the cloud, in an online database provided to user or in the smartphone, for example.
[0103] Embodiment 2, telephone with headset (in-ear headphone or earbuds with in-line microphone) in a noisy environment.
[0104] As an example, a user is making a teleconference with a headset in a noisy room, for example a restaurant or an airport. The user's own voice captured by the in-line microphone is combined with the environment noise, and this may decrease the naturalness perception. In addition, the user does not want the remote user to hear the environment noise as this may reduce the speech intelligibility.
[0105] Therefore, in the case of noisy environments, the captured user's voice is first decomposed into direct sound and ambient sound. The ambient sound is discarded. The extracted direct sound is filtered through a pair of HmeTFs and is further filtered through a pair of anti-occlusion hear-through equalization filters to simulate the direct sound part. The measured or synthesized late reverberation part is added to the direct part to simulate the quite environment but with local room information. The signals are then played back using headphones to the user and the naturalness while user is speaking is enhanced. In addition, the extracted direct sound can be sent to the remote user to enhance the speech intelligibility.
[0106] In one embodiment, the binaural signals are the sum of direct sound, early reflections and late reverberation:
Left=d.sub.left(t)+e.sub.left(t)+l.sub.left(t)
Right=d.sub.right(t)+e.sub.right(t)+l.sub.right(t)
[0107]
[0108] Applications of embodiments include any sound reproduction system or surround sound system using multiple loudspeakers. In particular, embodiments can be applied to, for example: [0109] TV speaker systems, [0110] car entertaining systems, [0111] teleconference systems, and/or [0112] home cinema system, [0113] where personal listening environments for one or multiple listeners is desirable.
[0114] The foregoing are only implementation manners of the present embodiments, and the embodiments are non-limiting. Any variations or replacements can be easily made by a person of ordinary skill in the art.