Method and apparatus for localizing multichannel sound signal
11445317 · 2022-09-13
Assignee
- Samsung Electronics Co., Ltd. (Suwon-si, KR)
- Korea Advanced Institute Of Science And Technology (Daejeon, KR)
Inventors
- Yoon-jae LEE (Seoul, KR)
- Young-jin Park (Daejeon, KR)
- Hyun Jo (Daejeon, KR)
- Sun-Min Kim (Yongin-si, KR)
- Young-Tae Kim (Seongnam-si, KR)
Cpc classification
H04S2420/07
ELECTRICITY
H04S2420/01
ELECTRICITY
H04S2400/11
ELECTRICITY
International classification
H04S5/00
ELECTRICITY
Abstract
Provided are a method and apparatus for localizing a multichannel sound signal. The method includes: obtaining a multichannel sound signal to which sense of elevation is applied by applying a first filter to an input sound signal; determining at least one frequency range of a dynamic cue according to change of a head-related transfer function (HRTF) indicating information regarding paths from a spatial location of an actual speaker to ears of an audience; and applying a second filter to at least one sound signal, corresponding to the determined at least one frequency range, of at least one channel in the multichannel sound signal to change the at least one sound signal so as to remove or to reduce the dynamic cue when the multichannel sound signal is output.
Claims
1. An immersive three-dimensional (3D) sound reproducing method comprising: receiving input channel audio signals including at least one height input channel signal and an input channel configuration; obtaining gains based on the input channel configuration and an output channel configuration; obtaining a first head-related transfer function (HRTF) based on the input channel configuration, to provide a sense of elevation using the output channel configuration indicating a plurality of output speakers located on a horizontal plane; obtaining a second HRTF used according to an input channel audio signal at a predetermined position, the input channel audio signal being output through at least two speakers at positions different from the predetermined position, based on the input channel configuration, the output channel configuration, and a frequency range of dynamic cue; obtaining a HRTF filter by dividing the second HRTF by the first HRTF; and elevation rendering the input channel audio signals based on the gains, the HRTF filter, to provide the sense of elevation using the output channel configuration, wherein the second HRTF includes filter coefficients for a plurality of frequency bands dividing the frequency range, wherein each of the at least one height input channel signal is outputted to at least two of output channel audio signals via at least two output speakers located on the horizontal plane, and wherein the frequency range of the dynamic cue is determined according to a change of the second head related transfer function.
2. The method of claim 1, wherein the dynamic cue represents speaker-to-listener orientation.
3. The method of claim 1, wherein the second HRTF is determined based on spatial locations of a output channel signal and a input channel signal located at a predetermined elevation.
4. A non-transitory computer readable recording medium having embodied thereon a computer program, which when executed by a processor, performs the method of claim 1.
5. The method of claim 1, wherein the second HRTF is determined based on spatial locations of an output channel signal and an input channel signal located on the horizontal plane.
6. The method of claim 1, wherein the first HRTF indicates information regarding paths from a spatial location of the plurality of output speakers to ears of an audience, and the second HRTF indicates information regarding paths from a spatial location of a virtual speaker located at a predetermined elevation to ears of the audience.
7. An immersive three-dimensional (3D) sound reproducing apparatus comprising: receiver configured to receive input channel audio signals including at least one height input channel signal and an input channel configuration; elevation renderer configured to obtain gains based on the input channel configuration and an output channel configuration, obtain a first head-related transfer function (HRTF) based on the input channel configuration, to provide a sense of elevation using the output channel configuration indicating a plurality of output speakers located on a horizontal plane, obtain a second HRTF used according to an input channel audio signal at a predetermined position, the input channel audio signal being output through at least two speakers at positions different from the predetermined position, based on the input channel configuration, the output channel configuration, and a frequency range of dynamic cue, obtain a HRTF filter by dividing the second HRTF by the first HRTF, and render the input channel audio signals based on the gains, and the HRTF filter, to provide the sense of elevation using the output channel configuration, wherein the second HRTF includes filter coefficients for a plurality of frequency bands dividing the frequency range, wherein each of the at least one height input channel signal is outputted to at least two of output channel audio signals via at least two output speakers located on the horizontal plan, and wherein the frequency range of the dynamic cue is determined according to a change of the second head related transfer function.
8. The apparatus of claim 7, wherein the dynamic cue represents speaker-to-listener orientation.
9. The apparatus of claim 7, wherein the second HRTF is determined based on spatial locations of a output channel signal and a input channel signal located on a same plane.
10. The apparatus of claim 9, wherein the same plane is the horizontal plane.
11. The apparatus of claim 7, wherein the first HRTF indicates information regarding paths from a spatial location of the plurality of output speakers to ears of an audience, and the second HRTF indicates information regarding paths from a spatial location of a virtual speaker located at a predetermined elevation to ears of the audience.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The above and other features and advantages will become more apparent by describing in detail exemplary embodiments with reference to the attached drawings in which:
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
(8) Exemplary embodiments will now be described more fully with reference to the accompanying drawings. An exemplary embodiment may, however, be embodied in many different forms and should not be construed as being limited to exemplary embodiments set forth herein; rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the inventive concept to those skilled in the art. Like reference numerals in the drawings denote like elements.
(9) A term “unit”, that is, “module”, used in the description of exemplary embodiment means software components or hardware components such as a field-programmable gate array (FPGA) and an application-specific integrated circuit (ASIC). Also, the module is configured to perform predetermined operations. However, the module or unit is not limited to software or hardware. The module can be formed such that the module is stored in an addressable recording media. Also, the module can be formed such that one or more processes are executed. For example, the module may include components, such as software components, object-oriented software components, class components, and task components, processes, functions, attributes, procedures, subroutines, segments of program codes, drivers, firmware, micro-code, circuits, data, databases, data formats, tables, arrays, and variables. Herein, operations provided by the above components and modules can be achieved with a smaller number of components and modules by combining components and modules with each other, or can be achieved with a larger number of components and modules by dividing the components and the modules.
(10)
(11) Referring to
(12) A signal replicating unit 20 (e.g., signal replicator) replicates the input signal and generates a multichannel sound signal, whereas a gain value adjusting unit 30 (e.g., gain value adjuster) applies a predetermined gain value to the sound signal of each channel, of the multichannel sound signal, and outputs the sound signals.
(13) An HRTF, which is included in the HRFT filter 10 and is applied to the input signal, may be a generalized HRTF indicating information regarding paths from an actual speaker to the ears of an audience. Therefore, a related art method of localizing a multichannel sound signal does not consider an HRTF that varies based on changes of locations of the ears of an audience or a change of an audience. As a result, the sense of elevation of an audience is deteriorated.
(14)
(15) Referring to
(16) An input sound signal 205 is input to the multichannel sound signal generating unit 210. The input sound signal 205 may include a mono sound signal and a multichannel sound signal. The input sound signal 205 may be a signal stored in a memory (e.g., a volatile storage or a non-volatile storage) or a signal transmitted from an external device (an audio receiver, an audio/video receiver, a set-top box, a television, a computer, a workstation, a tablet device, a portable device, a media storage, a media streaming device, etc.).
(17) The multichannel sound signal generating unit 210 may generate (e.g., obtain) a multichannel sound signal to which sense of elevation is applied by applying, to the input sound signal 205, a first filter corresponding to a predetermined elevation. In detail, the first filter may include an HRFT filter.
(18) The HRTF includes information regarding paths from a spatial location of a sound source to both ears of an audience, that is, frequency transmission characteristics. The HRTF enables an audience to recognize stereoscopic sounds by using not only simple path differences, such as interaural level difference (ILD) and interaural time difference (ITD) between signals received by both ears, but also a phenomenon that characteristics of a complicated path, such as diffraction at head surface and reflection by earflap, is changed based on directions in which sound propagates. In each of the directions in a space, HRTF has unique characteristics. Therefore, stereoscopic sounds may be generated by using the HRTF.
(19) Equation 1 below is an example of the first filter applied to the input sound signal 205 by the multichannel sound signal generating unit 210:
Second HRTF/First HRTF [Equation 1]
(20)
(21) Since an output sound signal 295 heard by the audience 410 is output by the actual speaker 430, to make the audience 410 sense that the output sound signal 295 is output by the virtual speaker 450, the second HRTF corresponding to a predetermined elevation θ is divided by the first HRTF corresponding to a horizontal surface (or elevation of the actual speaker 430).
(22) An optimal HRTF corresponding to the predetermined elevation θ may vary from person to person. Therefore, after calculating an HRTF for some people in a group having similar characteristics (e.g., physical characteristics such as age and elevation or preference characteristics such as preferred frequency bands and preferred genre of music), a representative value (e.g., an average value) may be determined as the HRTF to be applied to all people in the group. In other words, the second HRTF and the first HRTF in Equation 1 are generalized HRTFs corresponding to a predetermined elevation.
(23) The multichannel sound signal generating unit 210 may select a suitable second HRTF based on a location at which a virtual sound source is to be localized (e.g., an elevation angle). The multichannel sound signal generating unit 210 may select a second HRTF corresponding to a virtual sound source by using mapping information between the location of the virtual sound source and the HRTF. Information regarding the location of the virtual sound source may be received via a (software or hardware) module, such as an application, or may be input by a user.
(24) The frequency range determining unit 230 determines at least one frequency range of a dynamic cue according to a change of an HRTF indicating information regarding paths from the spatial location of an actual speaker to the ears of an audience.
(25) As described above, the first HRTF and the second HRTF included in the first filter are generalized HRTFs. Therefore, when locations of the ears of an audience change as the audience moves its head or the audience moves, information regarding paths from the spatial location of the actual speaker to the ears of the audience is also changed. As a result, it may be difficult for the audience to receive a sense of elevation from the output sound signal 295 due to the dynamic cue based on factors including the change of the locations of the ears of the audience. The dynamic cue refers to the basis for receiving a sense of elevation of the output sound signal 295 (e.g., spectrum peaks and notches of sound pressure reaching the eardrums via which an audience recognizes sense of elevation). Therefore, if the basis is changed, the audience is unable to receive the sense of elevation of the output sound signal 295.
(26)
(27)
(28) Referring to
(29) The L section may be determined in any of various manners. For example, the L section may be determined by comparing an HRTF at a first elevation to HRTFs at second elevations that are very close to the first elevation. Alternatively, the L section may be determined by comparing the HRTF at the first elevation to an HRTF corresponding to locations of the ears of an audience.
(30) Referring back to
(31) In a case where a multichannel sound signal to which the second filter is applied is output, a sound signal, corresponding to a frequency range of a dynamic cue, of at least one channel in the multichannel sound signal to which the second filter is applied, may be changed to remove or reduce the dynamic cue. When a sound signal (i.e., a portion of a channel signal) corresponding to a frequency range of the dynamic cue (i.e., a portion of a channel signal corresponding to the frequency range of the dynamic cue) is changed in a multichannel sound signal to remove or reduce the dynamic cue, an audience may receive a realistic sense of elevation even if locations of the ears of the audience change.
(32) For example, if frequency ranges of the dynamic cue are between 800 Hz and 1000 Hz and between 1500 Hz and 2000 Hz, when channel signals of the respective channels included in a multichannel sound signal are output by a speaker, sound signals corresponding to the frequency ranges between 800 Hz and 1000 Hz and between 1500 Hz and 2000 Hz from among the output channel signals may be changed by removing or reducing the dynamic cue.
(33) The second filter may include at least one from among a phase inverse filter for inversing a phase of sound signals included in the frequency ranges of the dynamic cue, an amplitude control filter for reducing amplitudes of sound signals included in the frequency ranges of the dynamic cue, and a delay filter for delaying the sound signals included in the frequency ranges of the dynamic cue.
(34) If the second filter is a phase inverse filter and the multichannel sound signal is a stereo sound signal, the second filtering unit 250 may inverse the phase of sound signals in a left signal or a right signal in the stereo sound signal corresponding to the frequency ranges between 800 Hz and 1000 Hz and between 1500 Hz and 2000 Hz by applying the phase inverse filter to the left signal or the right signal. If the phase of sound signals in the left signal corresponding to the frequency ranges between 800 Hz and 1000 Hz and between 1500 Hz and 2000 Hz is inversed, when the left signal and the right signal are output by two-channel speakers, sound signals in the left signal and the right signal corresponding to the frequency ranges between 800 Hz and 1000 Hz and between 1500 Hz and 2000 Hz are offset at locations of the ears of an audience, and thus the dynamic cue is removed.
(35) Furthermore, if the second filter is an amplitude control filter, the second filtering unit 250 may remove or reduce the dynamic cue by changing amplitudes of sound signals from among channel signals of the respective channels of a multichannel sound signal, the sound signals corresponding to the frequency ranges of the dynamic cue. For example, after sound signals in a left signal and a right signal in a stereo sound signal corresponding to the frequency ranges between 800 Hz and 1000 Hz and between 1500 Hz and 2000 Hz are divided or obtained according to frequency bands, amplitudes of the sound signals of the respective divided frequency bands may be adjusted to be different in the left signal and the right signal, and thus the dynamic cue may be reduced. Alternatively, the dynamic cue may be reduced by adjusting amplitudes of sound signals from among the channel signals of the respective channels in a multichannel sound signal corresponding to the frequency ranges between 800 Hz and 1000 Hz and between 1500 Hz and 2000 Hz to be close to zero.
(36) Furthermore, if the second filter is a delay filter and the multichannel sound signal is a stereo sound signal, the second filtering unit 250 may apply the delay filter to a left signal or a right signal in the stereo sound signal. For example, the dynamic cue may be removed by delaying sound signals in the left signal corresponding to the frequency ranges between 800 Hz and 1000 Hz and between 1500 Hz and 2000 Hz, wherein the difference between the phase of the sound signals in the left signal and the phase of the sound signals in the right signal corresponding to the frequency ranges between 800 Hz and 1000 Hz and between 1500 Hz and 2000 Hz is 180°.
(37) If a multichannel sound signal includes signals of two or more channels (e.g., 5.1 channels or 7.1 channels), dynamic cue may be removed or reduced by using at least one filter from among a phase inverse filter, an amplitude control filter, and a delay filter. Any of various methods for removing or reducing dynamic cue may be employed.
(38)
(39) Referring to
(40) The multichannel sound signal generating unit 310 may include a first filtering unit 315 and a signal replicating unit 317. The first filtering unit 315 applies a first filter to an input sound signal 305. In one or more exemplary embodiments, the first filtering unit 315 may also apply the first filter to the signal replicating unit 317. The first filter may include an HRTF filter. The signal replicating unit 317 generates (e.g., obtains) a multichannel sound signal by replicating the input sound signal 305 to which the first filter is applied. Although
(41) If the input sound signal 305 is a mono signal, the signal replicating unit 317 may generate a multichannel sound signal, such as a stereo sound signal, a 5.1 channel sound signal, and a 7.1 channel sound signal, by replicating the mono sound signal.
(42) The amplitude adjusting unit 370 adjusts amplitudes of sound signals of the respective channels of the multichannel sound signal, such that a virtual speaker is located at a predetermined position on a horizontal surface including the virtual speaker located at a predetermined elevation. To localize a multichannel sound signal, which is localized to a predetermined elevation, in a predetermined direction on the horizontal surface at the predetermined elevation, the multichannel sound signal may be localized on the horizontal surface by adjusting amplitudes of sound signals of the respective channels by applying suitable gain values to the sound signals of the respective channels. As a result, an audience may receive not only a sense of elevation, but also a directional impression from an output sound signal 395 output by a speaker.
(43)
(44) In operation S610, the multichannel sound signal localizing apparatus 200 generates a multichannel sound signal to which sense of elevation is applied by applying a first filter corresponding to a predetermined elevation to an input sound signal. The input sound signal may include a mono sound signal and a stereo sound signal, and the multichannel sound signal may have more channels than the input sound signal.
(45) In operation S620, the multichannel sound signal localizing apparatus 200 determines a frequency range of a dynamic cue according to change of an HRTF indicating information regarding paths from the spatial location of an actual speaker to the ears of an audience. Due to the dynamic cue according to the change of the HRTF, the sense of elevation received by an audience from a sound signal output by the speaker may be deteriorated.
(46) In operation S630, the multichannel sound signal localizing apparatus 200 applies a second filter to a sound signal of at least one channel from among the multichannel sound signal. When the multichannel sound signal to which the second filter is applied is output by a speaker, a signal in the multichannel sound signal to which the second filter is applied corresponding to the frequency range of the dynamic cue is changed to remove or reduce the dynamic cue. In other words, the dynamic cue of the multichannel sound signal may be removed or reduced by the second filter, and thus a realistic sense of elevation may be provided to an audience.
(47) One or more exemplary embodiments can be written as computer programs and can be implemented in general-use digital computers that execute the programs using a computer readable recording medium.
(48) Examples of the computer readable recording medium include magnetic storage media (e.g., ROM, floppy disks, hard disks, etc.), optical recording media (e.g., CD-ROMs, or DVDs), etc. Moreover, one or more of the above-described elements can include a processor or microprocessor executing a computer program stored in a computer-readable medium.
(49) While exemplary embodiments have been particularly shown and described above, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the following claims.