Method for providing a spatialized soundfield
11363402 · 2022-06-14
Assignee
Inventors
Cpc classification
H04S7/305
ELECTRICITY
H04S2420/01
ELECTRICITY
H04S2420/13
ELECTRICITY
H04S7/308
ELECTRICITY
H04S2400/11
ELECTRICITY
H04R2201/405
ELECTRICITY
H04S7/301
ELECTRICITY
H04R2203/12
ELECTRICITY
International classification
Abstract
A signal processing system and method for delivering spatialized sound by optimizing sound waveforms from a sparse array of speakers to the ears of a user. The system can provide listening areas within a room or space, to provide spatialization sounds to create a 3D audio effect. In a binaural mode, a binary speaker array provides targeted beams aimed towards a user's ears.
Claims
1. A method for producing transaural spatialized sound, comprising: receiving audio signals representing spatial audio objects; filtering each audio signal through a spatialization filter to generate an array of virtual audio transducer signals for a virtual audio transducer array representing spatialized audio; segregating the array of virtual audio transducer signals into subsets each comprising a plurality of virtual audio transducer signals, each subset being for driving a physical audio transducer situated within a physical location range of the respective subset; time-offsetting respective virtual audio transducer signals of a respective subset based on a time difference of arrival of a sound from a nominal location of respective virtual audio transducer and the physical location of the corresponding physical audio transducer with respect to a targeted ear of a listener; and combining the time-offset respective virtual speaker signals of the respective subset as a physical audio transducer drive signal.
2. The method according to claim 1, further comprising abating a peak amplitude of the combined time-offset respective virtual audio transducer signals to reduce saturation distortion of the physical audio transducer.
3. The method according to claim 1, wherein said filtering comprises processing at least two audio channels with a graphic processing unit configured to act as an audio signal processor.
4. The method according to claim 1, wherein the array of virtual audio transducer signals is a linear array of 12 virtual audio transducers.
5. The method according to claim 1, wherein the virtual audio transducer array is a linear array having at least 3 times a number of virtual audio transducer signals as physical audio transducer drive signals.
6. The method according to claim 1, wherein each subset is a non-overlapping adjacent group of virtual audio transducer signals.
7. The method according to claim 6, wherein each subset is a non-overlapping adjacent group of at least 6 virtual audio transducer signals.
8. The method according to claim 1, wherein each subset has a virtual audio transducer with a location which overlaps a represented location range of another subset of virtual audio transducer signals.
9. The method according to claim 8, wherein the overlap is one virtual audio transducer signal.
10. The method according to claim 1, wherein the array of virtual audio transducer signals is a linear array having 12 virtual audio transducer signals, divided into two non-overlapping groups of 6 adjacent virtual audio transducer signals each, which are respectively combined to form 2 physical audio transducer drive signals.
11. The method according to claim 10, wherein the corresponding physical audio transducer for each group is located between the 3.sup.rd and 4.sup.th virtual audio transducer of the adjacent group of 6 virtual audio transducer signals.
12. The method according to claim 1, wherein said filtering comprises cross-talk cancellation.
13. The method according to claim 1, wherein said filtering is performed using reentrant data filters.
14. The method according to claim 1, further comprising receiving a signal representing an ear location of the listener.
15. The method according to claim 1, further comprising tracking a movement of the listener, and adapting the filtering dependent on the tracked movement.
16. The method according to claim 1, further comprising adaptively assigning virtual audio transducer signals to respective subsets.
17. The method according to claim 1, further comprising: adaptively determining a head related transfer function of a listener; filtering according to the adaptively determined a head related transfer function; sensing a characteristic of a head of the listener; and adapting the head related transfer function in dependence on the characteristic.
18. A system for producing transaural spatialized sound, comprising: an input configured to receive audio signals representing spatial audio objects; a spatialization audio data filter, configured to process each audio signal to generate an array of virtual audio transducer signals for a virtual audio transducer array representing spatialized audio, the array of virtual audio transducer signals being segregated into subsets each comprising a plurality of virtual audio transducer signals, each subset being for driving a physical audio transducer situated within a physical location range of the respective subset; a time-delay processor, configured to time-offset respective virtual audio transducer signals of a respective subset based on a time difference of arrival of a sound from a nominal location of respective virtual audio transducer and the physical location of the corresponding physical audio transducer with respect to a targeted ear of a listener; and a combiner, configured to combine the time-offsetted respective virtual speaker signals of the respective subset as a physical audio transducer drive signal.
19. The system according to claim 18, further comprising at least one of: a peak amplitude abatement filter configured to reduce saturation distortion of the physical audio transducer of the combined time-offset respective virtual audio transducer signals; a limiter configured to reduce saturation distortion of the physical audio transducer of the combined time-offset respective virtual audio transducer signals; a compander configured to reduce saturation distortion of the physical audio transducer of the combined time-offsetted respective virtual audio transducer signals; and a phase rotator configured to rotate a relative phase of at least one virtual audio transducer signal.
20. A system for producing spatialized sound, comprising: an input configured to receive audio signals representing spatial audio objects; at least one automated processor, configured to: process each audio signal through a spatialization filter to generate an array of virtual audio transducer signals for a virtual audio transducer array representing spatialized audio, the array of virtual audio transducer signals being segregated into subsets each comprising a plurality of virtual audio transducer signals, each subset being for driving a physical audio transducer situated within a physical location range of the respective subset; time-offset respective virtual audio transducer signals of a respective subset based on a time difference of arrival of a sound from a nominal location of respective virtual audio transducer and the physical location of the corresponding physical audio transducer with respect to a targeted ear of a listener; and combine the time-offsetted respective virtual speaker signals of the respective subset as a physical audio transducer drive signal; and at least one output port configured to present the physical audio transducer drive signals for respective subsets.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
DETAILED DESCRIPTION
(16) In binaural mode, the speaker array provides two sound outputs aimed towards the primary listener's ears. The inverse filter design method comes from a mathematical simulation in which a speaker array model approximating the real-world is created and virtual microphones are placed throughout the target sound field. A target function across these virtual microphones is created or requested. Solving the inverse problem using regularization, stable and realizable inverse filters are created for each speaker element in the array. The source signals are convolved with these inverse filters for each array element.
(17) In a second beamforming, or wave field synthesis (WFS), mode, the transform processor array provides sound signals representing multiple discrete sources to separate physical locations in the same general area. Masking signals may also be dynamically adjusted in amplitude and time to provide optimized masking and lack of intelligibility of listener's signal of interest.
(18) The WFS mode also uses inverse filters. Instead of aiming just two beams at the listener's ears, this mode uses multiple beams aimed or steered to different locations around the array.
(19) The technology involves a digital signal processing (DSP) strategy that allows for the both binaural rendering and WFS/sound beamforming, either separately or simultaneously in combination. As noted above, the virtual spatialization is then combined for a small number of physical transducers, e.g., 2 or 4.
(20) For both binaural and WFS mode, the signal to be reproduced is processed by filtering it through a set of digital filters. These filters may be generated by numerically solving an electro-acoustical inverse problem. The specific parameters of the specific inverse problem to be solved are described below. In general, however, the digital filter design is based on the principle of minimizing, in the least squares sense, a cost function of the type J=E+βV
(21) The cost function is a sum of two terms: a performance error E, which measures how well the desired signals are reproduced at the target points, and an effort penalty βV, which is a quantity proportional to the total power that is input to all the loudspeakers. The positive real number β is a regularization parameter that determines how much weight to assign to the effort term. Note that, according to the present implementation, the cost function may be applied after the summing, and optionally after the limiter/peak abatement function is performed.
(22) By varying β from zero to infinity, the solution changes gradually from minimizing the performance error only to minimizing the effort cost only. In practice, this regularization works by limiting the power output from the loudspeakers at frequencies at which the inversion problem is ill-conditioned. This is achieved without affecting the performance of the system at frequencies at which the inversion problem is well-conditioned. In this way, it is possible to prevent sharp peaks in the spectrum of the reproduced sound. If necessary, a frequency dependent regularization parameter can be used to attenuate peaks selectively.
(23) Wave Field Synthesis/Beamforming Mode
(24) WFS sound signals are generated for a linear array of virtual speakers, which define several separated sound beams. In WFS mode operation, different source content from the loudspeaker array can be steered to different angles by using narrow beams to minimize leakage to adjacent areas during listening. As shown in
(25) When the virtual speaker signals are combined, a significant portion of the spatial sound cancellation ability is lost; however, it is at least theoretically possible to optimize the sound at each of the listener's ears for the direct (i.e., non-reflected) sound path.
(26) In the WFS mode, the array provides multiple discrete source signals. For example, three people could be positioned around the array listening to three distinct sources with little interference from each others' signals.
(27) The WFS mode signals are generated through the DSP chain as shown in
(28)
(29) An M×N matrix H(f) is computed, which represents the electro-acoustical transfer function between each loudspeaker of the array and each control point, as a function of the frequency f, where H.sub.p,1 corresponds to the transfer function between the 1.sup.th speaker (of N speakers) and the p.sup.th control point 92. These transfer functions can either be measured or defined analytically from an acoustic radiation model of the loudspeaker. One example of a model is given by an acoustical monopole, given by the following equation:
(30)
(31) where c is the speed of sound propagation, f is the frequency and is the distance between the l.sup.th loudspeaker and the p.sup.th control point.
(32) Instead of correcting for time delays after the array signals are fully defined, it is also possible to use the correct speaker location while generating the signal, to avoid reworking the signal definition.
(33) A more advanced analytical radiation model for each loudspeaker may be obtained by a multipole expansion, as is known in the art. (See, e.g., V. Rokhlin, “Diagonal forms of translation operators for the Helmholtz equation in three dimensions”, Applied and Computations Harmonic Analysis, 1:82-93, 1993.)
(34) A vector p(f) is defined with M elements representing the target sound field at the locations identified by the control points 92 and as a function of the frequency f. There are several choices of the target field. One possibility is to assign the value of 1 to the control point(s) that identify the direction(s) of the desired sound beam(s) and zero to all other control points.
(35) The digital filter coefficients are defined in the frequency (f) domain or digital-sampled (z)-domain and are the N elements of the vector a(f) or a(z), which is the output of the filter computation algorithm. The filer may have different topologies, such as FIR, IIR, or other types. The vector a is computed by solving, for each frequency for sample parameter z, a linear optimization problem that minimizes e.g., the following cost function
J(f)=∥H(f)a(f)−p(f)∥.sup.2+β∥a(f)∥.sup.2
(36) The symbol ∥ . . . ∥ indicates the L.sup.2 norm of a vector, and β is a regularization parameter, whose value can be defined by the designer. Standard optimization algorithms can be used to numerically solve the problem above.
(37) Referring now to
(38) For each sound source 102, the input signal is filtered through a set of N digital filters 104, with one digital filter 104 for each loudspeaker of the array. These digital filters 104 are referred to as “spatialization filters”, which are generated by the algorithm disclosed above and vary as a function of the location of the listener(s) and/or of the intended direction of the sound beam to be generated.
(39) The digital filters may be implemented as finite impulse response (FIR) filters; however, greater efficiency and better modelling of response may be achieved using other filter topologies, such as infinite impulse response (IIR) filters, which employ feedback or re-entrancy. The filters may be implemented in a traditional DSP architecture, or within a graphic processing unit (GPU, developer.nvidia.com/vrworks-audio-sdk-depth) or audio processing unit (APU, www.nvidia.com/en-us/drivers/apu/). Advantageously, the acoustic processing algorithm is presented as a ray tracing, transparency, and scattering model.
(40) For each sound source 102, the audio signal filtered through the n.sup.th digital filter 104 (i.e., corresponding to the n.sup.th loudspeaker) is summed at combiner 106 with the audio signals corresponding to the different audio sources 102 but to the same n.sup.th loudspeaker. The summed signals are then output to loudspeaker array 108.
(41)
(42) The PBEP 112 allows the listener to perceive sound information contained in the lower part of the audio spectrum by generating higher frequency sound material, providing the perception of lower frequencies using higher frequency sound). Since the PBE processing is non-linear, it is important that it comes before the spatialization filters 104. If the non-linear PBEP block 112 is inserted after the spatial filters, its effect could severely degrade the creation of the sound beam.
(43) It is important to emphasize that the PBEP 112 is used in order to compensate (psycho-acoustically) for the poor directionality of the loudspeaker array at lower frequencies rather than compensating for the poor bass response of single loudspeakers themselves, as is normally done in prior art applications.
(44) The DRCE 114 in the DSP chain provides loudness matching of the source signals so that adequate relative masking of the output signals of the array 108 is preserved. In the binaural rendering mode, the DRCE used is a 2-channel block which makes the same loudness corrections to both incoming channels.
(45) As with the PBEP block 112, because the DRCE 114 processing is non-linear, it is important that it comes before the spatialization filters 104. If the non-linear DRCE block 114 were to be inserted after the spatial filters 104, its effect could severely degrade the creation of the sound beam. However, without this DSP block, psychoacoustic performance of the DSP chain and array may decrease as well.
(46) Another optional component is a listener tracking device (LTD) 116, which allows the apparatus to receive information on the location of the listener(s) and to dynamically adapt the spatialization filters in real time. The LTD 116 may be a video tracking system which detects the listener's head movements or can be another type of motion sensing system as is known in the art. The LTD 116 generates a listener tracking signal which is input into a filter computation algorithm 118. The adaptation can be achieved either by re-calculating the digital filters in real time or by loading a different set of filters from a pre-computed database. Alternate user localization includes radar (e.g., heartbeat) or lidar tracking RFID/NFC tracking, breathsounds, etc.
(47)
(48) Binaural Mode
(49) The DSP for the binaural mode involves the convolution of the audio signal to be reproduced with a set of digital filters representing a Head-Related Transfer Function (HRTF).
(50)
(51)
(52) The binaural mode signal processing chain, shown in
(53) In the binaural mode, the invention generates sound signals feeding a virtual linear array. The virtual linear array signals are combined into speaker driver signals. The speakers provide two sound beams aimed towards the primary listener's ears—one beam for the left ear and one beam for the right ear.
(54)
(55) As described with reference to
(56) For each sound source 32, the input signal is filtered through two digital filters 34 (HRTF-L and HRTF-R) representing a left and right Head-Related Transfer Function, calculated for the angle at which the given sound source 32 is intended to be rendered to the listener. For example, the voice of a talker can be rendered as a plane wave arriving from 30 degrees to the right of the listener. The HRTF filters 34 can be either taken from a database or can be computed in real time using a binaural processor. After the HRTF filtering, the processed signals corresponding to different sound sources but to the same ear (left or right), are merged together at combiner 35 This generates two signals, hereafter referred to as “total binaural signal-left”, or “TBS-L” and “total binaural signal-right” or “TBS-R” respectively.
(57) Each of the two total binaural signals, TB S-L and TBS-R, is filtered through a set of N digital filters 36, one for each loudspeaker, computed using the algorithm disclosed below. These filters are referred to as “spatialization filters”. It is emphasized for clarity that the set of spatialization filters for the right total binaural signal is different from the set for the left total binaural signal.
(58) The filtered signals corresponding to the same n.sup.th virtual speaker but for two different ears (left and right) are summed together at combiners 37. These are the virtual speaker signals, which feed the combiner system, which in turn feed the physical speaker array 38.
(59) The algorithm for the computation of the spatialization filters 36 for the binaural modality is analogous to that used for the WFS modality described above. The main difference from the WFS case is that only two control points are used in the binaural mode. These control points correspond to the location of the listener's ears and are arranged as shown in
(60) The 2×N matrix H(f) is computed using elements of the electro-acoustical transfer functions between each loudspeaker and each control point, as a function of the frequency f. These transfer functions can be either measured or computed analytically, as discussed above. A 2-element vector p is defined. This vector can be either [1,0] or [0,1], depending on whether the spatialization filters are computed for the left or right ear, respectively. The filter coefficients for the given frequency f are the N elements of the vector a(f) computed by minimizing the following cost function
J(f)=∥H(f)a(f)−p(f)∥.sup.2+β∥a(f)∥.sup.2
(61) If multiple solutions are possible, the solution is chosen that corresponds to the minimum value of the L.sup.2 norm of a(f).
(62)
(63) It is important to emphasize that the PBEP 52 is used in order to compensate (psycho-acoustically) for the poor directionality of the loudspeaker array at lower frequencies rather than compensating for the poor bass response of single loudspeakers themselves.
(64) The DRCE 54 in the DSP chain provides loudness matching of the source signals so that adequate relative masking of the output signals of the array 38 is preserved. In the binaural rendering mode, the DRCE used is a 2-channel block which makes the same loudness corrections to both incoming channels.
(65) As with the PBEP block 52, because the DRCE 54 processing is non-linear, it is important that it comes before the spatialization filters 36. If the non-linear DRCE block 54 were to be inserted after the spatial filters 36, its effect could severely degrade the creation of the sound beam. However, without this DSP block, psychoacoustic performance of the DSP chain and array may decrease as well.
(66) Another optional component is a listener tracking device (LTD) 56, which allows the apparatus to receive information on the location of the listener(s) and to dynamically adapt the spatialization filters in real time. The LTD 56 may be a video tracking system which detects the listener's head movements or can be another type of motion sensing system as is known in the art. The LTD 56 generates a listener tracking signal which is input into a filter computation algorithm 58. The adaptation can be achieved either by re-calculating the digital filters in real time or by loading a different set of filters from a pre-computed database.
(67)
(68) WFS and binaural mode processing can be combined into a single device to produce total sound field control. Such an approach would combine the benefits of directing a selected sound beam to a targeted listener, e.g., for privacy or enhanced intelligibility, and separately controlling the mixture of sound that is delivered to the listener's ears to produce surround sound. The device could process audio using binaural mode or WFS mode in the alternative or in combination. Although not specifically illustrated herein, the use of both the WFS and binaural modes would be represented by the block diagrams of
Example
(69) A 12-channel spatialized virtual audio array is implemented in accordance with U.S. Pat. No. 9,578,440. This virtual array provides signals for driving a linear or curvilinear equally-spaced array of e.g., 12 speakers situated in front of a listener. The virtual array is divided into two or four. In the case of two, the “left” e.g., 6 signals are directed to the left physical speaker, and the “right” e.g., 6 signals are directed to the right physical speaker. The virtual signals are to be summed, with at least two intermediate processing steps.
(70) The first intermediate processing step compensates for the time difference between the nominal location of the virtual speaker and the physical location of the speaker transducer. For example, the virtual speaker closest to the listener is assigned a reference delay, and the further virtual speakers are assigned increasing delays. In a typical case, the virtual array is situated such that the time differences for adjacent virtual speakers are incrementally varying, though a more rigorous analysis may be implemented. At a 48 kHz sampling rate, the difference between the nearest and furthest virtual speaker may be, e.g., 4 cycles.
(71) The second intermediate processing step limits the peaks of the signal, in order to avoid over-driving the physical speaker or causing significant distortion. This limiting may be frequency selective, so only a frequency band is affected by the process. This step should be performed after the delay compensation. For example, a compander may be employed. Alternately, presuming only rare peaking, a simple limited may be employed. In other cases, a more complex peak abatement technology may be employed, such as a phase shift of one or more of the channels, typically based on a predicted peaking of the signals which are delayed slightly from their real-time presentation. Note that this phase shift alters the first intermediate processing step time delay; however, when the physical limit of the system is reached, a compromise is necessary.
(72) With a virtual line array of 12 speakers, and 2 physical speakers, the physical speaker locations are between elements 3-4 and 9-10. If (s) is the center-to-center distance between speakers, then the distance from the center of the array to the center of each real speaker is: A=3s. The left speaker is offset −A from the center, and the right speaker is offset A.
(73) The second intermediate processing step is principally a downmix of the six virtual channels, with a limiter and/or compressor or other process to provide peak abatement, applied to prevent saturation or clipping. For example, the left channel is:
L.sub.out=Limit(L.sub.1+L.sub.2+L.sub.3+L.sub.4+L.sub.5+L.sub.6)
(74) and the right channel is
R.sub.out=Limit(R.sub.1+R.sub.2+R.sub.3+R.sub.4+R.sub.5+R.sub.6)
(75) Before the downmix, the difference in delays between the virtual speakers and the listener's ears, compared to the physical speaker transducer and the listener's ears, need to be taken into account. This delay can be significant particularly at higher frequencies, since the ratio of the length of the virtual speaker array to the wavelength of the sound increases. To calculate the distance from the listener to each virtual speaker, assume that the speaker, n, is numbered 1 to 6, where 1 is the speaker closest to the center, and 6 is the farthest from center. The distance from the center of the array to the speaker is: d=((n−1)+0.5)*s. Using the Pythagorean theorem, the distance from the speaker to the listener can be calculated as follows:
d.sub.n=√{square root over (l.sup.2+(((n−1)+0.5)*s).sup.2)}
(76) The distance from the real speaker to the listener is
d.sub.r=√{square root over (l.sup.2+(3*s).sup.2)}
(77) The sample delay for each speaker can be calculated by the different between the two listener distances. This can them be converted to samples (assuming the speed of sound is 343 m/s and the sample rate is 48 kHz.
(78)
(79) This can lead to a significant delay between listener distances. For example, if the virtual array inter-speaker distance is 38 mm, and the listener is 500 mm from the array, the delay from the virtual far-left speaker (n=6) to the real speaker is:
(80)
(81) At higher audio frequencies, i.e., 12 kHz an entire wave cycle is 4 samples, to the difference amounts to a 360° phase shift. See Table 1.
(82) Thus, when combining the signals for the virtual speakers into the physical speaker signal, the time offset is preferably compensated based on the displacement of the virtual speaker from the physical one. The time offset may also be accomplished within the spatialization algorithm, rather than as a post-process.
(83) The invention can be implemented in software, hardware or a combination of hardware and software. The invention can also be embodied as computer readable code on a computer readable medium. The computer readable medium can be any data storage device that can store data which can thereafter be read by a computing device. Examples of the computer readable medium include read-only memory, random-access memory, CD-ROMs, magnetic tape, optical data storage devices, and carrier waves. The computer readable medium can also be distributed over network-coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
(84) The many features and advantages of the present invention are apparent from the written description and, thus, it is intended by the appended claims to cover all such features and advantages of the invention. Further, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation as illustrated and described. Hence, all suitable modifications and equivalents may be resorted to as falling within the scope of the invention.