SYSTEM AND METHOD FOR PROVIDING A SPATIALIZED SOUNDFIELD

Abstract

A signal processing system and method for delivering spatialized sound, comprising: a spatial mapping sensor, configured to map an environment, to determine at least a position of at least one listener and at least one object; a signal processor configured to: transform a received audio program according to a spatialization model comprising parameters defining a head-related transfer function, and an acoustic interaction of the object, to form spatialized audio; generate an array of audio transducer signals for an audio transducer array representing the spatialized audio; and a network port configured to communicate physical state information for the at least one listener through digital packet communication network.

Claims

1. A spatialized sound system, comprising: a spatial mapping sensor, configured to map an environment, to determine at least a position of at least one listener and an object; a signal processor configured to: transform a received audio program according to a spatialization model comprising spatialization model parameters defining at least one head-related transfer function, and an acoustic interaction of the object, to form spatialized audio; and generate an array of audio transducer signals for an audio transducer array representing the spatialized audio; and a network port configured to communicate physical state information for the at least one listener through digital packet communication network.

2. The spatialized sound system according to claim 1, wherein the spatial mapping sensor comprises a radar sensor having an antenna array.

3. The spatialized sound system according to claim 1, further comprising a single housing for both the audio transducer array and the spatial mapping sensor.

4. The spatialized sound system according to claim 1, wherein the signal processor is further configured to determine a body pose of the at least one listener.

5. The spatialized sound system according to claim 1, wherein the signal processor is further configured to determine a movement of the at least one listener.

6. The spatialized sound system according to claim 1, wherein the signal processor is further configured to determine an interaction between two listeners.

7. The spatialized sound system according to claim 1, wherein the physical state information communicated through the network port is not personally identifiable for the at least one listener.

8. The spatialized sound system according to claim 1, further comprising at least one media processor configured to control the network port to receive media content selectively dependent on the physical state information transmitted by the spatialized sound system.

9. The spatialized sound system according to claim 1, further comprising a microphone configured to receive audio feedback, wherein the spatialization model parameters are further dependent on the audio feedback.

10. The spatialized sound system according to claim 1, wherein the signal processor is further configured to filter the audio feedback for a command from the at least one listener, and to respond to the listener command.

11. The spatialized sound system according to claim 1, wherein the at least one listener comprises a first listener and a second listener, wherein the signal processor is further configured to determine a location of each of the first listener and the second listener within the environment, and to transform the audio program with the spatialization model, to generate the array of audio transducer signals for the audio transducer array representing the spatialized audio, selectively dependent on the respective location for each of the first listener and the second listener and the respective head-related transfer function for each of the first listener and the second listener.

12. The spatialized sound system according to claim 11, wherein the signal processor is further configured to transform each of a first audio program and a second audio program according to the spatialization model, to generate the array of audio transducer signals for the audio transducer array representing the spatialized audio to deliver the first audio program to the first listener while suppressing the second audio program at the location of the first listener, and to deliver the second audio program to the second listener while suppressing the first audio program at the location of the second listener, selectively dependent on respective locations and head-related transfer functions for the first listener and the second listener, and at least one acoustic reflection off the object.

13. The spatialized sound system according to claim 1, further comprising at least one automated processor configured to perform a statistical analysis of the physical state information over time for a plurality of listeners.

14. The spatialized sound system according to claim 1, wherein the array of audio transducers signals comprises a linear array of at least four audio transducers, and the signal processor is configured to perform cross-talk cancellation between ears of at least two different listeners.

15. A spatialized sound method, comprising: mapping an environment with a spatial mapping sensor to produce physical state information, defining at least a position of at least one listener and at least one object; receiving an audio program to be delivered to the at least one listener; transforming the audio program with a spatialization model dependent on the physical state information, to generate an array of audio transducer signals for an audio transducer array representing spatialized audio, the spatialization model comprising spatialization model parameters defining a head-related transfer function for the at least one listener, and an acoustic interaction of the object; and communicating the physical state information for the at least one listener through a network port to digital packet communication network.

16. The spatialized sound method according to claim 15, wherein the spatial mapping sensor comprises an imaging radar sensor having an antenna array.

17. The spatialized sound method according to claim 15, further comprising: determining dynamically-changing body state and a head-related transfer function for each of a plurality of listeners concurrently in the environment; and transforming the audio program with the spatialization model, to generate the array of audio transducer signals for the audio transducer array representing the spatialized audio, selectively dependent on the location of each respective listener and the location of the object, and the respective head-related transfer function for each respective listener, while suppressing crosstalk between spatialized audio targeted to each respective listener the locations of other listeners.

18. The spatialized sound method according to claim 15, further comprising: receiving audio feedback through at least one microphone, wherein the spatialization model parameters are further dependent on the audio feedback; and filtering the audio feedback for a listener command; and responding to the listener command.

19. The spatialized sound method according to claim 1, further comprising performing a statistical analysis of the physical state information for a plurality of listeners at a remote server.

20. A spatialized sound method, comprising: determining a physical state of at least one listener in an environment comprising at least one acoustically-interactive object; receiving an audio program to be delivered to the at least one listener and metadata associated with the audio program; transforming the audio program with a spatialization model, to generate an array of audio transducer signals for an audio transducer array representing a spatialized audio program configured dependent on the associated metadata, and the object, the spatialization model comprising spatialization model parameters defining a head-related transfer function for the at least one listener dependent on the determined physical state of the at least one listener; and reproducing the spatialized audio program with a speaker array.

21. The spatialized sound method according to claim 20, wherein the physical state is determined with a radar sensor, a lidar sensor, or an acoustic sensor, further comprising communicating data from the radar sensor, lidar sensor, or acoustic sensor to a remote server.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0256] FIGS. 1A and 1B show diagrams illustrating the wave field synthesis (WFS) mode operation used for private listening (FIG. 1A) and the use of WFS mode for multi-user, multi-position audio applications (FIG. 1B).

[0257] FIG. 2 is a block diagram showing the WFS signal processing chain.

[0258] FIG. 3 is a diagrammatic view of an exemplary arrangement of control points for WFS mode operation.

[0259] FIG. 4 is a diagrammatic view of a first embodiment of a signal processing scheme for WFS mode operation.

[0260] FIG. 5 is a diagrammatic view of a second embodiment of a signal processing scheme for WFS mode operation.

[0261] FIGS. 6A-6E are a set of polar plots showing measured performance of a prototype speaker array with the beam steered to 0 degrees at frequencies of 10000, 5000, 2500, 1000 and 600 Hz, respectively.

[0262] FIG. 7A is a diagram illustrating the basic principle of binaural mode operation.

[0263] FIG. 7B is a diagram illustrating binaural mode operation as used for spatialized sound presentation.

[0264] FIG. 8 is a block diagram showing an exemplary binaural mode processing chain.

[0265] FIG. 9 is a diagrammatic view of a first embodiment of a signal processing scheme for the binaural modality.

[0266] FIG. 10 is a diagrammatic view of an exemplary arrangement of control points for binaural mode operation.

[0267] FIG. 11 is a block diagram of a second embodiment of a signal processing chain for the binaural mode.

[0268] FIGS. 12A and 12B illustrate simulated frequency domain and time domain representations, respectively, of predicted performance of an exemplary speaker array in binaural mode measured at the left ear and at the right ear.

[0269] FIG. 13 shows the relationship between the virtual speaker array and the physical speakers.

[0270] FIG. 14 shows a schematic representation of a spatial sensor-based spatialized audio adaptation system.

DETAILED DESCRIPTION

[0271] In binaural mode, the speaker array provides two sound outputs aimed towards the primary listener's ears. The inverse filter design method comes from a mathematical simulation in which a speaker array model approximating the real-world is created and virtual microphones are placed throughout the target sound field. A target function across these virtual microphones is created or requested. Solving the inverse problem using regularization, stable and realizable inverse filters are created for each speaker element in the array. The source signals are convolved with these inverse filters for each array element.

[0272] In a beamforming, or wave field synthesis (WFS), mode, the transform processor array provides sound signals representing multiple discrete sources to separate physical locations in the same general area. Masking signals may also be dynamically adjusted in amplitude and time to provide optimized masking and lack of intelligibility of listener's signal of interest. The WFS mode also uses inverse filters. Instead of aiming just two beams at the listener's ears, this mode uses multiple beams aimed or steered to different locations around the array.

[0273] The technology involves a digital signal processing (DSP) strategy that allows for the both binaural rendering and WFS/sound beamforming, either separately or simultaneously in combination. As noted above, the virtual spatialization is then combined for a small number of physical transducers, e.g., 2 or 4.

[0274] For both binaural and WFS mode, the signal to be reproduced is processed by filtering it through a set of digital filters. These filters may be generated by numerically solving an electro-acoustical inverse problem. The specific parameters of the specific inverse problem to be solved are described below. In general, however, the digital filter design is based on the principle of minimizing, in the least squares sense, a cost function of the type J=E+βV

[0275] The cost function is a sum of two terms: a performance error E, which measures how well the desired signals are reproduced at the target points, and an effort penalty βV, which is a quantity proportional to the total power that is input to all the loudspeakers. The positive real number β is a regularization parameter that determines how much weight to assign to the effort term. Note that, according to the present implementation, the cost function may be applied after the summing, and optionally after the limiter/peak abatement function is performed.

[0276] By varying β from zero to infinity, the solution changes gradually from minimizing the performance error only to minimizing the effort cost only. In practice, this regularization works by limiting the power output from the loudspeakers at frequencies at which the inversion problem is ill-conditioned. This is achieved without affecting the performance of the system at frequencies at which the inversion problem is well-conditioned. In this way, it is possible to prevent sharp peaks in the spectrum of the reproduced sound. If necessary, a frequency dependent regularization parameter can be used to attenuate peaks selectively.

[0277] Wave Field Synthesis/Beamforming Mode

[0278] WFS sound signals are generated for a linear array of virtual speakers, which define several separated sound beams. In WFS mode operation, different source content from the loudspeaker array can be steered to different angles by using narrow beams to minimize leakage to adjacent areas during listening. As shown in FIG. 1A, private listening is made possible using adjacent beams of music and/or noise delivered by loudspeaker array 72. The direct sound beam 74 is heard by the target listener 76, while beams of masking noise 78, which can be music, white noise or some other signal that is different from the main beam 74, are directed around the target listener to prevent unintended eavesdropping by other persons within the surrounding area. Masking signals may also be dynamically adjusted in amplitude and time to provide optimized masking and lack of intelligibility of listener's signal of interest as shown in later figures which include the DRCE DSP block.

[0279] When the virtual speaker signals are combined, a significant portion of the spatial sound cancellation ability is lost; however, it is at least theoretically possible to optimize the sound at each of the listener's ears for the direct (i.e., non-reflected) sound path.

[0280] In the WFS mode, the array provides multiple discrete source signals. For example, three people could be positioned around the array listening to three distinct sources with little interference from each others' signals. FIG. 1B illustrates an exemplary configuration of the WFS mode for multi-user/multi-position application. With only two speaker transducers, full control for each listener is not possible, though through optimization, an acceptable (improved over stereo audio) is available. As shown, array 72 defines discrete sounds beams 73, 75 and 77, each with different sound content, to each of listeners 76a and 76b. While both listeners are shown receiving the same content (each of the three beams), different content can be delivered to one or the other of the listeners at different times. When the array signals are summed, some of the directionality is lost, and in some cases, inverted. For example, where a set of 12 speaker array signals are summed to 4 speaker signals, directional cancellation signals may fail to cancel at most locations. However, preferably adequate cancellation is preferably available for an optimally located listener.

[0281] The WFS mode signals are generated through the DSP chain as shown in FIG. 2. Discrete source signals 801, 802 and 803 are each convolved with inverse filters for each of the loudspeaker array signals. The inverse filters are the mechanism that allows that steering of localized beams of audio, optimized for a particular location according to the specification in the mathematical model used to generate the filters. The calculations may be done real-time to provide on-the-fly optimized beam steering capabilities which would allow the users of the array to be tracked with audio. In the illustrated example, the loudspeaker array 812 has twelve elements, so there are twelve filters 804 for each source. The resulting filtered signals corresponding to the same n.sup.th loudspeaker signal are added at combiner 806, whose resulting signal is fed into a multi-channel soundcard 808 with a DAC corresponding to each of the twelve speakers in the array. The twelve signals are then divided into channels, i.e., 2 or 4, and the members of each subset are then time adjusted for the difference in location between the physical location of the corresponding array signal, and the respective physical transducer, and summed, and subject to a limiting algorithm. The limited signal is then amplified using a class D amplifier 810 and delivered to the listener(s) through the two or four speaker array 812.

[0282] FIG. 3 illustrates how spatialization filters are generated. Firstly, it is assumed that the relative arrangement of the N array units is given. A set of M virtual control points 92 is defined where each control point corresponds to a virtual microphone. The control points are arranged on a semicircle surrounding the array 98 of N speakers and centered at the center of the loudspeaker array. The radius of the arc 96 may scale with the size of the array. The control points 92 (virtual microphones) are uniformly arranged on the arc with a constant angular distance between neighboring points.

[0283] An M×N matrix H(f) is computed, which represents the electro-acoustical transfer function between each loudspeaker of the array and each control point, as a function of the frequency f, where H.sub.p,l corresponds to the transfer function between the l.sup.th speaker (of N speakers) and the p.sup.th control point 92. These transfer functions can either be measured or defined analytically from an acoustic radiation model of the loudspeaker. One example of a model is given by an acoustical monopole, given by the following equation:

[00066] $H_{p, ℓ (f)} = \frac{\exp [- j 2 π {fr}_{p, ℓ} / c]}{4 π r_{p, ℓ}}$

[0284] where c is the speed of sound propagation, f is the frequency and r.sub.p,l is the distance between the l-the loudspeaker and the p.sup.th control point.

[0285] Instead of correcting for time delays after the array signals are fully defined, it is also possible to use the correct speaker location while generating the signal, to avoid reworking the signal definition.

[0286] A more advanced analytical radiation model for each loudspeaker may be obtained by a multipole expansion, as is known in the art. (See, e.g., V. Rokhlin, “Diagonal forms of translation operators for the Helmholtz equation in three dimensions”, Applied and Computations Harmonic Analysis, 1:82-93, 1993.)

[0287] A vector p(f) is defined with M elements representing the target sound field at the locations identified by the control points 92 and as a function of the frequency f. There are several choices of the target field. One possibility is to assign the value of 1 to the control point(s) that identify the direction(s) of the desired sound beam(s) and zero to all other control points.

[0288] The digital filter coefficients are defined in the frequency (f) domain or digital-sampled (z)-domain and are the N elements of the vector a(f) or a(z), which is the output of the filter computation algorithm. The filer may have different topologies, such as FIR, IIR, or other types. The vector a is computed by solving, for each frequency f or sample parameter z, a linear optimization problem that minimizes e.g., the following cost function

J(f)=∥H(f)a(f)−p(f)∥.sup.2+β∥a(f)∥.sup.2

[0289] The symbol ∥ . . . ∥ indicates the L.sup.2 norm of a vector, and β is a regularization parameter, whose value can be defined by the designer. Standard optimization algorithms can be used to numerically solve the problem above.

[0290] Referring now to FIG. 4, the input to the system is an arbitrary set of audio signals (from A through Z), referred to as sound sources 102. The system output is a set of audio signals (from 1 through N) driving the N units of the loudspeaker array 108. These N signals are referred to as “loudspeaker signals”.

[0291] For each sound source 102, the input signal is filtered through a set of N digital filters 104, with one digital filter 104 for each loudspeaker of the array. These digital filters 104 are referred to as “spatialization filters”, which are generated by the algorithm disclosed above and vary as a function of the location of the listener(s) and/or of the intended direction of the sound beam to be generated.

[0292] The digital filters may be implemented as finite impulse response (FIR) filters; however, greater efficiency and better modelling of response may be achieved using other filter topologies, such as infinite impulse response (IIR) filters, which employ feedback or re-entrancy. The filters may be implemented in a traditional DSP architecture, or within a graphic processing unit (GPU, developer.nvidia.com/vrworks-audio-sdk-depth) or audio processing unit (APU, www.nvidia.com/en-us/drivers/apu/). Advantageously, the acoustic processing algorithm is presented as a ray tracing, transparency, and scattering model.

[0293] For each sound source 102, the audio signal filtered through the n.sup.th digital filter 104 (i.e., corresponding to the n.sup.th loudspeaker) is summed at combiner 106 with the audio signals corresponding to the different audio sources 102 but to the same n.sup.th loudspeaker. The summed signals are then output to loudspeaker array 108.

[0294] FIG. 5 illustrates an alternative embodiment of the binaural mode signal processing chain of FIG. 4 which includes the use of optional components including a psychoacoustic bandwidth extension processor (PBEP) and a dynamic range compressor and expander (DRCE), which provides more sophisticated dynamic range and masking control, customization of filtering algorithms to particular environments, room equalization, and distance-based attenuation control.

[0295] The PBEP 112 allows the listener to perceive sound information contained in the lower part of the audio spectrum by generating higher frequency sound material, providing the perception of lower frequencies using higher frequency sound). Since the PBE processing is non-linear, it is important that it comes before the spatialization filters 104. If the non-linear PBEP block 112 is inserted after the spatial filters, its effect could severely degrade the creation of the sound beam. It is important to emphasize that the PBEP 112 is used in order to compensate (psycho-acoustically) for the poor directionality of the loudspeaker array at lower frequencies rather than compensating for the poor bass response of single loudspeakers themselves, as is normally done in prior art applications. The DRCE 114 in the DSP chain provides loudness matching of the source signals so that adequate relative masking of the output signals of the array 108 is preserved. In the binaural rendering mode, the DRCE used is a 2-channel block which makes the same loudness corrections to both incoming channels. As with the PBEP block 112, because the DRCE 114 processing is non-linear, it is important that it comes before the spatialization filters 104. If the non-linear DRCE block 114 were to be inserted after the spatial filters 104, its effect could severely degrade the creation of the sound beam. However, without this DSP block, psychoacoustic performance of the DSP chain and array may decrease as well.

[0296] Another optional component is a listener tracking device (LTD) 116, which allows the apparatus to receive information on the location of the listener(s) and to dynamically adapt the spatialization filters in real time. The LTD 116 may be a video tracking system which detects the listener's head movements or can be another type of motion sensing system as is known in the art. The LTD 116 generates a listener tracking signal which is input into a filter computation algorithm 118. The adaptation can be achieved either by re-calculating the digital filters in real time or by loading a different set of filters from a pre-computed database. Alternate user localization includes radar (e.g., heartbeat) or lidar tracking RFID/NFC tracking, breathsounds, etc.

[0297] FIGS. 6A-6E are polar energy radiation plots of the radiation pattern of a prototype array being driven by the DSP scheme operating in WFS mode at five different frequencies, 10,000 Hz, 5,000 Hz, 2,500 Hz, 1,000 Hz, and 600 Hz, and measured with a microphone array with the beams steered at 0 degrees.

[0298] Binaural Mode

[0299] The DSP for the binaural mode involves the convolution of the audio signal to be reproduced with a set of digital filters representing a Head-Related Transfer Function (HRTF).

[0300] FIG. 7A illustrates the underlying approach used in binaural mode operation, where an array of speaker locations 10 is defined to produce specially-formed audio beams 12 and 14 that can be delivered separately to the listener's ears 16L and 16R. Using this mode, cross-talk cancellation is inherently provided by the beams. However, this is not available after summing and presentation through a smaller number of speakers.

[0301] FIG. 7B illustrates a hypothetical video conference call with multiple parties at multiple locations. When the party located in New York is speaking, the sound is delivered as if coming from a direction that would be coordinated with the video image of the speaker in a tiled display 18. When the participant in Los Angeles speaks, the sound may be delivered in coordination with the location in the video display of that speaker's image. On-the-fly binaural encoding can also be used to deliver convincing spatial audio headphones, avoiding the apparent mis-location of the sound that is frequently experienced in prior art headphone set-ups.

[0302] The binaural mode signal processing chain, shown in FIG. 8, consists of multiple discrete sources, in the illustrated example, three sources: sources 201, 202 and 203, which are then convolved with binaural Head Related Transfer Function (HRTF) encoding filters 211, 212 and 213 corresponding to the desired virtual angle of transmission from the nominal speaker location to the listener. There are two HRTF filters for each source—one for the left ear and one for the right ear. The resulting HRTF-filtered signals for the left ear are all added together to generate an input signal corresponding to sound to be heard by the listener's left ear. Similarly, the HRTF-filtered signals for the listener's right ear are added together. The resulting left and right ear signals are then convolved with inverse filter groups 221 and 222, respectively, with one filter for each virtual speaker element in the virtual speaker array. The virtual speakers are then combined into a real speaker signal, by a further time-space transform, combination, and limiting/peak abatement, and the resulting combined signal is sent to the corresponding speaker element via a multichannel sound card 230 and class D amplifiers 240 (one for each physical speaker) for audio transmission to the listener through speaker array 250.

[0303] In the binaural mode, the invention generates sound signals feeding a virtual linear array. The virtual linear array signals are combined into speaker driver signals. The speakers provide two sound beams aimed towards the primary listener's ears—one beam for the left ear and one beam for the right ear.

[0304] FIG. 9 illustrates the binaural mode signal processing scheme for the binaural modality with sound sources A through Z.

[0305] As described with reference to FIG. 8, the inputs to the system are a set of sound source signals 32 (A through Z) and the output of the system is a set of loudspeaker signals 38 (1 through N), respectively.

[0306] For each sound source 32, the input signal is filtered through two digital filters 34 (HRTF-L and HRTF-R) representing a left and right Head-Related Transfer Function, calculated for the angle at which the given sound source 32 is intended to be rendered to the listener. For example, the voice of a talker can be rendered as a plane wave arriving from 30 degrees to the right of the listener. The HRTF filters 34 can be either taken from a database or can be computed in real time using a binaural processor. After the HRTF filtering, the processed signals corresponding to different sound sources but to the same ear (left or right), are merged together at combiner 35 This generates two signals, hereafter referred to as “total binaural signal-left”, or “TBS-L” and “total binaural signal-right” or “TBS-R” respectively.

[0307] Each of the two total binaural signals, TBS-L and TBS-R, is filtered through a set of N digital filters 36, one for each loudspeaker, computed using the algorithm disclosed below. These filters are referred to as “spatialization filters”. It is emphasized for clarity that the set of spatialization filters for the right total binaural signal is different from the set for the left total binaural signal.

[0308] The filtered signals corresponding to the same n.sup.th virtual speaker but for two different ears (left and right) are summed together at combiners 37. These are the virtual speaker signals, which feed the combiner system, which in turn feed the physical speaker array 38.

[0309] The algorithm for the computation of the spatialization filters 36 for the binaural modality is analogous to that used for the WFS modality described above. The main difference from the WFS case is that only two control points are used in the binaural mode. These control points correspond to the location of the listener's ears and are arranged as shown in FIG. 10. The distance between the two points 42, which represent the listener's ears, is in the range of 0.1 m and 0.3 m, while the distance between each control point and the center 46 of the loudspeaker array 48 can scale with the size of the array used, but is usually in the range between 0.1 m and 3 m.

[0310] The 2×N matrix H(f) is computed using elements of the electro-acoustical transfer functions between each loudspeaker and each control point, as a function of the frequency f. These transfer functions can be either measured or computed analytically, as discussed above. A 2-element vector p is defined. This vector can be either [1,0] or [0,1], depending on whether the spatialization filters are computed for the left or right ear, respectively. The filter coefficients for the given frequency f are the N elements of the vector a(f) computed by minimizing the following cost function

J(f)=∥H(f)a(f)−p(f)∥.sup.2+β∥a(f)∥.sup.2

[0311] If multiple solutions are possible, the solution is chosen that corresponds to the minimum value of the L.sup.2 norm of a(f).

[0312] FIG. 11 illustrates an alternative embodiment of the binaural mode signal processing chain of FIG. 9 which includes the use of optional components including a psychoacoustic bandwidth extension processor (PBEP) and a dynamic range compressor and expander (DRCE). The PBEP 52 allows the listener to perceive sound information contained in the lower part of the audio spectrum by generating higher frequency sound material, providing the perception of lower frequencies using higher frequency sound). Since the PBEP processing is non-linear, it is important that it comes before the spatialization filters 36. If the non-linear PBEP block 52 is inserted after the spatial filters, its effect could severely degrade the creation of the sound beam.

[0313] It is important to emphasize that the PBEP 52 is used in order to compensate (psycho-acoustically) for the poor directionality of the loudspeaker array at lower frequencies rather than compensating for the poor bass response of single loudspeakers themselves.

[0314] The DRCE 54 in the DSP chain provides loudness matching of the source signals so that adequate relative masking of the output signals of the array 38 is preserved. In the binaural rendering mode, the DRCE used is a 2-channel block which makes the same loudness corrections to both incoming channels.

[0315] As with the PBEP block 52, because the DRCE 54 processing is non-linear, it is important that it comes before the spatialization filters 36. If the non-linear DRCE block 54 were to be inserted after the spatial filters 36, its effect could severely degrade the creation of the sound beam. However, without this DSP block, psychoacoustic performance of the DSP chain and array may decrease as well.

[0316] Another optional component is a listener tracking device (LTD) 56, which allows the apparatus to receive information on the location of the listener(s) and to dynamically adapt the spatialization filters in real time. The LTD 56 may be a video tracking system which detects the listener's head movements or can be another type of motion sensing system as is known in the art. The LTD 56 generates a listener tracking signal which is input into a filter computation algorithm 58. The adaptation can be achieved either by re-calculating the digital filters in real time or by loading a different set of filters from a pre-computed database.

[0317] FIGS. 12A and 12B illustrate the simulated performance of the algorithm for the binaural modes. FIG. 12A illustrates the simulated frequency domain signals at the target locations for the left and right ears, while FIG. 12B shows the time domain signals. Both plots show the clear ability to target one ear, in this case, the left ear, with the desired signal while minimizing the signal detected at the listener's right ear.

[0318] WFS and binaural mode processing can be combined into a single device to produce total sound field control. Such an approach would combine the benefits of directing a selected sound beam to a targeted listener, e.g., for privacy or enhanced intelligibility, and separately controlling the mixture of sound that is delivered to the listener's ears to produce surround sound. The device could process audio using binaural mode or WFS mode in the alternative or in combination. Although not specifically illustrated herein, the use of both the WFS and binaural modes would be represented by the block diagrams of FIG. 5 and FIG. 11, with their respective outputs combined at the signal summation steps by the combiners 37 and 106. The use of both WFS and binaural modes could also be illustrated by the combination of the block diagrams in FIG. 2 and FIG. 8, with their respective outputs added together at the last summation block immediately prior to the multichannel soundcard 230.

EXAMPLE 1

[0319] A 12-channel spatialized virtual audio array is implemented in accordance with U.S. Pat. No. 9,578,440. This virtual array provides signals for driving a linear or curvilinear equally-spaced array of e.g., 12 speakers situated in front of a listener. The virtual array is divided into two or four. In the case of two, the “left” e.g., 6 signals are directed to the left physical speaker, and the “right” e.g., 6 signals are directed to the right physical speaker. The virtual signals are to be summed, with at least two intermediate processing steps.

[0320] The first intermediate processing step compensates for the time difference between the nominal location of the virtual speaker and the physical location of the speaker transducer. For example, the virtual speaker closest to the listener is assigned a reference delay, and the further virtual speakers are assigned increasing delays. In a typical case, the virtual array is situated such that the time differences for adjacent virtual speakers are incrementally varying, though a more rigorous analysis may be implemented. At a 48 kHz sampling rate, the difference between the nearest and furthest virtual speaker may be, e.g., 4 cycles.

[0321] The second intermediate processing step limits the peaks of the signal, in order to avoid over-driving the physical speaker or causing significant distortion. This limiting may be frequency selective, so only a frequency band is affected by the process. This step should be performed after the delay compensation. For example, a compander may be employed. Alternately, presuming only rare peaking, a simple limited may be employed. In other cases, a more complex peak abatement technology may be employed, such as a phase shift of one or more of the channels, typically based on a predicted peaking of the signals which are delayed slightly from their real-time presentation. Note that this phase shift alters the first intermediate processing step time delay; however, when the physical limit of the system is reached, a compromise is necessary.

[0322] With a virtual line array of 12 speakers, and 2 physical speakers, the physical speaker locations are between elements 3-4 and 9-10. If (s) is the center-to-center distance between speakers, then the distance from the center of the array to the center of each real speaker is: A=3 s. The left speaker is offset −A from the center, and the right speaker is offset A.

[0323] The second intermediate processing step is principally a downmix of the six virtual channels, with a limiter and/or compressor or other process to provide peak abatement, applied to prevent saturation or clipping. For example, the left channel is:

L.sub.out=Limit(L.sub.1+L.sub.2+L.sub.3+L.sub.4+L.sub.5+L.sub.6)

[0324] and the right channel is

R.sub.out=Limit(R.sub.1+R.sub.2+R.sub.3+R.sub.4+R.sub.5+R.sub.6)

[0325] Before the downmix, the difference in delays between the virtual speakers and the listener's ears, compared to the physical speaker transducer and the listener's ears, need to be taken into account. This delay can be significant particularly at higher frequencies, since the ratio of the length of the virtual speaker array to the wavelength of the sound increases. To calculate the distance from the listener to each virtual speaker, assume that the speaker, n, is numbered 1 to 6, where 1 is the speaker closest to the center, and 6 is the farthest from center. The distance from the center of the array to the speaker is: d=((n−1)+0.5)*s. Using the Pythagorean theorem, the distance from the speaker to the listener can be calculated as follows:

d.sub.n=√{square root over (l.sup.2+(((n−1)+0.5))*s).sup.2)}

[0326] The distance from the real speaker to the listener is

d.sub.r=√{square root over (l.sup.2+(3*s).sup.2)}

[0327] The system, in this example, is intended to deliver spatialized audio to each of two listeners within the environment. A radar sensor, e.g., a Vayyar 60 GHz sensor is used to locate the respective listeners. venturebeat.com/2018/05/02/vayyar-unveils-a-new-sensor-for-capturing-your-life-in-3d/. Various types of analysis can be performed to determine which objects represent people, versus inanimate objects, and for the people, what the orientation of their heads are. For example, depending on power output and proximity, the radar can detect heartbeat (and therefore whether the person is face toward or away from the sensor for a person with normal anatomy). Limited degrees of freedom of limbs and torso can also assist in determining anatomical orientation, e.g., limits on joint flexion. With localization of the listener, the head location is determined, and based on the orientation of the listener, the location of the ears inferred. Therefore, using a generic HRTF and inferred ear location, spatialized audio can be directed to a listener. For multiple listeners, the optimization is more complex, but based on the same principles. The acoustic signal to be delivered at a respective ear of a listener is maximized with acceptable distortion, while minimizing perceptible acoustic energy at the other ears, and the ears of other listeners. A perception model may be imposed to permit non-obtrusive white or pink noise, in contrast to voice, narrowband or harmonic sounds, which may be perceptually intrusive.

[0328] The SLAM sensor also permits modelling of the inanimate objects, which can reflect or absorb sound. Therefore, both direct line-of sight paths from the transducers to the ear(s) and reflected/scattered paths can be employed within the optimization. The SLAM sensor permits determination of static objects and dynamically moving objects, and therefore permits the algorithm to be updated regularly, and to be reasonably accurate for at least the first reflection of acoustic waves between the transducer array and the listeners.

[0329] The sample delay for each speaker can be calculated by the different between the two listener distances. This can them be converted to samples (assuming the speed of sound is 343 m/s and the sample rate is 48 kHz.

[00067] $delay = \frac{(d_{n} - d_{r})}{343 \frac{m}{s}} * 48000 Hz$

[0330] This can lead to a significant delay between listener distances. For example, if the virtual array inter-speaker distance is 38 mm, and the listener is 500 mm from the array, the delay from the virtual far-left speaker (n=6) to the real speaker is:

[00068] $d_{n} = \sqrt{{.5}^{2} + {(5.5 * .038)}^{2}} = .541 m$ $d_{r} = \sqrt{{.5}^{2} + {(3 * .038)}^{2}} = .513 m$ $delay = \frac{.541 - .512}{343} * 48000 = 4 samples$

[0331] At higher audio frequencies, i.e., 12 kHz an entire wave cycle is 4 samples, to the difference amounts to a 360° phase shift. See Table 1.

[0332] Thus, when combining the signals for the virtual speakers into the physical speaker signal, the time offset is preferably compensated based on the displacement of the virtual speaker from the physical one. The time offset may also be accomplished within the spatialization algorithm, rather than as a post-process.

EXAMPLE 2

[0333] FIG. 14 demonstrates the control flow for using intelligent spatial sensor technology in a spatialized audio system. The sensor detects the location of listeners around the room. This information is passed to an AI/facial recognition component, which determines how best to present the audio to those listeners. This may involve the use of cloud services for processing. The cloud services are accessed through a network communication port via the Internet. The processing for determining how best to present 3D sound to each listener, to increase the volume to specific listeners (e.g. hearing-impaired listeners), or other effects based on the user's preferences, may be performed locally within a sound bar or its processor, remotely in a server or cloud system, or in a hybrid architecture spanning both. The communication may be wired or wireless (e.g., WiFi or Bluetooth).

[0334] Incoming streaming audio may contain metadata that the intelligent loudspeaker system control would use for automated configuration. For example, 5.1 or 7.1 surround sound from a movie would invoke the speaker to produce a spatialized surround mode aimed at the listener(s) (single, double or triple binaural beams). If the audio stream were instead a news broadcast, the control could auto-select Mono Beaming mode (width of beam dependent of listener(s) position) plus the option to add speech enhancement equalization; or a narrow high sound pressure level beam could be aimed at a listener who is hard of hearing (with or without equalization) and a large portion of the room could be ‘filled’ with defined wavefield synthesis derived waves (e.g., a “Stereo Everywhere” algorithm). Numerous configurations are possible by modifying speaker configuration parameters such as filter type (narrow, wide, asymmetrical, dual/triple beams, masking, wave field synthesis), target distance, equalization, head-related transfer function, lip sync delay, speech enhancement equalization, etc. Furthermore, a listener could enhance a specific configuration by automatically enabling bass boost in the case of a movie or game, but disabling it in the case of a newscast or music.

[0335] The type of program may be determined automatically or manually. In a manual implementation, the user selects a mode through a control panel, remote control, speech recognition interface, or the like. FIG. 14 shows that the smart filter algorithm may also receive metadata, which may be, for example, a stream of codes which accompany the media, which define a target sonic effect or sonic type, over a range of changing circumstances. Thus, in a movie, different scenes or implied sound sources may encode different sonic effects. It is noted that these cannot be directly or simply encoded in the source media, as the location and/or acoustic environment is not defined until the time of presentation, and different recipients will have different environments. Therefore, a real-time spatialization control system is employed, which receives a sensor signal or signals defining the environment of presentation and listener location, to modify the audio program in real time to optimize the presentation. It is noted that the same sensors may also be used to control a 3D television presentation to ensure proper image parallax at viewer locations. The sensor data may be a visual image type, but preferably, the sensors do not capture visual image data, which minimizes the privacy risk if that data is communicated outside of the local control system. As such, the sensor data, or a portion thereof, may be communicated to a remote server or for cloud processing with consumer acceptance. The remote or cloud processing allows application of a high level of computational complexity to map the environment, including correlations of the sensor data to acoustic interaction. This process may not be required continuously, but may be updated periodically without explicit user interaction.

[0336] The sensor data may also be used for accounting, marketing/advertising, and other purposes independent of the optimization of presentation of the media to a listener. For example, a fine-grained advertiser cost system may be implemented, which charges advertisers for advertisements that were listened to, but not for those in which no awake listener was available. The sensor data may therefore convey listener availability and sleep/wake state. The sleep/wake state may be determined by movement, or in some cases, by breathing and heart rate. The sensor may also be able to determine the identity of listeners, and link the identity of the listener to their demographics or user profile. The identity may therefore be used to target different ads to different viewing environments, and perhaps different audio programs to different listeners. For example, it is possible to target different listeners with different language programs if they are spatially separated. Where multiple listeners are in the same environment, a consensus algorithm may optimize a presentation of a program for the group, based on the identifications and in some cases their respective locations.

[0337] Generally, the beam steering control may be any spatialization technology, though the real-time sensor permits modification of the beam steering to in some cases reduce complexity where it is unnecessary, with a limiting case being no listener present, and in other cases, a single listener optimally located for simple spatialized sound, and in other cases, higher complexity processing, for example multiple listeners receiving qualitatively different programs. In the latter case, processing may be offloaded to a remote server or cloud, permitting use of a local control that is computationally less capable than a “worst case” scenario.

[0338] The loudspeaker control preferably receives far field inputs from a microphone or microphone array, and performs speech recognition on received speech in the environment, while suppressing response to media-generated sounds. The speech recognition may be Amazon Alexa, Microsoft Cortana, Hey Google, or the loke, or may be a proprietary platform. For example, since the local control includes a digital signal processor, a greater portion of the speech recognition, or the entirety of the speech recognition, may be performed locally, with processed commands transmitted remotely as necessary. This same microphone array may be used for acoustic tuning of the system, including room mapping and equalization, listener localization, and ambient sound neutralization or masking.

[0339] Once the best presentation has been determined, the smart filter generation uses techniques similar to those described above, and otherwise known in the art, to generate audio filters that will best represent the combination of audio parameters effects for each listener. These filters are then uploaded to a processor the speaker array for rendering, if this is a distinct processor.

[0340] Content metadata provided by various streaming services can be used to tailor the audio experience based on the type of audio, such as music, movie, game, and so on, and the environment in which it is presented, and in some cases based on the mood or state of the listener. For example, the metadata may indicate that the program is an action movie. In this type of media, there are often high intensity sounds intended to startle, and may be directional or non-directional. For example, the changing direction of a moving car may be more important than accuracy of the position of the car in the soundscape, and therefore the spatialization algorithm may optimize the motion effect over the positional effect. On the other hand, some sounds, such as a nearby explosion, may be non-directional, and the spatialization algorithm may instead optimize the loudness and crispness over spatial effects for each listener. The metadata need not be redefined, and the content producer may have various freedom over the algorithm(s) employed.

[0341] Thus, according to one aspect, the desired left and right channel separation for a respective listener is encoded by metadata associated with the a media presentation. Where multiple listeners are present, the encoded effect may apply for each listener, or may be encoded to be different for different listeners. A user preference profile may be provided for a respective listener, which then presents the media. According to the user preferences, in addition to the metadata. For example, a listener, may have different hearing response in each ear, and the preference may be to normalize the audio for the listener response. In other cases, different respective listeners may have different preferred sound separation. Indicated by their preference profile. According to another embodiment, the metadata encodes a “type” of media, and the user profile maps the media type to a user-preferred spatialization effect or spatialized audio parameters.

[0342] As discussed above, the spatial location sensor has two distinct functions: location of persons and objects for the spatialization process, and user information which can be passed to a remote service provider. The remote service provider can then use the information, which includes the number and location of persons (and perhaps pets) in the environment proximate to the acoustic transducer array, as well as their poses, activity state, response to content, etc. and may include inanimate objects. The local system and/or remote service provider my also employ the sensor for interactive sessions with users (listeners), which may be games (similar to Microsoft Xbox with Kinect, or Nintendo Wii), exercise, or other types of interaction.

[0343] Preferably, the spatial sensor is not a camera, and as such, the personal privacy issues raised by having such a sensor with remote communication capability. The sensor may be a radar (e.g., imaging radar, MIMO WiFi radar [WiVi, WiSee]), lidar, Microsoft Kinect sensor (includes cameras), ultrasonic imaging array, camera, infrared sensing array, passive infrared sensor, or other known sensor.

[0344] The spatial sensor may determine a location of a listener in the environment, and may also identify a respective listener. The identification may be based on video pattern recognition in the case of a video imager, a characteristic backscatter in the case of radar or radio frequency identification, or other known means. Preferably the system does not provide a video camera, and therefore the sensor data may be relayed remotely for analysis and storage, without significant privacy violation. This, in turn, permits mining of the sensor data, for use in marketing, and other purposes, with low risk of damaging misuse of the sensor data.

[0345] The invention can be implemented in software, hardware or a combination of hardware and software. The invention can also be embodied as computer readable code on a computer readable medium. The computer readable medium can be any data storage device that can store data which can thereafter be read by a computing device. Examples of the computer readable medium include read-only memory, random-access memory, CD-ROMs, magnetic tape, optical data storage devices, and carrier waves. The computer readable medium can also be distributed over network-coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

[0346] The many features and advantages of the present invention are apparent from the written description and, thus, it is intended by the appended claims to cover all such features and advantages of the invention. Further, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation as illustrated and described. Hence, all suitable modifications and equivalents may be resorted to as falling within the scope of the invention.

SYSTEM AND METHOD FOR PROVIDING A SPATIALIZED SOUNDFIELD

Inventors

Cpc classification

Classification Explorer

H04S2420/01

ELECTRICITY

Classification Explorer

H04S2420/13

ELECTRICITY

Classification Explorer

H04R2201/405

ELECTRICITY

Classification Explorer

H04R2203/12

ELECTRICITY

Classification Explorer

H04R1/403

ELECTRICITY

Classification Explorer

H04S7/303

ELECTRICITY

Classification Explorer

H04R2201/403

ELECTRICITY

Classification Explorer

H04R3/12

ELECTRICITY

International classification

Classification Explorer

H04S7/00

ELECTRICITY

Classification Explorer

H04R1/40

ELECTRICITY

Abstract

Claims

Description