Sound spatialization with room effect
09848274 · 2017-12-19
Assignee
Inventors
Cpc classification
H04S2400/03
ELECTRICITY
H04S2420/01
ELECTRICITY
H04S7/30
ELECTRICITY
G10L19/008
PHYSICS
International classification
H04S7/00
ELECTRICITY
Abstract
A method of sound spatialization, in which at least one filtering process, including summation, is applied, to at least two input signals, the filtering process comprising: the application of at least one first room effect transfer function, the first transfer function being specific to each input signal, and the application of at least one second room effect transfer function, the second transfer function being common to all input signals. The method is such that it comprises a step of weighting at least one input signal with a weighting factor, said weighting factor being specific to each of the input signals.
Claims
1. A method of sound spatialization, wherein at least one block-based filtering process, with summation, is applied to at least two input signals, said filtering process comprising: applying at least one first room effect transfer function, said first transfer function being constructed from at least one first part and being specific to each input signal, and applying at least one second room effect transfer function, said second transfer function being constructed from at least one second part and being common to all input signals, wherein the method comprises: weighting at least one input signal with a weighting factor, said weighting factor being specific to each of the input signals; wherein at least one output signal of said method is given by applying a formula of the type:
2. The method according to claim 1, wherein said first and second transfer functions are respectively representative of: direct sound propagations and the first sound reflections of said propagations; and a diffuse sound field present after said first reflections, and wherein the method comprises: the application of first transfer functions respectively specific to the input signals, and the application of a second transfer function, identical for all input signals, and resulting from a general approximation of a diffuse sound field effect.
3. The method according to claim 2, comprising a preliminary step of constructing said first and second transfer functions from impulse responses incorporating a room effect, said preliminary step comprising, for the construction of a first transfer function, the operations of: determining a start time of the presence of direct sound waves, determining a start time of the presence of said diffuse sound field after the first reflections, and selecting, in an impulse response, a portion of the response which extends temporally between said start time of the presence of direct sound waves to said start time of the presence of the diffuse field, said selected portion of the response corresponding to said first transfer function.
4. The method according to claim 3, wherein the second transfer function is constructed from a set of portions of impulse responses temporally starting after said start time of the presence of the diffuse field.
5. The method according to claim 3, wherein said second transfer function is given by applying a formula of the type:
6. The method according to claim 3, wherein said filtering process includes the application of at least one compensating delay corresponding to a time difference between said start time of the direct sound waves and said start time of the presence of the diffuse field.
7. The method according to claim 6, wherein said first and second room effect transfer functions are applied in parallel to said input signals and wherein said at least one compensating delay is applied to the input signals filtered by said second transfer functions.
8. The method according to claim 1, wherein an energy correction gain factor is applied to the weighting factor.
9. The method according to claim 1, wherein it comprises a step of decorrelating the input signals prior to applying the second transfer functions, and wherein at least one output signal of said method is obtained by applying a formula of the type:
10. The method according to claim 1, wherein it comprises a step of determining an energy correction gain factor as a function of input signals and wherein at least one output signal is obtained by applying a formula of the type:
11. The method according to claim 1, wherein said weight is given by applying a formula of the type:
12. A non-transitory computer-readable storage medium with an executable program stored thereon, wherein the program instructs a microprocessor to perform steps of the method according to claim 1.
13. A sound spatialization device, comprising at least one filter with summation applied to at least two input signals, said filter using: at least one first room effect transfer function, said first transfer function being constructed from at least one first part and being specific to each input signal, and at least one second room effect transfer function, said second transfer function being constructed from at least one second part and being common to all input signals, wherein it comprises weighting modules for weighting at least one input signal with a weighting factor, said weighting factor being specific to each of the input signals; wherein at least one output signal of said method is given by applying a formula of the type:
14. An audio signal decoding module, comprising the spatialization device according to claim 13, said sound signals being input signals.
Description
(1) Other features and advantages of the invention will be apparent from reading the following detailed description of embodiments of the invention and from reviewing the drawings in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9) Still referring to
(10) Here, the cooperation between hardware and software elements produces a technical effect resulting in savings in the complexity of the spatialization, for substantially the same audio rendering (same sensation for a listener), as discussed below.
(11) We now refer to
(12) In a first step S21, the data are prepared. This preparation is optional; the signals may be processed in step S22 and subsequent steps without this pre-processing.
(13) In particular, this preparation consists of truncating each BRIR to ignore the inaudible samples at the beginning and end of the impulse response.
(14) For the truncation at the start of the impulse response TRUNC S, in step S211, this preparation consists of determining a direct sound waves start time and may be implemented by the following steps: A cumulative sum of the energies of each of the BRIR filters (1) is calculated. Typically, this energy is calculated by summing the square of the amplitudes of samples 1 to j, with j in [1; J] and J being the number of samples of a BRIR filter. The energy value of the maximum energy filter valMax (among the filters for the left ear and for the right ear) is calculated. For each of the speakers 1, we calculate the index for which the energy of each of the BRIR filters (1) exceeds a certain dB threshold calculated relative to valMax (for example valMax−50 dB). The truncation index iT retained for all BRIR is the minimum index among all BRIR indices and is considered as the direct sound waves start time.
(15) The resulting index iT therefore corresponds to the number of samples to be ignored for each BRIR. A sharp truncation at the start of the impulse response using a rectangular window can lead to audible artifacts if applied to a higher energy segment. It may therefore be preferable to apply an appropriate fade-in window; however, if precautions have been taken in the threshold chosen, such windowing becomes unnecessary as it would be inaudible (only the inaudible signal is cut).
(16) The synchrony between BRIR makes it possible to apply a constant delay for all BRIR for the sake of simplicity in implementation, even if it is possible to optimize the complexity.
(17) Truncation of each BRIR to ignore inaudible samples at the end of the impulse response TRUNC E, in step S212, may be performed starting with steps similar to those described above but adapted for the end of the impulse response. A sharp truncation at the end of the impulse response using a rectangular window can lead to audible artifacts on the impulse signals where the tail of the reverberation could be audible. Thus, in one embodiment, a suitable fade-out window is applied.
(18) In step 22, a synchronistic isolation ISOL A/B is performed. This synchronistic isolation consists of separating, for each BRIR, the “direct sound” and “first reflections” portion (Direct, denoted A) and the “diffused sound” portion (Diffuse, denoted B). The processing to be performed on the “diffused sound” portion may advantageously be different from that performed on the “direct sound” portion, to the extent that it is preferable to have a better quality of processing on the “direct sound” portion than on the “diffused sound” portion. This makes it possible to optimize the ratio of quality/complexity.
(19) In particular, to achieve synchronistic isolation, a unique sampling index “iDD” common to all BRIR (hence the term “synchronistic”) is determined, starting at which the rest of the impulse response is considered as corresponding to a diffuse field. The impulse responses BRIR(l) are therefore partitioned into two parts: A(l) and B(l), where the concatenation of the two corresponds to BRIR(l).
(20)
(21) The index iDD may be specific to the room for which the BRIR were determined. Calculation of this index may therefore depend on the spectral envelope, on the correlation of the BRIR, or on the echogram of these BRIR. For example, the iDD can be determined by a formula of the type iDD=√{square root over (V.sub.room)} where V.sub.room is the volume of the room where measured.
(22) In one embodiment, iDD is a fixed value, typically 2000. Alternatively, iDD varies, preferably dynamically, depending on the environment from which the input signals are captured.
(23) The output signal for the left (g) and right (d) ears, represented by O.sup.g/d, is therefore written:
(24)
(25) where z.sup.−iDD corresponds to the compensating delay for iDD samples.
(26) This delay is applied to the signals by storing the values calculated for Σ.sub.l=1.sup.LI(l)*B.sup.g/d(l) in temporary memory (for example a buffer) and retrieving them at the desired moment.
(27) In one embodiment, the sampling indexes selected for A and B may also take into account the frame lengths in the case of integration into an audio encoder. Indeed, typical frame sizes of 1024 samples can lead to choosing A=1024 and B=2048, ensuring that B is indeed a diffuse field area for all the BRIR.
(28) In particular, it may be advantageous that the size of B is a multiple of the size of A, because if the filtering is implemented by FFT blocks, then the calculation of an FFT for A can be reused for B.
(29) A diffuse field is characterized by the fact that it is statistically identical at all points of the room. Thus, its frequency response varies very little for the speaker to be simulated. The invention exploits this feature in order to replace all Diffuse filters D(l) of all the BRIR by a single “mean” filter B.sub.mean, in order to greatly reduce the complexity due to multiple convolutions. For this, again referring to
(30) In step S23B1, the value of the mean filter B.sub.mean is calculated. It is extremely rare that the entire system is calibrated perfectly, so we can apply a weighting factor which will be carried forward in the input signal in order to achieve a single convolution per ear for the diffuse field part. Therefore the BRIR are separated in energy-normalized filters, and the normalization gain √{square root over (E.sub.B.sub.
(31)
(32) Next, we approximate B.sub.norm.sup.g/d(l) with a single mean filter B.sub.mean.sup.g/d which is no longer a function of the speaker 1, but which it is also possible to energy-normalize:
(33)
(34) In one embodiment, this mean filter may be obtained by averaging temporal samples. Alternatively, it may be obtained by any other type of averaging, for example by averaging the power spectral densities.
(35) In one embodiment, the energy of the mean filter E.sub.B.sub.
(36)
(37) The energy can be calculated over all samples corresponding to the diffuse field part.
(38) In step S23B2, the value of the weighting factor W.sup.g/d(l) is calculated. Only one weighting factor to be applied to the input signal is calculated, incorporating the normalizations of the Diffuse filters and mean filter:
(39)
(40) As the mean filter is constant, from this sum we have:
(41)
(42) Thus, the L convolutions with the diffuse field part are replaced by a single convolution with a mean filter, with a weighted sum of the input signal.
(43) In step S23B3, we can optionally calculate a gain G correcting the gain of the mean filter B.sub.mean.sup.g/d. Indeed, in the case of convolution between the input signals and the non-approximated filters, regardless of the correlation values between the input signals, the filtering by the decorrelated filters which are the B.sup.g/d(l) results in signals to be summed which are then also decorrelated. Conversely, in the case of convolution between the input signals and the approximated mean filter, the energy of the signal resulting from summing the filtered signals will depend on the value of the correlation existing between the input signals.
(44) For example,
(45) * if all the input signals I(l) are identical and of unitary energy, and the filters B(l) are all decorrelated (because diffuse fields) and of unitary energy, we have
(46)
(47) * if all the input signals I(l) are decorrelated and of unitary energy, and the filters B(l) are all of unitary energy but are replaced with identical filters
(48)
we have:
(49)
(50) because the energies of the decorrelated signals are added.
(51) This case is equivalent to the preceding case in the sense that the signals resulting from filtration are all decorrelated, by means of the input signals in the first case, and by means of the filters in the second case.
(52) * if all the input signals I(l) are identical and of unitary energy, and the filters B(l) are all of unitary energy but are replaced with identical filters
(53)
we have:
(54)
(55) because the energies of the identical signals are added in quadrature (because their amplitudes are summed).
(56) So, If two speakers are active simultaneously, supplied with decorrelated signals, then no gain is obtained by applying steps S23B1 and S23B2 in comparison to the conventional method. If two speakers are active simultaneously, supplied with identical signals, then a gain of 10.Math.log.sub.10(L.sup.2/L)=10.Math.log.sub.10(2.sup.2/2)=3.01 dB is obtained by applying steps S23B1 and S23B2 in comparison to the conventional method. If three speakers are active simultaneously, supplied with identical signals, then a gain of 10.Math.log.sub.10(L.sup.2/L)=10.Math.log.sub.10 (3.sup.2/3)=4.77 dB is obtained by applying steps S23B1 and S23B2 in comparison to the conventional method.
(57) The cases mentioned above correspond to the extreme cases of identical or decorrelated signals. These cases are realistic, however: a source positioned in the middle of two speakers, virtual or real, will provide an identical signal to both speakers (for example with a VBAP (“vector-based amplitude panning”) technique). In the case of positioning within a 3D system, the three speakers can receive the same signal at the same level.
(58) Thus, we can apply a compensation in order to achieve consistency with the energy of binauralized signals.
(59) Ideally, this compensation gain G is determined according to the input signal (G(I(l))) and will be applied to the sum of the weighted input signals:
(60)
(61) The gain G (I(l)) may be estimated by calculating the correlation between each of the signals. It may also be estimated by comparing the energies of the signals before and after summation. In this case, the gain G can dynamically vary over time, depending for example on the correlations between the input signals, which themselves vary over time.
(62) In a simplified embodiment, it is possible to set a constant gain, for example G=−3 dB=10.sup.−3/20, which eliminates the need for a correlation estimation which can be costly. The constant gain G can then be applied offline to the weighting factors (thus giving
(63)
or to the filter B.sub.mean.sup.g/d, which eliminates the application of additional gain on the fly.
(64) Once the transfer functions A and B are isolated and the filters B.sub.mean.sup.g/d (optionally the weights W.sup.g/d(l) and G) are calculated, these transfer functions and filters are applied to the input signals.
(65) In a first embodiment, described with reference to
(66) Alternatively, with reference to
(67) In a second embodiment, the gain G is applied prior to summation of the input signals, meaning during the weighting steps (steps M4B1 to M4BL).
(68) In a third embodiment, a decorrelation is applied to the input signals. Thus, the signals are decorrelated after convolution by the filter B.sub.mean regardless of the original correlations between input signals. An efficient implementation of the decorrelation can be used (for example, using a feedback delay network) to avoid the use of expensive decorrelation filters.
(69) Thus, under the realistic assumption that BRIR 48000 samples in length can be: truncated between sample 150 and sample 3222 by the technique described in step S21, broken into two parts: direct field A of 1024 samples, and diffuse field B of 2048 samples, by the technique described in step S22,
(70) then the complexity of the binauralization can be approximated by:
C.sub.inv=C.sub.invA+C.sub.invB=(L+2).Math.(6.Math.log.sub.2(2.Math.NA))+(L+2).Math.(6.Math.log.sub.2(2.Math.NB)) where NA and NB are the sample sizes of A and B.
(71) Thus, for nBlocks=10, Fs=48000, L=22, NA=1024, and NB=2048, the complexity per multichannel signal sample for an FFT-based convolution is C.sub.conv=3312 multiplications-additions.
(72) However, logically this result should be compared to a simple solution that implements truncation only, meaning for nBlocks=10, Fs=3072, L=22:
C.sub.trunc=(L+2).Math.(nBlocks).Math.(6.Math.log.sub.2(2.Math.Fs/nBlocks))=13339
(73) There is therefore a complexity factor of 19049/3312=5.75 between the prior art and the invention, and a complexity factor of 13339/3312=4 between the prior art using truncation and the invention.
(74) If the size of B is a multiple of the size of A, then if the filter is implemented by FFT blocks, the calculation of an FFT for A can be reused for B. We therefore need L FFT over NA points, which will be used both for the filtration by A and by B, two inverse FFT over NA points to obtain the temporal binaural signal, and multiplication of the frequency spectra.
(75) In this case, the complexity can be approximated (leaving out additions, (L+1) corresponding to multiplication of the spectra, L for A and 1 for B) by:
C.sub.inv2=(L+2).Math.(6.Math.log.sub.2(2.Math.NA))+(L+1)=1607
(76) With this approach, we gain a factor of 2, and therefore a factor of 12 and 8 in comparison to the truncated and non-truncated prior art.
(77) The invention can have direct applications in the MPEG-H 3D Audio standard.
(78) Of course, the invention is not limited to the embodiment described above; it extends to other variants.
(79) For example, an embodiment has been described above in which the Direct signal A is not approximated by a mean filter. Of course, one can use a mean filter of A to perform the convolutions (steps S4A1 to S4AL) with the signals coming from the speakers.
(80) An embodiment based on the processing of multichannel content generated for L speakers was described above. Of course, the multichannel content may be generated by any type of audio source, for example voice, a musical instrument, any noise, etc.
(81) An embodiment based on formulas applied in a certain computational domain (for example the transform domain) was described above. Of course, the invention is not limited to these formulas, and these formulas can be modified to be applicable in other computational domains (for example time domain, frequency domain, time-frequency domain, etc.).
(82) An embodiment was described above based on BRIR values determined in a room. Of course, one can implement the invention for any type of outside environment (for example a concert hall, al fresco, etc.).
(83) An embodiment was described above based on the application of two transfer functions. Of course, one can implement the invention with more than two transfer functions. For example, one can synchronistically isolate a portion relative to the directly emitted sounds, a portion relative to the first reflections, and a portion relative to the diffuse sounds.