Audio encoding device and method
11632626 · 2023-04-18
Assignee
Inventors
Cpc classification
H04R2430/21
ELECTRICITY
H04S2400/15
ELECTRICITY
G10L19/02
PHYSICS
H04R3/02
ELECTRICITY
H04S2420/11
ELECTRICITY
H04S3/02
ELECTRICITY
International classification
H04S3/02
ELECTRICITY
G10L19/008
PHYSICS
G10L19/02
PHYSICS
Abstract
A method and a device encode N audio signals, from N microphones where N≥3. For each pair of the N audio signals an angle of incidence of direct sound is estimated. A-format direct sound signals are derived from the estimated angles of incidence by deriving from each estimated angle an A-format direct sound signal. Each A-format direct sound signal is a first-order virtual microphone signal, for example, a cardioids signal.
Claims
1. An audio encoding device, for encoding N audio signals, from N microphones where N≥3, the audio encoding device comprising: a delay estimator configured to estimate angles of incidence of direct sound by estimating, for each pair of the N audio signals, an angle of incidence of the direct sound, and a beam deriver configured to derive A-format direct sound signals from the estimated angles of incidence by deriving, from each of the estimated angles of incidence, a respective one of the A-format direct sound signals, each of the A-format direct sound signals being a first-order virtual microphone signal; and an encoder configured to encode the A-format direct sound signals in first-order ambisonic B-format direct sound signals by applying a transformation matrix to the A-format direct sound signals, wherein N=3, wherein the audio encoding device comprises a short time Fourier transformer configured to perform a short time Fourier transformation on each of the N audio signals x.sub.1, x.sub.2, x.sub.3, resulting in N short time Fourier transformed audio signals X.sub.1[k,i], X.sub.2[k,i], X.sub.3[k,i], wherein the delay estimator is configured to: determine cross spectra of each pair of the short time Fourier transformed audio signals according to:
X.sub.12[k,i]=α.sub.XX.sub.1[k,i]X*.sub.2[k,i]+(1−α.sub.X)X.sub.12[k−1,i],
X.sub.13[k,i]=α.sub.XX.sub.1[k,i]X*.sub.3[k,i]+(1−α.sub.X)X.sub.13[k−1,i], and
X.sub.23[k,i]=α.sub.XX.sub.2[k,i]X*.sub.3[k,i]+(1−α.sub.X)X.sub.23[k−1,i], determine an angle of the complex cross spectrum of each pair of the short time Fourier transformed audio signals according to:
δ.sub.12[k,i]=(N.sub.STFT/2+1)/(iπ)ψ.sub.12[k,i],
δ.sub.13[k,i]=(N.sub.STFT/2+1)/(iπ)ψ.sub.13[k,i], and
δ.sub.23[k,i]=(N.sub.STFT/2+1)/(iπ)ψ.sub.23[k,i], if i≤i.sub.alias or
δ.sub.12[k,i]=(N.sub.STFT/2+1)/(iπ)Ψ.sub.12[k,i],
δ.sub.13[k,i]=(N.sub.STFT/2+1)/(iπ)Ψ.sub.13[k,i], and
δ.sub.23[k,i]=(N.sub.STFT/2+1)/(iπ)Ψ.sub.23[k,i], if i>i.sub.alias estimate the delay in seconds according to:
2. The audio encoding device according to claim 1, wherein the beam deriver is configured to: determine cardioid directional responses according to:
A.sub.12[k,i]=D.sub.12[k,i]X.sub.1[k,i],
A.sub.13[k,i]=D.sub.13[k,i]X.sub.1[k,i], and
A.sub.23[k,i]=D.sub.23[k,i]X.sub.1[k,i], wherein: D is a cardioid directional response, and A is an A-format direct sound signal of the A-format direct sound signals.
3. The audio encoding device according to claim 2, wherein the encoder is configured to encode the A-format direct sound signals to the first-order ambisonic B-format direct sound signals according to:
4. The audio encoding device according to claim 1, comprising a direction of arrival estimator configured to estimate a direction of arrival from the first-order ambisonic B-format direct sound signals, and a higher order ambisonic encoder configured to encode higher order ambisonic B-format direct sound signals using the first-order ambisonic B-format direct sound signals and the estimated direction of arrival, wherein higher order ambisonic B-format direct sound signals have an order higher than one.
5. The audio encoding device according to claim 4, wherein the direction of arrival estimator is configured to estimate the direction of arrival according to:
6. The audio encoding device according to claim 5, wherein the higher order ambisonic B-format direct sound signals comprise second order ambisonic B-format direct sound signals limited to two dimensions, wherein the higher order ambisonic encoder is configured to encode the second order ambisonic B-format direct sound signals according to: denotes “defined as”, Φ is an elevation angle, and θ is an azimuth angle.
7. The audio encoding device according to claim 1, comprising a microphone matcher configured to perform a matching of the N frequency domain audio signals, resulting in N matched frequency domain audio signals.
8. The audio encoding device according to claim 7, comprising a diffuse sound estimator configured to estimate a diffuse sound power, and a de-correlation filter bank configured to perform a de-correlation of the diffuse sound power by generating three orthogonal diffuse sound components from the diffuse sound estimate power.
9. The audio encoding device according to claim 8, wherein the diffuse sound estimator is configured to estimate the diffuse sound power according to:
10. The audio encoding device according to claim 9, wherein the de-correlation filter bank is configured to perform the de-correlation of the diffuse sound power by generating three orthogonal diffuse sound components from the diffuse sound estimate power:
{tilde over (D)}.sub.W[k,i]=DFR.sub.Ww.sub.uU.sub.1P.sub.2D-diff[k,i],
{tilde over (D)}.sub.X[k,i]=DFR.sub.Xw.sub.uU.sub.2P.sub.2D-diff[k,i], and
{tilde over (D)}.sub.Y[k,i]=DFR.sub.Yw.sub.uU.sub.3P.sub.2D-diff[k,i], wherein:
11. The audio encoding device according to claim 1, comprising an adder, which is configured to add channel-wise, the first-order ambisonic B-format direct sound signals and the higher order ambisonic B-format direct sound signals, and/or the diffuse sound signals, resulting in complete ambisonic B-format signals.
12. The audio encoding device according to claim 1, wherein delay estimator configured to estimate the angle of incidence for each pair of the N audio signal based on a travelling time delay between the pair of audio signals.
13. The audio encoding device according to claim 1, wherein delay estimator configured to estimate the angle of incidence for each pair of the N audio signal based on a delay in second and a delay in samples between the pair of audio signals.
14. An audio recording device comprising the N microphones configured to record the N audio signals, and the audio encoding device according to claim 1.
15. A method for encoding N audio signals, from N microphones where N≤3, the method comprising: estimating angles of incidence of direct sound by estimating for each pair of the N audio signals an angle of incidence of the direct sound, deriving A-format direct sound signals from the estimated angles of incidence by deriving, from each of the estimated angles of incidence, a respective one of the A-format direct sound signals, each of the A-format direct sound signals being a first-order virtual microphone signal, and encoding the A-format direct sound signals in first-order ambisonic B-format direct sound signals by applying a transformation matrix to the A-format direct sound signals, wherein N=3, wherein the encoding further comprises performing a short time Fourier transformation on each of the N audio signals x.sub.1, x.sub.2, x.sub.3, resulting in N short time Fourier transformed audio signals X.sub.1[k,j], X.sub.2[k,j], X.sub.3[k,j], wherein the method further comprises: determining cross spectra of each pair of the short time Fourier transformed audio signals according to:
X.sub.12[k,i]=α.sub.XX.sub.1[k,i]X.sub.2.sup.*[k,i]+(1−α.sub.X)X.sub.12[k−1,i],
X.sub.13[k,i]=α.sub.XX.sub.1[k,i]X.sub.3.sup.*[k,i]+(1−α.sub.X)X.sub.13[k−1,i], and
X.sub.23[k,i]=α.sub.XX.sub.2[k,i]X.sub.3.sup.*[k,i]+(1−α.sub.X)X.sub.23[k−1,i], determining an angle of the complex cross spectrum of each pair of the short time Fourier transformed audio signals according to:
δ.sub.12[k,i]=(N.sub.STFT/2+1)/(iπ)ψ.sub.12[k,i],
δ.sub.13[k,i]=(N.sub.STFT/2+1)/(iπ)ψ.sub.13[k,i],
δ.sub.23[k,i]=(N.sub.STFT/2+1)/(iπ)ψ.sub.23[k,i], if i≤i.sub.alias
or
δ.sub.12[k,i]=(N.sub.STFT/2+1)/(iπ)Ψ.sub.12[k,i],
δ.sub.13[k,i]=(N.sub.STFT/2+1)/(iπ)Ψ.sub.13[k,i],
δ.sub.23[k,i]=(N.sub.STFT/2+1)/(iπ)Ψ.sub.23[k,i], if i>i.sub.alias estimating the delay in seconds according to:
16. A non-transitory computer readable storage medium comprising a computer program with a program code, which is configured to be executed by a computer to cause the computer to perform the method according to claim 15.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1) The present disclosure is in the following explained in detail in relation to embodiments of the present disclosure in reference to the enclosed drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
DETAILED DESCRIPTION
(12) First, we demonstrate the construction and general function of an embodiment of the first aspect and second aspect of the present disclosure along
(13) In
(14) The audio recording device 1 comprises a number of N≥3 microphones 2, which are connected to the audio encoding device 3. The audio encoding device 3 comprises a delay estimator 11, which is connected to the microphones 2. The audio encoding device 3 moreover comprises a beam deriver 12, which is connected to the delay estimator. Furthermore, the audio encoding device 3 comprises an encoder 13, which is connected to the beam deriver 12. Note that the encoder 13 is an optional feature with regard to the first aspect of the present disclosure.
(15) In order to determine ambisonic B-format direct sound signals, the microphones 2 record N≥3 audio signals. These audio signals are preprocessed by components integrated into the microphones 2, in this diagram. For example, a transformation into the frequency domain is performed. This will be shown in more detail along
(16) In
(17) The direction-of-arrival estimator 20 estimates a direction of arrival from the first-order ambisonic B-format direct sound signals and hands it to the higher order ambisonic encoder 21. The higher order ambisonic encoder 21 encodes higher order ambisonic B-format direct sound signals, using the first-order ambisonic B-format direct sound signals and the estimated direction of arrival as an input. The higher order ambisonic B-format direct sound signals have a higher order than 1.
(18) Moreover, the audio encoding device 3 comprises a microphone matcher 30, which performs a matching of the N frequency domain audio signals output by the short-time Fourier transformers 10a, 10b, 10c resulting in N match frequency domain audio signals. Connected to the microphone matcher 30, the audio encoding device 3 moreover comprises a diffuse sound estimator 31, which is configured to estimate a diffuse sound power based upon the N match frequency domain audio signals. Furthermore, the audio encoding device 3 comprises a de-correlation filter bank 32, which is connected to the diffuse sound estimator 31 and configured to perform a de-correlation of the diffuse sound power by generating three orthogonal diffuse sound components from the diffuse sound estimate power.
(19) Finally, the audio encoding device 3 comprises an adder 40, which adds the first-order B-format direct sound signals provided by the encoder 13, the higher order ambisonic B-format signals provided by the higher order encoder 21 and the diffuse sound components provided by the de-correlation filter bank 32. The sum signal is handed to an inverse short-time Fourier transformer 41, which performs an inverse short-time Fourier transformation to achieve the final ambisonic B-format signals in the time domain.
(20) In the following, along
(21) In
(22) Especially, the propagation of direct sound following a ray from a sound source to a pair of microphones in the free-field is considered in
(23) In
(24) The following algorithm aims at estimating the angle of incidence of direct sound based on cross-correlation between both recorded microphone signals x.sub.1 and x.sub.2, and derives parametrically gain filters to generate beams focusing in specific directions.
(25) A phase estimation, between both recording microphones, is carried out at each time-frequency tile. The microphone time-frequency representations, X.sub.1 and X.sub.2, of the microphone signals, are obtained using a N.sub.STFT points short-time Fourier transform (STFT). The delay relation between the two microphones can be derived from the cross-spectrum:
X.sub.12[k,i]=α.sub.XX.sub.1[k,i]X*.sub.2[k,i]+(1−α.sub.X)X.sub.12[k−1,i], (2)
where * denotes the complex conjugate operator. And a.sub.x is determined by:
(26)
where T.sub.X is an time-constant in seconds and f.sub.s is the sampling frequency. The phase response is defined as the angle of the complex cross-spectrum X.sub.12, derived as the ratio between the imaginary and the real part of it:
(27)
where j is the imaginary unit, that satisfies j.sup.2=−1.
(28) Unfortunately, analogous to the Nyquist frequency in temporal sampling, a microphone array has a restriction on the minimum spatial sampling rate. Using two microphones, the smallest wavelength of interest is given by:
λ.sub.alias=2d.sub.mic (5)
corresponding to a maximum frequency,
(29)
up to which the phase estimation is unambiguous. Above this frequency, the measured phase is still obtained following (4) but with an uncertainty term related to an integer l modulo of 2π:
{tilde over (ψ)}.sub.12[k,i]=ψ.sub.12[k,i]+2π.Math.l[i]. (7)
(30) Because the maximum travelling time between the two microphones of the array is given by d.sub.mic/c, the bounds of integer l is defined by:
(31)
(32) A high frequency extension is provided based in equation (8) to constrain an unwrapping algorithm. The unwrapping aims at correcting the phase angle {tilde over (ψ)}.sub.12[k,i] by adding a multiple l[k,i] of 2π when absolute jump between the two consecutive elements, |{tilde over (ψ)}.sub.12[k,i]−{tilde over (ψ)}.sub.12[k,i−1]|, are greater than or equal to the jump tolerance of π. The estimated unwrapped phase ψ.sub.12 is obtained by limiting the multiples l to their physical possible values. Eventually, even if the phase is aliased at high-frequency, its slope still follows the same principles as the delay estimation at low frequency. For the purpose of delay estimation, it is then sufficient to integrate the unwrapped phase ψ.sub.12 over a number of frequency bins in order to derive its slope for later delay
(33)
where N.sub.hf stands for the frequency bandwidth on which the phase is integrated.
(34) For each frequency bin i, dividing by the corresponding physical frequency, the delay δ.sub.12[k,i], expressed in number of samples, is obtained from the previously derived phase:
δ.sub.12[k,i]=(N.sub.STFT/2+1)/(iπ)ψ.sub.12[k,i] if i≤i.sub.alias
(35) otherwise:
δ.sub.12[k,i]=(N.sub.STFT/2+1)/(iπ)Ψ.sub.12[k,i], (10)
where i.sub.alias is the frequency bin corresponding to the aliasing frequency (1). The delay in second is:
(36)
(37) The derived delay relates directly to the angle of incidence of sound emitted by a sound source, as illustrated in
(38)
with d.sub.mic the distance between both microphones and c the celerity of sound in the air.
(39) In free-field, for direct sound, the directional response of a cardioid microphone pointing on the side of the array, is built as a function of the estimated angle of incidence:
(40)
(41) By applying the gain D to the input spectrum X.sub.1, a virtual cardioid signal can be retrieved from the direct sound of the input microphone signals. This corresponds to the function of the beam estimator 12.
(42) In
(43) In
(44) In the following, the conversion from A-format direct sound signals to B-format direct sound signals is shown. This corresponds to the function of the encoder 13.
(45) In the following Table are listed the Ambisonic B-format channels and their spherical representation D(θ,ϕ) up to third-order, normalized with the Schmidt semi-normalization (SN3D), where θ and ϕ are, respectively, the azimuth and elevation angles:
(46) TABLE-US-00001 Order Channel SN3D Definition: D(θ, ϕ) = 0 W 1 1 X cos θcos ϕ Y sin θcos ϕ Z sin ϕ 2 R (3sin.sup.2 ϕ − 1)/2 S {square root over (3/2)} cosθsin2ϕ T {square root over (3/2)} sinθsin2ϕ U {square root over (3/2)} cos2θcos.sup.2 ϕ V {square root over (3/2)} sin2θcos.sup.2 ϕ 3 K sinϕ(5sin.sup.2 ϕ − 3)/2 L {square root over (3/8)} cosθcosϕ(5sin.sup.2 ϕ − 1) M {square root over (3/8)} sinθcosϕ(5sin.sup.2 ϕ − 1) N {square root over (15/2)} cos2θsinϕcos.sup.2 ϕ O {square root over (15/2)} sin2θsinϕcos.sup.2 ϕ P {square root over (5/8)} cos3θcos.sup.3 ϕ Q {square root over (5/8)} sin3θcos.sup.3 ϕ
(47) These spherical harmonics form a set of orthogonal basis functions and can be used to describe any function on the surface of a sphere.
(48) Without loss of generality, three, the minimum number of, microphones are considered and placed in the horizontal XY-plane, for instance disposed at the edges of a mobile device as illustrated in
(49) The three possible unordered microphone pairs are defined as:
pair 1Δ=mic2.fwdarw.mic1
pair 2Δ=mic3.fwdarw.mic2
pair 3Δ=mic1.fwdarw.mic3
(50) The look direction (Θ=0) being defined by the X-axis, their direction vectors are:
(51)
(52) The direction for each of the pair in the horizontal plane are:
(53)
(54) And the microphone spacing:
(55)
(56) The gain (13) resulting from the angle of incidence estimation is applied to each pair leading to cardioid directional responses:
∀n∈[1 . . . 3],A.sub.p.sub.
(57) The three resulting cardioids are pointing in the three directions θ.sub.p.sub.
(58) Assuming that the obtained cardioids are coincident, the corresponding first order Ambisonic B-format signals can be computed by means of linear combination of the spectra A.sub.p.sub.
(59)
(60) The inverse matrix of (18) enables to convert the cardioids to Ambisonic B-format,
(61)
(62) The first order Ambisonic B-format normalized directional responses R.sub.W, R.sub.X, and R.sub.Y, are shown in
(63) In the following, the determining of higher order ambisonic B-format signals is shown. This corresponds to the function of the direction-of-arrival estimator 20 and the higher order ambisonic encoder 21.
(64) Deriving previously, the first order ambisonic B-format signals R.sub.W, R.sub.X, and R.sub.Y for the direct sound, no explicit direction of arrival (DOA) of sound was computed. Instead the directional responses of the three signals R.sub.W, R.sub.X, and R.sub.Y have been obtained from the A-format cardioid signals A.sub.p.sub.
(65) In order to obtain the higher order (e.g. second and third) ambisonic B-format signals, an explicit DOA is derived based on the two first order ambisonic B-format signals R.sub.X and R.sub.Y as:
(66)
(67) Again, assuming three omnidirectional microphones in the horizontal plane (φ=0), the channels of interest as defined in the ambisonic definition in the Table are limited to: order 0: W order 1: X, Y order 2: R, U, V order 3: L, M, P, Q
(68) The other channels are null since they are modulated by sinφ, with φ=0. For each of the above listed channels the directional responses are thus derived by substituting the azimuth angle Θ by the estimated DOA Θ.sub.XY. For instance, considering second order (assuming no elevation, i.e. φ=0):
(69)
(70) The resulting ambisonic channels, R.sub.R, R.sub.U, R.sub.V, R.sub.L, R.sub.M, R.sub.P, and R.sub.Q, contain only the direct sound components of the sound field.
(71) Now, the handling of diffuse sound is shown. This corresponds to the diffuse sound estimator 31 and the de-correlation filter bank 32 of
(72) In
(73) In
(74) The previous derivation of the ambisonic B-format signals is only valid under the assumption of direct sound. It does not hold for diffuse sound. In the following a method for obtaining an equivalent diffuse sound for Ambisonic B-format signals is given. Considering enough time after the direct sound and a number of early reflections, numerous reflections are themselves reflected in the space creating a diffuse sound field. By diffuse sound field is mathematically understood as independent sounds having the same energy and coming from all directions, as illustrated in
(75) It is assumed that X.sub.1 and X.sub.2 can be modelled as:
X.sub.1[k,i]=S[k,i]+N.sub.1[k,i],
X.sub.2[k,i]=a[k,i]S[k,i]+N.sub.2[k,i], (22)
where a[k,i] is a gain factor, S[k,i] is the direct sound in the left channel, and N.sub.1[k,i] and N.sub.2[k,i] represent diffuse sound. From (22) it follows that:
E{X.sub.1X*.sub.1}=E{SS*}+E{N.sub.1N*.sub.1}
E{X.sub.2X*.sub.2}=a.sup.2E{SS*}+E{N.sub.2N*.sub.2}
E{X.sub.1X*.sub.2}=aE{SS*}+E{N.sub.1N*.sub.2}. (23)
(76) It is reasonable to assume that the amount of diffuse sound in both microphone signals is the same, i.e. E{N.sub.1N*.sub.1}=E{N.sub.2N*.sub.2}=E{NN*}. Furthermore, the normalized cross-correlation coefficient between N.sub.1 and N.sub.2 is denoted Φ.sub.diff and can be obtained from the Cook's,
(77)
Eventually (23) can be re-written as
E{X.sub.1X*.sub.1}=E{SS*}+E{NN*}
E{X.sub.2X*.sub.2}=a.sup.2E{SS*}+E{NN*}
E{X.sub.1X*.sub.2}=aE{SS*}+Φ.sub.diffE{NN*}. (25)
(78) Elimination of E{SS*} and a in (25) yields the quadratic equation:
AE{NN*}.sup.2+BE{NN*}+C=0 (26)
with
A=1−Φ.sub.diff.sup.2,
B=2Φ.sub.diffE{X.sub.1X*.sub.2}−E{X.sub.1X*.sub.1}−E{X.sub.2X*.sub.2},
C=E{X.sub.1X*.sub.1}E{X.sub.2X*.sub.2}−E{X.sub.1X*.sub.2}.sup.2. (27)
(79) The power estimate of diffuse sound, denoted P.sub.diff, is then one of the two solutions of (26), the physically possible one (the other solution of (26), yielding a diffuse sound power larger than the microphone signal power, is discarded, as it is physically impossible), i.e.:
(80)
(81) Note that straightforwardly the contribution of the direct sound can be computed as:
P.sub.dir[k,i]=P.sub.X.sub.
(82) This corresponds to the function of the diffuse sound estimator 31.
(83) By definition the Ambisonic B-format signals are obtained by projecting the sound field unto the spherical harmonics basis defined in the previous table. Mathematically, the projection corresponds to the integration of the sound field signal over the spherical harmonics.
(84) As illustrated in
D.sub.W⊥D.sub.X⊥D.sub.Y. (30)
(85) Note that this property does not hold anymore for direct sound, since a sound source emitting from only ne direction projected unto the same basis will result in a single gain equal to the directional responses at the incidence angle of the sound source, leading to non-orthogonal, or in other terms, correlated components R.sub.W, R.sub.X, and R.sub.Y.
(86) However, here, considering a distribution of three omnidirectional microphones, the single diffuse sound estimate (28) is equivalent for all three microphones (or all three microphone pairs). Therefore there is no possibility to retrieve the native diffuse sound components of the Ambisonic B-format signals, i.e. D.sub.W, D.sub.X, and D.sub.Y as they would be obtained separately by projection of the diffuse sound field unto the spherical harmonics basis.
(87) Instead of getting the exact diffuse sound Ambisonic B-format signals, an alternative is to generate three orthogonal diffuse sound components from the single known diffuse sound estimate P.sub.diff. This way, even if the diffuse sound components do not correspond to the native Ambisonic B-format obtained by projection, the most perceptually important property of orthogonality (enabling localization and spatialization) is preserved. This can be achieved by using de-correlation filters.
(88) The de-correlation filters are derived from a Gaussian noise sequence u of given length l.sub.u. A Gram-Schmidt process applied to this sequence leads to N.sub.u orthogonal sequences U.sub.1, U.sub.2, Λ, U.sub.N.sub.
(89) Given the length l.sub.u of the noise Gaussian noise sequence u, the de-correlation filters are shaped such that they have an exponential decay over time, similarly as reverberation is a room. To do so, the sequences U.sub.1, U.sub.2, Λ, U.sub.N are multiplied with an exponential window w.sub.u with a time constant corresponding to the reverberation time RT.sub.60:
(90)
(91) In
(92) The exponential decay of the de-correlation filters, illustrated in
(93) Eventually, the resulting de-correlation filters are modulated by the diffuse-field responses of the ambisonic B-format channels they correspond to. This way the amount of diffuse sound in each ambisonic B-format channel matches the amount of diffuse sound of a natural B-format recording. The diffuse-field response DFR is the average of the corresponding spherical harmonic directional-response-squared contributions considering all directions, i.e.:
(94)
(95) In the three microphones case (N.sub.u=3), the resulting de-correlations filters are:
{tilde over (D)}.sub.W[k,i]=DFR.sub.Ww.sub.uU.sub.1P.sub.2D-diff[k,i],
{tilde over (D)}.sub.X[k,i]=DFR.sub.Xw.sub.uU.sub.2P.sub.2D-diff[k,i],
{tilde over (D)}.sub.Y[k,i]=DFR.sub.Yw.sub.uU.sub.3P.sub.2D-diff[k,i]. (33)
(96) This way, the orthogonality property between all three diffuse sounds being preserved any further processing using the generated B-format will work on diffuse sound too, i.e., using conventional ambisonic decoding.
(97) Eventually both direct and diffuse sound contributions have to be mixed together in order to generate the full Ambisonic B-format. Given the assumed signal model, the direct and diffuse sounds are, by definition, orthogonal, too. Thus the complete Ambisonic B-format signal are obtained using a straightforward addition:
B.sub.W[k,i]=R.sub.W[k,i]+{tilde over (D)}.sub.W[k,i],
B.sub.X[k,i]=R.sub.X[k,i]+{tilde over (D)}.sub.X[k,i],
B.sub.Y[k,i]=R.sub.Y[k,i]+{tilde over (D)}.sub.Y[k,i]. (34)
This addition is performed by the adder 40 of
(98) After this addition, only the inverse short-time Fourier transformation by the inverse short-time Fourier transformer 41 is performed in order to achieve the output B-format ambisonic signals.
(99) Finally, in
(100) Note that the audio encoding device according to the first aspect of the present disclosure as well as the audio recording device according to the second aspect of the present disclosure relate very closely to the audio encoding method according to the third aspect of the present disclosure. Therefore, the elaborations along
(101) These encoded signals are fully compatible with conventional Ambisonic B-format signals, and thus, can be used as input for Ambisonic B-format decoding or any other processing. The same principle can be applied to retrieve full higher order Ambisonic B-format signals with both direct and diffuse sounds contributions.
(102) Abbreviations and Notations
(103) TABLE-US-00002 Abbreviation Definition VR Virtual Reality DirAc Directional Audio Coding DOA Direction Of Arrival STFT short-Time Fourier Transform SN3D Schmidt semi-Normalization 3D DFR Diffuse-Field Response SNR Signal to Noise Ratio HOA High Order Ambisonic
(104) TABLE-US-00003 Notation Definition x.sub.1, x2 Both recorded microphone signals X.sub.1[k, i] STFT of x.sub.1 in frame k and frequency bin i S[k, i] STFT of source signal N.sub.1[k, i] Diffuse noise in microphone 1 α.sub.X Forgeting factor T.sub.X averaging time-constant X.sub.12 [k, i] cross-spectrum two microphone signal 1 and 2 f.sub.s sampling frequency f.sub.alias Frequency aliasing d.sub.mic Distance between both microphones E { } Expectation oparator θ and ϕ azimuth and elevation angles P.sub.diff power estimate of diffuse noise R.sub.W, R.sub.X, R.sub.Y First order Ambisonic components R.sub.R, R.sub.U, R.sub.V, R.sub.L, R.sub.M, Higher order Ambisonic components R.sub.P, and R.sub.Q P.sub.2D-diff power estimate of diffuse noise in 2D U.sub.1, U.sub.2, Λ, U.sub.N.sub.
(105) The present disclosure is not limited to the examples and especially not to a specific number of microphones. The characteristics of the exemplary embodiments can be used in any advantageous combination.
(106) The present disclosure has been described in conjunction with various embodiments herein. However, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure and the appended claims. In the claims, the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in usually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the internet or other wired or wireless communication systems.