Apparatus and method for microphone positioning based on a spatial power density
10284947 ยท 2019-05-07
Assignee
Inventors
- Giovanni Del Galdo (Heroldsberg, DE)
- Oliver Thiergart (Forchheim, DE)
- Fabian Kuech (Erlangen, DE)
- Emanuel Habets (Spardorf, NL)
- Alexandra Craciun (Erlangen, DE)
Cpc classification
H04S2400/15
ELECTRICITY
G10L19/12
PHYSICS
G10L15/10
PHYSICS
International classification
Abstract
An apparatus for microphone positioning includes a spatial power distribution determiner and a spatial information estimator. The spatial power distribution determiner is adapted to determine a spatial power density indicating power values for a plurality of locations of an environment based on sound source information indicating one or more power values and one or more position values of one or more sound sources located in the environment. The spatial information estimator is adapted to estimate acoustic spatial information based on the spatial power density.
Claims
1. An apparatus for microphone positioning, comprising: a spatial power density determiner for determining a spatial power density indicating power values for a plurality of locations of an environment based on sound source information indicating one or more power values and one or more position values of one or more sound sources located in the environment, and a spatial information estimator for estimating a position of a microphone based on the spatial power density, wherein the spatial information estimator comprises a sound scene center estimator for estimating a position of a center of a sound scene in the environment, wherein the spatial information estimator furthermore comprises a microphone position calculator for determining the position of the microphone based on the position of the center of the sound scene, wherein the spatial information estimator comprises an orientation determiner for determining an orientation of the microphone, wherein the orientation determiner is adapted to determine the orientation of the microphone based on the spatial power density, wherein the spatial power density determiner is adapted to determine the spatial power density by applying the formula
2. The apparatus according to claim 1, wherein the microphone position calculator is adapted to calculate the position of the microphone, wherein the microphone is a virtual spatial microphone.
3. The apparatus according to claim 1, wherein the sound scene center estimator is adapted to calculate a center of gravity of the spatial power density for estimating the center of the sound scene.
4. The apparatus according to claim 1, wherein the sound scene center estimator is configured to determine a power delay profile based on the spatial power density and to determine a root mean squared delay based on the power delay profile for each one of a plurality of locations in the environment, and wherein the sound scene center estimator is configured to determine the location of the plurality of locations as the center of the sound scene, which comprises the minimum root mean squared delay of the root mean squared delays of the plurality of locations.
5. The apparatus according to claim 1, wherein the microphone position calculator is adapted to determine a broadest-width line of a plurality of lines through the center of the sound scene in the environment, wherein each of the plurality of lines through the center of the sound scene is associated with an energy width, and wherein the broadest-width line is defined as the line of the plurality of lines through the center of the sound scene comprising the largest energy width, wherein the microphone position calculator is adapted determine the position of the microphone such that a second line, which passes through the center of the sound scene and the position of the microphone is orthogonal to the broadest-width line.
6. The apparatus according to claim 5, wherein energy width of a considered line of the plurality of lines indicates a largest width of a segment on the considered line, such that the first point of the segment limiting the segment, and such that a different second point of the segment limiting the segment, comprise both a power value indicated by the spatial power density, that is greater than or equal to a predefined power value.
7. The apparatus according to claim 1, wherein the microphone position calculator is configured to apply a singular value decomposition to a matrix comprising a plurality of columns, wherein the columns of the matrix indicate positions of locations in the environment relative to the center of the sound scene, and wherein the columns of the matrix only indicate the positions of locations comprising power values indicated by the spatial power density that are greater than a predefined threshold value, or the columns of the matrix only indicate the positions of locations comprising power values indicated by the spatial power density that are greater than or equal to a predefined threshold value.
8. The apparatus according to claim 1, wherein the orientation determiner is adapted to determine the orientation of the microphone such that the microphone is oriented towards the center of the sound scene.
9. An apparatus for generating a virtual output signal, comprising: an apparatus for microphone positioning, comprising: a spatial power density determiner for determining a spatial power density indicating power values for a plurality of locations of an environment based on sound source information indicating one or more power values and one or more position values of one or more sound sources located in the environment, and a spatial information estimator for estimating a position of a microphone based on the spatial power density, wherein the spatial information estimator comprises a sound scene center estimator for estimating a position of a center of a sound scene in the environment, wherein the spatial information estimator furthermore comprises a microphone position calculator for determining the position of the microphone based on the position of the center of the sound scene, wherein the spatial information estimator comprises an orientation determiner for determining an orientation of the microphone, wherein the orientation determiner is adapted to determine the orientation of the microphone based on the density determiner is adapted to determine the spatial power density by applying the formula
10. A method for microphone positioning, comprising: determining a spatial power density indicating power values for a plurality of locations of an environment based on sound source information indicating one or more power values and one or more position values of one or more sound sources located in the environment, and estimating a position of a microphone based on the spatial power density, and determining an orientation of the microphone, wherein estimating the position of the microphone based on the spatial power density is conducted by estimating a position of a center of a sound scene in the environment, and by determining the position of the microphone based on the position of the center of the sound scene, wherein determining the orientation of the microphone is conducted based on the spatial power density, wherein determining the spatial power density is conducted by applying the formula
11. A non-transitory computer-readable medium comprising a computer program for implementing the method for microphone positioning, said method comprising: determining a spatial power density indicating power values for a plurality of locations of an environment based on sound source information indicating one or more power values and one or more position values of one or more sound sources located in the environment, and estimating a position of a microphone based on the spatial power density, and determining an orientation of the microphone, wherein estimating the position of the microphone based on the spatial power density is conducted by estimating a position of a center of a sound scene in the environment, and by determining the position of the microphone based on the position of the center of the sound scene, wherein determining the orientation of the microphone is conducted based on the spatial power density, wherein determining the spatial power density is conducted by applying the formula
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
DETAILED DESCRIPTION OF THE INVENTION
(30)
(31)
(32)
(33) The effective sound sources may, e.g., be equal to the instantaneous point-like sound sources (IPLS) as described below for the apparatus for generating an audio output signal of a virtual microphone at a configurable virtual position.
(34) At the output, the position and location of the one or more virtual microphones are returned. In the following, the term physical source is used to describe a real source from the sound scene, e.g., a talker, whereas the term effective sound source (ESS), (also referred to as sound source), is used to describe a sound event which is active in a single time or time-frequency bin, as also used for the IPLS described below with respect to the apparatus for generating an audio output signal of a virtual microphone at a configurable virtual position.
(35) Moreover, it should be noted, that the term sound source covers both physical sources as well as to effective sound sources.
(36) The input of the apparatus according to the embodiment of
(37) [20] Giovanni Del Galdo, Oliver Thiergart, Tobias Weller, and E. A. P. Habets. Generating virtual microphone signals using geometrical information gathered by distributed arrays. In Third Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA '11), Edinburgh, United Kingdom, May 2011.
(38) For example, this information can be comprised in the output 106 in
(39) Regarding the apparatus for microphone positioning, different operating modes can become active during a certain time interval, each implying various scenarios for the positioning and orientating of the one or more virtual microphones. An apparatus for microphone positioning can be employed for a plurality of application scenarios:
(40) In a first application scenario, N omnidirectional virtual microphones may be placed inside the sound scene (see
(41) In a second application scenario, a single virtual microphone is positioned in the acoustic center of the sound scene. For example, omnidirectional virtual microphones, cardioid virtual microphones, or a virtual spatial microphone (such as a B-format microphone) is placed such that all participants are captured optimally (
(42) In a third application scenario, one spatial microphone is placed outside the sound scene. For example, a virtual stereo microphone is placed such that a broad spatial image is obtained, as illustrated in
(43) In a fourth application scenario, the optimal orientation of the virtual microphone is estimated while the virtual microphone is located at a fixed position (predetermined position), for example the position and directivity of the virtual microphone might be predefined and only the orientation is calculated automatically.
(44) It should be noted that all of the above applications may include temporal adaptability. For instance, the virtual spot microphone's position/orientation follows one talker as the talker moves in the room.
(45) In
(46) [21] Ville Pulkki. Spatial sound reproduction with directional audio coding. J. Audio Eng. Soc, 55(6):503-516, June 2007.
(47) The metric can be expressed either with respect to all of the inputs 91, . . . , 9N, (for example, a constant value of the metric for all inputs may be used), or, can be defined differently for each input 91, . . . , 9N. The outputs 15, 16 of the apparatus of
(48)
(49) The computation of the SPD for a time-frequency bin (k, n) may be done according to the formula
(50)
wherein, (x, y, z) represent the coordinates of the system and x.sub.ESSi, Y.sub.ESSi, z.sub.ESSi are the coordinates of the effective sound source i. The significance metric 103 ?.sub.i represents an indicator of how reliable the position estimates of each effective sound source are. By default, the significance metric may be equal to 1. It should be noted here that power, and coordinates xESS.sub.i, yESS.sub.i and zESS.sub.i correspond to input 9i in
(51) The SPD generated by the spatial power distribution main processing unit 31 (for instance in
(52) When computing the position and/or orientation of the one or more virtual microphones, an optional parameter which depends on the SPD may be employed. This parameter may refer to e.g., forbidden and/or advantageous regions of the room where to place the virtual microphones (VM), or, may refer to the SPD, choosing specific SPD ranges, which satisfy some predetermined rules.
(53) As can be seen in formula (1), g is a function of the significance metric ? (or rather ?.sub.i) in space, which has, by default, a value equal to 1. Otherwise, ? may be used to take different contributions into account. For example, if ?.sup.2 is the variance of the position estimation, then, e.g. ? may be set to
(54)
(55) Alternatively, the average diffuseness ? computed at the microphone arrays can be employed, resulting in ?=1??.
(56) By this, ? may be chosen such that it decreases for more unreliable estimates and increases for more reliable ones.
(57) A plurality of possibilities exist for constructing function g. Two examples particularly useful in practice are:
(58)
(59) In the first function, ?(x), ?(y) and ?(z) indicate delta functions (see
?.sub.?=E[(s??)(s??).sup.T],(4)
which is dependent on the choice of ? for the scenario where ?=1/?.sup.2, having in mind that, for example, for the 1D case:
?.sup.2=E[(x??.sub.x).sup.2].(5)
(60) As can be seen in formula (3), function g can be described by a distribution function around the effective sound source positions given by the inputs 91 . . . 9N, where e.g., the significance metric is the inverse of the variance of a Gaussian distribution. If the estimate of a sound source position has a high reliability, the according distribution will be rather narrow, whereas a more unreliable estimate would correspond to a high variants and would therefore, a wide distribution, see for example,
(61)
(62)
(63) The spatial information estimator 22 of
(64) The sound scene center estimator 41 provides an estimate of the sound scene center. The output of the sound scene center estimator 41 is then provided as input to the microphone position/orientation calculator 44. The microphone position/orientation calculator 44 performs the actual estimation of the final position 15 and/or orientation 16 of the one or more virtual microphones according to the operating mode which characterizes the target application.
(65) Embodiments of the sound scene center estimator are now explained in more detail. In order to obtain the center of the sound scene, several possible concepts exist.
(66) According to a first concept of a first embodiment, the center of the sound scene is obtained by computing the center of gravity of the SPD ?(x,y,z). The value of ?(x,y,z) may be s interpreted as the existing mass at point (x,y,z) in space.
(67) According to a second concept of a second embodiment, the position in space with a minimum time dispersion of the channel shall be found. This is achieved by considering the root mean squared (RMS) delay spread. At first, for each point in space p=(x0, y0), a power delay profile (PDP) A.sub.p(?) is computed based on the SPD ?(x, y, z), for instance using
(68)
where ?=?{square root over ((x?x0).sup.2+(y?y0).sup.2)}/c
(69) From A.sub.p(?), the RMS delay is then calculated using the following equation:
(70)
where
(71) According to a third concept of a third embodiment, which may be employed as an alternative to sound scene center estimation, a circle-integration is proposed. For example, in the 2D case, the SPD ?(x, y) is convolved with a circle C.sub.(r,o), according to the following formula:
g(x,y)=?(x,y)*C.sub.(r,o)(x,y),
wherein r is the radius of the circle, and wherein o defines the center of the circle. The radius r may either be constant or may vary depending on the power value in the point (x,y). For example, high power in the point (x,y) may correspond to a large radius, whereas low power may correspond to a small radius. Additional dependencies on the power may also be possible. One such example would be to convolve the circle with a bivariate Gaussian function before using it for constructing function g(x, y). According to such an embodiment, the covariance matrix of the bivariate Gaussian function becomes dependent on the power in the position (x,y), i.e., high power corresponds to low variance, whereas low power corresponds to high variance.
(72) Once g(x, y) is computed, the center of the sound scene may be determined according to the following formula:
(73)
(74) In further embodiments, this concept is extended to 3D by employing a 3D convolution of ?(x, y, z) with a sphere, analogously.
(75)
(76) Different concepts may be applied for calculating a microphone position, e.g.:
(77) optimization based on projected energy width,
(78) optimization based on principle component analysis.
(79) It may, for illustrative purposes be assumed, that the position of the microphone is computed according to the application scenario of
(80) The concepts for estimating the position of the virtual microphones according to embodiments, which were previously enumerated, will now be described in more detail in the following.
(81) The optimization based on projected energy width defines a set of M equally spaced lines which pass through the center of the sound scene. For each of these lines, in e.g., a 2D scenario, the SPD ?(x,y) is orthogonally projected on them and summed up.
(82)
(83) The distance at which the VM is positioned may be computed, for example, based on geometric considerations together with the opening angle of the virtual microphone. This is illustrated by
(84) According to another embodiment, the described optimization concept based on projected energy may be extended to 3D. In this case, M.sup.2 equally spaced planes (in azimuthal and elevation direction) are defined instead of M lines. The width then corresponds to the diameter of the circle which comprises the largest part of the projected energy. The final position is obtained by placing the VM on the normal to the plane surface of the largest circle diameter. According to an embodiment, the distance from the center of the sound scene to the VM position may be computed again, similarly as in the 2D case, that is using geometric considerations and the opening angle specified by the operating mode.
(85) According to another embodiment, optimization based on a principle component analysis is employed. The optimization based on a principle component analysis-like processing uses directly the information available from the SPD. At first, the SPD ?(x,y,z) is quantized and a threshold-selective filter is applied on the quantized data set. By this, all points which have energy levels smaller than a certain threshold are discarded. Afterwards, the remaining points h.sub.i=[h.sub.x,i, h.sub.y,i, h.sub.z,i].sup.T are mean-centered (i.e., the mean-centered points represent the coordinates of the i-th effective source minus the coordinates of the sound scene center), and are then reorganized in a data matrix H as follows:
(86)
where N defines the number of points after thresholding. Then, the singular value decomposition (SVD) is applied to H, such that it is factorized into the following product:
H=U.Math.?.Math.V.sup.T.
(87) The first column of U represents the principal component, which has the highest variability of the data set. The second column of U is orthogonal to the first and represents the direction on which we want to place the VM. The width is implicitly given by the first singular value in the matrix ?. Knowing the width, as well as the direction, we can compute the position and orientation of the VM as described in the optimization method based on projected energy width as described above explained with reference to
(88) In another embodiment, these methods are applied to a 2D problem, which is straightforward, as one merely needs to ignore/remove the z axis component from the equations and considerations.
(89) For other applications, such as the application scenario of
(90)
(91) In the following, orientation estimation will be described. The optimization approaches based on projected energy width as well as on principal component analysis compute the orientation of the virtual microphone 15 implicitly, since the virtual microphone is assumed to be oriented towards the center of the sound scene.
(92) For some other application scenarios, however, it may be suitable to calculate the orientation explicitly, for example, in an application scenario, wherein the optimal orientation of the virtual microphone is estimated, wherein the virtual microphone is located at a fixed position. In this case, the orientation should be determined, such that the virtual microphone picks up most of the energy in the sound scene.
(93) According to an embodiment, to determine the orientation of a virtual microphone, at first, the possible directions ? are sampled and integration over the energy on each of these directions is performed. The following function of ? is obtained:
f(?)=?.sub.0.sup.T.sup.
where r.sub.max is defined is defined as the maximum distance from the VM and controls the VM's pick-up pattern. Then, the final orientation ? of the VM is computed as:
(94)
where w.sub.?(?) is a weighting function based on the input characteristics of the VM. E.g., w.sub.?(?) may be the function which defines how the energy coming from direction ? is scaled given a certain viewing direction ? and a specific pick-up pattern of the VM.
(95) In the following, an apparatus for generating an audio output signal to simulate a recording of a virtual microphone at a configurable virtual position in an environment is explained. An apparatus for microphone positioning according to one of the above described embodiments may be employed to determine the virtual position for the apparatus for generating the audio output signal.
(96)
(97)
(98) In embodiments, the sound event localization in space, as well as describing the position of the virtual microphone may be conducted based on the positions and orientations of the real and virtual spatial microphones in a common coordinate system. This information may be represented by the inputs 121 . . . 12N and input 104 in
(99) The output of the apparatus or a corresponding method may be, when desired, one or more sound signals 105, which may have been picked up by a spatial microphone defined and placed as specified by 104. Moreover, the apparatus (or rather the method) may provide as output corresponding spatial side information 106 which may be estimated by employing the virtual spatial microphone.
(100)
(101) In the following, position estimation of a sound events position estimator according to an embodiment is described in more detail.
(102) Depending on the dimensionality of the problem (2D or 3D) and the number of spatial microphones, several solutions for the position estimation are possible.
(103) If two spatial microphones in 2D exist, (the simplest possible case) a simple triangulation is possible.
(104) [13] R. Roy, A. Paulraj, and T. Kailath, Direction-of-arrival estimation by subspace rotation methodsESPRIT, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Stanford, Calif., USA, April 1986, or (root) MUSIC, see
(105) [14] R. Schmidt, Multiple emitter location and signal parameter estimation, IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, 1986
(106) to the pressure signals transformed into the time-frequency domain.
(107) In
(108) The triangulation fails when the two lines 430, 440 are exactly parallel. In real applications, however, this is very unlikely. However, not all triangulation results correspond to a physical or feasible position for the sound event in the considered space. For example, the estimated position of the sound event might be too far away or even outside the assumed space, indicating that probably the DOAs do not correspond to any sound event which can be physically interpreted with the used model. Such results may be caused by sensor noise or too strong room reverberation. Therefore, according to an embodiment, such undesired results are flagged such that the information computation module 202 can treat them properly.
(109)
(110) Similarly to the 2D case, the triangulation may fail or may yield unfeasible results for certain combinations of directions, which may then also be flagged, e.g. to the information computation module 202 of
(111) If more than two spatial microphones exist, several solutions are possible. For example, the triangulation explained above, could be carried out for all pairs of the real spatial microphones (if N=3, 1 with 2, 1 with 3, and 2 with 3). The resulting positions may then be averaged (along x and y, and, if 3D is considered, z).
(112) Alternatively, more complex concepts may be used. For example, probabilistic approaches may be applied as described in
(113) [15] J. Michael Steele, Optimal Triangulation of Random Samples in the Plane, The Annals of Probability, Vol. 10, No. 3 (August, 1982), pp. 548-553.
(114) According to an embodiment, the sound field may be analyzed in the time-frequency domain, for example, obtained via a short-time Fourier transform (STFT), in which k and n denote the frequency index k and time index n, respectively. The complex pressure P.sub.v(k, n) at an arbitrary position p.sub.v for a certain k and n is modeled as a single spherical wave emitted by a narrow-band isotropic point-like source, e.g. by employing the formula:
P.sub.v(k,n)=P.sub.IPLS(k,n).Math.?(k,pIPLS(k,n),p.sub.v),(1)
where P.sub.IPLS(k, n) is the signal emitted by the IPLS at its position p.sub.IPLS(k, n). The complex factor ?(k, p.sub.IPLS, p.sub.v) expresses the propagation from p.sub.IPLS(k, n) to p.sub.v, e.g., it introduces appropriate phase and magnitude modifications. Here, the assumption may be applied that in each time-frequency bin only one IPLS is active. Nevertheless, multiple narrow-band IPLSs located at different positions may also be active at a single time instance.
(115) Each IPLS either models direct sound or a distinct room reflection. Its position p.sub.IPLS(k, n) may ideally correspond to an actual sound source located inside the room, or a mirror image sound source located outside, respectively. Therefore, the position p.sub.IPLS(k, n) may also indicates the position of a sound event.
(116) Please note that the term real sound sources denotes the actual sound sources physically existing in the recording environment, such as talkers or musical instruments. On the contrary, with sound sources or sound events or IPLS we refer to effective sound sources, which are active at certain time instants or at certain time-frequency bins, wherein the sound sources may, for example, represent real sound sources or mirror image sources.
(117)
(118)
(119)
(120) Both the actual sound source 153 of
(121)
(122) While this single-wave model is accurate only for mildly reverberant environments given that the source signals fulfill the W-disjoint orthogonality (WDO) condition, i.e. the time-frequency overlap is sufficiently small. This is normally true for speech signals, see, for example,
(123) [12] S. Rickard and Z. Yilmaz, On the approximate W-disjoint orthogonality of speech, in Acoustics, Speech and Signal Processing, 2002. ICASSP 2002. IEEE International Conference on, April 2002, vol. 1.
(124) However, the model also provides a good estimate for other environments and is therefore also applicable for those environments.
(125) In the following, the estimation of the positions p.sub.IPLS(k, n) according to an embodiment is explained. The position p.sub.IPLS(k, n) of an active IPLS in a certain time-frequency bin, and thus the estimation of a sound event in a time-frequency bin, is estimated via triangulation on the basis of the direction of arrival (DOA) of sound measured in at least two different observation points.
(126)
(127)
Here, ?.sub.1(k, n) represents the azimuth of the DOA estimated at the first microphone array, as depicted in
e.sub.1(k,n)=R.sub.1.Math.e.sub.1.sup.POV(k,n),
e.sub.2(k,n)=R.sub.2.Math.e.sub.2.sup.POV(k,n),(3)
where R are coordinate transformation matrices, e.g.,
(128)
when operating in 2D and c.sub.1=[c.sub.1,x, c.sub.1,y].sup.T. For carrying out the triangulation, the direction vectors d.sub.1(k, n) and d.sub.2(k, n) may be calculated as:
d.sub.1(k,n)=d.sub.1(k,n)e.sub.1(k,n),
d.sub.2(k,n)=d.sub.2(k,n)e.sub.2(k,n),(5)
where d.sub.1(k, n)=?d.sub.1(k, n)? and d.sub.2(k,n)=?d.sub.2(k, n)? are the unknown distances between the IPLS and the two microphone arrays. The following equation
p.sub.1+d.sub.1(k,n)=p.sub.2+d.sub.2(k,n)(6)
may be solved for d.sub.1(k, n). Finally, the position p.sub.IPLS(k, n) of the IPLS is given by
p.sub.IPLS(k,n)=d.sub.1(k,n)e.sub.1(k,n)+p.sub.1.(7)
(129) In another embodiment, equation (6) may be solved for d.sub.2(k, n) and p.sub.IPLS(k, n) is analogously computed employing d.sub.2(k, n).
(130) Equation (6) provides a solution when operating in 2D, unless e.sub.1(k, n) and e.sub.2(k, n) are parallel. However, when using more than two microphone arrays or when operating in 3D, a solution cannot be obtained when the direction vectors d do not intersect. According to an embodiment, in this case, the point which is closest to all direction vectors d is be computed and the result can be used as the position of the IPLS.
(131) In an embodiment, all observation points p.sub.1, p.sub.2, . . . should be located such that the sound emitted by the IPLS falls into the same temporal block n. This requirement may simply be fulfilled when the distance ? between any two of the observation points is smaller than
(132)
where n.sub.FFT is the STFT window length, 0?R<1 specifies the overlap between successive time frames and f.sub.s is the sampling frequency. For example, for a 1024-point SIFT at 48 kHz with 50% overlap (R=0.5), the maximum spacing between the arrays to fulfill the above requirement is ?=3.65 m.
(133) In the following, an information computation module 202, e.g. a virtual microphone signal and side information computation module, according to an embodiment is described in more detail.
(134)
(135)
(136) To compute the audio signal of the virtual microphone, the geometrical information, e.g. the position and orientation of the real spatial microphones 121 . . . 12N, the position, orientation and characteristics of the virtual spatial microphone 104, and the position estimates of the sound events 205 are fed into the information computation module 202, in particular, into the propagation parameters computation module 501 of the propagation compensator 500, into the combination factors computation module 502 of the combiner 510 and into the spectral weights computation unit 503 of the spectral weighting unit 520. The propagation parameters computation module 501, the combination factors computation module 502 and the spectral weights computation unit 503 compute the parameters used in the modification of the audio signals 111 . . . 11N in the propagation compensation module 504, the combination module 505 and the spectral weighting application module 506.
(137) In the information computation module 202, the audio signals 111 . . . 11N may at first be modified to compensate for the effects given by the different propagation lengths between the sound event positions and the real spatial microphones. The signals may then be combined to improve for instance the signal-to-noise ratio (SNR). Finally, the resulting signal may then be spectrally weighted to take the directional pick up pattern of the virtual microphone into account, as well as any distance dependent gain function. These three steps are discussed in more detail below.
(138) Propagation compensation is now explained in more detail. In the upper portion of
(139) The lower portion of
(140) The signals at the two real arrays are comparable only if the relative delay Dt12 between them is small. Otherwise, one of the two signals needs to be temporally realigned to compensate the relative delay Dt12, and possibly, to be scaled to compensate for the different decays.
(141) Compensating the delay between the arrival at the virtual microphone and the arrival at the real microphone arrays (at one of the real spatial microphones) changes the delay independent from the localization of the sound event, making it superfluous for most applications.
(142) Returning to
(143) The propagation compensation module 504 is configured to use this information to modify the audio signals accordingly. If the signals are to be shifted by a small amount of time (compared to the time window of the filter bank), then a simple phase rotation suffices. If the delays are larger, more complicated implementations may be used.
(144) The output of the propagation compensation module 504 are the modified audio signals expressed in the original time-frequency domain.
(145) In the following, a particular estimation of propagation compensation for a virtual microphone according to an embodiment will be described with reference to
(146) In the embodiment that is now explained, it is assumed that at least a first recorded audio input signal, e.g. a pressure signal of at least one of the real spatial microphones (e.g. the microphone arrays) is available, for example, the pressure signal of a first real spatial microphone. We will refer to the considered microphone as reference microphone, to its position as reference position p.sub.ref and to its pressure signal as reference pressure signal P.sub.ref(k, n). However, propagation compensation may not only be conducted with respect to only one pressure signal, but also with respect to the pressure signals of a plurality or of all of the real spatial microphones.
(147) The relationship between the pressure signal P.sub.IPLS(k, n) emitted by the IPLS and a reference pressure signal P.sub.ref(k, n) of a reference microphone located in p.sub.ref can be expressed by formula (9):
P.sub.refa(k,n)=P.sub.IPLS(k,n).Math.?(k,p.sub.IPLSp.sub.ref),(9)
(148) In general, the complex factor ?(k, p.sub.a, p.sub.b) expresses the phase rotation and amplitude decay introduced by the propagation of a spherical wave from its origin in p.sub.a to p.sub.b. However, practical tests indicated that considering only the amplitude decay in ? leads to plausible impressions of the virtual microphone signal with significantly fewer artifacts compared to also considering the phase rotation.
(149) The sound energy which can be measured in a certain point in space depends strongly on the distance r from the sound source, in
(150) Assuming that the first real spatial microphone is the reference microphone, then p.sub.ref=p.sub.1. In
s(k,n)=?s(k,n)?=?p.sub.1+d.sub.1(k,n)?p.sub.v?.(10)
(151) The sound pressure P.sub.v(k, n) at the position of the virtual microphone is computed by combining formulas (1) and (9), leading to
(152)
(153) As mentioned above, in some embodiments, the factors ? may only consider the amplitude decay due to the propagation. Assuming for instance that the sound pressure decreases with 1/r, then
(154)
(155) When the model in formula (1) holds, e.g., when only direct sound is present, then formula (12) can accurately reconstruct the magnitude information. However, in case of pure diffuse sound fields, e.g., when the model assumptions are not met, the presented method yields an implicit dereverberation of the signal when moving the virtual microphone away from the positions of the sensor arrays. In fact, as discussed above, in diffuse sound fields, we expect that most IPLS are localized near the two sensor arrays. Thus, when moving the virtual microphone away from these positions, we likely increase the distance s=?s? in
(156) By conducting propagation compensation on the recorded audio input signal (e.g. the pressure signal) of the first real spatial microphone, a first modified audio signal is obtained.
(157) In embodiments, a second modified audio signal may be obtained by conducting propagation compensation on a recorded second audio input signal (second pressure signal) of the second real spatial microphone.
(158) In other embodiments, further audio signals may be obtained by conducting propagation compensation on recorded further audio input signals (further pressure signals) of further real spatial microphones.
(159) Now, combining in blocks 502 and 505 in
(160) Possible solutions for the combination comprise: Weighted averaging, e.g., considering SNR, or the distance to the virtual microphone, or the diffuseness which was estimated by the real spatial microphones. Traditional solutions, for example, Maximum Ratio Combining (MRC) or Equal Gain Combining (EQC) may be employed, or Linear combination of some or all of the modified audio signals to obtain a combination signal. The modified audio signals may be weighted in the linear combination to obtain the combination signal, or Selection, e.g., only one signal is used, for example, dependent on SNR or distance or diffuseness.
(161) The task of module 502 is, if applicable, to compute parameters for the combining, which is carried out in module 505.
(162) Now, spectral weighting according to embodiments is described in more detail. For this, reference is made to blocks 503 and 506 of
(163) For each time-frequency bin the geometrical reconstruction allows us to easily obtain the DOA relative to the virtual microphone, as shown in
(164) The weight for the time-frequency bin is then computed considering the type of virtual microphone desired.
(165) In case of directional microphones, the spectral weights may be computed according to a predefined pick-up pattern. For example, according to an embodiment, a cardioid microphone may have a pick up pattern defined by the function g(theta),
g(theta)=0.5+0.5 cos(theta),
where theta is the angle between the look direction of the virtual spatial microphone and the DOA of the sound from the point of view of the virtual microphone.
(166) Another possibility is artistic (non physical) decay functions. In certain applications, it may be desired to suppress sound events far away from the virtual microphone with a factor greater than the one characterizing free-field propagation. For this purpose, some embodiments introduce an additional weighting function which depends on the distance between the virtual microphone and the sound event. In an embodiment, only sound events within a certain distance (e.g. in meters) from the virtual microphone should be picked up.
(167) With respect to virtual microphone directivity, arbitrary directivity patterns can be applied for the virtual microphone. In doing so, one can for instance separate a source from a complex sound scene.
(168) Since the DOA of the sound can be computed in the position p.sub.v of the virtual microphone, namely
(169)
where c.sub.v is a unit vector describing the orientation of the virtual microphone, arbitrary directivities for the virtual microphone can be realized. For example, assuming that P.sub.v(k,n) indicates the combination signal or the propagation-compensated modified audio signal, then the formula:
{tilde over (P)}.sub.v(k,n)=P.sub.v(k,n)[1+cos(?.sub.v(k,n))](14)
calculates the output of a virtual microphone with cardioid directivity. The directional patterns, which can potentially be generated in this way, depend on the accuracy of the position estimation.
(170) In embodiments, one or more real, non-spatial microphones, for example, an omnidirectional microphone or a directional microphone such as a cardioid, are placed in the sound scene in addition to the real spatial microphones to further improve the sound quality of the virtual microphone signals 105 in
(171) In a further embodiment, computation of the spatial side information of the virtual microphone is realized. To compute the spatial side information 106 of the microphone, the information computation module 202 of
(172) The output of the spatial side information computation module 507 is the side information of the virtual microphone 106. This side information can be, for instance, the DOA or the diffuseness of sound for each time-frequency bin (k, n) from the point of view of the virtual microphone. Another possible side information could, for instance, be the active sound intensity vector Ia(k, n) which would have been measured in the position of the virtual microphone. How these parameters can be derived, will now be described.
(173) According to an embodiment, DOA estimation for the virtual spatial microphone is realized. The information computation module 120 is adapted to estimate the direction of arrival at the virtual microphone as spatial side information, based on a position vector of the virtual microphone and based on a position vector of the sound event as illustrated by
(174)
h(k,n)=s(k,n)?r(k,n).
(175) The desired DOA a(k, n) can now be computed for each (k, n) for instance via the definition of the dot product of h(k, n) and v(k,n), namely
a(k,n)=arcos(h(k,n).Math.v(k,n)/(?h(k,n)??v(k,n)?).
(176) In another embodiment, the information computation module 120 may be adapted to estimate the active sound intensity at the virtual microphone as spatial side information, based on a position vector of the virtual microphone and based on a position vector of the sound event as illustrated by
(177) From the DOA a(k, n) defined above, we can derive the active sound intensity Ia(k, n) at the position of the virtual microphone. For this, it is assumed that the virtual microphone audio signal 105 in
Ia(k,n)=?(?rho)|P.sub.v(k,n)|.sup.2*[cos a(k,n), sin a(k,n)].sup.T,
where [ ].sup.T denotes a transposed vector, rho is the air density, and P.sub.v (k, n) is the sound pressure measured by the virtual spatial microphone, e.g., the output 105 of block 506 in
(178) If the active intensity vector shall be computed expressed in the general coordinate system but still at the position of the virtual microphone, the following formula may be applied:
Ia(k,n)=(?rho)|P.sub.v(k,n)|.sup.2h(k,n)/?h(k,n)?.
(179) The diffuseness of sound expresses how diffuse the sound field is in a given time-frequency slot (see, for example, [2]). Diffuseness is expressed by a value ?, wherein 0???1. A diffuseness of 1 indicates that the total sound field energy of a sound field is completely diffuse. This information is important e.g. in the reproduction of spatial sound.
(180) Traditionally, diffuseness is computed at the specific point in space in which a microphone array is placed.
(181) According to an embodiment, the diffuseness may be computed as an additional parameter to the side information generated for the Virtual Microphone (VM), which can be placed at will at an arbitrary position in the sound scene. By this, an apparatus that also calculates the diffuseness besides the audio signal at a virtual position of a virtual microphone can be seen as a virtual DirAC front-end, as it is possible to produce a DirAC stream, namely an audio signal, direction of arrival, and diffuseness, for an arbitrary point in the sound scene. The DirAC stream may be further processed, stored, transmitted, and played back on an arbitrary multi-loudspeaker setup. In this case, the listener experiences the sound scene as if he or she were in the position specified by the virtual microphone and were looking in the direction determined by its orientation.
(182)
(183) A diffuseness computation unit 801 of an embodiment is illustrated in
(184) Let E.sub.dir.sup.(SMI) to E.sub.(SM N) and E.sub.diff.sup.(SM1) to E.sub.diff.sup.(SM N) denote the estimates of the energies of direct and diffuse sound for the N spatial microphones computed by energy analysis unit 810. If P.sub.i is the complex pressure signal and ?.sub.i is diffuseness for the i-th spatial microphone, then the energies may, for example, be computed according to the formulae:
E.sub.dir.sup.(SMi)=(1??.sub.i).Math.|P.sub.i|.sup.2
E.sub.diff.sup.(SMi)=?.sub.i.Math.|P.sub.i|.sup.2
(185) The energy of diffuse sound should be equal in all positions, therefore, an estimate of the diffuse sound energy E.sub.diff.sup.(VM) at the virtual microphone can be computed simply by averaging E.sub.diff.sup.(SM1) to E.sub.diff.sup.(SM N), e.g. in a diffuseness combination unit 820, for example, according to the formula:
(186)
(187) A more effective combination of the estimates E.sub.diff.sup.(SM1) to E.sub.diff.sup.(SM N) could be carried out by considering the variance of the estimators, for instance, by considering the SNR.
(188) The energy of the direct sound depends on the distance to the source due to the propagation. Therefore, E.sub.dir.sup.(SM1) to E.sub.dir.sup.(SM N) may be modified to take this into account. This may be carried out, e.g., by a direct sound propagation adjustment unit 830. For example, if it is assumed that the energy of the direct sound field decays with 1 over the distance squared, then the estimate for the direct sound at the virtual microphone for the i-th spatial microphone may be calculated according to the formula:
(189)
(190) Similarly to the diffuseness combination unit 820, the estimates of the direct sound energy obtained at different spatial microphones can be combined, e.g. by a direct sound combination unit 840. The result is E.sub.dir.sup.(VM), e.g., the estimate for the direct sound energy at the virtual microphone. The diffuseness at the virtual microphone ?.sup.(VM) may be computed, for example, by a diffuseness sub-calculator 850, e.g. according to the formula:
(191)
(192) As mentioned above, in some cases, the sound events position estimation carried out by a sound events position estimator fails, e.g., in case of a wrong direction of arrival estimation.
(193) Additionally, the reliability of the DOA estimates at the N spatial microphones may be considered. This may be expressed e.g. in terms of the variance of the DOA estimator or SNR. Such an information may be taken into account by the diffuseness sub-calculator 850, so that the VM diffuseness 103 can be artificially increased in case that the DOA estimates are unreliable. In fact, as a consequence, the position estimates 205 will also be unreliable.
(194)
(195) Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
(196) The inventive decomposed signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
(197) Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
(198) Some embodiments according to the invention comprise a non-transitory data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
(199) Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
(200) Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
(201) In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
(202) A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
(203) A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
(204) A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
(205) A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
(206) In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
(207) While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
REFERENCES
(208) [1] Michael A. Gerzon. Ambisonics in multichannel broadcasting and video. J. Audio Eng. Soc, 33(11): 859-871, 1985. [2] V. Pulkki, Directional audio coding in spatial sound reproduction and stereo upmixing, in Proceedings of the AES 28.sup.th International Conference, pp. 251-258, Pita Sweden, Jun. 30-Jul. 2, 2006. [3] V. Pulkki, Spatial sound reproduction with directional audio coding, J. Audio Eng. Soc., vol. 55, no. 6, pp. 503-516, June 2007. [4] C. Faller: Microphone Front-Ends for Spatial Audio Coders, in Proceedings of the AES125.sup.th International Convention, San Francisco, October 2008. [5] M. Kallinger, H. Ochsenfeld, G. Del Galdo. F. K?ch, D. Mahne. R. Schultz-Amling. and O. Thiergart, A spatial filtering approach for directional audio coding. in Audio Engineering Society Convention 126, Munich, Germany, May 2009. [6] R. Schultz-Amling, F. K?ch, O. Thiergart, and M. Kallinger, Acoustical zooming based on a parametric sound field representation, in Audio Engineering Society Convention 128, London UK, May 2010. [7] J. Herre, C. Falch. D. Mahne, G. Del Galdo. M. Kallinger. and O. Thiergart. Interactive teleconferencing combining spatial audio object coding and DirAC technology, in Audio Engineering Society Convention 128, London UK, May 2010. [8] E. G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography, Academic Press, 1999. [9] A. Kuntz and R. Rabenstein, Limitations in the extrapolation of wave fields from circular measurements. in 15th European Signal Processing Conference (EUSIPCO 2007). 2007. [10] A. Walther and C. Faller, Linear simulation of spaced microphone arrays using b-format recordings, in Audio Engiineering Society Convention 128, London UK. May 2010. [11] U.S. 61/287,596: An Apparatus and a Method for Converting a First Parametric Spatial Audio Signal into a Second Parametric Spatial Audio Signal. [12] S. Rickard and Z. Yilmaz, On the approximate W-disjoint orthogonality of speech, in Acoustics. Speech and Signal Processing, 2002. ICASSP 2002. IEEE International Conference on. April 2002. vol. 1. [13] R. Roy, A. Paulraj, and T. Kailath, Direction-of-arrival estimation by subspace rotation methodsESPRIT, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Stanford, Calif., USA, April 1986. [14] R. Schmidt, Multiple emitter location and signal parameter estimation, IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, 1986. [15] J. Michael Steele, Optimal Triangulation of Random Samples in the Plane, The Annals of Probability, Vol. 10, No. 3 (August, 1982), pp. 548-553. [16] F. J. Fahy, Sound Intensity, Essex: Elsevier Science Publishers Ltd., 1989. [17] R. Schultz-Amling. F. K?ch, M. Kallinger, G. Del Galdo, T. Ahonen and V. Pulkki. Planar microphone array processing for the analysis and reproduction of spatial audio using directional audio coding, in Audio Engineering Society Convention 124, Amsterdam, The Netherlands, May 2008. [18] M. Kallinger, F. K?ch, R. Schultz-Amling, G. Del Galdo, T. Ahonen and V. Pulkki, Enhanced direction estimation using microphone arrays for directional audio coding; in Hands-Free Speech Communication and Microphone Arrays, 2008. HSCMA 2008, May 2008, pp. 45-48. [19] R. K. Furness. AmbisonicsAn overview, in AES 8.sup.th International Conference, April 1990, pp. 181-189. [20] Giovanni Del Galdo, Oliver Thiergart, TobiasWeller, and E. A. P. Habets. Generating virtual microphone signals using geometrical information gathered by distributed arrays. In Third Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA '11), Edinburgh, United Kingdom, May 2011. [21] Ville Pulkki. Spatial sound reproduction with directional audio coding. J. Audio Eng. Soc, 55(6):503-516, June 2007.