Method for conversion, stereophonic encoding, decoding and transcoding of a three-dimensional audio signal
11232802 · 2022-01-25
Assignee
Inventors
Cpc classification
H04S3/00
ELECTRICITY
H04S2400/03
ELECTRICITY
H04S2420/01
ELECTRICITY
H04S5/02
ELECTRICITY
H04R5/04
ELECTRICITY
G10L19/02
PHYSICS
H04S2420/11
ELECTRICITY
H04S3/02
ELECTRICITY
International classification
G10L19/02
PHYSICS
G10L19/008
PHYSICS
H04R5/04
ELECTRICITY
Abstract
Methods for converting, encoding, decoding and transcoding an acoustic field, more particularly a first-order Ambisonics three-dimensional acoustic field.
Claims
1. A method for converting a first-order Ambisonics signal into a spherical field made up of a plurality of monochromatic progressive plane waves, by a computer programmed to perform the following operations when encoding the spherical field to obtain an encoded stereophonic signal for any frequency from among a plurality of frequencies, the method comprising: separating said Ambisonics signal into three components comprising: a first complex vectorial component (A), corresponding to a mean acoustic intensity vector of said Ambisonics signal, a second complex vectorial component (B), a complex coefficient of which is equal to subtraction of the pressure wave generated by the component A from a pressure component of said Ambisonics signal, and a direction of which is modified as a function of a random process, a third complex vectorial component (C) corresponding to a subtraction of a pressure gradient generated by the component A from a pressure gradient of said Ambisonics signal, phases of which are modified as a function of a random process, and each of three axial components of which assumes, as direction, a vector derived from a random process; grouping said first, second and third vectorial components A, B and C into a total vector and a total complex coefficient describing said spherical field, wherein: the total complex coefficient is equal to the sum of the complex coefficients corresponding to said first, second and third vectorial components, and the total vector is equal to the sum of the directions of said three components, weighted by the magnitude of the complex coefficients corresponding to said three components; and outputting an encoded stereophonic signal based on the total complex coefficient and the total vector.
2. The method for converting a first-order Ambisonics signal to a spherical field according to claim 1, wherein said second vectorial component B is assigned an arbitrary and predefined direction of origin with negative elevations.
3. A method for converting a first-order Ambisonics signal into a spherical field made up of a plurality of monochromatic progressive plane waves, comprising, for any frequency from among a plurality of frequencies: separating said Ambisonics signal into: a first complex vectorial component (A), determined by a complex coefficient and a direction, said first complex vectorial component being obtained by: (a1) determining a divergence value, calculated as the ratio between a mean acoustic intensity of said Ambisonics signal and the square of the magnitude of a pressure component of said Ambisonics signal, said ratio being saturated at a maximum value of 1, (a2) determining a complex coefficient corresponding to the pressure component of said Ambisonics signal, (a3) determining the direction of said first vectorial component (A), calculated by a weight, as a function of said divergence value, between the direction of a mean acoustic intensity vector and the direction of a vector generated by a random process; and a second complex vectorial component (C), determined by a complex coefficient and a direction, said second complex vectorial component being obtained by: (c1) determining three axial complex components of the pressure gradient of said Ambisonics signal, (c2) determining three axial complex components of the pressure gradient that would be generated by a monochromatic progressive plane wave, a complex coefficient of the monochromatic progressive plane wave would be that of the pressure of the Ambisonic signal multiplied by the divergence value and the direction of which would be that of the mean acoustic intensity vector, (c3) subtracting the result of said (c2) from the result of said (c1), and (c4) changing the phases and direction vectors of the three axial components of the result of said (c3), as a function of a random process, to obtain the complex coefficients and the directions of said second vectorial component (C); grouping said first and second vectorial components A and C into a total vector and a total complex coefficient describing said spherical field, wherein: the total complex coefficient is equal to the sum of the complex coefficients corresponding to said first and second vectorial components, and the total vector is equal to the sum of the directions of said first and second vectorial components, weighted by the magnitude of the complex coefficients corresponding to said two components; and outputting an encoded stereophonic signal based on the total complex coefficient and the total vector.
4. A method for converting said first-order Ambisonics signal into a spherical field according to claim 1, further comprising encoding said spherical field to obtain the encoded stereophonic signal by determining panorama and phase difference values from spherical spatial coordinates describing said spherical field, for any frequency from among a plurality of frequencies, determining the position of the singularity Ψ in the inter-channel domain, done by analyzing the panorama and phase difference values and moving said singularity from a preceding position of said singularity such that said singularity is not positioned on a useful signal, determining a phase correspondence Φ.sub.Ψ(panorama,phasediff) corresponding to each pair of complex coefficients derived from said spherical field, and determining a table of complex coefficient pairs c.sub.L and c.sub.R, for any frequency from a plurality of frequencies, from complex coefficients derived from the spherical field c.sub.s, the phase correspondence, and the phase difference values, said complex coefficients CL and CR being combined to obtain said encoded stereophonic signal.
Description
BRIEF DESCRIPTION OF THE FIGURES
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
DETAILED DESCRIPTION
(14) The techniques described hereinafter deal with data that assume the form of complex frequency coefficients. These coefficients represent a frequency band over a reduced temporal window. They are obtained using a technique called short-term Fourier transform (STFT), and may also be obtained using similar transforms, such as those from the family of complex wavelet transforms (CWT), complex wavelet packet transforms (CWPT), the modified discrete cosine transform (MDCT) or the modulated complex lapped transform (MCLT), etc. Each of these transforms, applied on successive windows and overlapping the signal, has an inverse transform making it possible, from the complex frequency coefficients representing all of the frequency bands of the signal, to obtain a signal in temporal form.
(15) In the present document, we define:
(16)
(17) Using one of the time-to-frequency transforms previously described, two channels in temporal form, for example, forming a stereophonic signal, can be transformed to the frequency domain in two complex coefficient tables. The complex frequency coefficients of the two channels can be paired, so as to have one pair for each frequency or frequency band from a plurality of frequencies, and for each temporal window of the signal.
(18) Each pair of complex frequency coefficients can be analyzed using two metrics, combining information from two stereophonic channels, which are introduced below: the panorama and the phase difference, which form what will be called the “inter-channel domain” in the continuation of the present document. The panorama of two complex frequency coefficients c.sub.1 and c.sub.2 is defined as the ratio between the difference in their powers and the sum of their powers:
(19)
(20) The panorama thus assumes values in the interval [−1,1]. If the two coefficients simultaneously have a nil magnitude, there is no signal in the frequency band that they represent, and the use of the panorama is not relevant.
(21) The panorama applied to a stereophonic signal made up of two left (L) and right (R) channels will thus be, for the respective coefficients of the two channels c.sub.L and c.sub.R, not simultaneously nil:
(22)
(23) The panorama is thus equal to, inter alia: 1 for a signal completely contained in the left channel, i.e., c.sub.R=0, −1 for a signal completely contained in the right channel, i.e., c.sub.L=0, 0 for a signal of equal magnitude on both channels.
(24) Knowing a panorama and a total power p makes it possible to determine the magnitudes of the two complex frequency coefficients:
(25)
(26) One variant of the formulation of the panorama is as follows:
(27)
(28) With this formulation, knowing a panorama and a total power p makes it possible to determine the magnitudes of the two complex frequency coefficients:
(29)
(30) The phase difference is also defined between two complex frequency coefficients c.sub.1 and c.sub.2 that are both not nil as follows:
phasediff(c.sub.1,c.sub.2)=arg(c.sub.2)−arg(c.sub.1)+k2π (6)
where k∈Z such that phasediff(c.sub.1,c.sub.2)∈]−π, π].
(31) In the rest of this document, we consider the three-dimensional Cartesian coordinate system with axes (X,Y,Z) and coordinates (x,y,z). The azimuth is considered the angle in the plane (z=0), from the axis X toward the axis Y (trigonometric direction), in radians. A vector v will have an azimuth coordinate a when the half-plane (y=0,x≥0) having undergone a rotation around the axis Z by an angle a will contain the vector v. A vector v will have an elevation coordinate e when, in the half-plane (y=0, x≥0) having undergone a rotation around the axis Z, it has an angle e with a non-nil vector of the half-line defined by intersection between the half-plane and the horizontal plane (z=0), positive toward the top.
(32) An azimuth and elevation unit vector a and e will have, for Cartesian coordinates:
(33)
(34) In this Cartesian coordinate system, a signal expressed in the form of a “First Order Ambisonics” (FOA) field, i.e., in first-order spherical harmonics, is made up of four channels W, X Y, Z, corresponding to the pressure and pressure gradient at a point in space in each of the directions: the channel W is the pressure signal the channel X is the signal of the pressure gradient at the point along the axis X the channel Y is the signal of the pressure gradient at the point along the axis Y the channel Z is the signal of the pressure gradient at the point along the axis Z
(35) Normalization standard of the spherical harmonics can be defined as follows: a monochromatic progressive plane wave (MPPW) with complex frequency component c and direction of origin of the unitary vector {right arrow over (v)} with Cartesian coordinates (v.sub.x, v.sub.y, v.sub.z) or azimuth and elevation coordinates (a, e) will create, for each channel, a coefficient of equal phase, but altered magnitude:
(36)
or respectively
(37)
the whole being expressed to within a normalization factor. By linearity of the time-frequency transforms, the expression of the equivalents in the temporal domain is trivial. Other normalization standards exist, which are for example presented by Daniel in “Representation de champs acoustiques, application à la transmission et à la reproduction de scènes sonores complexes dans un contexte multimédia” [Representation of acoustic fields, application to the transmission and reproduction of complex sound scenes in a multimedia context” (Doctoral Thesis at the Université Paris 6, Jul. 31, 2001).
(38) The concept of “divergence” makes it possible to simulate, in the FOA field, a source moving inside the unitary sphere of the directions: the divergence is a real parameter with values in [0,1], a divergence div=1 will position the source on the surface of the sphere like in the previous equations, and divergence div=0 will position the source at the center of the sphere. Thus, the coefficients of the FOA field are as follows:
(39)
or respectively
(40)
the whole being expressed to within a normalization factor. By linearity of the time-frequency transforms, the expression of the equivalents in the temporal domain is trivial.
(41) One preferred implementation of the invention comprises a first conversion method of such a FOA field into complex coefficients and spherical coordinates. This first method allows a conversion, with losses, based on a perceptual nature of the FOA field to a format made up of complex frequency coefficients and their spatial correspondence in azimuth and elevation coordinates (or a unit norm Cartesian vector). Said method is based on a frequency representation of the FOA signals obtained after temporal clipping and time-to-frequency transform, for example through the use of the short-term Fourier transform (STFT).
(42) The following method is applied on each group of four complex coefficients corresponding to a frequency “bin”, i.e., the complex coefficients of the frequency representation of each of the channels W, X, Y, Z that correspond to the same frequency band, for any frequency or frequency band from among a plurality of frequencies. An exception is made for the frequency bin(s) corresponding to the continuous component (due to the “padding” applied to the signal before time-to-frequency transform, the following few frequency bins can also be affected).
(43) References c.sub.W, c.sub.X, c.sub.Y, c.sub.Z denote the complex coefficients corresponding to a considered frequency “bin”. An analysis is done to separate the content of this frequency band into three parts: a part A corresponding to a monochromatic progressive plane wave (MPPW), directional, a part B corresponding to a diffuse pressure wave, a part C corresponding to a standing wave.
(44) To understand the separation, the following examples are given: An analysis leading to a separation in which only part A is non-nil can be obtained with a signal coming from a MPPW as described in equation 8 or equation 9. An analysis leading to a separation in which only part B is non-nil can be obtained with two MPPWs (of equal frequency), in phase, and with opposite directions of origin (only c.sub.W then being nil). An analysis leading to a separation in which only part C is non-nil can be obtained with two MPPWs (of equal frequency), out of phase, and with opposite directions of origin (only c.sub.X, c.sub.Y, c.sub.Z then being non-harmed).
(45) Hereinafter, the three parts are grouped together in order to obtain a whole signal.
(46) Regarding part A defined above, the medium-intensity vector of the signal of the FOA field is examined. In “Instantaneous intensity” (AES Convention 81, November 1986), Heyser indicates a formulation in the frequency domain of the active part of the acoustic intensity, which can then be expressed, in all three dimensions:
{right arrow over (I)}.sub.x,y,z=½=Re[{right arrow over (u)}.sub.x,y,z*] (12)
where: {right arrow over (I)}.sub.x,y,z is the medium-intensity three-dimensional vector, oriented toward the origin of the MPPW, of magnitude proportional to the square of the magnitude of the MPPW, the operator Re[{right arrow over (v)}] designates the real part of the vector {right arrow over (v)}, i.e., the vector of the real parts of the components of the vector {right arrow over (v)}, p is the complex coefficient corresponding to the pressure component, i.e., p=√2 c.sub.W, {right arrow over (u)}.sub.x,y,z is the three-dimensional vector made up of the complex coefficients corresponding to the pressure gradients respectively along the axis X, Y, and Z, i.e., {right arrow over (u)}.sub.x,y,z=(c.sub.X, c.sub.Y, c.sub.Z).sup.T, the operator {right arrow over (v)}* is the conjugation operator of the complex components of the vector.
(47) One thus obtains for part A, for each frequency “bin” except that or those corresponding to the continuous component:
(48)
(49) Furthermore, regarding part B defined above, or the complex component c.sub.w′, the result of the subtraction of the complex coefficient corresponding to the signal extracted in part A (i.e., via equation 8) from the original coefficient c.sub.w:
(50)
(51) It is possible to define several behavior modes for the determination of part B: In a first spherical conversion mode retaining all of the directions of origin at negative elevations, and therefore in particular suitable for virtual reality, part B is expressed as
(52)
(53)
{right arrow over (v.sub.s)}=(1−s)×{right arrow over (r.sub.w)}+s×[cos(e.sub.w),0, sin(e.sub.w)].sup.T (17) One obtains:
(54)
(55) Lastly, regarding part C, let the complex coefficients c.sub.x′, c.sub.y′, and c.sub.z′ be the results of the subtraction of the complex coefficients corresponding to the signal extracted from part A (i.e., the coefficients obtained with the equation) from the original coefficients c.sub.x, c.sub.y, and c.sub.z:
(56)
where a.sub.x, a.sub.y, a.sub.z are the Cartesian coefficients of the vector {right arrow over (a)}.
(57) One obtains:
(58)
where {right arrow over (r)}.sub.x, {right arrow over (r)}.sub.y, and {right arrow over (r)}.sub.z are vectors depending on the frequency or the frequency band, described hereinafter.
(59) The separate parts A, B and C are grouped together in a direction of origin vector {right arrow over (v)}.sub.total and a complex coefficient c.sub.total:
(60)
where ϕ.sub.x, ϕ.sub.y, and ϕ.sub.0 are phases that will be defined later in the present document.
(61) The first conversion method described above does not consider any divergence nature that may be introduced during the FOA panoramic. A second preferred implementation makes it possible to consider the divergence nature.
(62) For part A, {right arrow over (I)}.sub.x,y,z obtained by equation 12 is considered. The divergence div is calculated as follows:
(63)
(64) From div, c.sub.w and {right arrow over (I)}.sub.x,y,z are calculated {right arrow over (a)} and c.sub.a:
{right arrow over (a.sub.0)}=Norm{right arrow over (I)}.sub.x,y,z|(0,0,0).sup.T
(23)
(65) In a first spherical embodiment, the unitary direction vector {right arrow over (a)}.sub.spherical is calculated as follows:
{right arrow over (a.sub.spherical)}=Normdiv {right arrow over (a.sub.0)}+(1−div){right arrow over (r.sub.w)}|(1,0,0).sup.T
(24)
(66) In a second hemispherical embodiment, the unitary direction vector {right arrow over (a)}.sub.hemispherical is calculated as follows:
{right arrow over (a.sub.1)}=div {right arrow over (a.sub.0)} (25)
(67) One defines {right arrow over (p)} the vector {right arrow over (a)}.sub.1 projected on the horizontal plane:
{right arrow over (p)}={right arrow over (a.sub.1)}−({right arrow over (a.sub.1)}).Math.(0,0,1).sup.T (26)
where .Math. is the scalar product, and one defines its norm p:
p=∥{right arrow over (p)}∥ (27)
(68) One also defines h:
h=√{square root over (1−p.sup.2)} (28)
{right arrow over (a.sub.2)}={right arrow over (a.sub.1)}−(1−p)(h−{right arrow over (a.sub.1)}.Math.(0,0,1).sup.T)(0,0,1).sup.T (29)
then if the coordinate in Z of {right arrow over (a.sub.2)} is less than −h, it is reduced to −h. One defines hdiv:
h div=∥{right arrow over (a.sub.2)}∥ (30)
(69) Then lastly {right arrow over (a)}.sub.hemispherical:
{right arrow over (a.sub.hemispherical)}=Norm{right arrow over (a.sub.2)}+(1−h div){right arrow over (r.sub.w)}|(1,0,0).sup.T
(31)
(70) Modes midway between the spherical mode and the hemispherical mode can be built, indexed by a coefficient s ∈ [0,1], 0 for the spherical mode and 1 for the hemispherical mode:
{right arrow over (a)}=(1−s){right arrow over (a)}.sub.spherical+s{right arrow over (a)}.sub.hemispherical (32)
(71) The complex frequency coefficient is in turn:
c.sub.a=c.sub.w√{square root over (2)} (33)
(72) Furthermore, it will be noted that there is no part B, since the latter is fully taken into account by the divergence in part A.
(73) Lastly, regarding part C, let the complex coefficients c.sub.x′, c.sub.y′, and c.sub.z′ be the results of the subtraction of the complex coefficients corresponding to the signal extracted from part A (i.e., the coefficients obtained with the equation), in its direction without divergence, from the original coefficients c.sub.x, c.sub.y, and c.sub.z:
(74)
where a.sub.0x, a.sub.0y, a.sub.0z are the Cartesian components of the vector {right arrow over (a.sub.0)}. One obtains:
(75)
where {right arrow over (r)}.sub.x, {right arrow over (r)}.sub.y, and {right arrow over (r)}.sub.z are vectors depending on the frequency band, described hereinafter.
(76) The separate parts A and C are definitively grouped together in a direction of origin vector {right arrow over (v)}.sub.total and a complex coefficient c.sub.total:
(77)
where ϕ.sub.x, ϕ.sub.y and ϕ.sub.z are phases that will be defined later in the present document.
(78) Regarding the direction vectors for the diffuse parts, reference is made above to: vectors {right arrow over (r)}.sub.w, {right arrow over (r)}.sub.x, {right arrow over (r)}.sub.y, {right arrow over (r)}.sub.z, and phases ϕ.sub.x, ϕ.sub.y and ϕ.sub.z.
(79) These vectors and phases are responsible for establishing a diffuse nature of the signal, to which they give the direction and of which they modify the phase. They depend on the processed frequency band, i.e., there is a vector and phase set for each frequency “bin”. In order to establish this diffuse nature, they result from a random process, which makes it possible to smooth them spectrally, and temporally if it is desired for them to be dynamic.
(80) The process of obtaining these vectors is as follows: For each frequency or frequency band, a set of unitary vectors {right arrow over (r.sub.0)}.sub.w, {right arrow over (r.sub.0)}.sub.x, {right arrow over (r.sub.0)}.sub.y, {right arrow over (r.sub.0)}.sub.z, and phases ϕ.sub.0x, ϕ.sub.0y and ϕ.sub.0z are generated from a pseudorandom process: the unitary vectors are generated from an azimuth derived from a uniform real number pseudorandom generator]−π, π] and an elevation derived from the arcsine of a real number from a uniform pseudorandom generator in [−1,1]; the phases are obtained using a uniform pseudorandom generator of real numbers in]−π, π], The frequencies or frequency bands are swept from those corresponding to the low frequencies toward those corresponding to the high frequencies, to spectrally smooth the vectors and phases using the following procedure: For the vectors {right arrow over (r)}.sub.w (b) where b is the index of the frequency or the frequency band,
(81)
(82)
(83) The vectors of the lowest frequencies, for example those corresponding to the frequencies below 150 Hz, are modified to be oriented in a favored direction, for example and preferably (1,0,0).sup.T. To that end, the generation of the random vectors {right arrow over (r.sub.0)}.sub.w, {right arrow over (r.sub.0)}.sub.x, {right arrow over (r.sub.0)}.sub.y, {right arrow over (r.sub.0)}.sub.z is modified: it then consists of generating a random unitary vector, determining a vector (m n.sup.b, 0,0).sup.T where m is a factor greater than 1, for example 8, and n is a factor of less than 1, for example 0.9, making it possible to decrease the preponderance of this vector relative to the random unitary vector when the index b of the frequency bin increases. summing and normalizing the obtained vector.
(84) The spectral smoothing to obtain vectors {right arrow over (r)}.sub.w, {right arrow over (R)}.sub.x, {right arrow over (r)}.sub.y, {right arrow over (r)}.sub.x is unchanged.
(85) As an alternative to the procedure for generating random vectors, the vectors {right arrow over (r)}.sub.w, {right arrow over (r)}.sub.x, {right arrow over (r)}.sub.y, {right arrow over (r)}.sub.x, and phases ϕ.sub.x, ϕ.sub.y and ϕ.sub.z can be determined by impulse response measurements: it is possible to obtain them by analyzing complex frequency coefficients derived from multiple sound captures of the first-order spherical field, using signals emitted by speakers, in phase all the way around the measurement point for {right arrow over (r)}.sub.w, on either side and out of phase along the axes X, Y, and Z for {right arrow over (r)}.sub.x, {right arrow over (r)}.sub.y and {right arrow over (r)}.sub.x respectively and ϕ.sub.x, ϕ.sub.y and ϕ.sub.z respectively.
(86) For the frequency (frequencies) or frequency band(s) corresponding to the continuous component, the processing is separate. It will be noted that due to the padding, the continuous state corresponds to one or more frequencies or frequency bands: if there is no padding, only the first frequency or frequency band undergoes the processing as defined below; if there is 100% padding (which therefore doubles the length of the signal before time-to-frequency transform), the first two frequencies or frequency bands are subject to application of the processing as defined below (as well as the “negative” frequency or frequency band, which is conjugate-symmetrical with respect to the second frequency or frequency band); if there is 300% padding (which therefore quadruples the length of the signal before time-to-frequency transform), the first four frequencies or frequency bands are subject to application of the processing as defined below (as well as the “negative” frequencies or frequency bands, which are conjugates-symmetrical with respect to the second, third and fourth frequencies or frequency bands); the other padding cases follow the same logic.
(87) This or these frequencies or frequency bands have a real and non-complex value, which does not make it possible to determine the phase of the signal for the corresponding frequencies; the direction analysis is therefore not possible. However, as shown by the psychoacoustic literature, a human being cannot perceive a direction of origin for the low frequencies in question (those below 80 to 100 Hz, in the case at hand). It is thus possible only to analyze the pressure wave, therefore the coefficient c.sub.w, and to choose an arbitrary, frontal direction of origin: (1,0,0).sup.T. Thus, the representation in the spherical domain of the first frequency band(s) is:
(88)
(89) In order to guarantee the correspondence between spherical coordinates and the inter-channel domain, the Scheiber sphere, corresponding, in the optics field, to the Stokes-Poincaré sphere, is used hereinafter.
(90) The Scheiber sphere symbolically represents the magnitude and phase relations of two monochromatic waves, i.e., also two complex frequency coefficients representing these waves. It is made up of two half-circles joining the opposite points L and R, each half-circle being derived from a rotation around the axis LR of the frontal arc in bold by an angle β and representing a phase difference value β∈]−π, π]. The frontal half-circle represents a nil phase difference. Each point of the half-circle represents a distinct panorama value, with a value close to 1 of the points close to L, and a value close to −1 for the points close to R.
(91)
(92) Regarding the conversion from the inter-channel domain to the spherical coordinates, the coordinate system of the Scheiber sphere is spherical with polar axis Y, and it is possible to express the coordinates in X, Y, Z as a function of the panorama and the phase difference:
(93)
(94) The azimuth and elevation spherical coordinates for such Cartesian coordinates are obtained by the following method:
(95)
(96) Thus, given a pair of complex frequency coefficients, their relationship establishing a panorama and a phase difference, it is possible to determine a direction of origin of a sound signal on a sphere. This conversion also makes it possible to determine the magnitude of the complex frequency coefficient of the monophonic signal, but the determination of its phase is not established by the above method and will be specified hereinafter.
(97) It is possible to obtain the reciprocal of the conversion previously described, i.e., the conversion from the spherical coordinates to the inter-channel domain:
(98)
or, in spherical coordinates:
(99)
(100) Thus, given the complex coefficient of a monophonic signal and its direction of origin, it is possible to determine the magnitudes of two complex coefficients as well as their phase difference, but, as seen above, the determination of their absolute phase is not established by the above method.
(101) According to the presentation done by Peter Scheiber in “Analyzing Phase-Amplitude Matrices” (JAES, 1971), the azimuths 90° and −90° correspond to the left (L) and right (R) speakers, which are typically located respectively at the azimuths 30° and −30° on either side facing the listener. Thus, to respect this spatial correspondence, which naturally allows a compatibility with the stereo and mastered surround formats, a conversion to the spherical domain can be followed by an affine modification by segments of the azimuth coordinates: any azimuth a∈[−90°, 90° ] is stretched in the interval [−30°, 30° ] in an affine manner, any azimuth a∈[90°, 180° ] is stretched in the interval [30°, 180° ] in an affine manner, any azimuth a∈]−180°, 90° ] is stretched in the interval]−180°, 30° ] in an affine manner.
(102) To follow the same principle, a conversion from the spherical domain can then naturally be preceded by the inverse conversion: any azimuth a∈[−30°, 30° ] is stretched in the interval [−90°, 90° ] in an affine manner, any azimuth a∈[30°, 180° ] is stretched in the interval [90°, 180° ] in an affine manner, any azimuth a∈] 180°, 30° ] is stretched in the interval]−180°, 90° ] in an affine manner.
(103) In “Understanding the Scheiber Sphere” (MCS Review, Vol. 4, No. 3, Winter 1983), Sommerwerck illustrates this principle of correspondence between physical space and Scheiber sphere, said principle therefore being obvious to any person in light of the state of the art. These azimuth conversions are illustrated in
(104) In the context of the determination of the phase correspondence, the objective is to produce a fully determined correspondence between a pair of complex frequency coefficients (inter-channel domain) on the one hand and a complex frequency coefficient and spherical coordinates on the other hand (spherical domain).
(105) As seen above, the correspondence previously established does not make it possible to determine the phase of the complex frequency coefficients, but only the phase difference in the pair of complex frequency coefficients of the inter-channel domain.
(106) It is then a matter of determining the appropriate correspondence for the phases, i.e., how to define the phase of a coefficient in the spherical domain as a function of the position in the inter-channel domain (panorama, phasediff), as well as the absolute phase of said coefficients (which will be represented by an intermediate phase value, as will be seen later).
(107) A representation is established of a phase correspondence in the form of a two-dimensional map of the phases in the inter-channel domain, with the panorama on the x-axis on the value domain [−1,1], and the phase difference on the y-axis in the value domain]−π, π]. This map shows the pairs of complex coefficients of the inter-channel domain obtained from a conversion from a coefficient of the spherical domain: having a phase ϕ=0, the other input and output phases being obtained to within an identical rotation, having spherical coordinates, which are bijective with a panorama and a phase difference, chosen hereinafter as coordinates of the map.
(108) The pairs of coefficients are shown locally, the map therefore shows a field of complex coefficient pairs. The choice of a phase correspondence corresponds to the local rotation of the complex plane containing the pair of complex frequency coefficients. One can see that the map is a two-dimensional representation of the Scheiber sphere, to which the phase information is added.
(109)
(110) The criterion chosen to design a correspondence is that of spatial continuity of the phase of the signal, i.e., that an infinitesimal change in position of a sound source must result in an infinitesimal change of the phase. The phase continuity criterion imposes constraints for a phase correspondence at the edges of the domain: the top and bottom of the domain are, for looping of the phase to within 2π, adjacent. Thus, the values must be identical at the top and bottom of the domain. all of the values to the left of the domain (respectively all of the values to the right of the domain) correspond to the vicinity of the point L (respectively the point R) of the sphere of the locations. To guarantee the continuity around these points on the sphere, the phase of the complex frequency coefficient having the greatest magnitude must be constant. The phase of the complex frequency coefficient having the smallest magnitude is then imposed by the phase difference; it performs a rotation of 2π when a curve is traveled around the points L or R of the sphere, but this is not problematic, since the magnitude is canceled out at the phase discontinuity point, leading to continuity of the complex frequency coefficient.
(111)
(112)
(113) Hereinafter we consider the field of tangent vectors generated by the coefficient of the left channel c.sub.L; the considerations are identical for the field of tangent vectors generated by the coefficient of the right channel c.sub.R. For the considerations of the demonstration, we modify the field of vectors in the immediate vicinity of L using a real factor that cancels it out at L, in order to guarantee the continuity of the vector field; this in no way modifies the phases and therefore the correspondence of the phases.
(114) According to the Poincaré-Hopf theorem, the sum of the zero indices isolated from the vector field is equal to the Euler-Poincaré characteristic of the surface. In the case at hand, a vector field on a sphere has a Euler-Poincaré characteristic of 2. Yet by construction, the vector field derived from c.sub.L cancels itself out through the modification around L with an index 1, as can be seen in
(115) The method disclosed in the present invention resolves this issue of phase continuity. It is based on the observation that in real cases, the entire sphere is not fully and simultaneously traveled over by signals. A phase correspondence discontinuity located at one point of the sphere traveled by signals (fixed signals or spatial trajectories of signals) will cause a phase discontinuity. A phase correspondence discontinuity located at one point of the sphere not traveled by signals (fixed signals or spatial trajectories of signals) does not cause a phase discontinuity. Without a priori knowledge of the signals, a discontinuity at a fixed point will not be able to guarantee that no signal will pass through that point. A discontinuity at a moving point may, however, “avoid” being traveled by a signal, if its location depends on the signal. This moving discontinuity point may be part of a dynamic phase correspondence that is continuous over any other point of the sphere. The principle of dynamic phase correspondence based on avoidance of the spatial location of the signal by the discontinuity is thus established. We will establish such a phase correspondence based on this principle, other phase correspondences being possible.
(116) A phase correspondence Φ (panorama, phasediff) function is defined that is used in both conversion directions, from the inter-channel domain to the spherical domain and in the reverse direction; the panorama and the phase difference are obtained in the original domain or in the arrival domain of these two conversions as previously indicated. This function describes the phase difference between the spherical domain and the inter-channel domain:
Φ(panorama,phasediff)=ϕ.sub.s−ϕ.sub.i (44)
where ϕ.sub.s the phase of the complex frequency coefficient of the spherical domain, and ϕ.sub.i is the intermediate phase of the inter-channel domain:
ϕ.sub.i=arg(c.sub.L)+½phasediff=arg(c.sub.R)−½phasediff (45)
where c.sub.L and c.sub.R are the complex frequency coefficients of the inter-channel domain. The phase correspondence function is dynamic, i.e., it varies from one temporal window to the next. This function is built with a dynamic singularity, situated at a point Ψ=(panorama.sub.singularity, phasediff.sub.singularity) of the inter-channel domain defined by a panorama value panorama.sub.singularity in [−½, ½] and phase difference value phasediff.sub.singularity in]−π, −π/2]. This corresponds to a zone situated behind the listener, at a slight height. It is possible to choose other zones at random. The singularity is initially located at the center of said zone, at a position Ψ.sub.0 that is called “anchor” hereinafter. It is possible to choose other initial locations of the anchor at random within said zone. The choice of panorama and phase difference corresponding to the singularity are noted in the index of the phase correspondence function. A formulation of a phase correspondence function creating only one singularity is as follows: If phasediff≥−π/2:
Φ.sub.Ψ(panorama,phasediff)=−½panorama phasediff (46) If phasediff←π/2 and panorama≤−½:
Φ.sub.Ψ(panorama,phasediff)=−½panorama phasediff+(panorama,+1)(2phasediff+π) (47)
(117) If phasediff←π/2 and panorama≥½:
Φ.sub.Ψ(panorama,phasediff)=−½panorama phasediff+(panorama,−1)(2phasediff+π) (48) If phasediff←π/2 and panorama e]−½, ½[, i.e., if the coordinates of the point are inside the zone of the singularity, then its coordinates are projected from the point Ψ on the edge of the zone, and the preceding formulas are used with the coordinates of the projected point. If the point is situated exactly on Ψ despite the precautions, any point of the edge of the zone can be used.
(118) In order to prevent the point of the singularity Ψ from being situated, spatially speaking, close to a signal, it is moved in the zone in order to “flee” the location of the signal, processing window after processing window. To that end, preferably before calculating the phase correspondence, all of the frequency bands are analyzed in order to determine their respective panorama and phase difference location in the inter-channel domain, and for each one, a change vector is calculated, intended to move the point of the singularity. For example, in a favored implementation of the present invention, the change resulting from a frequency band can be calculated as follows:
(119)
(120) As norm of the change vector, where N is the number of frequency bands and d is the distance between the point Ψ and the point of coordinates (panorama, phasediff), if d≠0, 0 otherwise, and
(121)
(122) As direction of the change vector, if d≠0, {right arrow over (0)} otherwise. Preferably, for better avoidance of the trajectories, it is possible to apply a slight rotation in the plane to {right arrow over (u)}.sub.Ψ (panorama, phasediff), for example of π/16 for a sampling frequency of 48000 Hz, for sliding windows of 2048 samples and 100% padding (the value of the rotation angle being adapted based on these factors), useful for example when a source has a linear trajectory that passes through the point ⇔.sub.0, so that the singularity bypasses the source on one side. The change vector is then:
{right arrow over (F)}.sub.Ψ(panorama,phasediff)=f.sub.Ψ(panorama,phasediff){right arrow over (u)}.sub.Ψ(panorama,phasediff) (51)
(123) The change vectors derived from all of the frequency bands are next added, and to this sum, a vector to return the singularity to the anchor Ψ.sub.0 is added, formulated for example as follows:
{right arrow over (F)}.sub.Ψ0= 1/10(Ψ.sub.0−Ψ) (52)
where the factor 1/10 is modified according to the sampling frequency, the size of the window and the padding rate like for the rotation. The resulting change vector Σ {right arrow over (F)} is applied to the singularity in the form of a simple vector addition to a point:
Ψ←Ψ+Σ{right arrow over (F)} (53)
(124) Thus, when idle, one obtains the phase correspondence map (700) of
(125)
(126) As described above in the present document, a signal expressed in the spherical domain is characterized, for any frequency or frequency band, by an azimuth and an elevation, a magnitude and a phase.
(127) Implementations of the present invention include a means for transcoding from the spherical domain to a given audio format chosen by the user. Several techniques are presented as examples, but their adaptation to other audio formats will be trivial for a person familiar with the state of the art of sound rendering or encoding of the sound signal.
(128) A first-order spherical harmonic (or First-Order Ambisonic, FOA) transcoding may be done in the frequency domain. For each complex coefficient c corresponding to a frequency band, knowing the corresponding azimuth a and elevation e, four complex coefficients w, x, y, z corresponding to the same frequency band can be generated using the following formulas:
(129)
(130) The coefficients w, x, y, z obtained for each frequency band are assembled to respectively generate frequency representations W, X, Y and Z of four channels, and the application of the frequency-to-time transform (inverse of that used for the time-to-frequency transform), any clipping, then the overlap of the successive time windows obtained makes it possible to obtain four channels that are a first-order spatial harmonic temporal representation of the three-dimensional audio signal. A similar approach can be used for transcoding to a format (HOA) of an order greater than or equal to 2, by completing equation (54) with the encoding formulas for the considered order.
(131) Transcoding to a surround 5.0 format including five left, center, right, rear left and rear right channels can be done as follows.
(132) For each frequency or frequency band, the coefficients c.sub.L, c.sub.c, c.sub.R, c.sub.Ls, c.sub.Rs respectively corresponding to the speakers usually called L, C, R, Ls, Rs, are calculated as follows, from azimuth and elevation coordinates a and e of the direction of origin vector and the complex frequency coefficient c.sub.s. The gains g.sub.L, g.sub.R, g.sub.Ls, g.sub.Rs, are defined as the gains that will be applied to the coefficient c.sub.s to obtain the complex frequency coefficients of the output coefficient tables, as well as two gains g.sub.B and g.sub.T corresponding to virtual speakers allowing a redistribution of the signals into “Bottom”, i.e., with a negative elevation, and “Top”, i.e., with a positive elevation, to the other speakers.
(133)
then gains g.sub.B and g.sub.T are redistributed between the other coefficients:
(134)
(135) Lastly, the frequency coefficients of the various channels are obtained by:
(136)
(137) Transcoding into a L-C-R-Ls-Rc 5.0 multichannel audio format, to which a T zenith channel (“top” or “voice of God” channel) is added can also be done in the frequency domain. During the redistribution of the gains of the virtual channels, only the redistribution of the “bottom” gain g.sub.B is then done:
(138)
and the frequency coefficients of the various channels are obtained by:
(139)
(140) The six complex coefficients thus obtained for each frequency band are assembled to respectively generate frequency representations of six channels L, C, R, Ls, Rs and T, and the application of the frequency-to-time transform (inverse of that used for the time-to-frequency transform), any clipping, then the overlap of the successive time windows obtained makes it possible to obtain six channels in the temporal domain.
(141) Furthermore, for a format having any spatial arrangement of the channels, it will advantageously be possible to apply a three-dimensional VBAP algorithm to obtain the desired channels, while guaranteeing, if needed, a good triangulation of the sphere by adding virtual channels that are redistributed toward the final channels.
(142) A transcoding of a signal expressed in the spherical domain toward a binaural format may also be done. It may for example be based on the following elements: a database including, for a plurality of frequencies, for a plurality of directions in space, and for each ear, the expression and complex coefficients (magnitude and phase) of the Head-Related Transfer Function (HRTF) filters in the frequency domain; a projection of said database in the spherical domain to obtain, for a plurality of directions and for each ear, a complex coefficient for each frequency from among a plurality of frequencies; a spatial interpolation of said complex coefficients, for any frequency from among a plurality of frequencies, so as to obtain a plurality of complex spatial functions continuously defined on the unit sphere, for each frequency from among a plurality of frequencies. This interpolation can be done in a bilinear or spline manner, or via spherical harmonic functions.
(143) One thus obtains a plurality of functions on the unit sphere, for any frequency, describing the frequency behavior of said HRTF database for any point of the spherical space. Since, for any frequency from a plurality of frequencies, it is established that said spherical signal is described by a direction of origin (azimuth, elevation) and a complex coefficient (magnitude, phase), said interpolation-projection next makes it possible to perform the binauralization operation of the spherical signal, as follows: for each frequency and for each ear, given the direction of origin of said spherical signal, one establishes the value of said complex spatial function previously established by projection and interpolation, resulting in a HRTF complex coefficient; for each frequency and for each ear, said HRTF complex coefficient is then multiplied by the complex coefficient corresponding to the spherical signal, resulting in a left ear frequency signal and a right ear frequency signal; a frequency-to-time transform is then done, yielding a dual-channel binaural signal.
(144) Furthermore, the spherical harmonic formats are often used as intermediate formats before decoding on speaker constellations or decoding by binauralization. The multichannel formats obtained via VBAP rendering are also subject to binauralization. Other types of transcoding can be obtained by using standard spatialization techniques such as pairwise panoramic with or without horizontal layers, SPCAP, VBIP or even WFS. It is lastly necessary to note the possibility of changing the orientation of the spherical field, by altering the direction vectors using simple geometric operations (rotations around an axis, etc.). By applying this capability, it is possible to perform an acoustic compensation of the rotation of the listener's head, if it is captured by a head-tracking device, just before applying a rendering technique. This method allows a perceptual gain in location precision of the sound sources in space; this is a known phenomenon in the field of psychoacoustics: small head movements allow the human auditory system to better locate sound sources.
(145) By applying conversion techniques between the two domains that were previously described, the encoding of a spherical signal can be done as follows. The spherical signal is made up of temporally successive tables each corresponding to a representation over a temporal window of the signal, these windows overlapping. Each table is made up of pairs (complex frequency coefficient, coordinates on the sphere in azimuth and elevation), each pair corresponding to a frequency band. The original spherical signal is obtained from spatial analysis techniques like those described, which convert an FOA signal into a spherical signal. The encoding makes it possible to obtain temporally successive pairs of complex frequency coefficient tables, each table corresponding to a channel, for example left (L) and right (R).
(146)
(147)
(148) The representation in the form of temporally successive pairs of complex frequency coefficient tables is generally not kept as is; the application of the appropriate frequency-to-time inverse transform (the inverse of the direct transform used upstream), such as the frequency-to-time part of the short-term Fourier transform, makes it possible to obtain a pair of channels in the form of temporal samples.
(149) Pursuant to the domain conversion techniques previously described, the decoding of a stereo signal encoded with the technique previously presented can be done as follows. The input signal being in the form of a pair of channels that are generally temporal, a transform such as the short-term Fourier transform is used to obtain temporally successive pairs of complex frequency coefficient tables, each coefficient of each table corresponding to a frequency band. In each pair of tables corresponding to a temporal window, the coefficients corresponding to the same frequency band are paired. The decoding makes it possible to obtain, for each temporal window, a spherical representation of the signal, in pair table form (complex frequency coefficient, coordinates on the sphere in azimuth and elevation). Here is the sequence of the decoding technique for each temporal window successively processed, illustrated in
(150)
(151) A pair table (complex frequency coefficient, coordinates on the sphere in azimuth and elevation) is obtained, each pair corresponding to a frequency band. This spherical representation of the signal is generally not kept as is, but undergoes transcoding based on broadcasting needs: it is thus possible, as was seen above, to perform transcoding (or “rendering”) to a given audio format, for example binaural, VBAP, planar or three-dimensional multi-channel, first-order Ambisonics (FOA) or higher-order Ambisonics (HOA), or any other known spatialization method as long as the latter makes it possible to use the spherical coordinates to steer the desired position of a sound source.
(152) Large quantities of stereo content being encoded in surround form with a mastering technique, and the coordinates of the mastering points generally being positioned in the inter-channel domain in consistent positions, the decoding of such surround content works, with a few absolute positioning defects of the sources. Therefore, in general, the stereo content not provided to be played on a device other than a speaker system pair may advantageously be processed using the decoding method, resulting in a 2D or 3D upmix of the content, the term “upmix” corresponding to processing a signal to be able to broadcast it on devices with a number of speaker systems greater than the number of original channels, each speaker system receiving a signal that is specific to it, or its virtualized equivalent in the headset.
INDUSTRIAL APPLICATIONS OF THE INVENTION
(153) The stereophonic signal resulting from the encoding of a three-dimensional audio field can be reproduced suitably without decoding on a standard stereophonic listening device, for example audio headset, sound bar or audio system. Said signal can also be processed by the mastered surround content multichannel decoding systems that are commercially available without audible artifacts appearing.
(154) The decoder according to the invention is versatile: it makes it possible simultaneously to decode content specially encoded for it, to decode content pre-existing in the mastered surround format (for example, cinematographic sound content) in a relatively satisfactory manner, and to upmix stereo content. It thus immediately finds its utility, embedded via software or hardware (for example in the form of a chip) in any system dedicated to sound broadcasting: television, hi-fi audio system, living room or home cinema amplifier, audio system on board a vehicle, equipped with multichannel broadcasting system, or even any system broadcasting for listening in headphones, via binaural rendering, optionally with head-tracking, such as a computer, a mobile telephone, a digital-audio portable music player. A listening device with crosstalk cancellation also allows binaural listening without headphones from at least two speakers, and allows surround or 3D listening to sound content decoded by the invention and binaural rendering. The decoding algorithm described in the present invention makes it possible to rotate the sound space on the direction of origin vectors of the obtained spherical field, the direction of origin being that which would be perceived by a listener located at the center of said sphere; this capacity makes it possible to implement tracking of the listener's head (or head-tracking) in the processing chain as close as possible to its rendering, which is an important element to reduce the lag between the movements of the head and their compensation in the audible signal.
(155) An audio headset itself may embed the described decoding system in one embodiment of the present invention, optionally by adding head-tracking and binaural rendering functions.
(156) The prerequisite processing and content broadcasting infrastructure is already ready for the application of the present invention, for example the stereo audio connector technology, the stereophonic digital encoding such as MPEG-2 layer 3 or AAC, the FM or DAB stereo radio broadcasting techniques, or the wireless, cable or IP video stereophonic broadcasting standards.
(157) The encoding in the format presented in this invention is done at the end of multichannel or 3D mastering (finalization), from a FOA field via a conversion to a spherical field like one of those presented in this document or from another technique. The encoding may also be done on each source added to the sound mixing, independently of one another, using spatialization or panoramic tools embedding the described method, which makes it possible to perform 3D mixing on digital audio workstations only supporting 2 channels. This encoded format may also be stored or archived on any medium only comprising two channels, or for size compression purposes.
(158) The decoding algorithm makes it possible to obtain a spherical field, which may be altered, by deleting the spherical coordinates while only keeping the complex frequency coefficients, in order to obtain a mono downmix. This process may be implemented by software, or hardware to embed it in an electronic chip, for example embedded in monophonic FM listening devices.
(159) Furthermore, the content of video games and virtual reality or augmented reality systems may be stored in stereo encoded form, then decoded to be spatialized again by transcoding, for example in FOA field form. The availability of direction of origin vectors also makes it possible to manipulate the sound field using geometric operations, for example allowing zooms, distortions following the sound environment such as by projecting the sphere of the directions on the inside of a room of a video game, then deformation by parallax of the direction of origin vectors. A video game or other virtual reality or augmented reality system having a surround or 3D audio format as internal sound format may also encode its content before broadcasting; as a result, if the final listening device of the user implements the decoding method disclosed in the present invention, it thus provides a three-dimensional spatialization, and if the device is an audio headset implementing head-tracking (tracking the orientation of the listener's head), the binaural customization and the head-tracking allow dynamic immersive listening.
(160) The embodiments of the present invention can be carried out in the form of one or more computer programs, said computer programs operating on at least one computer or on at least one processing circuit of the embedded signal, locally, remotely or distributed (for example in the context of an infrastructure in the “cloud”).