OPTIMISED CODING OF AN ITEM OF INFORMATION REPRESENTATIVE OF A SPATIAL IMAGE OF A MULTICHANNEL AUDIO SIGNAL
20230260522 · 2023-08-17
Inventors
Cpc classification
International classification
Abstract
A method for optimised coding of a multichannel sound signal. The method includes: coding at least one audio signal channel from the original multichannel signal; dividing the original multichannel signal into frequency sub-hands; determining one covariance matrix for each frequency sub-band, representative of a spatial image of the original multichannel signal; decomposing the predetermined covariance matrices into eigenvalues; coding by quantisation of the parameters from the decomposition into eigenvalues including both eigenvalues and eigenvectors. Also provided are a decoding method for decoding the parameters from the decomposition into eigenvalues of the covariance matrix of the original multichannel signal, and coding and decoding devices implementing the respective methods.
Claims
1. A method for coding an original multichannel sound signal, the method being implemented by a coding device and comprising: coding at least one audio signal channel originating from the original multichannel sound signal; dividing the original multichannel sound signal into frequency sub-bands; determining a covariance matrix per frequency sub-band, representative of a spatial image of the original multichannel sound signal; decomposing the determined covariance matrices into eigenvalues; and coding by quantizing the parameters resulting from the decomposition into eigenvalues comprising both eigenvalues and eigenvectors.
2. The method as claimed in claim 1, wherein the eigenvalues are ordered before quantization and the quantizing is performed by a differential scalar quantization.
3. The method as claimed in claim 1, wherein the covariance matrix is decomposed into eigenvalues using the following steps: obtaining a matrix of eigenvectors Q such that C=QΛQ.sup.T, where C is the covariance matrix and Λ=diag(λ.sub.1, . . . , λ.sub.K) is a diagonal matrix of eigenvalues; modifying the matrix of eigenvectors as a function of a determinant value of the matrix of eigenvectors Q; converting the matrix of eigenvectors Q into the domain of generalized Euler angles; the generalized Euler angles that are obtained forming part of the parameters to be quantized.
4. The method as claimed in claim 3, wherein the generalized Euler angles are quantized by uniform quantization.
5. The method as claimed in claim 1, wherein the covariance matrix is decomposed into eigenvalues using the following steps: obtaining a matrix of eigenvectors Q such that C=QΛQ.sup.T, where C is the covariance matrix and Λ=diag(λ.sub.1, . . . , λ.sub.K) is a diagonal matrix of eigenvalues; modifying the matrix of eigenvectors as a function of a determinant value of the matrix of eigenvectors Q; converting the matrix of eigenvectors Q into the domain of quaternions; at least one quaternion that is obtained forming part of the parameters to be quantized.
6. The method as claimed in claim 5, wherein the quaternions are quantized by spherical vector quantization.
7. A method for decoding an original multichannel sound signal, the method being implemented by a decoding device and comprising: decoding at least one coded channel of the original multichannel sound signal and obtaining a decoded multichannel signal; dividing the decoded multichannel signal into frequency sub-bands; decoding parameters resulting from a decomposition of covariance matrices of the original multichannel sound signal into eigenvalues; determining the covariance matrices of the original multichannel sound signal from the decoded parameters; determining a covariance matrix, per frequency sub-band, of the decoded multichannel signal; determining a set of corrections to be made to the decoded signal based on the covariance matrices of the original multichannel sound signal and the covariance matrices of the decoded multichannel signal; and correcting the decoded multichannel signal using the determined set of corrections.
8. A coding device comprising: a processing circuit configured to code an original multichannel sound signal by: coding at least one audio signal channel originating from the original multichannel sound signal; dividing the original multichannel sound signal into frequency sub-bands; determining a covariance matrix per frequency sub-band, representative of a spatial image of the original multichannel sound signal; decomposing the determined covariance matrices into eigenvalues; and coding by quantizing the parameters resulting from the decomposition into eigenvalues comprising both eigenvalues and eigenvectors.
9. A decoding device comprising: a processing configured to decode an original multichannel sound signal by: decoding at least one coded channel of the original multichannel sound signal and obtaining a decoded multichannel signal; dividing the decoded multichannel signal into frequency sub-bands; decoding parameters resulting from a decomposition of covariance matrices of the original multichannel sound signal into eigenvalues; determining the covariance matrices of the original multichannel sound signal from the decoded parameters; determining a covariance matrix, per frequency sub-band, of the decoded multichannel signal; determining a set of corrections to be made to the decoded signal based on the covariance matrices of the original multichannel sound signal and the covariance matrices of the decoded multichannel signal; and correcting the decoded multichannel signal using the determined set of corrections.
10. A non-transitory computer readable storage medium storing a computer program comprising instructions for executing a method of coding an original multichannel sound signal when the instructions are executed by a processing circuit of a coding device, wherein the method comprises: coding at least one audio signal channel originating from the original multichannel sound signal; dividing the original multichannel sound signal into frequency sub-bands; determining a covariance matrix per frequency sub-band, representative of a spatial image of the original multichannel sound signal; decomposing the determined covariance matrices into eigenvalues; and coding by quantizing the parameters resulting from the decomposition into eigenvalues comprising both eigenvalues and eigenvectors.
11. The coding device as claimed in claim 8, wherein the processing circuit comprises: a processor; and a non-transitory computer readable medium comprising instructions stored thereon which when executed by the processor configure the coding device to code the multichannel sound signal.
12. The decoding device as claimed in claim 8, wherein the processing circuit comprises: a processor; and a non-transitory computer readable medium comprising instructions stored thereon which when executed by the processor configure the decoding device to decode the multichannel sound signal.
13. A non-transitory computer readable storage medium storing a computer program comprising instructions for executing a method of decoding an original multichannel sound signal when the instructions are executed by a processing circuit of a decoding device, wherein the method comprises: decoding at least one coded channel of the original multichannel sound signal and obtaining a decoded multichannel signal; dividing the decoded multichannel signal into frequency sub-bands; decoding parameters resulting from a decomposition of covariance matrices of the original multichannel sound signal into eigenvalues; determining the covariance matrices of the original multichannel sound signal from the decoded parameters; determining a covariance matrix, per frequency sub-band, of the decoded multichannel signal; determining a set of corrections to be made to the decoded signal based on the covariance matrices of the original multichannel sound signal and the covariance matrices of the decoded multichannel signal; and correcting the decoded multichannel signal using the determined set of corrections.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0079] Other features and advantages of the invention will become more clearly apparent on reading the following description of particular embodiments, which are provided by way of simple illustrative and non-limiting examples, and of the appended drawings, in which:
[0080]
[0081]
[0082]
[0083]
[0084]
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
[0085] A reminder is given here of the known technique for encoding (in the acoustic sense) a sound source in the ambisonic format. A mono sound source may be artificially spatialized by multiplying the associated signal by the values of the spherical harmonics associated with its direction of origin (assuming the signal is carried by a plane wave) in order to obtain the same number of ambisonic components. This involves computing the coefficients for each spherical harmonic for a position determined in azimuth θ and in elevation ϕ in the desired order:
B=Y(θ,φ).Math.s
where s is the mono signal to be spatialized and Y(θ,ϕ) is the encoding vector defining the coefficients of the spherical harmonics associated with the direction (θ, ϕ) for the Mth order. One example of an encoding vector is given below for the 1st order with the SN3D convention and the order of the SID or FuMa channels:
[0086] Other normalization conventions (for example: maxN, N3D) and channel orders (for example: ACN) exist, and the various embodiments are then adapted according to the convention used for the order or the normalization of the ambisonic components (FOA or HOA). This is tantamount to modifying the order of the rows Y(θ,φ) or multiplying these rows by predefined constants.
[0087] For higher orders, the coefficients Y(θ,ϕ) of the spherical harmonics may be found in the book by B. Rafaely, Fundamentals of Spherical Array Processing, Springer, 2015. In general, for an order M, there are K=(M+1).sup.2 ambisonic signals.
[0088] Likewise, a reminder will be given here of a few concepts regarding ambisonic rendering by loudspeakers. An ambisonic sound is not meant to be listened to as such; for immersive listening on loudspeakers or on headphones, a “decoding” step in the acoustic sense, also called rendering (“renderer”), has to be carried out. Consideration is given to the case of N (virtual or physical) loudspeakers distributed over a sphere—typically with a unit radius—and whose directions (θ.sub.n, ϕ.sub.n), n=0, . . . , N−1, in terms of azimuth and elevation, are known. Decoding, as considered here, is a linear operation that consists in applying a matrix D to the ambisonic signals B in order to obtain the signals s.sub.n of the loudspeakers, which may be combined into a matrix S=[s.sub.0, . . . , s.sub.N-1], S=D.Math.B, where
[0089] The matrix D may be decomposed into row vectors d.sub.n, that is to say
[0090] d.sub.n may be seen as a weighting vector for the nth loudspeaker, used to recombine the components of the ambisonic signal and compute the signal played on the nth loudspeaker: s.sub.n=d.sub.n.Math.B.
[0091] There are multiple methods for “decoding” in the acoustic sense. What is known as the “basic decoding” method, also called “mode-matching”, is based on the encoding matrix E associated with all of the directions of virtual loudspeakers:
E=[Y(θ.sub.0,φ.sub.0) . . . Y(θ.sub.N-1,φ.sub.N-1)]
According to this method, the matrix D is typically defined as the pseudo-inverse of E:
D=pinv(E)=D.sup.T(D.Math.D.sup.T).sup.−1
[0092] As an alternative, the method that may be called the “projection” method gives similar results for certain regular distributions of directions, and is described by the equation:
[0093] In the latter case, it may be seen that, for each direction of index n,
[0094] In the context of this invention, such matrices will serve as a directional beamforming matrix that describes how to obtain signals characteristic of directions in space in order to perform an analysis and/or spatial transformations.
[0095] In the context of the present invention, it is useful to describe the reciprocal conversion for passing from the loudspeaker domain to the ambisonic domain. The successive application of the two conversions should exactly reproduce the original ambisonic signals if no intermediate modification is applied in the loudspeaker domain. The reciprocal conversion is therefore defined as bringing into play the pseudo-inverse of D:
pinv(D).Math.S=D.sub.T(D.Math.D.sub.T).sup.−1.Math.S
When K=(M+1).sup.2, the matrix D of size K×K is able to be inverted under certain conditions and, in this case: B=D.sup.−1.Math.S
[0096] In the case of the “mode-matching” method, it appears that pinv(D)=E. In some variants, other methods for decoding using D may be used, with the corresponding inverse conversion E; the only condition to be met is that the combination of the decoding using D and the inverse conversion using E should give a perfect reconstruction (when no intermediate processing operation is performed between the acoustic decoding and the acoustic encoding).
[0097] Such variants are for example given by: [0098] “mode-matching” decoding, with a regulation term in the following form D.sup.T(D.Math.D.sup.T+ε.sub.DI).sup.−1 where ε.sub.D is a low value (for example 0.01), [0099] “in phase” or “max-rE” decoding, known from the prior art [0100] or variants in which the distribution of the directions of the loudspeakers is not regular over the sphere.
[0101] The method described below is based on transmitting a spatial image representation in the form of a covariance matrix and correcting spatial degradations, in particular to ensure that the spatial image of the decoded signal is as close as possible to the original signal. Unlike known parametric coding approaches for stereo or multichannel signals, in which perceptual cues are coded, the invention is not based on a perceptual interpretation of spatial image information, since the ambisonic domain is not directly “hearable”.
[0102] In the embodiment described below, coding is carried out, with an optional downmix/upmix using a map, hereinafter called spatial image of the original ambisonic sound scene. Upon coding, a certain number of channels (preferably lower than the number of input channels) are transmitted to the decoder. These channels may be a subset of the original channels (for example: W or X channel, 4 channels of the 3D FOA, 3 channels of the planar FOA, etc.) or a re-mastering of the input channels (for example: stereo downmix resulting from an FOA input). In addition to these channels, the encoder transmits information resulting from a map of the original sound scene. This information may be defined on a signal on a single frequency band (for example: 0-16000 or 100-14000 Hz for a signal sampled at 32 kHz) but, in the preferred embodiment, the spectrum is divided into sub-bands (which may be derived from existing Bark or Mel divisions or other divisions, as described later). According to one embodiment of the invention, the spatial image of the original sound scene is a covariance matrix as defined later. Optimized coding of this covariance matrix is provided in order to optimize the coding rate of this representation of a spatial image, especially when it is defined by sub-bands.
[0103] Upon decoding, the received and decoded signals are optionally extended by an “upmix” (by decorrelation), described below. Depending on the type of information received, a map of the “degraded” sound scene is produced. A transformation operation is determined in order to recreate the original sound scene. This transformation is determined at the decoder based on the received and decoded original map (information from the spatial image of the original signal) and on the degraded map (information from the spatial image of the decoded multichannel signal).
[0104] In some variants, the downmix/upmix may be replaced by direct coding of the channels, for example multimono or multistereo.
[0105]
[0106] The original multichannel signal B of dimension K×L (that is to say K components of L time or frequency samples) is at the input of the encoder.
[0107] What is of interest here is the case of a multichannel signal with an ambisonic representation, as described above. The invention may also be applied to other types of multichannel signal, such as a B-format signal with modifications, such as for example the suppression of certain components (for example: suppression of the 2nd-order R component so as to keep only 8 channels) or the matrixing of the B-format in order to pass to an equivalent domain (called “Equivalent Spatial Domain”) as described in the 3GPP TS 26.260 specification—another example of matrixing is given by “channel mapping 3” of the IETF Opus codec and in the 3GPP TS 26.918 specification (clause 6.1.6.3).
[0108] In the embodiment thus described, the input signal is sampled at 32 kHz. The encoder operates in frames that are preferably 20 ms long, that is to say L=640 samples per frame at 32 kHz. In some variants, other frame lengths and sampling frequencies are possible (for example L=480 samples per frame of 10 ms at 48 kHz).
[0109] In one preferred embodiment, the spatial image is coded in sub-bands in the frequency domain after a temporal short-term discrete Fourier transform (STFT) (on one or more bands), but, in some variants, the invention may be implemented in sub-bands by applying a real or complex filter bank to process sub-bands in the time domain, or using another type of transform such as the modified discrete cosine transform (MDCT) or the Modulated Complex Lapped Transform (MCLT).
[0110] A block 110 for reducing the number of channels (DMX) is optionally implemented. This consists for example, for a 1st-order ambisonic input signal, in keeping only the W channel and, for an ambisonic input signal of order >1, in keeping only the first 4 ambisonic components W, X, Y, Z (therefore in truncating the signal to the 1st order). Other types of downmix (selection of a subset of channels and/or matrixing, use of “delay-sum beamforming”) may be implemented without this modifying the method according to the invention.
[0111] Block 111 codes the audio signal b′.sub.k, k=1, . . . , K.sub.dmx (where K.sub.dmx≤K) of B′ at the output of block 110.
[0112] In one preferred embodiment, block 111 uses multi-mono coding (COD) with a variable allocation, in which the core codec is the standard 3GPP EVS codec. In this multi-mono approach, each channel b′.sub.k is coded separately by one instance of the codec; however, in some variants, other coding methods are possible, for example multi-stereo coding or joint multichannel coding. This therefore gives, at the output of this coding block 111, at least one coded channel of an audio signal resulting from the original multichannel signal, in the form of a bitstream that is sent to the multiplexer 140.
[0113] Block 120 extracts a given frequency band (which may correspond to the full-band signal or in a restricted band) or carries out a division into multiple frequency sub-bands. In some variants, the extraction of a given band or the division into sub-bands may reuse equivalent processing operations performed in blocks 110 or 111.
[0114] In general, the division into sub-bands may be uniform or non-uniform.
[0115] In one preferred embodiment, when the signal is not coded in a frequency band, the channels of the original multichannel audio signal are divided into frequencies using frequency intervals defined on the Bark scale.
[0116] The Bark scale is defined over the following 24 intervals (in Hz) for a signal sampled at 32 kHz:
[20, 100], [100, 200], [200, 300], [300, 400], [400, 510], [510, 630], [630, 770], [770, 920], [920, 1080], [1080, 1270], [1270, 1480], [1480, 1720], [1720, 2000], [2000, 2320], [2320, 2700], [2700, 3150], [3150, 3700], [3700, 4400], [4400, 5300], [5300, 6400], [6400, 7700], [7700, 9500], [9500, 12000], [12000, 16000]
[0117] This predefined division may be modified for the case of a sampling frequency in order to use a different number of bands, for example by keeping only 21 bands at 16 kHz and by changing the last interval to [6400, 8000] or by adding a band [16000, 20000] at 48 kHz. This division into sub-bands, which is implemented in the domain of the short-term discrete Fourier transform (STFT) computed on 20 ms frames with windowing over 30 ms (10 ms of signal passed), is tantamount to band-pass filtering in the Fourier domain. In some variants, it is possible to apply a filter bank with or without critical sampling in order to obtain real or complex signals corresponding to the sub-bands. It will be noted that the operation of dividing into sub-bands generally involves a processing delay that depends on the type of filter bank that is implemented; according to the invention, temporal alignment may be applied before or after coding-decoding and/or before the extraction of spatial image information, such that the spatial image information is well temporally synchronized with the corrected signal.
[0118] The remainder of the description describes the various coding and decoding steps as though a processing operation in the complex frequency domain were involved. The actual case is also described as a variant.
[0119] Block 121 determines (Inf. B) information representative of a spatial image of the original multichannel signal.
[0120] In the embodiment described here, the information representative of the spatial image of the original multichannel signal is a covariance matrix of the input channels B in each frequency band predetermined by block 120. It will be noted here that, for simplicity, the description does not distinguish here the sub-band index for the matrix C. In the preferred embodiment, the invention is implemented in a complex-value transform domain, the covariance being computed as follows:
C=Re(B.Math.B.sup.H)
to within a normalization factor.
[0121] This matrix is computed as follows in the real case:
C=B.Math.B.sup.T
to within a normalization factor.
[0122] In the cases of a multichannel signal in the time domain, the covariance may be estimated recursively (sample by sample) in the following form:
Cij(n)=n/(n+1)Cij(n−1)+1/(n+1)bi(n)bj(n).
[0123] In some variants, operations of temporally smoothing the covariance matrix may be used.
[0124] In some variants, the covariance matrix C may be regularized before quantization in the form C+εI or by applying thresholding to the diagonal coefficients of C in order to ensure a minimum value ε (for example ε=10.sup.−9 if the input ambisonic signals are amplitude-normalized over the interval +/−1).
[0125] The covariance matrix C (of size K×K) is, by definition, symmetric, K being the number of ambisonic components.
[0126] Block 130 quantizes the coefficients of the matrix.
[0127]
[0128] Thus, according to the invention, the covariance matrix is coded using the following steps:
[0129] It is assumed at this stage that the covariance matrix has been estimated and that it has been modified (regularized) in order to ensure that no eigenvalue is zero. This may be achieved by replacing the values Cii of the diagonal of C with Cii=max(Cii, ε), where ε is a low value fixed for example at 10.sup.−9 (if the values of the ambisonic signal in the time domain are defined in the interval +/−1). In some variants, it is possible to modify the matrix C=C+εI, where I is the identity matrix.
[0130] The covariance matrix C (thus regularized) is decomposed into eigenvalues in step S1, in the form: C=QΛQ.sup.T
where Q is an orthogonal matrix (with in particular det Q=+/−1) and A=diag(λ.sub.1, . . . , λ.sub.K) is a diagonal matrix of eigenvalues. Without loss of generality, it is assumed that λ.sub.1≥ . . . ≥λ.sub.K≥0. It will be noted that the regularization of C, if applied, guarantees that the eigenvalues are strictly positive.
[0131] Multiple methods are known from the prior art for carrying out this factorization: (iterative) QR decomposition, Householder transformation, Givens rotations or variants of these methods, such as the “sorted QR” decomposition. It does not matter which method is chosen if the eigenvalues λ.sub.i (i=1 . . . K) are not positive and are not ordered in descending order, according to the invention, if λ.sub.i<0, the sign of λ.sub.i and of the associated eigenvector will be inverted; the eigenvalues will also be permuted if necessary in order to comply with the constraint λ.sub.1≥ . . . ≥λ.sub.K≥0 by applying the same permutations to the eigenvectors (columns) in Q.
[0132] In step S2, the determinant of the matrix of eigenvectors Q is computed and it is determined whether det Q=−1. If this is the case (Y in step S2), Q is modified in step S3, preferably by inverting the sign of the eigenvector associated with the lowest eigenvalue so as to obtain a rotation matrix (orthogonal, unitary matrix with det Q=+1). The matrix of vectors Q is therefore called “rotation matrix” below after step S2.
[0133] For the case of the planar FOA (three channels), step S2 is adapted to compute a determinant of size 3×3 and, for the ambisonic of the FOA (4 channels), a determinant of a 4×4 matrix is used. One exemplary embodiment is given for the 4×4 case in APPENDIX 3.
[0134] In step S4, the matrix Q resulting from step S3 or from step S1 is converted, depending on the value of the determinant in step S2. This conversion takes place either in the domain of Euler angles (K=3) or generalized Euler angles (K>4), or in the domain of quaternions in the case of the FOA (K=3 or 4).
[0135] The conversion into Euler angles (for K=3) is for example given in Appendix I of the article by K. Shoemake, Animating Rotation with Quaternion Curves Proc. SIG-GRAPH 1985, p. 245-254. It will be recalled that there are variants for defining the Euler angles according to the chosen axes of rotation (X,Y,Z) and according to whether or not the axes are fixed. In some variants of the invention, it is possible to use variants for defining the Euler angles other than the one adopted in the article by K. Shoemake for the conversion.
[0136] The conversion into generalized Euler angles (for K>3) is for example detailed in the article D. K. Hoffman, R. C. Raffenetti, and K. Ruedenberg, “Generalization of Euler Angles to N-Dimensional Orthogonal Matrices,” Journal of Mathematical Physics, vol. 13, no. 4, pp. 528-533, 1972. This parametrization based on generalized Euler angles is general and applies to any dimension.
[0137] In some variants, in the case K=3, it is possible to convert the rotation matrix Q (after steps S1 and S2) into a single unit quaternion. One exemplary embodiment is given in Appendix I of the article by Ken Shoemake, Animating Rotation with Quaternion Curves Proc. SIG-GRAPH 1985, p. 245-254.
[0138] In the case K=4, a double unit quaternion parametrization of Q is also possible; the double quaternion conversion is given for example in the article P. Mahé, S. Ragot, S. Marchand, “First-Order Ambisonic Coding with PCA Matrixing and Quaternion-Based Interpolation”, Proc. DAFx, Birmingham, UK, September 2019.
[0139] In step S5, the parameters obtained in step S4 are quantized. For Euler angles (K=3) or generalized Euler angles (K>3), denoted in APPENDIX 1 as angles [i] (i=1, . . . , 6 for the example K=4), in the preferred embodiment, a scalar quantization is applied for example with a quantization step (denoted in APPENDIX 1 as “stepSize”) that is identical for each angle. A budget of 5 and 6 bits for an interval of length π and 2π is defined, for example, thereby giving a budget of 33 bits for 6 generalized Euler angles. A pseudo-code carrying out this quantization operation is given in APPENDIX 1. In the case K=3, with 3 Euler angles, there would be for example a budget of 17 bits (6+6+5 bits for 2 angles defined over an interval of length 2π and an angle over an interval of length π). In some variants, other methods for quantizing Euler angles may be used.
[0140] In the case K=3, if the rotation matrix Q is converted into a single unit quaternion, this quaternion is preferably coded with a hemispherical spherical vector quantization dictionary in dimension 4. In one exemplary embodiment, the vertices of a polytope of dimension 4 may be taken, preferably using the vertices of a truncated (7200 vertices) or omnitruncated (14400 vertices) 600-cell as defined in the literature or even the 7200 vertices of a “120-cell snub” whose code words (coordinates in dimension 4) are available for example in: http://paulbourke.net/geometry/hyperspace/120cell_snub.ascii.gz (beginning of the file, lines 2-7201).
[0141] Spherical vector quantization is carried out by simple comparison by scalar product in dimension 4 with code words (typically normalized to a unit norm equal to 1). The exhaustive search for the nearest neighbor may be carried out efficiently by taking into account the possible permutations of one and the same code word in the dictionary. According to the invention, it is possible to truncate the dictionary into a hemisphere in order to retain, in the search for the nearest neighbor, only the code words whose last (fourth) component is positive (or negative according to the alternative convention that may be used in some variants). In some variants, the truncation by the sign may be carried out on one of the other three components of the unit quaternion. In some variants, the quantization dictionary might not be truncated to a hemisphere.
[0142] No reminder is given here of the known principles of spherical vector quantization with the use of “leaders”, which are for example defined in the article by C. Lamblin and J.-P. Adoul, Algorithme de quantification vectorielle sphérique à partir du réseau de Gosset d'ordre 8. [Spherical vector quantization algorithm based on 8th-order Gosset lattice] Ann. Télécommun., vol. 43, no. 3-4, pp. 172-186, 1988 (Lamblin, 1988). Here, the scalar product is computed for all elements in the dictionary (with or without restriction to the hemisphere) and the number of computations may be equivalently reduced to a subset by listing the signed or unsigned “leaders” in a pre-computed table. The computation of the quantization index is given either by the index in the exhaustive table or by the addition of a permutation index and a cardinality offset, according to approaches known to those skilled in the art. One example of spherical quantization (which may be easily adapted) is found in clause 6.6.9 of ITU-T Recommendation G.729.1.
[0143] In the case K=4, in the case of double quaternions, the pair of unit quaternions q.sub.1 and q.sub.2 is quantized by a spherical quantization dictionary in dimension 4; by convention, q.sub.1 is quantized with a hemispherical dictionary (because q.sub.1 and −q.sub.1 correspond to one and the same 3D rotation) and q.sub.2 is quantized with a spherical dictionary. Examples of dictionaries may be given by predefined points in polyhedra of dimension 4. The quantization dictionaries for q.sub.1 and q.sub.2 may be interchanged for the quantization. The quantization is implemented as explained above by repeating the case K=3 for q.sub.1 and q.sub.2 with a hemispherical and a spherical dictionary.
[0144] In step S5, the matrix of eigenvalues is also coded. According to the invention, the eigenvalues are ordered such that
λ.sub.1≥ . . . ≥λ.sub.K≥0
[0145] In one exemplary embodiment, a differential scalar quantization on a logarithmic scale is used.
[0146] One example of quantization is that of coding λ.sub.1 in absolute terms on 5 bits, and then coding the difference (in dB) between λ.sub.k and λ.sub.k-1 on 3 bits, that is to say a budget of 17 bits for K=4. One exemplary embodiment is given in APPENDIX 4 using a logarithm in base 2—in some variants, a base 10 (or other base) may be used. In some variants, other implementations of the logarithmic scalar quantization may be used.
[0147] It is also possible to use a vector quantization after having converted the eigenvalues into the logarithmic domain, for example by using a Pyramidal Vector Quantization (PVQ) described in the article T. Fischer, “A pyramid vector quantizer,” IEEE transactions on information theory, vol. 32, no. 4, p. 568-583, 1986, or in variants (as in the Opus codec defined in IETF RFC 6716). Vector quantization uses only one quadrant of the possible code words because the eigenvalues are positive and ordered, and therefore code word indexing is able to be simplified to account for these two constraints. For the case of PVQ, one preferred exemplary embodiment scales the eigenvalues before applying the search to a pyramid face of dimension 4.
[0148] In some variants, it is possible to normalize the eigenvalues so as to code only K−1 normalized eigenvalues λ.sub.2/λ.sub.1, . . . , λ.sub.K/λ.sub.1. A scalar quantization is then used on a logarithmic scale on 14 bits for K=4. In this case, the same normalization constraint should be applied to the decoding on the covariance matrix computed on the decoded signal. The exemplary embodiment may be adapted to code differential indices directly.
[0149] In some variants, the eigenvalues resulting from the decomposition of the matrix C may be quantized predictively using an inter-frame or intra-frame prediction. In other variants, if the coding uses a division into multiple sub-bands, it is possible to use a joint quantization of the eigenvalues of all of the sub-bands.
[0150] The quantization indices of the rotation matrix and of the matrix of eigenvalues are sent to the multiplexer (block 140).
[0151] The quantized values (index_angle[i], etc.) are sent to the multiplexer 140.
[0152] In the exemplary implementation for the 4-channel FOA case (with 6 generalized Euler angles coded on 33 bits and 4 eigenvalues coded on 17 bits), this therefore gives a budget of 50 bits (that is to say 2.5 kbit/s) to code a covariance matrix of size 4×4 in each sub-band. By way of example, if a division into sub-bands is defined with respectively 4, 6, 12 or 24 sub-bands and if a covariance matrix is transmitted for each of the sub-bands, this gives a rate of “meta-data” describing the spatial image of 10, 15, 30, or 60 kbit/s.
[0153] The decoder illustrated in
[0154] Block 160 decodes (Q−1) the covariance matrix in each band or sub-band defined by the encoder or other information representative of the spatial image of the original signal. In order not to overload the notations, the decoded covariance matrix is also denoted C like in the encoder.
[0155] Block 160 implements the steps illustrated in
[0156] If the matrix Q has been coded in the domain of generalized Euler angles, block 160 may decode, in S′1, the quantization indices of the generalized Euler angles. In the (4-channel) FOA case, the pseudo-following one is given in APPENDIX 2. The same approach is easily adapted to the case of three Euler angles for K=3 or in the general case K>3.
[0157] If the matrix Q has been coded in the domain of quaternions, the one or more quantization indices, corresponding for example to a code word in a quantization dictionary in dimension 4, is or are decoded (possibly restricted to one hemisphere by restricting the sign of one of the components of the unit quaternion in the dictionary).
[0158] In step S′2, block 160 reconstructs the decoded matrix Q by applying the conversion of generalized Euler angles or one or more quaternions to a rotation matrix, for example in accordance with the abovementioned articles for the encoder portion.
[0159] The eigenvalues are also decoded in S′1, so as to obtain Λ=diag(λ.sub.1, . . . , λ.sub.K), and then the covariance matrix is computed in step S′3: C=QΛQ.sup.T.
[0160] Block 170 of
[0161] The decoding implemented in block 170 makes it possible to obtain a decoded audio signal {circumflex over (B)}′, which is sent as input to upmix block 171. Block 171 thus implements a step (UPMIX) of increasing the number of channels. In one embodiment of this step, for the channel of a mono signal {circumflex over (B)}′, this consists in convolving the signal {circumflex over (B)}′ using various spatial impulse responses that implement power-normalized all-pass decorrelator filters on the various channels of the signal {circumflex over (B)}′. In some variants, the signal {circumflex over (B)}′ may also be convolved using spatial room impulse responses (SRIR); these SRIRs are set to the original ambisonic order of B. In other variants, the decorrelation will be implemented in a transformed domain or in sub-bands (by applying a real or complex filter bank).
[0162] The upmix will add a number of channels K.sub.up so as to obtain K.sub.dmx+K.sub.up=K, where K is the number of channels of the original signal. In one particular embodiment, with an FOA downmix signal, K.sub.dmx=1 (the W channel) and K.sub.up=3.
[0163] Block 172 implements a step (SB) of dividing into sub-bands in a transformed domain. In some variants, a filter bank may be applied in order to obtain signals in the time or frequency domain. A reverse step, in block 191, recombines the sub-bands in order to reconstruct a decoded signal at output.
[0164] In the preferred embodiment, the decorrelation of the signal (block 171) is implemented before the division into sub-bands (block 172), but it is entirely possible, in some variants, to interchange these two blocks. The only condition to be verified is that of ensuring that the decorrelation is adapted to the predefined band or sub-bands.
[0165] Block 175 determines (Inf {circumflex over (B)}) information representative of a spatial image of the decoded multichannel signal in a manner similar to what was described for block 121 (for the original multichannel signal), this time applied to the decoded multichannel signal {circumflex over (B)} obtained at output of block 171.
[0166] Similarly to what was described for block 121, in one embodiment, this information is a covariance matrix of the channels of the decoded multichannel signal.
[0167] In one embodiment, in the STFT domain, the complex case will be used in which Ĉ=Re({circumflex over (B)}.Math.{circumflex over (B)}.sup.H) to within a normalization factor.
[0168] This covariance matrix is obtained as follows in the real case: Ĉ={circumflex over (B)}.Math.{circumflex over (B)}.sup.T to within a normalization factor.
[0169] The matrices C may optionally be normalized by the term Ĉ.sub.11 associated with the W channel, if a similar normalization is applied to the matrix C.
[0170] In some variants, operations of temporally smoothing the covariance matrix may be used. In the cases of a multichannel signal in the time domain, the covariance may be estimated recursively (sample by sample).
[0171] In some variants, the covariance matrix Ĉ of the decoded signal may be decomposed into eigenvalues (ordered as in the encoder) and the eigenvalues may be normalized by the largest eigenvalue.
[0172] From the information representative of the spatial images of the original multichannel signal (Inf. B) and of the decoded multichannel signal (Inf. {circumflex over (B)}), respectively, for example, the covariance matrices C and Ĉ, block 180 implements a step of determining (Det.Corr) a set of corrections per sub-band (in at least one band).
[0173] For this purpose, a transformation matrix T to be applied to the decoded signal is determined, such that the spatial image modified after applying the transformation matrix T to the decoded signal {circumflex over (B)} is the same as that of the original signal B.
[0174]
[0175] What is sought is therefore a matrix T that satisfies the following equation: T.Math.Ĉ.Math.T.sup.T=C where C=B.Math.B.sup.T is the covariance matrix of B and Ĉ={circumflex over (B)}.Math.{circumflex over (B)}.sup.T is the covariance matrix of {circumflex over (B)}, in the current frame.
[0176] In this embodiment, a factorization known as a Cholesky factorization is used to solve this equation.
[0177] Given a matrix A of size n×n, the Cholesky factorization consists in determining a (lower or upper) triangular matrix L such that A=LL.sup.T (real case) and A=LL.sup.H (complex case). For the decomposition to be possible, the matrix A should be a positive definite symmetric matrix (real case) or positive definite Hermitian matrix (complex case); in the real case, the diagonal coefficients of L are strictly positive.
[0178] In the real case, a matrix M of size n×n is said to be positive definite symmetric if it is symmetric (M.sup.T=M) and positive definite (x.sup.TMx>0 for any value of x∈R.sup.n\{0}).
[0179] For a symmetric matrix M, it is possible to verify that the matrix is positive definite if all of its eigenvalues are strictly positive (λ.sub.i>0) If the eigenvalues are positive (λ.sub.i≥0) the matrix is said to be positive semi-definite.
[0180] A matrix M of size n×n is said to be positive definite symmetric Hermitian if it is Hermitian (M.sup.H=M) and positive definite (z.sup.HMz is a real >0 for any value of z∈C.sup.n\{0}).
[0181] The Cholesky factorization is for example used to find a solution to a system of linear equations of the type Ax=b. For example, in the complex case, it is possible to transform A into LL.sup.H using the Cholesky factorization, to solve Ly=b and then to solve L.sup.Hx=y.
[0182] In equivalent fashion, the Cholesky factorization may be written as A=U.sup.TU (real case) and A=U.sup.HU (complex case), where U is an upper triangular matrix.
[0183] In the embodiment described here, without loss of generality, only the case of a Cholesky factorization with a triangular matrix L is dealt with.
[0184] The Cholesky factorization thus makes it possible to decompose a matrix C=L.Math.L.sup.T into two triangular matrices on the condition that the matrix C is positive definite symmetric. This gives the following equation:
T.Math.{circumflex over (L)}.Math.{circumflex over (L)}.sup.TT.sup.T=L.Math.L.sup.T.
[0185] Identification is used to find:
T.Math.{circumflex over (L)}=L
[0186] That is to say:
T=L.Math.L.sup.−1
[0187] Since the covariance matrices C and Ĉ are generally positive semi-definite matrices, the Cholesky factorization cannot be used as such.
[0188] It will be noted here that, when the matrices L and L are lower (respectively upper) triangular, the transformation matrix T is also lower (respectively upper) triangular.
[0189] Block 210 thus forces the covariance matrix C to be positive definite. This modification of the matrix C may be omitted for the decoded covariance matrix if the quantization guarantees that the eigenvalues are indeed non-zero. If it is used, it is possible to replace the values of the diagonal Cii with max(Cii, ε), where ε is a low value fixed for example at 10.sup.−9 (if the values of the ambisonic signal in the time domain are defined in the interval +/−1). In some variants, ε is added (Fact. C for factorization of C) to the coefficients of the diagonal of the matrix in order to guarantee that the matrix is actually positive definite: C=C+εI, and I is the identity matrix.
[0190] Similarly, block 220 forces the covariance matrix Ĉ to be positive definite, by replacing the values of the diagonal Cii with max(Cii, ε), where ε is a low value fixed for example at 10.sup.−9 (if the values of the ambisonic signal in the time domain are defined in the interval +/−1) or by modifying this matrix in the form Ĉ=Ĉ+εI. In the preferred embodiment, this conditioning of the covariance matrices is preferably integrated into the blocks 121 (at the encoder) for the matrix C and 175 (at the decoder) for the matrix Ĉ.
[0191] Once the two covariance matrices C and Ĉ are conditioned (regularized) to be positive definite, block 230 computes the associated Cholesky factorizations and finds (Det.T) the optimum transformation matrix T in the form
T=L.Math.{circumflex over (L)}.sup.−1.
[0192] In this embodiment, it is possible for the relative difference in energy between the decoded ambisonic signal and the corrected ambisonic signal to be very large, in particular at high frequencies, which may be strongly deteriorated by encoders such as multi-mono EVS coding. In order to avoid excessively amplifying certain frequency areas, a regularization term may be added. Block 240 optionally takes responsibility for normalizing (Norm. T) this correction.
[0193] In the preferred embodiment, a normalization factor is therefore computed so as not to amplify frequency areas.
[0194] From the covariance matrix Ĉ of the coded and then decoded multichannel signal and from the transformation matrix T, it is possible to compute the covariance matrix of the corrected signal as:
R=T.Math.Ĉ.Math.T.sup.T
[0195] Only the value of the first coefficient R.sub.00 of the matrix R, corresponding to the omnidirectional component (W channel), is retained in order to be applied, as normalization factor, to T and avoid an increase in the overall gain due to the correction matrix T:
{circumflex over (B)}.sub.corr==T.sub.norm.Math.{circumflex over (B)}
T.sub.norm=g.sub.norm.Math.T
with
g.sub.norm=√{square root over (Ĉ.sub.00/R.sub.00)}
where Ĉ.sub.00 corresponds to the first coefficient of the covariance matrix of the decoded multichannel signal.
[0196] In some variants, the normalization factor g.sub.norm may be determined without computing the whole matrix R, since it is enough to compute only a subset of matrix elements in order to determine R.sub.00 (and therefore g.sub.norm).
[0197] The matrix T or T.sub.norm thus obtained in each band or sub-band corresponds to the corrections to be made to the decoded multichannel signal in block 190 of
[0198] Block 190 performs the step of correcting the decoded multichannel signal by applying, in each band or sub-band, the transformation matrix T or T.sub.norm directly to the decoded multichannel signal, in the ambisonic domain (preferably in the transformed domain), in order to obtain the corrected output ambisonic signal ({circumflex over (B)} corr).
[0199] Even though the invention applies to the ambisonic case, in some variants, it is possible to convert other formats (multichannel, object, etc.) into ambisonic in order to apply the methods implemented according to the various embodiments described. One exemplary embodiment of such a conversion from a multichannel or object format to an ambisonic format is described in FIG. 2 of the 3GPP TS 26.259 specification (v15.0.0).
[0200]
[0201] The coding device DCOD comprises a processing circuit typically including: [0202] a memory MEM1 for storing instruction data of a computer program within the sense of the invention (these instructions possibly being distributed between the encoder DCOD and the decoder DDEC); [0203] an interface INT1 for receiving an original multichannel signal B, for example an ambisonic signal distributed over various channels (for example four 1st-order channels W, Y, Z, X) with a view to compression-coding it within the sense of the invention; [0204] a processor PROC1 for receiving this signal and processing it by executing the computer program instructions stored in the memory MEM1, with a view to coding it; and [0205] a communication interface COM 1 for transmitting the coded signals via the network.
[0206] The decoding device DDEC comprises its own processing circuit, typically including: [0207] a memory MEM2 for storing instruction data of a computer program within the sense of the invention (these instructions possibly being distributed between the encoder DCOD and the decoder DDEC, as indicated above); [0208] an interface COM2 for receiving the coded signals from the network RES with a view to compression-decoding them within the sense of the invention; [0209] a processor PROC2 for processing these signals by executing the computer program instructions stored in the memory MEM2, with a view to decoding them; and [0210] an output interface INT2 for delivering the corrected decoded signals ({circumflex over (B)} Corr), for example in the form of ambisonic channels W . . . X, with a view to rendering them.
[0211] Of course, this
APPENDIX 1
[0212]
min_angle[6]={−PI_2,−PI_2,−PI,−PI_2,−PI,−PI}
max_angle[6]={PI_2,PI_2,PI,PI_2,PI,PI}
excess_bit[6]={0,0,1,0,1,1}
bits=5+v_excess_bit[i]
stepSize=(max_angle[i]−min_angle[i])/(1<<bits)
index_angle[i]=int((angles[i]−min_angle[i])/stepSize)+0.5)
index_angle[i]=index_angle[i] % (1<<bits)
APPENDIX 2
[0213]
min_angle[6]={−PI_2,−PI_2,−PI,−PI_2,−PI,−PI}
max_angle[6]={PI_2,PI_2,PI,PI_2,PI,PI}
excess_bit[6]={0,0,1,0,1,1}
bits=5+v_excess_bit[i]
stepSize=(max_angle[i]−min_angle[i])/(1<<bits)
angles_q[i]=index*stepSize+min_angle[i]
APPENDIX 3
[0214] Computation of the determinant d=det M in literal form for a matrix M=[aij] of size 4×4:
d=a11*a22*a33*a44+a11*a24*a32*a43+a11*a23*a34*a42−a11*a24*a33*a42−a11*a22*a34*a43−a11*a23*a32*a44−a12*a21*a33*a44−a12*a23*a34*a41−a12*a24*a31*a43+a12*a24*a33*a41+a12*a21*a34*a43+a12*a23*a31*a44+a13*a21*a32*a44+a13*a22*a34*a41+a13*a24*a31*a42−a13*a24*a32*a41−a13*a21*a34*a42−a13*a22*a31*a44-a14*a21*a32*a43-a14*a22*a33*a41−a14*a23*a31*a42+a14*a23*a32*a41+a14*a21*a33*a42+a14*a22*a31*a43
APPENDIX 4
[0215] Assuming conditioning of matrix C by ε=10.sup.−9 (the interval of the indices is adapted as a function of this value):
Quantization:
[0216]
index_val[i]=round(½ log 2(λi)), i=1, . . . ,K−1
index_val[i]=clip(index_val[i],[−15,37]) # saturation in the interval [−15,37]
diff_index_val[i]=index_val[i]−index[i−1], i=2 . . . K−1
diff_index_val[i]=clip(diff_index_val[i],[0,7]) # saturation in the interval [0,7]
[0217] Decoding:
index_val[i]=index_val[i−1]+diff_index[i], i=2 . . . K−1
λi=2.sup.1/2 index_val[i]
[0218] Although the present disclosure has been described with reference to one or more examples, workers skilled in the art will recognize that changes may be made in form and detail without departing from the scope of the disclosure and/or the appended claims.