Audio source parameterization
11152014 · 2021-10-19
Assignee
Inventors
Cpc classification
G10L21/0308
PHYSICS
H04S2400/15
ELECTRICITY
H04S2400/11
ELECTRICITY
G10L19/08
PHYSICS
International classification
G10L21/0308
PHYSICS
H04S3/00
ELECTRICITY
G10L19/08
PHYSICS
Abstract
The present document describes a method (600) for estimating source parameters of audio sources (101) from mix audio signals (102), with. The mix audio signals (102) comprise a plurality of frames. The mix audio signals (102) are representable as a mix audio matrix in a frequency domain and the audio sources (101) are representable as a source matrix in the frequency domain. The method (600) comprises updating (601) an un-mixing matrix (221) which is configured to provide an estimate of the source matrix from the mix audio matrix, based on a mixing matrix (225) which is configured to provide an estimate of the mix audio matrix from the source matrix. Furthermore, the method (600) comprises updating (602) the mixing matrix (225) based on the un-mixing matrix (221) and based on the mix audio signals (102). In addition, the method (600) comprises iterating (603) the updating steps (601, 602) until an overall convergence criteria is met.
Claims
1. A method of estimating source parameters of J audio sources from I mix audio signals, with I,J>1, wherein the I mix audio signals comprise a plurality of frames, wherein the I mix audio signals are represented as a mix audio matrix in a frequency domain, wherein the J audio sources are represented as a source matrix in the frequency domain, wherein the method comprises, receiving the I mix audio signals that are captured by microphones at different places within an acoustic environment; for a frame n, updating an un-mixing matrix which is configured to provide an estimate of the source matrix from the mix audio matrix, based on a mixing matrix which is configured to provide an estimate of the mix audio matrix from the source matrix; updating the mixing matrix based on the un-mixing matrix and based on the/mix audio signals for the frame n, by updating the mixing matrix with a non-negative multiplier multiplying previous values of the mixing matrix, wherein the non-negative multiplier is determined based at least in part on the un-mixing matrix and the I mix audio signals; and iterating the updating steps of the un-mixing matrix and the mixing matrix until an overall convergence criterion is met, wherein the method further comprises determining a covariance matrix of the audio sources; the un-mixing matrix is updated based on the covariance matrix of the audio sources; and the covariance matrix of the audio sources is determined based on the mix audio matrix and based on the un-mixing matrix; boosting, attenuating or leveling one or more audio sources in the J audio sources using the estimated source parameters in one or more audio processing applications, wherein the estimated source parameters include the mixing matrix.
2. The method of claim 1, wherein the method comprises determining a covariance matrix of the I mix audio signals based on the mix audio matrix; and the mixing matrix is updated based further on the covariance matrix of the I mix audio signals.
3. The method of claim 2, wherein the covariance matrix R.sub.XX,fn of the I mix audio signals for frame n and for a frequency bin f of the frequency domain is determined based on an average of covariance matrices of frames of the I mix audio signals within a window around the frame n; a covariance matrix of a frame k is determined based on X.sub.fkX.sub.fk.sup.H; and X.sub.fn is the mix audio matrix for frame n and for the frequency bin f.
4. The method of claim 2, wherein determining the covariance matrix of the I mix audio signals comprises normalizing the covariance matrix for the frame n and for a frequency bin f such that a sum of energies of the I mix audio signals for the frame n and for the frequency bin f is equal to a pre-determine normalization value.
5. The method of claim 1, wherein the covariance matrix R.sub.SS,fn of the audio sources for frame n and for a frequency bin f of the frequency domain is determined based on R.sub.SS,fn=Ω.sub.fnR.sub.XX,fnΩ.sub.fn.sup.H; R.sub.XX,fn is a covariance matrix of the I mix audio signals; and Ω.sub.fn is the un-mixing matrix.
6. The method of claim 1, wherein the method comprises determining a covariance matrix of noises within the I mix audio signals; and the un-mixing matrix is updated based on the covariance matrix of noises within the I mix audio signals.
7. The method of claim 1, wherein a covariance matrix of noises is determined based on the I mix audio signals; and/or the covariance matrix of noises is proportional to trace of a covariance matrix of the I mix audio signals; and/or the covariance matrix of noises is determined such that only a main diagonal of the covariance matrix of noises comprises non-zero matrix terms; and/or a magnitude of the matrix terms of the covariance matrix of noises decreases with an increasing number q of iterations of the method.
8. The method of claim 1, wherein updating the un-mixing matrix comprises improving an un-mixing objective function which is dependent on the un-mixing matrix; and/or updating the mixing matrix comprises improving a mixing objective function which is dependent on the mixing matrix.
9. The method of claim 8, wherein the un-mixing objective function and/or the mixing objective function comprises one or more constraint terms; and a constraint term is dependent on a desired property of the un-mixing matrix or the mixing matrix.
10. The method of claim 9, wherein the mixing objective function comprises one or more of a constraint term which is dependent on a non-negativity of matrix terms of the mixing matrix; a constraint term which is dependent on a number of non-zero matrix terms of the mixing matrix; a constraint term which is dependent on a correlation between different columns or different rows of the mixing matrix; and/or a constraint term which is dependent on a deviation of the mixing matrix for frame n and a mixing matrix for a preceding frame.
11. The method of claim 9, wherein the un-mixing objective function comprises one or more of a constraint term which is dependent on a degree to which the un-mixing matrix provides a covariance matrix of the audio sources from a covariance matrix of the I mix audio signals, such that non-zero matrix terms of the covariance matrix of the audio sources are concentrated towards the main diagonal; a constraint term which is dependent on a degree of invertibility of the un-mixing matrix; and/or a constraint term which is dependent on a degree of orthogonality of column vectors or row vectors of the un-mixing matrix.
12. The method of claim 9, wherein the one or more constraint terms are included into the un-mixing objective function and/or the mixing objective function using one or more constraint weights, respectively, to increase or reduce an impact of the one or more constraint terms on the un-mixing objective function and/or on the mixing objective function.
13. The method of claim 8, wherein the un-mixing objective function and/or the mixing objective function are improved in an iterative manner until a sub convergence criterion is met, to update the un-mixing matrix and/or the mixing matrix, respectively.
14. The method of claim 13, wherein improving the mixing objective function comprises repeatedly multiplying the mixing matrix with a multiplier matrix until the sub convergence criterion is met; and the multiplier matrix is dependent on the un-mixing matrix and on the I mix audio signals.
15. The method of claim 14, wherein the multiplier matrix is dependent on
16. The method of claim 13, wherein improving the un-mixing objective function comprises repeatedly adding a gradient to the un-mixing matrix until the sub convergence criterion is met; and the gradient is dependent on a covariance matrix of the I mix audio signals.
17. The method of claim 1, wherein the method comprises determining the mix audio matrix by transforming the I mix audio signals from a time domain to the frequency domain.
18. The method of claim 17, wherein the mix audio matrix is determined using a short-term Fourier transform.
19. The method of claim 1, wherein an estimate of the source matrix for the frame n and for a frequency bin f is determined as S.sub.fn=Ω.sub.fnX.sub.fn; an estimate of the mix audio matrix for the frame n and for the frequency bin f is determined based on X.sub.fn=A.sub.fnS.sub.fn; S.sub.fn is an estimate of the source matrix; Ω.sub.fn is the un-mixing matrix; A.sub.fn is the mixing matrix; and X.sub.fn is the mix audio matrix.
20. The method of claim 1, wherein the overall convergence criterion is dependent on a degree of change of the mixing matrix between two successive iterations.
21. The method of claim 1, wherein the method comprises, initializing the mixing matrix based on an un-mixing matrix determined for a frame preceding the frame n and based on the I mix audio signals for the frame n.
22. The method of claim 1, wherein the method comprises, subsequent to meeting the convergence criterion, performing post-processing on the mixing matrix to determine one or more source parameters with regards to the audio sources.
23. A non-transitory storage medium comprising a software program that, when executed by a processor causes the processor to perform operations comprising: receiving the I mix audio signals that are captured by microphones at different places within an acoustic environment; estimating source parameters of J audio sources from I mix audio signals, with I,J>1, wherein the I mix audio signals comprise a plurality of frames, wherein the I mix audio signals are represented as a mix audio matrix in a frequency domain, wherein the J audio sources are represented as a source matrix in the frequency domain, the estimating comprising, for a frame n: updating an un-mixing matrix which is configured to provide an estimate of the source matrix from the mix audio matrix, based on a mixing matrix which is configured to provide an estimate of the mix audio matrix from the source matrix; updating the mixing matrix based on the un-mixing matrix and based on the/mix audio signals for the frame n, by updating the mixing matrix with a non-negative multiplier multiplying previous values of the mixing matrix, wherein the non-negative multiplier is determined based at least in part on the un-mixing matrix and the I mix audio signals; and iterating the updating steps of the un-mixing matrix and the mixing matrix until an overall convergence criterion is met, wherein the estimating further comprises determining a covariance matrix of the audio sources; the un-mixing matrix is updated based on the covariance matrix of the audio sources; and the covariance matrix of the audio sources is determined based on the mix audio matrix and based on the un-mixing matrix; boosting, attenuating or leveling one or more audio sources in the J audio sources using the estimated source parameters in one or more audio processing applications, wherein the estimated source parameters include the mixing matrix.
24. A system for estimating source parameters of J audio sources from I mix audio signals, with I,J>1, wherein the I mix audio signals comprise a plurality of frames, wherein the I mix audio signals are represented as a mix audio matrix in a frequency domain, wherein the J audio sources are represented as a source matrix in the frequency domain, wherein the system comprises a mix audio signal receiver which is configured to receive the I mix audio signals that are captured by microphones at different places within an acoustic environment; the system comprises a parameter learner which is configured, for a frame n, to update an un-mixing matrix which is configured to provide an estimate of the source matrix from the mix audio matrix, based on a mixing matrix which is configured to provide an estimate of the mix audio matrix from the source matrix; and update the mixing matrix based on the un-mixing matrix and based on the I mix audio signals for the frame n, by updating the mixing matrix with a non-negative multiplier multiplying previous values of the mixing matrix, wherein the non-negative multiplier is determined based at least in part on the un-mixing matrix and the I mix audio signals; the system comprises a source pre-processor which is configured to determine a covariance matrix of the audio sources; the parameter learner is configured to update the un-mixing matrix based on the covariance matrix of the audio sources; the system is configured to cause the parameter learner to update the mixing matrix and the un-mixing matrix in a repeated manner until an overall convergence criterion is met; and the source pre-processor is configured to determine the covariance matrix of the audio sources based on the mix audio matrix and based on the un-mixing matrix; the system comprises an audio signal processor which is configured to boost, attenuate or level one or more audio sources in the J audio sources using the estimated source parameters in one or more audio processing applications, wherein the estimated source parameters include the mixing matrix.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The invention is explained below in an exemplary manner with reference to the accompanying drawings, wherein
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION
(8) As outlined above, the present document is directed at the estimation of source parameters of audio sources from mix audio signals.
(9) The following notations are used in the present document, A. B denotes an element-wise product of two matrices A and B;
(10)
denotes an element-wise division of two matrices A and B; B.sup.−1 denotes a matrix inversion of matrix B; B.sup.H denotes the transpose of B if B is a real-valued matrix and denotes a conjugate transpose of B if B is a complex-valued matrix; and 1 denotes a matrix of suitable dimension with all ones.
(11)
X.sub.fn=A.sub.fnS.sub.fn+B.sub.fn (1)
where S.sub.fn are matrices of dimension J×1, representing STFTs of J unknown audio sources (referred to herein as source matrices), A.sub.fn are matrices of dimension I×J, representing mixing parameters, which can be frequency-dependent and time-varying (referred to herein as mixing matrices), and B.sub.fn are matrices of dimension I×1, representing additive noise plus diffusive ambience signals (referred to herein as noise matrices).
(12) Likewise, the inverse mixing process from the observed mix audio signals 102 to the unknown audio sources 101 may be modeled in a similar matrix form as:
{tilde over (S)}.sub.fn=Ω.sub.fnX.sub.fn (2)
where {tilde over (S)}.sub.fn are matrices of dimension J×1, representing STFTs of J estimated audio sources (referred to herein as estimated source matrices), Ω.sub.fn are matrices of dimension J×1, representing inverse mixing parameters or un-mixing parameters (referred to herein as the un-mixing matrices).
(13) In the present document, an unsupervised learning method and system 200 for estimating source parameters for the use in different subsequent audio processing tasks is described. Meanwhile, if prior-knowledge is available, the method and system 200 may be extended to incorporate the prior information within the learning scheme. The source parameters may include the mixing and un-mixing parameters A.sub.fn, Ω.sub.fn, and/or estimated spectral and temporal parameters of the unknown audio sources 101.
(14) The system 200 may include the following modules: a mix pre-processor 201 which is adapted to process the mix audio signals 102 and which outputs processed covariance matrices R.sub.XX,fn 222 of the mix audio signals 102. a mixing parameter learner 202 which is adapted to take at a first input 211 the covariance matrices 222 of the mix audio signals 102 and the un-mixing parameters Ω.sub.fn 221 and to provide at a first output 213 the mixing parameters or the mixing matrix A.sub.fn 225. Alternatively or in addition, the mixing parameter learner 202 is adapted to take at a second input 212 the mixing parameters A.sub.fn 225, the output signals 224 of the source pre-processor 203 and possibly the covariance matrices 222 of the mix audio signals 102, and to provide at a second output 214 the un-mixing parameters or the un-mixing matrixΩ.sub.fn 221. a source pre-processor 203 which is adapted to take as input the covariance matrices 222 of the mix audio signals 102 and the un-mixing parameters Ω.sub.fn 201. In addition, the input may include prior knowledge 223, if available, about the audio sources 101 and/or the noises, which may be used to regulate the covariance matrices. The source pre-processor 203 outputs covariance matrices R.sub.SS,fn of the audio sources 101 and covariance matrices R.sub.BB,fn of the noises. an iterative processor 204 which is adapted to iteratively apply modules 202 and 203 until one or more convergence criteria are met. Subsequent to convergence, the learned source parameters (for example, the mixing parameters A.sub.fn 225, as shown in
(15) Table 1 illustrates example inputs and outputs of the parameter learner 202.
(16) TABLE-US-00001 TABLE 1 Input Output Covariance Inverse mixing Mixing matrices parameters parameters observed First input: First input: First output: mix Covariance matrices Ω.sub.fn: the un-mixing A.sub.fn audio output from the Mix parameters initially signals audio pre-processor set with random values or with prior information about the mix (if available) and consequently the feedback from the second output unknown Second input: Second input: Second audio Covariance matrices A.sub.fn: the mixing output: sources output from the parameters being the Ω.sub.fn Source parameter feedback from the first regulator, and that output from the from noise parameter learner estimation
(17) In the following, examples for the different modules of the system 200 are described.
(18) The mix pre-processor 201 may read in I mix audio signals 102 and may apply a time domain to frequency domain transform (such as a STFT transform) to provide the frequency-domain mix audio matrix X.sub.fn. The covariance matrices R.sub.XX,fn 222 of the mix audio signals 102 may be calculated as below:
(19)
where n is the current frame index, and where T is the frame count of the analysis window of the transform.
(20) In addition, the covariance matrices 222 of the mix audio signals 102 may be normalized by the energy of the mix audio signals 102 per TF tiles, so that the sum of all normalized energies of the mix audio signals 102 for a given TF tile is one:
(21)
where ε.sub.1 is a relatively small value (for example, 10.sup.−6) to avoid division by zero, and trace(.Math.) returns the sum of the diagonal entries of the matrix within the bracket.
(22) The source pre-processor 203 may be adapted to calculate the audio sources' covariance matrices R.sub.SS,fn as:
R.sub.SS,fn=Ω.sub.fnR.sub.XX,fnΩ.sub.fn.sup.H (5)
(23) It may be assumed that the noises in each mix audio signal 102 are uncorrelated to each other, which does not limit the generality from the practical point of view. Hence, the noises' covariance matrices are diagonal matrices, wherein all diagonal entries may be initialized as being proportional to the trace of mix covariance matrices of the mix audio signals 102 and wherein the proportionality factor may decrease along the iteration times of the iterative processor:
(24)
where Q is the overall iteration times and q is the current iteration count during the iterative processing.
(25) If prior knowledge 223 about the audio sources 101 and/or noises is available, advanced methods may be adopted within the source pre-processor 203.
(26) The mixing parameter learner 202 may implement a learning method that determines the mixing and un-mixing parameters 225, 221 for the audio sources 101 by minimizing and/or optimizing a cost function (or objective function). The cost function may depend on the mix audio matrices and the mixing parameters. In an example, such a cost function for learning the mixing parameters A.sub.fn (or A, when omitting the frequency index f and the frame index n) may be defined as below:
(27)
where ∥.Math.∥.sub.F represents the Frobenius norm.
(28) The cost function for learning the un-mixing parameters Ω.sub.fn (or Ω) may be defined in the same manner. The input to the cost function is changed by replacing A with Ω and replacing X with S. Thus, the cost function may depend on the source matrices and the un-mixing parameters. In an example corresponding to the example of equation (7):
(29)
(30) Alternatively, notably if the noise model is to be taken into account, a cost function using the minus log-likelihood may be used, such as:
(31)
where Ā=R.sub.BB,fn.sup.−1A.sub.fn, and where R.sub.BB,fn is the covariance matrix of the noise signals. Typically, R.sub.BB,fn is a diagonal matrix, if the noises are considered to be uncorrelated signals. It can be observed that the cost function of equation (9) is in the same form as the cost functions of equations (7) and (8).
(32) Different optimization techniques may be applied to learn the mixing parameters and/or un-mixing parameters. In particular, the problem of learning the mixing/un-mixing parameters may be considered as the minimization problems:
A=argmin E(A) (10)
Ω=argmin E(Ω) (11)
(33) The system 200 may use an inverse-matrix method by solving ∇E=0 to determine optimized values of the mixing parameters as follows:
A=R.sub.XXΩ.sup.H(ΩR.sub.XXΩ.sup.H).sup.−1 (12)
Ω=R.sub.SSA.sup.H(AR.sub.SSA.sup.H+R.sub.BB).sup.−1 (13)
(34) The successful and efficient design and implementation of the mixing parameter learner 202 typically depends on an appropriate use of regularization, pre-processing and post-processing based on prior knowledge 223. For this purpose, one or more constraints may be taken into account within the mixing parameter learner 202, thereby enabling the extraction and/or identification of physically significant and meaningful hidden source parameters.
(35)
(36) Example constraints 311 for learning the mixing parameters A: A non-negativity constraint: According to a non-negativity constraint all learned mixing parameters A may be constrained to be positive value or zeros. In practice, especially for processing mix audio signals 102 created in a studio, such as movies and TV programs, it may be valid to assume that the mixing parameters A are non-negative. As a matter of fact, negative mixing parameters are rare if not impossible for content creation in a studio environment. A mixing parameter learner 202, 302 which does not make use of the non-negativity constraint may cause audible artifacts, spatial distortions and/or instability. For example, spurious out-of-phase audio sources may be generated within the system 200, if no non-negativity constraint is imposed. Such out-of-phase audio sources typically introduce audible artifacts, an energy build-up and spatial distortions when performing post processing such as up-mixing. Sparseness constraint: A sparseness constraint may force the mixing parameter learner 202, 203 in favor of sparse solutions of A, meaning mixing matrices A with an increased number of zero entries. This property is typically beneficial in the context of unsupervised learning, when information such as the number of audio sources 101 is unknown. For example, when the number of audio sources 101 is over-estimated (meaning, higher than the actual number of audio sources 101), the unconstrained learner 202, 302 may output a mixing matrix A which is a legitimate solution but with a number of non-zero elements that is higher than the optimal solution. Such additional non-zero elements typically correspond to spurious audio sources which may introduce instability and artifacts in the context of post processing 205. Such non-zero elements may be removed by imposing the sparseness constraint. Uncorrelatedness constraint: The uncorrelatedness constraint may force the parameter learner 202, 302 to be more biased towards solutions with uncorrelated columns within the mixing matrix A. This constraint may be used for screening out spurious audio sources in unsupervised learning. Combined sparseness and uncorrelatedness constraint: It may be beneficial for the learner 202, 302 to apply a dimension-specific sparseness constraint, which means that A is assumed to be sparse only along a first dimension rather than a second dimension. Such a dimension-specific sparseness may be achieved by imposing both the sparseness and the uncorrelatedness constraints. Consistency constraint: Domain knowledge indicates that the mixing matrix A typically exhibits a consistency property along time, which means that the mixing parameters of a current frame are typically consistent with the mixing parameters of a previous frame, without abrupt changes.
(37) Moreover, for learning the un-mixing parameters Ω, one or more of the following constraints may be enforced within the learner 202, 302. Example constraints are: A diagonalizability constraint: A diagonalizability constraint may force the parameter learner 202, 302 to search for solutions of Ω such that the un-mixing matrix diagonalizes R.sub.SS, which means that the diagonalizability constraint favors the estimation of the audio sources 101 to be uncorrelated to each other. The assumption of uncorrelatedness among the audio sources 101 typically enables the unsupervised learning system 200 to converge promptly to meaningful audio sources 101. That is, a respective constraint term may depend on capacity of the un-mixing matrix to provide the covariance matrix R.sub.SS of the audio sources from the covariance matrix R.sub.XX of the mix audio signals such that non-zero matrix terms of the covariance matrix of the audio sources are concentrated towards the main diagonal (e.g., the constraint term may depend on a degree of diagonality of R.sub.SS). A degree of diagonality may be determined based on the metric A defined below. An invertibility constraint: The invertibility constraint regarding the un-mixing parameters may be used as a constraint which prevents the convergence of the minimizer of the cost function to a zero solution. An orthogonality constraint: Orthogonality may be used to reduce the space within which the learner 202, 302 is operating, thereby further speeding up the convergence of the learning system 200.
(38) While a cost function may include terms such as the Frobenius norm as expressed in equations (7) and (8) or the minus log-likelihood term as expressed in equation (9), other cost functions may be used instead of or in addition to the cost functions as described in the present document. Especially, additional constraint terms may be used to regulate the learning for fast convergence and improved performance. For example, the constrained cost function may be given by
E(A)=∥(X.sup.H−(AS)∥.sub.F.sup.2+E.sub.uncorr+E.sub.sparse (14)
where E.sub.uncorr is a term for the uncorrelatedness constraint:
E.sub.uncorr=α.sub.uncorr∥A1∥.sub.F.sup.2 (15)
and E.sub.sparse is a term for the sparseness constraint:
(39)
(40) The level of the uncorrelatedness and/or the sparsity may be increased with the increase of the regularization coefficients α.sub.uncorr and/or α.sub.sparse. By way of example, α.sub.uncorr∈[0,10] and α.sub.sparse∈[0.0, 0.5].
(41) An example constrained learner 302 may use the inverse-matrix method by solving ∇E=0 to determine optimized values of the mixing parameters as follows:
A=(R.sub.XXΩ.sup.H−α.sub.sparse1)(ΩR.sub.XXΩ.sup.H+α.sub.uncorr1).sup.−1 (17)
(42) However, there may be limitations for the inverse-matrix method with regards to the constraints. A possible method for enforcing a non-negativity constraint is to make A=A.sub.+ after each calculation of equation (17), where a positive component A.sub.+ and a negative component A.sub.− of a matrix A are respectively defined as follows:
(43)
(44) Such a method for imposing non-negativity may not necessarily converge to the global optimum. On the other hand, if the non-negativity constraint is not enforced, meaning if the condition A.sub.ij≥0, ∀i,j in equation (16) does not hold, it may be difficult to impose the L1-norm sparseness constraint, as defined in equation (16).
(45) Instead of or in addition to using the inverse-matrix method, an unsupervised iterative learning method may be used, which is flexible with regards to imposing different constraints. This method may be used to discover a structure underlying the observed mix audio signals 102, to extract meaningful parameters, and to identify a useful representation of the given data. The iterative learning method may be implemented in a relatively simple manner.
(46) It may be relevant to solve the problem by multiplicative updates when constraints such as L1-norm sparseness are imposed, since a closed form solution no longer exists. Furthermore, given non-negative initialization and non-negative multipliers, the multiplicative iterative learner naturally enforces a non-negativity constraint. In addition, the multiplicative update approach also provides stability for ill-conditioned situations. It leads the learner 202 to output robust and stable mixing parameters A given ill-conditioned ΩR.sub.XXΩ.sup.H. Such an ill-conditioned situation may occur frequency for unsupervised learning, especially when the number of audio sources 101 is over-estimated, or when the estimated audio sources 101 are highly correlated to each other. In these cases, the matrix ΩR.sub.XXΩ.sup.H is singular (having a lower rank than its dimension), so that using the inverse-matrix method in equations (12) and (13) may lead to numerical issues and may become unstable.
(47) When using the multiplicative update approach, current values of the mixing parameters are obtained by iteratively updating previous values of the mixing parameters with a non-negative multiplier. For the purpose of illustration only, the current values of the mixing parameters may be derived from the previous values of the mixing parameters with a non-negative multiplier as follows:
(48)
where M=ΩR.sub.XXΩ.sup.H+α.sub.uncorr1, D=−R.sub.XXΩ.sup.H+α.sub.sparse1, and where ε is a small value (typically ε=10.sup.−8) to avoid zero-division. In the above, α.sub.sparse and/or α.sub.uncorr may be zero.
(49) When α.sub.sparse=0 and α.sub.uncorr=0, the above mentioned updated approach is identical to an un-constrained learner without a sparseness constraint or uncorrelatedness constraint. The uncorrelatedness level and sparsity level may be pronounced by increasing the regularization coefficients or constraint weights ═.sub.uncorr and α.sub.sparse. These coefficients may be set empirically depending on the desired degree of uncorrelatedness and/or sparseness. Typically, α.sub.uncorr∈[0, 10] and α.sub.sparse∈[0.0, 0.5]. Alternatively, optimal regularization coefficients may be learned based on a target metric such as a signal-to-distortion ratio. It may be shown that the optimization of the cost function E (A) using the multiplicative update approach is convergent.
(50) Although M is typically diagonalizable and positive definite, the mixing parameters obtained via the inverse-matrix method as given by equations (12) or (17) may not necessarily be positive. In contrast, when updating mixing parameter values through an update factor that is a positive multiplier according to equation (19) non-negativity in the optimization process of the mixing parameters may be ensured, provided that the initial values of the mixing parameters are non-negative. The mixing parameters obtained using a multiplicative-update method according to equation (19) may remain zero provided the initial values of the mixing parameters are zero.
(51) The multiplicative update method may be extended for a learner 202, 302 without the non-negativity constraint, meaning that A is allowed to contain both non-negative and negative entries: A=A.sub.+−A.sub.−. For the purpose of illustration only, the current values of the mixing parameters may be derived by updating its non-negative part and negative part separately as follows:
(52)
where D.sub.p=−R.sub.XXΩ.sup.H−A.sub.−M+α.sub.sparse1, D.sub.n=R.sub.XXΩ.sup.H−A.sub.+M+α.sub.sparse1, M=ΩR.sub.XXΩ.sup.H+α.sub.uncorr1, and ε is a small value (typically ε=10.sup.−8) to avoid zero-division.
(53) As shown in
(54)
(55) An example implementation of the constrained learner 302 for the mixing parameters using the multiplicative method is shown in Table 2:
(56) TABLE-US-00002 TABLE 2 Input: Ω, R.sub.XX, A.sub.f,n−1 (if n > 1) Initialize: // initialize A with learned values from previous frames; if no history data available, use random non-negative values
(57) In the above, α.sub.sparse and/or α.sub.uncorr may be zero.
(58) The multiplicative updater may be applied for learning un-mixing parameters Ω in a similar manner. In
(59) TABLE-US-00003 TABLE 3 Input: A, R.sub.SS, R.sub.XX, R.sub.BB Initialize: // initialize Ω with Example method I using Eq. (13) Ω = R.sub.SSA.sup.H (AR.sub.SSA.sup.H + R.sub.BB).sup.−1, Iteration: for iter = 1: iteration_times, do: //Update Ω by enforcing the diagonalizability constraint, where: //
(60) The convergence for the iterative processor 204 in
(61) As such, the iterative processor 204 of
(62) In the following, example post-processing 205 is described. The audio sources' position metadata may be directly estimated from the mixing parameters A. Provided that non-negativity has been enforced when determining the mixing parameters A, each column of the mixing matrix represents the panning coefficients of the corresponding audio source. The square of the panning coefficients may represent the energy distribution of an audio source 101 within the mix audio signals 102. Thus, the position of an audio source 101 may be estimated as the energy weighted center of mass: P.sub.j=Σ.sub.i=1.sup.Iw.sub.ijP.sub.i, where P.sub.j is the spatial position of the j-th audio source, where P.sub.i is the position corresponding to the i-th mix audio signal 102, and where w.sub.ij is the energy distribution of the j-th audio source in the i-th mix audio signal:
(63)
(64) Alternatively or in addition, the spatial position of each audio source 101 may be estimated by reversing the Center of Mass Amplitude Panning (CMAP) algorithm and by using:
(65)
where α.sub.distance is a weight of a constraint term in CMAP which penalizes firing speakers that are far from the audio sources 101, and where α.sub.distance is typically set to 0.01.
(66) The position metadata estimated for conventional channel-based mix audio signals (such as 5.1 and 7.1 multi-channel signals) typically contains 2D (two dimensional) information only (x and y since the mix audio signals only contain horizontal signals). z may be estimated with a pre-defined hemisphere function:
(67)
where
(68)
are relative distances between the position of an audio source (x, y) and the center of the space (0.5, 0.5), and where h.sub.max is the maximum object height which typically ranges from 0 to 1.
(69)
(70) The method 600 includes updating 601 an un-mixing matrix 221 which is adapted to provide an estimate of the source matrix from the mix audio matrix, based on a mixing matrix 225 which is adapted to provide an estimate of the mix audio matrix from the source matrix. Furthermore, the method 600 includes updating 602 the mixing matrix 225 based on the un-mixing matrix 221 and based on the I mix audio signals 102. In addition, the method 600 includes iterating 603 the updating steps 601, 602 until an overall convergence criteria is met.
(71) By repeatedly and alternately updating the mixing matrix 225 based on the un-mixing matrix 221 and then using the updated mixing matrix 225 to update the un-mixing matrix 221, a precise mixing matrix 225 may be determined, thereby enabling the determination of precise source parameters of the audio sources 101. The method 600 may be performed for different frequency bins f of the frequency domain and/or for different frames n.
(72) The methods and systems described in the present document may be implemented as software, firmware and/or hardware. Certain components may for example be implemented as software running on a digital signal processor or microprocessor. Other components may for example be implemented as hardware and or as application specific integrated circuits.
(73) The signals encountered in the described methods and systems may be stored on media such as random access memory or optical storage media. They may be transferred via networks, such as radio networks, satellite networks, wireless networks or wireline networks, for example the Internet.
(74) Various aspects of the present invention may be appreciated from the following enumerated example embodiments (EEEs): EEE 1. A method (600) for estimating source parameters of J audio sources (101) from I mix audio signals (102), with I,J>1, wherein the mix audio signals (102) comprise a plurality of frames, wherein the I mix audio signals (102) are representable as a mix audio matrix in a frequency domain, wherein the J audio sources (101) are representable as a source matrix in the frequency domain, wherein the method (600) comprises, for a frame n, updating (601) an un-mixing matrix (221) which is configured to provide an estimate of the source matrix from the mix audio matrix, based on a mixing matrix (225) which is configured to provide an estimate of the mix audio matrix from the source matrix; updating (602) the mixing matrix (225) based on the un-mixing matrix (221) and based on the I mix audio signals (102) for the frame n; and iterating (603) the updating steps (601, 602) until an overall convergence criteria is met. EEE 2. The method (600) of EEE 1, wherein the method (600) comprises determining a covariance matrix (222) of the mix audio signals (102) based on the mix audio matrix; and the mixing matrix (225) is updated based on the covariance matrix (222) of the mix audio signals (102). EEE 3. The method (600) of EEE 2, wherein the covariance matrix R.sub.XX,fn (222) of the mix audio signals (102) for frame n and for a frequency bin f of the frequency domain is determined based on an average of covariance matrices of frames of the mix audio signals (102) within a window around the frame n; the covariance matrix of a frame k is determined based on X.sub.fkX.sub.fk.sup.H; and X.sub.fn is the mix audio matrix for frame n and for the frequency bin f. EEE 4. The method (600) of any of EEEs 2 to 3, wherein determining the covariance matrix (222) of the mix audio signals (102) comprises normalizing the covariance matrix (222) for the frame n and for a frequency bin f such that a sum of energies of the mix audio signals (102) for the frame n and for the frequency bin f is equal to a pre-determine normalization value. EEE 5. The method (600) of any previous EEE, wherein the method (600) comprises determining a covariance matrix (224) of the audio sources (101) based on the mix audio matrix and based on the un-mixing matrix (221); and the un-mixing matrix (221) is updated based on the covariance matrix (224) of the audio sources (101). EEE 6. The method (600) of EEE 5, wherein the covariance matrix R.sub.SS,fn (224) of the audio sources (101) for frame n and for a frequency bin f of the frequency domain is determined based on R.sub.SS,fn=Ω.sub.fnR.sub.XX,fnΩ.sub.fn.sup.H; R.sub.XX,fn is a covariance matrix (222) of the mix audio signals (102); and Ω.sub.fn is the un-mixing matrix (221). EEE 7. The method (600) of any previous EEE, wherein the method (600) comprises determining a covariance matrix (224) of noises within the mix audio signals (102); and the un-mixing matrix (221) is updated based on the covariance matrix (224) of noises within the mix audio signals (102). EEE 8. The method (600) of EEE 7, wherein the covariance matrix (224) of noises is determined based on the mix audio signals (102); and/or the covariance matrix (224) of noises is proportional to the trace of a covariance matrix (222) of the mix audio signals (102); and/or the covariance matrix (224) of noises is determined such that only a main diagonal of the covariance matrix (224) of noises comprises non-zero matrix terms; and/or a magnitude of the matrix terms of the covariance matrix (224) of noises decreases with an increasing number q of iterations of the method (600). EEE 9. The method (600) of any previous EEEs, wherein updating (601) the un-mixing matrix (221) comprises improving an un-mixing objective function which is dependent on the un-mixing matrix (221); and/or updating (602) the mixing matrix (225) comprises improving a mixing objective function which is dependent on the mixing matrix (225). EEE 10. The method (600) of EEE 9, wherein the un-mixing objective function and/or the mixing objective function comprises one or more constraint terms; and a constraint term is dependent on a desired property of the un-mixing matrix (221) or the mixing matrix (225). EEE 11. The method (600) of EEE 10, wherein the mixing objective function comprises one or more of a constraint term which is dependent on non-negativity of the matrix terms of the mixing matrix (225); a constraint term which is dependent on a number of non-zero matrix terms of the mixing matrix (225); a constraint term which is dependent on a correlation between different columns or different rows of the mixing matrix (225); and/or a constraint term which is dependent on a deviation of the mixing matrix (225) for frame n and a mixing matrix (225) for a preceding frame. EEE 12. The method (600) of any of EEEs 10 to 11, wherein the un-mixing objective function comprises one or more of a constraint term which is dependent on a capacity of the un-mixing matrix (221) to provide a covariance matrix (224) of the audio sources (101) from a covariance matrix (222) of the mix audio signals (102), such that non-zero matrix terms of the covariance matrix (224) of the audio sources (101) are concentrated towards the main diagonal; a constraint term which is dependent on a degree of invertibility of the un-mixing matrix (221); and/or a constraint term which is dependent on a degree of orthogonality of column vectors or row vectors of the un-mixing matrix (221). EEE 13. The method (600) of any of EEEs 10 to 12, wherein the one or more constraint terms are included into the un-mixing objective function and/or the mixing objective function using one or more constraint weights, respectively, to increase or reduce an impact of the one or more constraint terms on the un-mixing objective function and/or on the mixing objective function. EEE 14. The method (600) of any of EEEs 9 to 13, wherein the un-mixing objective function and/or the mixing objective function are improved in an iterative manner until a sub convergence criteria is met, to update the un-mixing matrix (221) and/or the mixing matrix (225), respectively. EEE 15. The method (600) of EEE 14, wherein improving the mixing objective function comprises repeatedly multiplying the mixing matrix (225) with a multiplier matrix until the sub convergence criteria is met; and the multiplier matrix is dependent on the un-mixing matrix (221) and on the mix audio signals (102). EEE 16. The method (600) of EEE 15, wherein the multiplier matrix is dependent on
(75)