SOUND-SOURCE SIGNAL ESTIMATE APPARATUS, SOUND-SOURCE SIGNAL ESTIMATE METHOD, AND PROGRAM

Abstract

The transfer function estimation device includes: a correlation matrix computing unit 43 computing a correlation matrix of N frequency domain signals y(f,l); a signal space basis vector computing unit 44 obtaining M vectors v.sub.1(f), . . . , v.sub.M(f) from eigenvectors of the correlation matrix from highest in the order of corresponding eigenvalues; and a plural RTF estimation unit 45 determining t.sub.i(f), . . . , t.sub.M(f) that satisfy the relationship of Expression (1), determining a matrix D(f) that is not a zero matrix and that makes u.sub.i(f), . . . , u.sub.M(f) defined by Expression (2) sparse in a time direction, determining c.sub.i,1(f), . . . , c.sub.M,N(f) that satisfy the relationship of Expression (3), and outputting c.sub.1(f)/c.sub.1,j(f), . . . , c.sub.M(f)/c.sub.M,j(f) as a relative transfer function, where j is an integer of 1 or more and not more than N.

Claims

1. A transfer function estimation device comprising: a correlation matrix determiner configured to determine a correlation matrix of N frequency domain signals y(f,l) corresponding to N time domain signals picked up by N microphones that form a microphone array, where N is an integer of 2 or more, f is a frequency index, and l is a frame index; a signal space basis vector determiner configured to obtain M vectors v.sub.1(f), . . . , v.sub.M(f) from eigenvectors of the correlation matrix from highest in an order of corresponding eigenvalues, where M is an integer of 2 or more; and a plural RTF estimator configured to determine t.sub.i(f), . . . , t.sub.M(f) that satisfy a relationship of: $\begin{matrix} Y (f, l) = v_{1} (f), .Math., v_{M} (f) [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}], & [Formula 41] \end{matrix}$ where Y(f,l)=[y(f,l+1), . . . , y(f,l+L)], L being an integer of 2 or more, $\begin{matrix} [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = D (f) [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}] & [Formula 42] \end{matrix}$ determining a matrix D(f) that is not a zero matrix and that makes u.sub.i(f), . . . , u.sub.M(f) defined by an expression above sparse in a time direction, determining c.sub.i,1(f), . . . , c.sub.M,N(f) that satisfy a relationship of:
[c.sub.1(f), . . . ,c.sub.M(f)]=[v.sub.1(f), . . . ,v.sub.M(f)]D.sup.−1(f)
c.sub.i(f)=[c.sub.i,1(f), . . . ,c.sub.i,N(f)].sup.Ti=1, . . . ,M. [Formula 43] and output c.sub.1(f)/c.sub.1,j(f), . . . , c.sub.M(f)/c.sub.M,j(f) as a relative transfer function, where j is an integer of 1 or more and not more than N.

2. The transfer function estimation device according to claim 1, wherein the plural RTF estimator determines a matrix D(f) that minimizes |u.sub.1(f)|.sub.1+ . . . +|u.sub.M(f)|.sub.1, in a condition in which diagonal elements of the matrix D(f) are fixed to a predetermined value.

3. The transfer function estimation device according to claim 1, wherein, where A.sup.H is a Hermitian matrix of a matrix A, I.sub.M is an M×M unit matrix, ∥t.sub.i(f)∥.sub.2 is an L2 norm of t.sub.i(f), and t.sub.ni(f)=t.sub.i(f)/∥t.sub.i(f)∥.sub.2, where i=1, . . . , M, the plural RTF estimator determines a matrix A that minimizes |u.sub.1(f)|.sub.1+ . . . +|u.sub.M(f)|.sub.1 and that satisfies a following condition: $\begin{matrix} [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = A [\begin{matrix} t_{n 1} (f) \\ .Math. \\ t_{nM} (f) \end{matrix}] A^{H} A = I_{M}, & [Formula 44] \end{matrix}$ and determines a matrix D(f) defined by a following expression: $\begin{matrix} D (f) = A [\begin{matrix} 1 / {.Math. t_{1} (f) .Math.}_{2} & 0 & 0 \\ 0 & ⋱ & 0 \\ 0 & 0 & 1 / {.Math. t_{M} (f) .Math.}_{2} \end{matrix}], & [Formula 45] \end{matrix}$ using the determined matrix A.

4. A transfer function estimation method comprising: determining, by a correlation matrix determiner, a correlation matrix of N frequency domain signals y(f,l) corresponding to N time domain signals picked up by N microphones that form a microphone array, where N is an integer of 2 or more, f is a frequency index, and l is a frame index; obtaining, by a signal space basis vector determiner, eigenvectors v.sub.1(f), . . . , v.sub.M(f) of the correlation matrix, where M is an integer of 2 or more and not more than N; and determining, by a plural RTF estimator, t.sub.i(f), . . . , t.sub.M(f) that satisfy a relationship of: $\begin{matrix} Y (f, l) = [v_{1} (f), .Math., v_{M} (f)] [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}], & [Formula 46] \end{matrix}$ where Y(f,l)=[y(f,l+1), . . . , y(f,l+L)], L being an integer of 2 or more, $\begin{matrix} [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = D (f) [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}] & [Formula 47] \end{matrix}$ determines a matrix D(f) that is not a zero matrix and that makes u.sub.i(f), . . . , u.sub.M(f) defined by an expression above sparse in a time direction, determines c.sub.i,1(f), . . . , C.sub.M,N(f) that satisfy a relationship of:
[c.sub.1(f), . . . ,c.sub.M(f)]=[v.sub.1(f), . . . ,v.sub.M(f)]D.sup.−1(f)
c.sub.i(f)=[c.sub.i,1(f), . . . ,c.sub.i,N(f)].sup.Ti=1, . . . ,M, [Formula 48] and outputs c.sub.1(f)/c.sub.1,j(f), . . . , c.sub.M(f)/c.sub.M,j(f) as a relative transfer function, where j is an integer of 1 or more and not more than N.

5. A computer-readable non-transitory recording medium storing a computer-executable program instructions that when executed by a processor cause a computer system to: determine, by a correlation matrix determiner, a correlation matrix of N frequency domain signals y(f,l) corresponding to N time domain signals picked up by N microphones that form a microphone array, where N is an integer of 2 or more, f is a frequency index, and l is a frame index; obtain, by a signal space basis vector determiner, eigenvectors v.sub.1(f), . . . , v.sub.M(f) of the correlation matrix, where M is an integer of 2 or more and not more than N; and determine, by a plural RTF estimator, t.sub.i(f), . . . , t.sub.M(f) that satisfy a relationship of: $\begin{matrix} Y (f, l) = [v_{1} (f), .Math., v_{M} (f)] [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}], & [Formula 46] \end{matrix}$ where Y(f,l)=[y(f,l+1), . . . , y(f,l+L)], L being an integer of 2 or more, $\begin{matrix} [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = D (f) [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}] & [Formula 47] \end{matrix}$ determines a matrix D(f) that is not a zero matrix and that makes u.sub.i(f), . . . , u.sub.M(f) defined by an expression above sparse in a time direction, determines c.sub.i,1(f), . . . , c.sub.M,N(f) that satisfy a relationship of:
[c.sub.1(f), . . . ,c.sub.M(f)]=[v.sub.1(f), . . . ,v.sub.M(f)]D.sup.−1(f)
c.sub.i(f)=[c.sub.i,1(f), . . . ,c.sub.i,N(f)].sup.Ti=1, . . . ,M, [Formula 48] and outputs c.sub.1(f)/c.sub.1,j(f), . . . , c.sub.M(f)/c.sub.M,j(f) as a relative transfer function, where i is an integer of 1 or more and not more than N.

6. The transfer function estimation method according to claim 4, wherein the plural RTF estimator determines a matrix D(f) that minimizes |u.sub.1(f)|.sub.1+ . . . +|u.sub.M(f)|.sub.1, in a condition in which diagonal elements of the matrix D(f) are fixed to a predetermined value.

7. The transfer function estimation method according to claim 4, wherein, where A.sup.H is a Hermitian matrix of a matrix A, I.sub.M is an M×M unit matrix, ∥t.sub.i(f)∥.sub.2 is an L2 norm of t.sub.i(f), and t.sub.ni(f)=t.sub.i(f)/∥t.sub.i(f)∥.sub.2, where i=1, . . . , M, the plural RTF estimator determines a matrix A that minimizes |u.sub.1(f)|.sub.1+ . . . +|u.sub.M(f)|.sub.1 and that satisfies a following condition: $\begin{matrix} [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = A [\begin{matrix} t_{n 1} (f) \\ .Math. \\ t_{nM} (f) \end{matrix}] A^{H} A = I_{M}, & [Formula 44] \end{matrix}$ and determines a matrix D(f) defined by a following expression: $\begin{matrix} D (f) = A [\begin{matrix} 1 / {.Math. t_{1} (f) .Math.}_{2} & 0 & 0 \\ 0 & ⋱ & 0 \\ 0 & 0 & 1 / {.Math. t_{M} (f) .Math.}_{2} \end{matrix}], & [Formula 45] \end{matrix}$ using the determined matrix A.

8. The computer-readable non-transitory recording medium according to claim 5, wherein the plural RTF estimator determines a matrix D(f) that minimizes |u.sub.1(f)|.sub.1+ . . . +|u.sub.M(f)|.sub.1, in a condition in which diagonal elements of the matrix D(f) are fixed to a predetermined value.

9. The computer-readable non-transitory recording medium according to claim 5, wherein, where A.sup.H is a Hermitian matrix of a matrix A, I.sub.M is an M×M unit matrix, ∥t.sub.i(f)∥.sub.2 is an L2 norm of t.sub.i(f), and t.sub.ni(f)=t.sub.i(f)/∥t.sub.i(f)∥.sub.2, where i=1, . . . , M, the plural RTF estimator determines a matrix A that minimizes |u.sub.1(f)|.sub.1+ . . . +|u.sub.M(f)|.sub.1 and that satisfies a following condition: $\begin{matrix} [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = A [\begin{matrix} t_{n 1} (f) \\ .Math. \\ t_{nM} (f) \end{matrix}] A^{H} A = I_{M}, & [Formula 44] \end{matrix}$ and determines a matrix D(f) defined by a following expression: $\begin{matrix} D (f) = A [\begin{matrix} 1 / {.Math. t_{1} (f) .Math.}_{2} & 0 & 0 \\ 0 & ⋱ & 0 \\ 0 & 0 & 1 / {.Math. t_{M} (f) .Math.}_{2} \end{matrix}], & [Formula 45] \end{matrix}$ using the determined matrix A.

Description

BRIEF DESCRIPTION OF DRAWINGS

[0036] FIG. 1 is a diagram for explaining a beamforming technique.

[0037] FIG. 2 is a diagram for explaining an MVDR method.

[0038] FIG. 3 is a diagram for explaining an existing technique for estimating an RTF.

[0039] FIG. 4 is a diagram illustrating an example of a functional configuration of the transfer function estimation device of this invention.

[0040] FIG. 5 is a diagram illustrating an example of processing steps of the transfer function estimation method of this invention.

[0041] FIG. 6 is a diagram illustrating an example of a functional configuration of a computer.

DESCRIPTION OF EMBODIMENTS

[0042] Hereinafter, one embodiment of this invention will be described in detail. Constituent units having the same functions in the drawings are given the same reference numerals to omit repetitive description.

[0043] [Transfer Function Estimation Device and Method]

[0044] The transfer function estimation device includes, as illustrated in FIG. 4, a microphone array 41, a short-time Fourier transform unit 42, a correlation matrix computing unit 43, a signal space basis vector computing unit 44, and a plural RTF estimation unit 45, for example.

[0045] The transfer function estimation method is realized, for example, by each of the constituent units of the transfer function estimation device performing the processing from step S2 to step S5 described below and illustrated in FIG. 5.

[0046] Below, the constituent units of the transfer function estimation device will each be described.

[0047] The microphone array 41 is configured by N microphones. N is any integer of 2 or more. The time domain signal picked up by each microphone is input to the short-time Fourier transform unit 42.

[0048] The short-time Fourier transform unit 42 performs short-time Fourier transform on each input time domain signal to generate a frequency domain signal y(f,l) (step S2). Here, f is the frequency index, and l is the frame index. y(f,l) represents an N-dimensional vector having N elements of frequency domain signals Y.sub.1(f,l), . . . , Y.sub.N(f,l) corresponding to N time domain signals picked up by N microphones. The generated frequency domain signals y(f,l) are output to the correlation matrix computing unit 43, signal space basis vector computing unit 44, and plural RTF estimation unit 45.

[0049] When the number of sound sources is M that is an integer of 2 or more and not more than N, the frequency domain signal y(f,l) is expressed as follows, where M=2, for example. The number of sound sources M is predetermined based on other information such as a video image or the like. Alternatively, the number of sound sources M may be obtained by the method described in NPL 2, or by estimating the number of significant eigenvalues from the distribution of a correlation matrix's eigenvalues. The number of sound sources M may be obtained by any existing methods such as the one described in NPL 2.

[Formula 12]

y(f,l)=g.sub.1(f)s.sub.1(f,l)+ . . . +g.sub.M(f)s.sub.M(f,l) (1)

[0050] Here, S.sub.i(f,l) represents the sound of the i-th sound source, where i=1, . . . , M, and g.sub.i(f) represents the transfer characteristic from the i-th sound source to each of the microphones forming the microphone array 1.

[0051] The correlation matrix computing unit 43 computes a correlation matrix of the frequency domain signal y(f,l) that is a pickup signal containing a mixture of speeches of several speakers (step S3). More particularly, the correlation matrix computing unit 43 computes a correlation matrix of N frequency domain signals y(f,l) corresponding to N time domain signals picked up by the N microphones that form the microphone array. The computed correlation matrix is output to the signal space basis vector computing unit 44.

[0052] The correlation matrix computing unit 43 computes the correlation matrix by the processing similar to that of the correlation matrix computing unit 23, for example.

[0053] The signal space basis vector computing unit 44 decomposes the correlation matrix into eigenvectors and eigenvalues, and obtains eigenvectors v.sub.1(f), . . . , v.sub.M(f) in the same number as the number of sound sources M, from highest in the order of absolute values of the eigenvalues (step S4). In other words, the signal space basis vector computing unit 44 obtains M vectors v.sub.1(f), . . . , v.sub.M(f) from the eigenvectors of the correlation matrix from highest in the order of corresponding eigenvalues.

[0054] The expression (1) defines that the frequency domain signal y(f,l) that is an N-dimensional signal vector necessarily exits in the space spanned by the M vectors g.sub.1(f), . . . , g.sub.M(f). Eigendecomposition of the correlation matrices of the frequency domain signals y(f,l) produces only M eigenvalues with significantly large absolute values, the remaining N-M eigenvalues being substantially 0. The space spanned by the vectors g.sub.1(f), . . . , g.sub.M(f) conforms to the space spanned by v.sub.1(f), . . . , v.sub.M(f). There is hardly any one-to-one correspondence between g.sub.1(f), . . . , g.sub.M(f) and v.sub.1(f), . . . , v.sub.M(f), but each of g.sub.1(f), . . . , g.sub.M(f) is expressed by the linear sum of v.sub.1(f), . . . , v.sub.M(f) (see, for example, Reference Literature 1).

[0055] [Reference Literature 1] S. Malkovich, S. Gannot, and I. Cohen, Multichannel Eigenspace Beamforming in a Reverberant Noisy Environment With Multiple Interfering Speech Signals, IEEE Trans. On Audio, speech, Lang., 17, 7, pp. 1071-1086, 2009.

[0056] The plural RTF estimation unit 5 estimates the RTFs by extracting the information of this linear sum.

[0057] More specifically, the plural RTF estimation unit 45 first decomposes Y(f,l), which is composed of frequency domain signals y(f,l) of continuous L frames where L is an integer of 2 or more:

Y(f,l)=[y(f,l+1), . . . ,y(f,l+L)], [Formula 13]

[0058] using the eigenvectors v.sub.1(f), . . . , v.sub.M(f) extracted by the signal space basis vector computing unit 44 into the following formula:

[00005] $\begin{matrix} Y (f, l) .fwdarw. [v_{1} (f), .Math., v_{M} (f)] [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}] & [Formula 14] \end{matrix}$

[0059] Here, t.sub.i(f), where i=1, . . . , M, represents a 1×L vector computed by the following formula.

t.sub.i(f)=v.sub.i.sup.H(f)Y(f,l) [Formula 15]

[0060] Here, v being a given vector, v.sup.H is a vector that is the complex conjugate of the transpose of v.

[0061] Suppose, t.sub.i(f), . . . , t.sub.M(f) are converted into u.sub.1(f), . . . , u.sub.M(f) by an M×M matrix D(f). Assuming that the source signal is a voice signal, for example, the sparsity of the signal is reduced when voices are mixed together. If, then, D(f) that makes u.sub.1(f), . . . , u.sub.M(f) as sparse as possible in the time direction is determined, it is expected that u.sub.1(f), . . . , u.sub.M(f) will be closer to respective speakers' voices before mixed together.

[0062] Therefore, the sparsity of u.sub.1(f), . . . , u.sub.M(f) is measured with an L1 norm to obtain a cost function. The plural RTF estimation unit 45 solves the following optimization problem:

[00006] $\begin{matrix} Minimize {.Math. u_{1} (f) .Math.}_{1} + .Math. + {.Math. u_{M} (f) .Math.}_{1} [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = D (f) [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}] & [Formula 16] \end{matrix}$

[0063] under the following constraint:

D.sub.i,1(f)=1(i=1, . . . ,M) [Formula 17]

[0064] to determine D(f). Here, by restricting the diagonal elements of D(f) to 1, D(f) is prevented from becoming a 0 matrix. The diagonal elements of D(f) may be restricted to other predetermined values than 1. In this case, the diagonal elements may each be different. Namely, there may be i, jϵ[1, . . . , M] where

D.sub.i,j(f)≠D.sub.i,j(f). [Formula 18]

[0065] With the main diagonal elements of D(f) set to a predetermined value like this, the plural RTF estimation unit determines D(f) that minimizes |u.sub.1(f)|.sub.1+ . . . +|u.sub.M(f)|.sub.1. Since this optimization problem is a convex function, there is only one solution.

[0066] Using the 1×L matrix S.sub.i(f,l) of the source signal

S.sub.i(f,l)=[s.sub.i(f,l+1), . . . ,s.sub.i(f,l+L)](i=1, . . . ,M), [Formula 19]

[0067] Y(f,l) can be written as follows.

[00007] $\begin{matrix} Y (f, l) = [v_{1} (f), .Math., v_{M} (f)] [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}] = [v_{1} (f), .Math., v_{M} (f)] D^{- 1} (f) [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = [g_{1} (f), .Math., g_{M} (f)] [\begin{matrix} S_{1} (f) \\ .Math. \\ S_{M} (f) \end{matrix}] & [Formula 20] \end{matrix}$

[0068] This is defined as below.

[c.sub.1(f), . . . ,c.sub.M(f)]=[v.sub.1(f), . . . ,v.sub.M(f)]D.sup.−1(f) [Formula 21]

[0069] If the mixed voice signal is decomposed by D(f) favorably, s.sub.i(f) and u.sub.i(f), where i=1, . . . , M, will substantially match each other except for the scaling. Namely, it is expected that the directions of the vectors will be substantially aligned. At the same time, it is expected that the directions of c.sub.i(f) and g.sub.i(f), where i=1, . . . , M, will be substantially aligned, too. Accordingly, if:

c.sub.i(f)=[c.sub.i,1(f), . . . ,c.sub.i,N(f)].sup.T, [Formula 22]

[0070] where j is an integer of 1 or more and not more than N, the j-th microphone is the reference microphone, and i=1, . . . , M, then c.sub.i(f)/c.sub.i,1(f) is the estimate of the relative transfer function relating to each sound source.

[0071] In this way, with L being an integer of 2 or more and Y(f,l)=[y(f,l+1), . . . , y(f,l+L)], the plural RTF estimation unit 45 determines t.sub.i(f), . . . , t.sub.M(f) that satisfy the relationship of the following.

[00008] $\begin{matrix} Y (f, l) = [v_{1} (f), .Math., v_{M} (f)] [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}] . & [Formula 23] \\ [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = D (f) [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}] & [Formula 24] \end{matrix}$

[0072] Then, a matrix D(f) that is not a 0 matrix and that makes u.sub.i(f), . . . , u.sub.M(f) defined by the expression above sparse in the time direction is determined. Next, c.sub.1,1(f), . . . , c.sub.M,N(f) that satisfy the relationship of:

[c.sub.1(f), . . . ,c.sub.M(f)]=[v.sub.1(f), . . . ,v.sub.M(f)]D.sup.−1(f)

c.sub.i(f)=[c.sub.i,1(f), . . . ,c.sub.i,N(f)].sup.Ti=1, . . . ,M [Formula 25]

[0073] are determined. Then, c.sub.1(f)/c.sub.1,j(f), . . . , c.sub.M(f)/c.sub.M,j(f) are output, where j is an integer of 1 or more and not more than N, as a relative transfer function.

VARIATION EXAMPLE

[0074] In the optimization described above, when determining u.sub.1(f), . . . , u.sub.M(f) from the time-varying vectors t.sub.1(f), . . . , t.sub.M(f) with the matrix D(f), D(f) is determined such as to make u.sub.1(f), . . . , u.sub.M(f) sparsest in the time direction. For this purpose, the sparsity of u.sub.1(f), . . . , u.sub.M(f) is measured with L1 norms.

[0075] However, the L1 norm used in this way reduces not only when u.sub.1(f), . . . , u.sub.M(f) become sparse in the time direction but also when the amplitudes of u.sub.1(f), . . . , u.sub.M(f) become smaller. Therefore, minimization of the L1 norm does not necessarily always provide a sparsest signal.

[0076] To achieve a sparse signal more reliably, therefore, D(f) is determined such as to make the signal u.sub.1(f), . . . , u.sub.M(f) sparsest under a constraint that the signal power of the signal u.sub.1(f), . . . , u.sub.M(f) is constant.

[0077] Specifically, the plural RTF estimation unit 45 first regularizes the time-varying vectors t.sub.1(f), . . . , t.sub.M(f) so that their respective L2 norms become 1 to obtain normalized time-varying vectors. Namely, plural RTF estimation unit 45 calculates t.sub.ni(f)=t.sub.i(f)/∥t.sub.i(f)∥.sub.2, where i=1, . . . , M. ∥t.sub.i(f)∥.sub.2 is the L2 norm of t.sub.i(f). The normalized time-varying vectors are expressed as (t.sub.n1(f), . . . , t.sub.nM(f)).

[0078] Next, the plural RTF estimation unit 45 solves the optimization problem that uses the L1 norm as a cost function to determine a matrix A. Namely, the plural RTF estimation unit 45 determines the matrix A that minimizes |u.sub.1(f)|.sub.1+ . . . , +|u.sub.M(f)|.sub.1 and that satisfies the following condition, using t.sub.n1(f), . . . , t.sub.nM(f).

[00009] $\begin{matrix} [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = A [\begin{matrix} t_{n 1} (f) \\ .Math. \\ t_{nM} (f) \end{matrix}] A^{H} A = I_{M} & [Formula 26] \end{matrix}$

[0079] Here, A.sup.H is the Hermitian matrix of the matrix A, and I.sub.M is an M×M unit matrix. Here, each element of the matrix A can be described as follows. Each element of the matrix A may also be called the coefficient.

[00010] $\begin{matrix} A = [\begin{matrix} α_{1, J} & .Math. & α_{1, M} \\ .Math. & ⋱ & .Math. \\ α_{M, 1} & .Math. & α_{M, M} \end{matrix}] & [Formula 27] \end{matrix}$

[0080] This optimization problem can be solved by applying a method called Alternating Direction Method of Multipliers (ADMM) method (see, for example, Reference Literature 2).

[0081] [Reference Literature 2] S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein, “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Foundations and Trends in Machine Learning”, Vol. 3, No. 1 (2010) 1-122.

[0082] Using the matrix A, the sparsest signal is expressed as follows.

[00011] $\begin{matrix} [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = A [\begin{matrix} t_{n 1} (f) \\ .Math. \\ t_{n M} (f) \end{matrix}] = [\begin{matrix} 1 / {.Math. t_{1} (f) .Math.}_{2} & 0 & 0 \\ 0 & ⋱ & 0 \\ 0 & 0 & 1 / {.Math. t_{M} (f) .Math.}_{2} \end{matrix}] [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}] & [Formula 28] \end{matrix}$

[0083] Here, if:

[00012] $\begin{matrix} d (f) = A [\begin{matrix} 1 / {.Math. t_{1} (f) .Math.}_{2} & 0 & 0 \\ 0 & ⋱ & 0 \\ 0 & 0 & 1 / {.Math. t_{M} (f) .Math.}_{2} \end{matrix}], & [Formula 29] \end{matrix}$

[0084] then the relationship

[00013] $\begin{matrix} Y (f, l) = [v_{1} (f), .Math., v_{M} (f)] [\begin{matrix} t_{1} (f) \\ .Math. \\ t_{M} (f) \end{matrix}] = [v_{1} (f), .Math., v_{M} (f)] D^{- 1} (f) [\begin{matrix} u_{1} (f) \\ .Math. \\ u_{M} (f) \end{matrix}] = [g_{1} (f), .Math., g_{M} (f)] [\begin{matrix} S_{1} (f) \\ .Math. \\ S_{M} (f) \end{matrix}] & [Formula 30] \end{matrix}$

[0085] is established. Thus, by using the D(f) described above, the relative transfer function of each sound source can be estimated by the method similar to the foregoing.

[0086] Namely, using the determined D(f) and eigenvectors v.sub.1(f), . . . , v.sub.M(f), the plural RTF estimation unit 45 determines c.sub.i,1(f), . . . , c.sub.M,N(f) that satisfy the relationship of the following.

[c.sub.1(f), . . . ,c.sub.M(f)]=[v.sub.1(f), . . . ,v.sub.M(f)]D.sup.−1(f)

c.sub.i(f)=[c.sub.i,1(f), . . . ,c.sub.i,N(f)].sup.Ti=1, . . . ,M [Formula 31]

[0087] Then, c.sub.1(f)/c.sub.1,j(f), . . . , c.sub.M(f)/c.sub.M,j(f) are output, where j is an integer of 1 or more and not more than N, as a relative transfer function.

[0088] The pickup signal contains noise, so that the time-varying vectors t.sub.1(f), . . . , t.sub.M(f) calculated from the pickup signal also contain noise-originated components as well as source-originated components.

[0089] In the method described above, the time-varying vectors are regularized. Therefore, the norms of t.sub.1(f), . . . , t.sub.M(f) take various values depending on the circumstance. Looking at a particular frequency f, when there are equal amounts of the component of the first sound source and the component of the m-th sound source, the norms of t.sub.1(f), . . . , t.sub.M(f) show close values. Here, m is an integer from 2 to M.

[0090] When, however, the component of the second sound source is significantly smaller than that of the first sound source, for example, the norm of t.sub.2(f) becomes very small as compared to t.sub.1(f). In such a case, the normalized time-varying vector t.sub.n2(f), which is regularized t.sub.2(f), may contain only a very small component originating from the second sound source, other components being mostly noises.

[0091] Using such t.sub.n2(f) may possibly cause large deterioration of the estimation of RTF.

[0092] For this reason, an upper limit may be provided to the coefficient related to the normalized time-varying vector t.sub.n2(f), when the norm of t.sub.2(f) is very small relative to t.sub.1(f), to inhibit deterioration of the RTF estimate.

[0093] The plural RTF estimation unit 45 determines such an upper limit in the following manner.

[0094] First, it is assumed that t.sub.1(f) and t.sub.2(f) each contain an equal amount of noise.

[0095] The plural RTF estimation unit 45 sets the norm ratios θ, θ.sub.2 when normalizing the time-varying vectors as follows.

[00014] $\begin{matrix} θ_{1} = \frac{{.Math. t_{n 1} (f) .Math.}_{2}}{{.Math. t_{1} (f) .Math.}_{2}} θ_{2} = \frac{{.Math. t_{n 2} (f) .Math.}_{2}}{{.Math. t_{2} (f) .Math.}_{2}} & [Formula 32] \end{matrix}$

[0096] t.sub.1(f) and t.sub.2(f) are determined from the eigenvalues of the correlation matrix. Since the eigenvalue related to t.sub.1(f) is larger than the eigenvalue related to t.sub.2(f), ∥t.sub.1(f)∥.sub.2≥∥t.sub.2(f)∥.sub.2. After the normalization, the norms are both 1, so that θ.sub.1≤θ.sub.2.

[0097] There is the following relationship, where Δt.sub.n1(f) and Δt.sub.n2(f) respectively represent the noise contained in the normalized time-varying vectors (t.sub.n1(f), t.sub.n2(f)).

[00015] $\begin{matrix} \frac{{.Math. Δ t_{n 1} (f) .Math.}_{2}}{{.Math. Δ t_{n 2} (f) .Math.}_{2}} = \frac{θ_{1}}{θ_{2}} & [Formula 33] \end{matrix}$

[0098] Since θ.sub.1≤θ.sub.2, ∥Δt.sub.n2(f)∥.sub.2≥∥Δt.sub.n1(f)∥.sub.2.

[0099] Now, when the sparse signal vector u.sub.1(f) is expressed using coefficients α.sub.1,1 and α.sub.1,2 as:

u.sub.1(f)=α.sub.1,1t.sub.n1(f)+α.sub.1,2t.sub.n2(f), [Formula 34]

[0100] the error contained in u.sub.1(f) is as follows.

|α.sub.1,1|.sup.2∥Δt.sub.n1(f)∥.sub.2.sup.2+|α.sub.1,2|.sup.2∥Δt.sub.n2(f)∥.sub.2.sup.2 [Formula 35]

[0101] The size of the coefficient α.sub.1,2 is limited so that this is less than T times ∥t.sub.n1(f)∥.sub.2.sup.2. Namely, the upper limit of the coefficient α.sub.1,2 is set by:

[00016] $\begin{matrix} {.Math. α_{1, 1} .Math.}^{2} {.Math. Δ t_{n 1} (f) .Math.}_{2}^{2} + {.Math. α_{1, 2} .Math.}^{2} {.Math. Δ t_{n 2} (f) .Math.}_{2}^{2} \leq T {.Math. Δ t_{n 1} (f) .Math.}_{2}^{2} {.Math. α_{1, 2} .Math.}^{2} \leq (T - {.Math. α_{1, 1} .Math.}^{2}) {.Math. Δ t_{n 1} (f) .Math.}_{2}^{2} / {.Math. Δ t_{n 2} (f) .Math.}_{2}^{2} = (T - {.Math. α_{1, 1} .Math.}^{2}) \frac{θ_{1}^{2}}{θ_{2}^{2}} .Math. α_{1, 2} .Math. \leq \sqrt{T - {.Math. α_{1, 1} .Math.}^{2}} \frac{θ_{1}}{θ_{2}}, & [Formula 36] \end{matrix}$

[0102] where T is a predetermined positive number. It is desirable to use a value of 100 or more for T. Since |α.sub.1,1|<<T, the upper limit may be specified by the following instead of the above.

[00017] $\begin{matrix} .Math. α_{1, 2} .Math. \leq \sqrt{T} \frac{θ_{1}}{θ_{2}} & [Formula 37] \end{matrix}$

[0103] Providing an upper limit to the coefficient α.sub.1,2 related to the normalized time-varying vector t.sub.n2(f) this way increases the estimation accuracy of RTF.

[0104] When the number M of sound sources is larger than 2, the norm ratios θ.sub.1, θ.sub.2, . . . , θ.sub.M when normalizing time-varying vectors are given as:

[00018] $\begin{matrix} θ_{1} = \frac{{.Math. t_{n 1} (f) .Math.}_{2}}{{.Math. t_{1} (f) .Math.}_{2}} θ_{2} = \frac{{.Math. t_{n 2} (f) .Math.}_{2}}{{.Math. t_{2} (f) .Math.}_{2}} .Math. θ_{M} = \frac{{.Math. t_{n M} (f) .Math.}_{2}}{{.Math. t_{M} (f) .Math.}_{2}}, & [Formula 38] \end{matrix}$

[0105] and the m′-th (1≤m′≤M) extracted signal is expressed by coefficients α.sub.m′,1, . . . , α.sub.m′,M as follows:

u.sub.m′(f)=α.sub.m′,1t.sub.n1(f)+α.sub.m′,2t.sub.n2(f)+ . . . α.sub.m′,Mt.sub.nM(f) [Formula 39]

[0106] In this case, the plural RTF estimation unit 45 may determine the upper limit for the size of the coefficient α.sub.m′,m by the following.

[00019] $\begin{matrix} .Math. α_{m^{'}, m} .Math. \leq \sqrt{T} \frac{θ_{1}}{θ_{m}} (2 \leq m \leq M) & [Formula 40] \end{matrix}$

[0107] When the number of sound sources is M, the plural RTF estimation unit 45 estimates relative transfer function vectors c.sup.m(f)=c.sub.1(f)/c.sub.1,j(f), . . . , c.sub.m′(f)/c.sub.m′,j(f), . . . , c.sub.M(f)/c.sub.M,j(f), containing M elements of relative transfer functions, where m=1, . . . , M, at each frequency. The relative transfer function vector c.sup.m(f) is the m-th relative transfer function vector generated by the plural RTF estimation unit 45.

[0108] Here, the correspondence between the relative transfer functions from index 1 to index M to the sound sources, i.e., the correspondence between the indexes m′ of u.sub.m′(f) (1≤m′≤M) and the sound sources are not necessarily the same at any frequency. Therefore it is necessary to determine the index σ(f,m) of the sound source for u.sub.m′(f) to correspond to at each frequency. This is called permutation solution.

[0109] A permutation solution unit 46 may perform this permutation solution. The permutation solution may be realized, for example, by the method described in Reference Literature 3.

[0110] [Reference Literature 3] H. Sawada, S. Araki, S. Makino, “MLSP 2007 Data Analysis Competition: Frequency-Domain Blind Source Separation for Convolutive Mixtures of Speech/Audio Signals”, IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2007), pp. 45-50, August 2007.

[0111] At a given frequency f, the relative transfer function vector c.sup.m(f) corresponds to u.sub.m(f). By permutation solution, this relative transfer function vector c.sup.m(f) corresponds to the σ(f,m)-th sound source.

[0112] While the embodiment and variation example have been described above, it should be understood that specific configurations are not limited to those of the embodiment and any design changes or the like made without departing from the scope of this invention shall be included in this invention.

[0113] Various processing steps described above in the embodiment may not only be executed in chronological order in accordance with the description, but also be executed in parallel or individually in accordance with the processing capacity of the device executing the processing, or in accordance with necessity.

[0114] [Program and Recording Medium]

[0115] When various processing functions of each of the devices described above are to be realized by a computer, the processing contents of the functions each device should have are described by a program. By executing this program on a computer, the various processing functions of each of the devices described above are realized on the computer. For example, the various processing steps described above may be performed by reading in a program to be executed to a recording unit 2020 of the computer illustrated in FIG. 6, and by causing the control unit 2010, input unit 2030, and output unit 2040, etc., to operate.

[0116] The program that describes the processing contents may be recorded on a computer-readable recording medium. Any computer-readable recording medium may be used, such as, for example, a magnetic recording device, an optical disc, an optomagnetic recording medium, a semiconductor memory, and so on.

[0117] This program may be distributed by selling, transferring, leasing, etc., a portable recording medium such as a DVD, CD-ROM and the like on which this program is recorded, for example. Moreover, this program may be distributed by storing the program in a memory device of a server computer, and by forwarding this program from the server computer to another computer via a network.

[0118] A computer that executes such a program may, for example, first temporarily store the program recorded on a portable recording medium or the program forwarded from a server computer, in a memory device of its own. In executing the processing, this computer reads out the program stored in its own memory device, and executes the processing in accordance with the read-out program. Moreover, as an alternative form of executing this program, the computer may read out this program directly from a portable recording medium and execute the processing in accordance with the program. Further, every time a program is forwarded from a server computer to this computer, the processing in accordance with the received program may be executed consecutively. In an alternative configuration, instead of forwarding the program from a server computer to this computer, the processing described above may be executed by a service known as ASP (Application Service Provider) that realizes processing functions only through instruction of execution and acquisition of results. It should be understood that the program in this embodiment includes information to be provided for the processing by an electronic calculator based on the program (such as data having a characteristic to define processing of a computer, though not direct instructions to the computer).

[0119] Note, instead of configuring the device by executing a predetermined program on a computer as in this embodiment, at least some of these processing contents may be realized by hardware.

REFERENCE SIGNS LIST

[0120] 41 Microphone array [0121] 42 Short-time Fourier transform unit [0122] 43 Correlation matrix computing unit [0123] 44 Signal space basis vector computing unit [0124] 45 Estimation unit

SOUND-SOURCE SIGNAL ESTIMATE APPARATUS, SOUND-SOURCE SIGNAL ESTIMATE METHOD, AND PROGRAM

Assignee

Inventors

Cpc classification

Classification Explorer

H04R1/406

ELECTRICITY

Classification Explorer

H04S2420/01

ELECTRICITY

Classification Explorer

H04S2400/15

ELECTRICITY

Classification Explorer

H04R2499/15

ELECTRICITY

Classification Explorer

H04R1/028

ELECTRICITY

Classification Explorer

H04R3/005

ELECTRICITY

Classification Explorer

H04S7/30

ELECTRICITY

Classification Explorer

H04S7/304

ELECTRICITY

Classification Explorer

H04R5/027

ELECTRICITY

Classification Explorer

H04R2201/401

ELECTRICITY

Classification Explorer

H04R1/326

ELECTRICITY

Classification Explorer

H04S7/301

ELECTRICITY

Classification Explorer

G10K15/00

PHYSICS

International classification

Classification Explorer

H04R1/32

ELECTRICITY

Abstract

Claims

Description