SOUND SOURCE SEPARATION APPARATUS, SOUND SOURCE SEPARATION METHOD, AND PROGRAM
20230079569 · 2023-03-16
Assignee
Inventors
Cpc classification
International classification
Abstract
A sound source separation device (10) acquires, from a mixed signal including sounds that came from a plurality of sound sources, a separated signal including an emphasized sound for every sound source. A signal conversion unit (1) converts the mixed signal into the frequency domain. A separated signal estimation unit (2) acquires the separated signals from the mixed signal using an optimized filter. A gradient calculation unit (3) calculates the gradient of a cost function using the mixed signal and the separated signals. A filter update unit (4) optimizes the filter to fulfill separating, for every sound source, a sound emitted from the sound source, and to fulfill having, for every sound source, strong directivity in a direction of the sound source compared with a direction not of the sound source. A signal inverse conversion unit (5) converts the separated signals into the time domain.
Claims
1. A sound source separation device comprising a processor configured to execute a method comprising: acquiring a separated signal from a mixed signal including sounds that came from a plurality of source sources, wherein the separated signal includes an emphasized sound for a sound source, using a separation filter optimized to: fulfill separating, for a first sound source, a sound emitted from the first sound source, and fulfill having, for the first sound source, strong directivity in a direction of the first sound source compared with a direction not of the first sound source.
2. The sound source signal separation device according to claim 1, wherein the separation filter is obtained by optimizing a likelihood of a target sound source and an index value that represents having strong directivity toward the sound source, based on a single cost function.
3. The sound source signal separation device according to claim 2, wherein the cost function is defined by the following equations, where t={1, . . . , T} represents a time frame, n={1, . . . , N} represents a sound source, f={1, . . . , F} represents a frequency bin, p(y.sub.tn.sup.(k)) is a stochastic model to which conforms a vector y.sub.tn.sup.(k) that collects a separated signal of a frequency domain in a dimension of the frequency bin, W.sub.f.sup.(k) is a separation matrix whose rows contain a separation filter at a present time k, γ is a weight hyperparameter, a.sub.θf is an array manifold vector assuming the target sound source came from a direction of arrival θ={1, . . . , θ} by plane wave, and B.sub.f is a scaling matrix,
4. The sound source signal separation device according to claim 2, wherein the separation filter is optimized based on frequency characteristics of the sounds emitted from the sound sources.
5. The sound source signal separation device according to claim 4, wherein the separation filter is optimized by calculating the following equations, where f.sub.1 and f.sub.2 are predetermined frequencies, the outline character I is an indicator function, a.sub.θf is an array manifold vector assuming the target sound source came from a direction of arrival θ by plane wave, B.sub.f is a scaling matrix, and W.sub.f.sup.(k) is a separation matrix whose rows contain a separation filter at a present time k,
6. A computer implemented method for acquiring sound source separation, the method comprising: acquiring a separated signal from the mixed signal including sounds that came from a plurality of sound sources, using a separation filter optimized to: fulfill separating, for a first sound source, a sound emitted from the first sound source, and fulfill having, for the first sound source, strong directivity in a direction of the first sound source compared with a direction not of the first sound source, wherein the separated signal includes an emphasized sound for every sound source.
7. A computer-readable non-transitory recording medium storing computer-executable program instructions that when executed by a processor cause a computer to execute a method comprising: acquiring a separated signal from the mixed signal including sounds that came from a plurality of sound sources, using a separation filter optimized to: fulfill separating a sound emitted from a sound source of the plurality of sources, and fulfill having strong directivity in a direction of the sound source compared with a direction not of the sound source, wherein the separated signal includes an emphasized sound for the sound source.
8. The sound source signal separation device according to claim 3, wherein the separation filter is optimized based on frequency characteristics of the sounds emitted from the sound sources.
9. The computer implemented method according to claim 6, wherein the separation filter is obtained by optimizing a likelihood of a target sound source and an index value that represents having strong directivity toward the sound source, based on a single cost function.
10. The computer implemented method according to claim 9, wherein the cost function is defined by the following equations, where t={1, . . . , T} represents a time frame, n={1, . . . , N} represents a sound source, f={1, . . . , F} represents a frequency bin, p(y.sub.tn.sup.(k)) is a stochastic model to which conforms a vector y.sub.tn.sup.(k) that collects a separated signal of a frequency domain in a dimension of the frequency bin, W.sub.f.sup.(k) is a separation matrix whose rows contain a separation filter at a present time k, γ is a weight hyperparameter, a.sub.θf is an array manifold vector assuming the target sound source came from a direction of arrival θ={1, . . . , θ} by plane wave, and B.sub.f is a scaling matrix,
11. The computer implemented method according to claim 9, wherein the separation filter is optimized based on frequency characteristics of the sounds emitted from the sound sources.
12. The computer implemented method according to claim 10, wherein the separation filter is optimized based on frequency characteristics of the sounds emitted from the sound sources.
13. The computer implemented method according to claim 11, wherein the separation filter is optimized by calculating the following equations, where f.sub.1 and f.sub.2 are predetermined frequencies, the outline character I is an indicator function, a.sub.θf is an array manifold vector assuming the target sound source came from a direction of arrival θ by plane wave, B.sub.f is a scaling matrix, and W.sub.f.sup.(k) is a separation matrix whose rows contain a separation filter at a present time k,
14. The computer-readable non-transitory recording medium according to claim 7, wherein the separation filter is obtained by optimizing a likelihood of a target sound source and an index value that represents having strong directivity toward the sound source, based on a single cost function.
15. The computer-readable non-transitory recording medium according to claim 14, wherein the cost function is defined by the following equations, where t={1, . . . , T} represents a time frame, n={1, . . . , N} represents a sound source, f={1, . . . , F} represents a frequency bin, p(y.sub.tn.sup.(k)) is a stochastic model to which conforms a vector y.sub.tn.sup.(k) that collects a separated signal of a frequency domain in a dimension of the frequency bin, W.sub.f.sup.(k) is a separation matrix whose rows contain a separation filter at a present time k, γ is a weight hyperparameter, a.sub.θf is an array manifold vector assuming the target sound source came from a direction of arrival θ={1, . . . , θ} by plane wave, and B.sub.f is a scaling matrix,
16. The computer-readable non-transitory recording medium according to claim 14, wherein the separation filter is optimized based on frequency characteristics of the sounds emitted from the sound sources.
17. The computer-readable non-transitory recording medium according to claim 15, wherein the separation filter is optimized based on frequency characteristics of the sounds emitted from the sound sources.
18. The computer-readable non-transitory recording medium according to claim 16, wherein the separation filter is optimized by calculating the following equations, where f.sub.1 and f.sub.2 are predetermined frequencies, the outline character I is an indicator function, a.sub.θf is an array manifold vector assuming the target sound source came from a direction of arrival θ by plane wave, B.sub.f is a scaling matrix, and W.sub.f.sup.(k) is a separation matrix whose rows contain a separation filter at a present time k,
Description
BRIEF DESCRIPTION OF DRAWINGS
[0013]
[0014]
[0015]
DESCRIPTION OF EMBODIMENTS
[0016] Hereinafter, embodiments of this invention will be described in detail. Note that the same reference numerals are given to constituent elements having the same function in the drawings, and redundant description will be omitted.
Embodiments
[0017] Embodiments of this invention are a sound source separation device and method for executing an audio processing algorithm for separating each target sound source from a mixed signal composed of a plurality of mixed sound source signals. This audio processing algorithm includes (1) a signal conversion step of converting a mixed signal that is defined in the time domain into a mixed signal of the frequency domain, (2) a separated signal estimation step of estimating a separated signal of the frequency domain at a present time k, by applying a separation filter that is estimated at the present time k to the mixed signal of the frequency domain derived in the signal conversion step, (3) a gradient calculation step of calculating respective gradients of the likelihood relating to the separation filter that is estimated at the present time k and regularization that is based on the direction of arrival, using the mixed signal of the frequency domain derived in the signal conversion step and the separated signal of the frequency domain derived in the separated signal estimation step, (4) a filter update step of updating the separation filter, using the gradients calculated in the gradient calculation step, and (5) a signal inverse conversion step of converting the separated signal of the frequency domain derived in the separated signal estimation step into a separated signal that is defined in the time domain.
[0018] A sound source separation device 10 of an embodiment is an audio signal processing device that receives input of a mixed signal of the time domain that includes sounds that came from a plurality of sound sources, and outputs a separated signal of the time domain that includes an emphasized sound for every sound source. As illustrated in
[0019] The sound source separation device 10 is, for example, a special device constituted by a special program being loaded onto a known or dedicated computer having a Central Processing Unit (CPU) and a main storage device (Random Access Memory (RAM)), and the like. The sound source separation device 10 executes various processing under the control of the central processing unit, for example. Data input to the sound source separation device 10 and data obtained by the various processing is stored in the main storage device, for example, and data stored in the main storage device is read out to the central processing unit and utilized in other processing as required. The processing units of the sound source separation device 10 may be constituted at least in part by hardware such as an integrated circuit.
[0020] The processing procedure of the sound source separation method that is executed by the sound source separation device 10 of an embodiment will be described, with reference to
[0021] In this embodiment, the number N of sound sources and the number M of microphones are known. Also, the input of the sound source separation device 10 is a mixed signal X.sub.tmεR of the time domain that is acquired from an m∈{1, . . . , M}th microphone. Here, t∈{1, . . . , T} represents each time frame, and T represents the maximum time frame. Also, R is the entire set of real numbers.
[0022] In step S1, the signal conversion unit 1 converts the mixed signal X.sub.tm of the time domain input to the sound source separation device 10 into a mixed signal x.sub.ftm∈C of the frequency domain, using the Short-Time Fourier Transform (STFT) or the like. Here, f∈{1, . . . , F} represents each frequency bin, and F represents the maximum frequency bin. Also, C is the entire set of complex numbers. The signal conversion unit 1 outputs the mixed signal x.sub.ftm of the frequency domain to the separated signal estimation unit 2 and the gradient calculation unit 3.
[0023] In step S2, the separated signal estimation unit 2, first, creates a separation matrix W.sub.f.sup.(k)=[w.sub.1f.sup.(k), . . . , w.sub.Nf.sup.(k)].sup.T∈C.sup.N×M whose rows contain a separation filter w.sub.nf.sup.(k)∈C.sup.1×M that is estimated at the present time k. Note that .sup.⋅T represents transposition. Next, the separated signal estimation unit 2 estimates a separated signal y.sub.ftn.sup.(k) of the frequency domain at the present time k, by calculating the matrix product of the separation matrix W.sub.f.sup.(k) and a vector x.sub.ft=[x.sub.ft1, . . . , x.sub.ftm].sup.T∈C.sup.M×1 of the mixed signal x.sub.ftm of the frequency domain. Specifically, the separated signal estimation unit 2 calculates equation (1).
[Math. 1]
y.sub.ft.sup.(k)=W.sub.f.sup.(k)x.sub.ft (1)
[0024] Here, y.sub.ft.sup.(k)=[y.sub.ft1.sup.(k), . . . , y.sub.ftN.sup.(k)].sup.T∈C.sup.N×1. The separation filter w.sub.nf.sup.(k) will output a separated signal y.sub.ftn.sup.(k) of the frequency domain that corresponds to an n∈{1, . . . , N}th sound source from the mixed signal vector x.sub.ft of the frequency domain. The separated signal estimation unit 2 outputs the separated signal y.sub.ftn.sup.(k) of the frequency domain to the gradient calculation unit 3.
[0025] In step S3, the gradient calculation unit 3 calculates the gradient of the likelihood relating to the separation filter w.sub.nf.sup.(k) that is estimated at the present time k and the gradient of regularization that is based on the direction of arrival, using the mixed signal x.sub.ftm of the frequency domain which is the output result of the signal conversion unit 1 and the separated signal y.sub.ftn.sup.(k) of the frequency domain which is the output result of the separated signal estimation unit 2. The gradient calculation unit 3 outputs the gradients to the filter update unit 4. Hereinafter, the method of calculating the gradients will be described in detail.
[0026] First, a negative log likelihood L.sub.NLL.sup.(k) at the present time k is defined as in equation (2) in relation to the mixed signal vector x.sub.tm=[x.sub.1tm, . . . , x.sub.Ftm].sup.T that collects the mixed signal x.sub.ftm of the frequency domain in the dimension of the frequency bin.
[0027] Equation (2) can be written as in equation (3), taking the linear constraint equation (1) into consideration.
[0028] Here, y.sub.tn.sup.(k) represents a separated signal vector [y.sub.1tm.sup.(k), . . . , y.sub.Ftn.sup.(k)]∈C.sup.F×1 that collects the separated signal y.sub.ftn.sup.(k) of the frequency domain in the dimension of the frequency bin. p(y.sub.tn.sup.(k)) represents a stochastic model to which the separated signal vector y.sub.tn.sup.(k) conforms. Note that the stochastic model that is used here is generally the independent Laplacian distribution model (e.g., see NPL 1) or the like, although there is no particular restriction to the model in the present invention.
[0029] The gradient of the likelihood relating to the separation filter w.sub.nf.sup.(k)∈W.sub.f.sup.(k) that is estimated at the present time k is derived, by calculating the gradient of a complex conjugate W.sub.f* of the separation filter with respect to equation (3). Specifically, the gradient calculation unit 3 calculates equation (4).
[0030] Here, E[⋅] represents calculating the expected value of ⋅, and .sup.⋅H represents the Hermitian transpose.
[0031] Regularization that is based on the direction of arrival with respect to the separation filter w.sub.nf.sup.(k)∈W.sub.f.sup.(k) that is estimated at the present time k is also considered, and the gradient thereof is calculated. Here, regularization is defined as the composite function of simple functions g.sub.1 to g.sub.5, as in equation (5).
[Math. 5]
L.sub.norm.sup.(k)=g.sub.1∘g.sub.2∘g.sub.3∘g.sub.4∘g.sub.5)({W.sub.f.sup.(k)}.sub.f=1.sup.F) (5)
[0032] Here, g.sub.1 to g.sub.5 are defined as follows.
[0033] Here, ψ.sub.θf=[ψ.sub.1θf, . . . , ψ.sub.Nθf].sup.T represents a beam pattern relating to the direction of arrival θ={1, . . . , θ} in a frequency bin f of the separation filter w.sub.nf.sup.(k)∈W.sub.f.sup.(k). a.sub.θf=[a.sub.1θf, . . . , a.sub.Mθf].sup.T represents an array manifold vector assuming the target sound source came from the direction of arrival θ by plane wave. B.sub.f=diag [b.sub.1, . . . , b.sub.n] is a scaling matrix for adjusting the problem whereby the scale for the separation matrix W.sub.f.sup.(k) is indeterminate during optimization, and a projection back technique (Reference Literature 1) for example, has been proposed, although there is no particular restriction to the technique in the present invention.
[0034] Also,
⊙ [Math. 7]
[0035] represents the Hadamard product, and ⋅* represents a complex conjugate. [0036] [Reference Literature 1] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., “Learning representations by back-propagating errors,” Cognitive Modeling, vol. 5, no. 3, pp. 1, 1988.
[0037] The beam pattern at the present time k is calculated by g.sub.3◯g.sub.4◯g.sub.5 within this regularization. The beam pattern is a feature amount that can be rendered as a two-dimensional heat map (e.g., red is sensitivity high, blue is sensitivity low) with the direction of arrival θ on the x-axis, the frequency bin f on the y-axis, and the sensitivity value ψ.sub.θf on the z-axis, and represents the characteristics of the separation filter. The maximum sensitivity relating to a given specific direction of arrival θ is then acquired with the max function of g.sub.2. In other words, this is equivalent to acquiring the direction of arrival θ at which the red band appears darkest in the y-axis direction on the heat map. Also, the direction in which the separation filter w.sub.nf.sup.(k)∈W.sub.f.sup.(k) at the present time k is to form the maximum sensitivity, that is, the direction of arrival of the target sound source will be estimated implicitly. Finally, the extent to which the maximum sensitivity can be formed in a given specific direction of arrival is calculated using g.sub.1. Note that although g.sub.1 simply takes the form of an L.sub.2 norm, the value of the maximum sensitivity ultimately converges on 1, and thus may conceivably be formulated as g.sub.1=∥h.sub.1−1∥.sub.2.sup.2. However, it is empirically clear that this makes regularization tougher and optimization unstable. Thus, it is basically desirable to use g.sub.1=∥h.sub.1∥.sub.2.sup.2 as in equation (6).
[0038] Since regularization L.sub.norm.sup.(k) is represented as a composite function of the simple functions g.sub.1 to g.sub.5, the gradient of regularization L.sub.norm.sup.(k) can be calculated as in equations (11) to (14), by using back propagation that is based on the chain rule used by neural networks and the like.
[0039] Here,
II [Math. 9]
[0040] The outline character I is an indicator function, and represents propagating only the calculation result relating to the maximum direction of arrival {circumflex over ( )}θ=argmax.sub.θ{h.sub.2,θ}.sub.θ=1.sup.θ as the gradient. f.sub.1 and f.sub.2 are predetermined frequencies.
[0041] Also, in the present invention, equation (14) is proposed as an approximation of ∂L.sub.norm.sup.(k)/∂W.sub.f*. This enables the frequency qualities of the target sound source to be incorporated when calculating the gradients. For example, since the main frequency band of the human voice is 500 to 3000 Hz, it is possible to calculate the gradients with consideration for only this frequency band by setting f.sub.1=500 and f.sub.2=3000.
[0042] Ultimately, a gradient ∂L.sup.(k)/∂W.sub.f* at the present time k is represented as in equation (15), as the weighted linear summation of the gradient ∂L.sub.NLL.sup.(k)/∂W.sub.f* of the negative log likelihood and the gradient ∂L.sub.norm.sup.(k)/∂W.sub.f* of regularization that is based on the direction of arrival.
[0043] Here, γ is a weight hyperparameter. Accordingly, a cost function L.sup.(k) at the present time k is defined by equation (16) from equations (3) and (5).
[0044] In step S4-1, the filter update unit 4 updates a separation filter W.sub.f.sup.(k) at the present time k using the natural gradient method as in equation (17), for example, based on the gradient ∂L.sup.(k)/∂W.sub.f* at the present time k which is the output result of the gradient calculation unit 3, and calculates a separation filter W.sub.f.sup.(k+1) at the next time k+1.
[0045] Here, α represents the update step size. Ultimately, a separated signal y.sub.ftn.sup.(k+1) of the frequency domain which is the output result of the separated signal estimation unit 2 when the separation filter W.sub.f.sup.(k+1) is no longer updated will be an expression in the frequency domain of the target sound source to be derived. The filter update unit 4 outputs the separation filter W.sub.f.sup.(k+1) to the separated signal estimation unit 2.
[0046] In step S4-2, the filter update unit 4 determines whether updating of the separation filter is completed. If updating is completed, the processing advances to step S5. If updating is not completed, the processing returns to step S2. It may be determined that updating is completed when the amount by which the separation filter is updated falls below a predetermined value, or when the separation filter has been updated a predetermined number of times, for example.
[0047] In step S5, the signal inverse conversion unit 5 converts the separated signal y.sub.ftn.sup.(k+1) of the frequency domain which is the output result of the separated signal estimation unit 2 into a separated signal y.sub.tn∈R of the time domain, using the inverse short-time Fourier transform. The signal inverse conversion unit 5 outputs the separated signal y.sub.tn of the time domain as the output of the sound source separation device 10.
[0048] The present invention proposes differentiable regularization for implicitly incorporating utilization of the direction of arrival into optimization, and proposes a simple novel optimization technique that takes both estimation of the separation filter and utilization of the direction of arrival into consideration in the optimization framework at the same time. Also, the regularization term proposed by the present invention is differentiable, and thus can be readily incorporated as an error term in a model premised on the gradient method such as a deep neural network.
[0049] Although embodiments of the present invention are described above, the specific configurations are not limited to these embodiments, and design modification and so forth are naturally intended to be included in the invention as appropriate to the extent that they do not depart from the spirit of the invention. The various types of processing described in the embodiments may not only be executed chronologically in accordance with the written order but may also be executed parallelly or individually as required or according to the processing capacity of the device that executes the processing.
[0050] [Computer Program, Recording Medium]
[0051] In the case where the various types of processing functions of the devices described with the above embodiments are realized by a computer, the processing contents of the functions that the devices are to be provided with are described by a computer program. The various types of processing functions of the above devices are realized on a computer, by causing this program to be loaded onto a storage unit 1020 of the computer shown in
[0052] The program describing the processing contents can be recorded to a computer-readable recording medium. The computer-readable recording medium is, for example, a non-transitory recording medium, such as a magnetic recording device, an optical disc, and the like.
[0053] Also, distribution of this program is performed by, for example, selling, transferring, leasing and the like a portable recording medium such as a DVD and CD-ROM on which the program is recorded. Furthermore, a configuration may be adopted in which this program is distributed by storing the program on a storage device of a server computer, and transferring the stored program to other computers from the server computer via a network.
[0054] The computer that executes such a program first stores the program recorded on the portable recording medium or the program transferred from the server computer temporarily in an auxiliary recording unit 1050 which is a non-transitory storage device provided in the computer, for example. When processing is to be executed, this computer then loads the program stored in the auxiliary recording unit 1050 which is a non-transitory storage device provided in the computer onto the storage unit 1020 which is a transitory storage device, and executes processing that conforms to the loaded program. Also, as other execution modes of the program, the computer may be configured to load a program directly from the portable recording medium and execute processing that conforms to the loaded program, and may, furthermore, be configured such that, every time a program is transferred to the computer from the server computer, processing that conforms to the received program is executed. A configuration may also be adopted whereby a program is not transferred to the computer from the server computer, and the above-mentioned processing is executed by a so-called ASP (Application Service Provider) service that realizes processing functions through only execution instructions and result acquisition. Note that a program in this mode includes information provided for use in processing by an electronic computer and equivalent to a program (data, etc., that is not a direct instruction to the computer but has the characteristic of regulating processing by the computer).
[0055] Although, in this mode, the device is constituted by executing a predetermined program on a computer, at least some of the processing contents may be realized in a hardware manner.