Method for audio source separation and corresponding apparatus
09734842 · 2017-08-15
Assignee
Inventors
Cpc classification
International classification
Abstract
Separation of speech and background from an audio mixture by using a speech example, generated from a source associated with a speech component in the audio mixture, to guide the separation process.
Claims
1. A method of audio source separation from an audio signal comprising a mix of a background component and a speech component, wherein said method is based on a non-negative matrix partial co-factorization, the method comprising: producing a speech example relating to a speech component in the audio signal; converting said speech example and said audio signal to non-negative matrices representing their respective spectral amplitudes; receiving a first set of characteristics of the audio signal and a second set of characteristics of the produced speech example; estimating parameters for configuration of said separation, said received first set of characteristics and said received second set of characteristics being used for modeling mismatches between the speech example and the speech component, said mismatches comprising a temporal synchronization mismatch, a pitch mismatch and a recording conditions mismatch; obtaining an estimated speech component and an estimated background component of the audio signal by separation of the speech component from the audio signal through filtering of the audio signal using the estimated parameters; the first and the second set of received characteristics being at least one of a tessiture, a prosody, a dictionary built from phonemes, a phoneme order, or recording conditions.
2. The method according to claim 1, wherein said speech example is produced by a speech synthesizer.
3. The method according to claim 2, wherein said speech synthesizer receives as input subtitles that are related to said audio signal.
4. The method according to claim 2, wherein said speech synthesizer receives as input at least a part of a movie script related to the audio signal.
5. The method according to claim 1, further comprising a dividing the audio signal and the speech example into blocks, each block representing a spectral characteristic of the audio signal and of the speech example.
6. A device for separating, through non-negative matrix partial co-factorization, audio sources from an audio signal comprising a mix of a background component and a speech component, comprising: a speech example producer configured to produce a speech example relating to a speech component in said audio signal; a converter configured to convert said speech example and said audio signal to non-negative matrices representing their respective spectral amplitudes; a parameter estimator configured to estimate parameters for configuring said separating by a separator, said parameter estimator receiving a first set of characteristics of the audio signal and a second set of characteristics of the produced speech example, wherein said first set of characteristics and said second set of characteristics serve for modeling by said parameter estimator mismatches between the speech example and the speech component, said mismatches comprising a temporal synchronization mismatch, a pitch mismatch and a recording conditions mismatch; the separator being configured to separate the speech component of the audio signal by filtering of the audio signal using said parameters estimated by the parameter estimator, to obtain an estimated speech component and an estimated background component of the audio signal; the first and the second set of received characteristics being at least one of a tessiture, a prosody, a dictionary built from phonemes, a phoneme order, or recording conditions, the synchronization mismatch between the speech example and the speech component being at least one of a temporal mismatch between the speech example and the speech component, a mismatch between distributions of phonemes between the speech example and the speech component, a mismatch between a distribution of pitch between the speech example and the speech component, or a recording conditions mismatch between the speech example and the speech component.
7. The device according to claim 6, further comprising a divider configured to divide the audio signal and the speech example in blocks of a spectral characteristic of the audio signal and of the speech example.
8. The device according to claim 6, further comprising a speech synthesizer configured to produce said speech example.
9. The device according to claim 8, wherein said speech synthesizer is further configured to receive as input subtitles that are related to the audio signal.
10. The device according to claim 8, wherein said speech synthesizer is further configured to receive as input at least a part of a movie script related to the audio signal.
Description
4. LIST OF FIGURES
(1) More advantages of the disclosure will appear through the description of particular, non-restricting embodiments of the disclosure.
(2) The embodiments will be described with reference to the following figures:
(3)
(4)
(5)
(6)
(7)
(8)
(9)
5. DETAILED DESCRIPTION
(10) One of the objectives of the present disclosure is the separation of speech signals from a background audio in single channel or multiple channel mixtures such as a movie audio track. For simplicity of explanation of the features of the present disclosure, the description hereafter concentrates on single-channel case. The skilled person can easily extend the algorithm to multichannel case where the spatial model accounting for the spatial locations of the sources are added. The background audio component of the mixture comprises for example music, background speech, background noise, etc). The disclosure presents a workflow and an example algorithm where available textual information associated with the speech signal comprised in the mixture is used as auxiliary information to guide the source separation. Given the associated textual information, a sound that mimics the speech in the mixture (hereinafter referred to as the “speech example”) is generated via, for example, a speech synthesizer or a human speaker. The mimicked sound is then time-synchronized with the mixture and incorporated in an NMF (Non-negative Matrix Factorization) based source separation system. State of the art source separation has been previously briefly discussed. Many approaches use a PLCA (Probabilistic Latent Component Analysis) modeling framework or Gaussian Mixture Model (GMM), which is however less flexible for an investigation of a deep structure of a sound source compared to the NMF model. Prior art also takes into account a possibility for manual annotation of source activity, i.e. to indicate when each source is active in a given time-frequency region of a spectrum. However, such prior-art manual annotation is difficult and time-consuming.
(11) The disclosure also concerns a new NMF based signal modeling technique that is referred to as Non-negative Matrix Partial Co-Factorization or NMPCF that can handle a structure of audio sources and recording conditions. A corresponding parameter estimation algorithm that jointly handles the audio mixture and the generated guide source (the speech example) is also disclosed.
(12)
(13)
(14)
Characteristics f of the audio mixture comprise: as above for speech example background spectral dictionary background temporal activations
(15) The blocks are matrices comprised of information about the audio signal, each matrix (or block) containing information about a specific characteristic of the audio signal e.g. intonation, tessitura, phoneme spectral envelopes. Each block models one spectral characteristic of the signal. Then these “blocks” are estimated jointly in the so-called NMPCF framework described in the disclosure. Once they are estimated, they are used to compute the estimated sources.
(16) From the combination of both, the time-frequency variations between the speech example and the speech component in the audio mixture can be modeled.
(17) In the following, a model will be introduced where the speech example shares linguistic characteristics with the audio mixture, such as tessitura, dictionary of phonemes, and phonemes order. The speech example is related to the mixture so that the speech example can serve as a guide during the separation process. In this step 31, the characteristics are jointly estimated, through a combination of NMF and source filter modeling on the spectrograms. In a third step 32, a source separation is done using the characteristics obtained in the second step, thereby obtaining estimated speech and estimated background, classically through Wiener filtering.
(18)
(19)
(20) The previous discussed characteristics can be translated in mathematical terms by using an excitation-filter model of speech production combined with an NMPCF model, as described hereunder.
(21) The excitation part of this model represents the tessitura and the prosody of speech such that: the tessitura 408 is modeled by a matrix W.sub.p.sup.E in which each column is a harmonic spectral shape corresponding to a pitch; the prosody 404 and 410, representing temporal activations of the pitches, is modeled by a matrix whose rows represent temporal distributions of the corresponding pitches: denoted by H.sub.Y.sup.E 410 for the speech example and H.sub.S.sup.E 404 for the audio mix.
(22) The filter part of the excitation-filter model of speech production represents the dictionary of phonemes and their temporal distribution such that: the dictionary of phonemes 407 is modeled by a matrix W.sub.Y.sup.φ whose columns represent spectral shapes of phonemes; the temporal distribution of phonemes 409 is modeled by a matrix whose rows represent temporal distributions of the corresponding phonemes: H.sub.Y.sup.φ for the example speech and H.sub.Y.sup.φD for the audio mix (as previously mentioned, the order of the phonemes is considered as being the same but the speech example and the audio mix are considered as not being perfectly synchronized).
(23) For the recording conditions 403 and 411, a stationary filter is used: denoted by w.sub.Y 411 for the speech example and w.sub.S 403 for the audio mixture.
(24) The background in the audio mixture is modeled by a matrix W.sub.B 405 of a dictionary of background spectral shapes and the corresponding matrix H.sub.B 406 representing temporal activations.
(25) Finally, the temporal mismatch 402 between the speech example and the speech part of the mixture is modeled by a matrix D (that can be seen as a Dynamic Time Warping (DTW) matrix).
(26) The two parts of the excitation-filter model of speech production can then be summarized by these two equations:
(27)
(28) Where ⊙ denotes the entry-wise product (Hadamard) and i is a column vector whose entries are one when the recording condition is unchanged.
(29) Parameter estimation can be derived according to either Multiplicative Update (MU) or Expectation Maximization (EM) algorithms. A hereafter described example embodiment is based on a derived MU parameter estimation algorithm where the Itakura-Saito divergence between spectrograms V.sub.Y and V.sub.X and their estimates {circumflex over (V)}.sub.Y and {circumflex over (V)}.sub.X is minimized (in order to get the best approximation of the characteristics) by a so-called cost function (CF):
CF=d.sub.IS(V.sub.Y|{circumflex over (V)}.sub.Y)+d.sub.IS(V.sub.X|{circumflex over (V)}.sub.X)
(30) where
(31)
is the Itakura-Saito (“IS”) divergence.
(32) Note that a possible constraint over the matrices W.sub.Y.sup.φ, w.sub.Y and w.sub.S can be set to allow only smooth spectral shapes in these matrices. This constraint takes the form of a factorization of the matrices by a matrix Pthat contains elementary smooth shapes (blobs), such that:
W.sub.Y.sup.φ=PE.sup.φ,w.sub.Y=Pe.sub.Y,w.sub.S=Pe.sub.S
(33) where P is a matrix of frequency blobs, E.sup.φ, e.sub.Y and e.sub.S are encodings used to construct W.sub.Y.sup.φ, w.sub.Y and w.sub.S, respectively.
(34) In order to minimize the cost function CF, its gradient is cancelled out. To do so its gradient is computed with respect to each parameter and the derived multiplicative update (MU) rules are finally as follows.
(35) To obtain the prosody characteristic 410 H.sub.Y.sup.E for the speech example:
(36)
(37) To obtain the prosody characteristic 404 H.sub.S.sup.E for the audio mix:
(38)
(39) To obtain the dictionary of phonemes W.sub.Y.sup.φ=PE.sup.φ:
(40)
(41) To obtain the characteristic 409 of the temporal distribution of phonemes H.sub.Y.sup.φ of the example speech:
(42)
(43) To obtain characteristic D 402, the synchronization matrix of synchronization between the speech example and the audio mix:
(44)
(45) To obtain the example channel filter w.sub.Y=Pe.sub.Y:
(46)
(47) To the mixture channel filter w.sub.S=Pe.sub.S:
(48)
(49) To obtain characteristic H.sub.B 406 representing temporal activations of the background in the audio mix:
(50)
(51) To obtain characteristic W.sub.B 405 of a dictionary of background spectral shapes of the background in the audio mix:
(52)
(53) Then, once the model parameters are estimated (i.e. via the above mentioned equations), the STFT of the speech component in the audio mix can be reconstructed in the reconstruction function 44 via a well-known Wiener filtering:
(54)
(55) Where A.sub.,ij is the entry value of matrix A at row i and column j, X is the STFT of the mixture, {circumflex over (V)}.sub.S is the speech related part of {circumflex over (V)}.sub.X and {circumflex over (V)}.sub.B its background related part.
(56) Thereby obtaining the estimated speech component 201. The STFT of the estimated background audio component 202 is then obtained by:
(57)
(58) A program for estimating the parameters can have the following structure:
(59) TABLE-US-00001 Compute V.sub.Y and V.sub.X;// compute the spectrograms of the // example Vx and of the // mixture Vy Initialize {circumflex over (V)}.sub.Y and {circumflex over (V)}.sub.X; // and all the parameters // constituting them according // to (1) For step 1 to N; // iteratively update params Update parameters constituting {circumflex over (V)}.sub.Y and {circumflex over (V)}.sub.X; // according to (2) ,..., (10) End for; Wiener filtering audio mixture based on params comprised in {circumflex over (V)}.sub.Y and {circumflex over (V)}.sub.X; // according to (11) and (12); Output separate sources.
(60)
(61) As will be appreciated by one skilled in the art, aspects of the present principles can be embodied as a system, method or computer readable medium. Accordingly, aspects of the present principles can take the form of an entirely hardware embodiment, en entirely software embodiment (including firmware, resident software, micro-code and so forth), or an embodiment combining hardware and software aspects that can all generally be defined to herein as a “circuit”, “module” or “system”. Furthermore, aspects of the present principles can take the form of a computer readable storage medium. Any combination of one or more computer readable storage medium(s) can be utilized.
(62) Thus, for example, it will be appreciated by those skilled in the art that the diagrams presented herein represent conceptual views of illustrative system components and/or circuitry embodying the principles of the present disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable storage media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
(63) A computer readable storage medium can take the form of a computer readable program product embodied in one or more computer readable medium(s) and having computer readable program code embodied thereon that is executable by a computer. A computer readable storage medium as used herein is considered a non-transitory storage medium given the inherent capability to store the information therein as well as the inherent capability to provide retrieval of the information there from. A computer readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. It is to be appreciated that the following, while providing more specific examples of computer readable storage mediums to which the present principles can be applied, is merely an illustrative and not exhaustive listing as is readily appreciated by one of ordinary skill in the art: a portable computer diskette; a hard disk; a read-only memory (ROM); an erasable programmable read-only memory (EPROM or Flash memory); a portable compact disc read-only memory (CD-ROM); an optical storage device; a magnetic storage device; or any suitable combination of the foregoing.