Device, method, and program for analyzing speech signal
11798579 · 2023-10-24
Assignee
Inventors
Cpc classification
International classification
G10L15/14
PHYSICS
Abstract
A parameter included in a fundamental frequency pattern of a voice can be estimated from the fundamental frequency pattern with high accuracy and the fundamental frequency pattern of the voice can be reconstructed from the parameter included in the fundamental frequency pattern. A learning unit 30 learns a deep generation model including an encoder which regards a parameter included in a fundamental frequency pattern in a voice signal as a latent variable of the deep generation model and estimates the latent variable from the fundamental frequency pattern in the voice signal on the basis of parallel data of the fundamental frequency pattern in the voice signal and the parameter included in the fundamental frequency pattern in the voice signal, and a decoder which reconstructs the fundamental frequency pattern in the voice signal from the latent variable.
Claims
1. A computer-implemented method for estimating aspects of speech signal in voice data, the method comprising: learning a deep generation model, wherein the deep generation model comprises: an encoder, wherein the encoder estimates a first parameter included in a first fundamental frequency pattern of a first speech signal in a first input voice data, the first parameter corresponds to a latent variable of the deep generation model, and the learning of the deep generation model includes updating the latent variable of the deep generation model based on parallel data between the first fundamental frequency pattern of the first speech signal and the first parameter included in the first fundamental frequency pattern of the first speech signal as training data, and a decoder, wherein the decoder reconstructs, based on the latent variable of the deep generation model, the first fundamental frequency pattern of the first speech signal in the first input voice data, wherein the latent variable of the deep generation model corresponds to the first parameter included in the first fundamental frequency pattern of the first speech signal; estimating, based on a second fundamental frequency pattern of a second speech signal in a second input voice data for encoding and subsequently for reconstructing, a second parameter included in the second fundamental frequency pattern using the encoder of the learnt deep generation model; and estimating, based on the second parameter included in the second fundamental frequency pattern of speech signal in the second input voice data, the second fundamental frequency pattern using the decoder of the deep generation model to reconstruct the second fundamental frequency pattern associated with the second input voice data.
2. The computer-implemented method of claim 1, wherein the first parameter in the first fundamental frequency pattern of speech signal in the first voice data represents at least one of: an accent of voice in the first voice data, or musical notes representing a musical piece signal associated with the first voice data as a singing voice.
3. The computer-implemented method of claim 1, the method further comprising: receiving the first voice data, wherein the first voice data includes singing voice data; receiving the first fundamental frequency pattern based on the received first voice data, wherein the first fundamental frequency pattern includes vibrato and overshoot associated with the received singing voice data; generating the first parameter included in the first fundamental frequency pattern of speech signal in the received singing voice data, wherein the first parameter represents musical notes associated with the received singing voice data; learning, based on a combination including the first fundamental frequency pattern associated with the first voice data and the first parameter, the deep generation model; synthesizing voice data of the singer based on the learnt deep generation model; and outputting the synthesized voice data.
4. The computer-implemented method of claim 1, the method further comprising: maximizing an output of an objective function for learning the deep generation model, wherein the objective function is based at least on: a distance between an output of the decoder having the first fundamental frequency pattern of speech signal in the first voice data as an input and a prior distribution of the first parameter represented using a state sequence of a path-restricted hidden Markov model (HMM), and an output of the encoder having the latent variable as an input of the decoder.
5. The computer-implemented method of claim 4, wherein each of the encoder and the decoder is configured using a convolutional neural network.
6. The computer-implemented method of claim 4, wherein the first voice data represents learning data, wherein the first voice data and the second voice data are distinct, and wherein the first voice data and the third voice data are distinct.
7. The computer-implemented method of claim 4, wherein the first fundamental frequency pattern of speech signal in the first voice data relates to one or more of: an interrogative sentence based on the ending of an utterance sentence, an intention of a speaker represented by the first voice data, a melody of a singer represented by the first voice data, and an emotion of the singer represented by the first voice data.
8. A system for estimating aspects of voice data, the system comprises: a processor; and a memory storing computer-executable instructions that when executed by the processor cause the system to: learn a deep generation model, wherein the deep generation model comprises: an encoder, wherein the encoder estimates a first parameter included in a first fundamental frequency pattern of a first speech signal in a first input voice data, the first parameter corresponds to a latent variable of the deep generation model, and the learning of the deep generation model includes updating the latent variable of the deep generation model based on parallel data between the first fundamental frequency pattern of the first speech signal and the first parameter included in the first fundamental frequency pattern of the first speech signal as training data, and a decoder, wherein the decode reconstructs, based on the latent variable of the deep generation model, the first fundamental frequency pattern of the first speech signal in the first input voice data, wherein the latent variable of the deep generation model corresponds to the first parameter included in the first fundamental frequency pattern of the first speech signal; estimate, based on a second fundamental frequency pattern of a second speech signal in a second input voice data for encoding and subsequently for reconstructing, a second parameter included in the second fundamental frequency pattern using the encoder of the learnt deep generation model; and estimate, based on the second parameter included in the second fundamental frequency pattern of speech signal in the second input voice data, the second fundamental frequency pattern using the decoder of the deep generation model to reconstruct the second fundamental frequency pattern associated with the second input voice data.
9. The system of claim 8, the computer-executable instructions when executed further causing the system to: maximize an output of an objective function for learning the deep generation model, wherein the objective function is based at least on: a distance between an output of the decoder having the first fundamental frequency pattern of the first voice data as an input and a prior distribution of the first parameter represented using a state sequence of a path-restricted hidden Markov model (HMM), and an output of the encoder having the latent variable as an input of the decoder.
10. The system of claim 8, wherein each of the encoder and the decoder is configured using a convolutional neural network.
11. The system of claim 8, wherein the first voice data is a learning data, wherein the first voice data and the second voice data are distinct, and wherein the first voice data and the third voice data are distinct.
12. The system of claim 8, wherein the first fundamental frequency pattern of speech signal in the first voice data relates to one or more of: an interrogative sentence based on the ending of an utterance sentence, an intention of a speaker represented by the first voice data, a melody of a singer represented by the first voice data, and an emotion of the singer represented by the first voice data.
13. The system of claim 8, wherein the first parameter in the first fundamental frequency pattern of speech signal in the first voice data represents at least one of: an accent of voice in the first voice data, or musical notes representing a musical piece signal associated with the first voice data as a singing voice.
14. The system of claim 8, the computer-executable instructions when executed further causing the system to: receive the first voice data, wherein the first voice data includes singing voice data; generate the first fundamental frequency pattern based on the received first voice data, wherein the first fundamental frequency pattern includes vibrato and overshoot associated with the received singing voice data; generate the first parameter included in the first fundamental frequency pattern of speech signal in the received singing voice data, wherein the first parameter represents musical notes associated with the received singing voice data; learn, based on a combination including the first fundamental frequency pattern associated with the first voice data and the first parameter, the deep generation model; synthesize voice data of the singer based on the learnt deep generation model; and output the synthesized voice data.
15. A computer-readable non-transitory recording medium storing computer-executable instructions that when executed by a processor cause a computer system to: learn a deep generation model, wherein the deep generation model comprises: an encoder, wherein the encoder estimates a first parameter included in a first fundamental frequency pattern of a first speech signal in a first input voice data, the first parameter corresponds to a latent variable of the deep generation model, and the learning of the deep generation model includes updating the latent variable of the deep generation model based on parallel data between the first fundamental frequency pattern of the first speech signal and the first parameter included in the first fundamental frequency pattern of the first speech signal as training data, and a decoder, wherein the decoder reconstructs, based on the latent variable of the deep generation model, the first fundamental frequency pattern of the first speech signal in the first input voice data, wherein the latent variable of the deep generation model corresponds to the first parameter included in the first fundamental frequency pattern of the first speech signal; estimate, based on a second fundamental frequency pattern of a second speech signal in a second input voice data for encoding and subsequently for reconstructing, a second parameter included in the second fundamental frequency pattern using the encoder of the learnt deep generation model; and estimate, based on the second parameter included in the second fundamental frequency pattern of speech signal in the second input voice data, the second fundamental frequency pattern using the decoder of the deep generation model to reconstruct the second fundamental frequency pattern associated with the second input voice data.
16. The computer-readable non-transitory recording medium of claim 15, wherein each of the encoder and the decoder is configured using a convolutional neural network.
17. The computer-readable non-transitory recording medium of claim 15, wherein the first fundamental frequency pattern of speech signal in the first voice data relates to one or more of: an interrogative sentence based on the ending of an utterance sentence, an intention of a speaker represented by the first voice data, a melody of a singer represented by the first voice data, and an emotion of the singer represented by the first voice data.
18. The computer-readable non-transitory recording medium of claim 15, the computer-executable instructions when executed further causing the system to: maximize an output of an objective function for learning the deep generation model, wherein the objective function is based at least on: a distance between an output of the decoder having the first fundamental frequency pattern of speech signal in the first voice data as an input and a prior distribution of the first parameter represented using a state sequence of a path-restricted hidden Markov model (HMM), and an output of the encoder having the latent variable as an input of the decoder.
19. The computer-readable non-transitory recording medium of claim 18, wherein the first parameter in the first fundamental frequency pattern of speech signal in the first voice data represents at least one of: an accent of voice in the first voice data, or musical notes representing a musical piece signal associated with the first voice data as a singing voice.
20. The computer-readable non-transitory recording medium of claim 18, the computer-executable instructions when executed further causing the system to: receive the first voice data, wherein the first voice data includes singing voice data; generate the first fundamental frequency pattern based on the received first voice data, wherein the first fundamental frequency pattern includes vibrato and overshoot associated with the received singing voice data; generate the first parameter included in the first fundamental frequency pattern of speech signal in the received singing voice data, wherein the first parameter represents musical notes associated with the received singing voice data; learn, based on a combination including the first fundamental frequency pattern associated with the first voice data and the first parameter, the deep generation model; synthesize voice data of the singer based on the learnt deep generation model; and output the synthesized voice data.
Description
BRIEF DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
DESCRIPTION OF EMBODIMENTS
(5) Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. A technology proposed in an embodiment of the present invention belongs to a technical field of signal processing and is a signal processing technology for solving a problem of estimating a parameter included in a fundamental frequency pattern of a voice from the fundamental frequency pattern and forward problems thereof.
(6) Here, related technologies 1 and 2 in an embodiment of the present invention will be described.
(7) <Related Technology 1: F.sub.0 Pattern Generation Process Model of Voice>
(8) First, an F.sub.0 pattern generation process model of a voice will be described.
(9) As a model describing a process of generating an F.sub.0 pattern of a voice, the fundamental frequency (F.sub.0) pattern generation process model of Fujisaki (Fujisaki model) is known (NPL 1). The Fujisaki model is a physical model describing an F.sub.0 pattern generation process according to the motion of the thyroid cartilage. In the Fujisaki model, the sum of stretches of vocal cords involved in two independent motions (translational motion and a rotational motion) of the thyroid cartilage is interpreted to cause time variance of F.sub.0 and an F.sub.0 pattern is modeled on the basis of the assumption that stretch of vocal cords and a logarithmic value y(t) of the F.sub.0 pattern are in a proportional relation. An F.sub.0 pattern x.sub.p(t) generated by the translational motion of the thyroid cartilage is referred to as a phrase component and an F.sub.0 pattern x.sub.a(t) generated by the rotational motion is referred to as an accent component. In the Fujisaki model, an F.sub.0 pattern y(t) is obtained by adding a baseline component μ.sub.b determined according to physical limitations of vocal cords to the aforementioned components and is represented as follows.
[Formula 1]
y(t)=x.sub.p(t)+x.sub.a(t)+μ.sub.b (1)
(10) The two components are assumed as outputs of a secondary critical braking system and are represented as follows.
(11)
(12) (* represents a convolution operation with respect to a time t). Here, u.sub.p(t) is called a phrase command function and composed of a string of delta functions (phrase commands) and u.sub.a(t) is called an accent command function and is composed of a string of square waves (accent commands). These command strings have constraint conditions that a phrase command is generated at the beginning of an utterance, two phrase commands are not consecutively generated, and two different commands are not generated at the same time. In addition, α and β are natural angular frequencies of a phrase control mechanism and an accent control mechanism and it is empirically known that they are approximately α=3 rad/s and β=20 rad/s regardless of a speaker and details of an utterance.
(13) <Related Technology 2: F.sub.0 Pattern Generation Process Model of Singing Voice>
(14) Next, an F.sub.0 pattern generation process model of a singing voice will be described.
(15) Control of abrupt rising and falling of a fundamental frequency involved in a melody of a singing voice and periodic vibration such as vibrato cannot be represented in a critical braking system such as the above-described Fujisaki model. Accordingly, in an F.sub.0 control model of a singing voice,
(16)
(17) various oscillatory phenomena composed of exponent attenuation (ζ>1), attenuated vibration (0<ζ<1), corresponding to overshoot), critical braking (ζ=1), and steady-state vibration (ζ=0, corresponding to vibrato) are represented by adjusting the attenuation factor ζ in a transfer function of a secondary system represented by Equation (6) above using control parameters (an attenuation factor and a natural frequency Ω).
Principle According to Embodiment of the Present Invention
(18) A technology of an embodiment of the present invention includes learning processing and estimation processing.
(19) <Learning Processing>
(20) In learning processing, it is assumed that parallel data of an F.sub.0 pattern (e.g., an F.sub.0 pattern of a voice) and a parameter (e.g., phrase and accent components) included in the F.sub.0 pattern or data of which a part is parallel data is provided.
(21) First, a latent variable z is assumed to be a parameter in charge of an F.sub.0 pattern generation process. For example, it corresponds to phrase and accent components in the case of the Fujisaki model. By approximating a conditional probability distribution P.sub.θ(x|z) of an F.sub.0 pattern x with respect to predetermined z through a decoder described with a neural network, a posterior probability P.sub.θ(z|x) thereof can be regarded as an inverse problem of estimating z when the predetermined F.sub.0 pattern x is provided. Since it is difficult to precisely obtain the posterior probability, a conditional probability distribution Q.sub.φ(z|x) of x is approximated through an encoder described with the neural network. By learning the aforementioned encoder and decoder, the conditional probability distribution Q.sub.φ(z|x) of x is consistent with a true posterior probability P.sub.θ(z|x)∝P.sub.θ(x|z)P(z). A logarithmic marginal probability density function log P.sub.θ(x) with respect to the F.sub.0 pattern x is represented as follows.
(22)
(23) Here, D.sub.KL[.Math.|.Math.] represents a Kullback-Leibler (KL) distance. It can be ascertained from Equation (8) that a KL distance between P.sub.θ(z|x) and Q.sub.φ(z|x) can be minimized by maximizing L(θ,φ;x) with respect to θ and φ. In the conventional typical VAE, Q.sub.φ(z|x) and P.sub.θ(x|z) are assumed to be a single normal distribution (NPLs 5 and 6).
(24) [NPL 5] Diederik P Kingma and Max Welling, “Auto-encoding variational bayes”, arXiv preprint arXiv: 1312. 6114, 2013.
(25) [NPL 6] Casper Kaae Sonderby, Tapani Raiko, Lars Maaloe, Soren Kaae Sonderby, and Ole Winther, “Ladder variational autoencoders”, in Advances in Neural Information Processing Systems, 2016, pp. 3738-3746.
(26) Here, a specific form with respect to P(z) that is a prior distribution can be designed by setting the latent variable z as a specific interesting variable. For example, when the latent variable z is associated with phrase and accent components as described above, P(z) can be represented as P(z)=Σ.sub.8P(z|s)P(s). Meanwhile, s represents a state sequence of the path-restricted HMM (refer to
(27) <Estimation Processing>
(28) In processing of estimating a parameter z included in a predetermined F.sub.0 pattern x from the F.sub.0 pattern, a posterior distribution with respect to z is obtained using the above-described encoder Q.sub.φ(z|x) and an average sequence at the time is regarded as z. Processing of estimating the F.sub.0 pattern x from the parameter z included in the predetermined F.sub.0 pattern is obtained using the above-described decoder P.sub.θ(x|z). Since the encoder and the decoder are described through a CNN, repeated execution as in the conventional technology is not required and parallel operations in each batch of the CNN can be performed, and thus fast estimation can be achieved.
(29) <System Configuration>
(30) As shown in
(31) As shown in
(32) The input unit 10 receives parallel data of a fundamental frequency pattern in a voice signal and a parameter included in the fundamental frequency pattern in the voice signal. In addition, the input unit 10 receives a parameter included in a fundamental frequency pattern in a voice signal that is an estimation object. Further, the input unit 10 receives the fundamental frequency pattern in the voice signal that is the estimation object.
(33) Meanwhile, a fundamental frequency pattern is acquired by extracting a fundamental frequency from a voice signal using fundamental frequency extraction processing that is a known technology.
(34) The operation unit 20 includes a learning unit 30, a deep generation model storage unit 40, a parameter estimation unit 50, and a fundamental frequency pattern estimation unit 60.
(35) The learning unit 30 learns a deep generation model including an encoder and a decoder described below. Encoder: regards a parameter included in a fundamental frequency pattern in a voice signal as a latent variable of the deep generation model on the basis of parallel data of the fundamental frequency pattern in the voice signal received through the input unit 10 and the parameter included in the fundamental frequency pattern in the voice signal and estimates the latent variable from the fundamental frequency parameter in the voice signal. Decoder: reconstructs the fundamental frequency pattern in the voice signal from the latent variable.
(36) Specifically, a decoder P.sub.θ(x|z) and an encoder Q.sub.φ(z|x) of the deep generation model are learnt such that the objective function of the aforementioned equation (8) defined using a distance between an output of a decoder having a fundamental frequency pattern in a voice signal as an input and a prior distribution of the parameter represented using a state sequence of a path-restricted hidden Markov model (HMM) and an output of an encoder having a latent variable as an input is maximized.
(37) Here, the state sequence of the path-restricted hidden Markov model (HMM) is a state sequence s estimated from the fundamental frequency pattern and composed of a state sk of each time k in the HMM.
(38) Here, in the state sequence of the path-restricted HMM, as shown in
(39) In addition, each of the decoder P.sub.θ(x|z) and the encoder Q.sub.φ(z|x) of the deep generation model is configured using a convolutional neural network.
(40) The decoder P.sub.θ(x|z) and the encoder Q.sub.φ(z|x) of the deep generation model trained by the learning unit 30 are stored in the deep generation model storage unit 40.
(41) The parameter estimation unit 50 estimates, from the fundamental frequency pattern in the voice signal that is the estimation object input thereto, the parameter included in the fundamental frequency pattern using the encoder Q.sub.φ(z|x) of the deep generation model and outputs the parameter to the output unit 90.
(42) The fundamental frequency pattern estimation unit 60 estimates the fundamental frequency pattern from the parameter included in the fundamental frequency pattern in the voice signal that is the estimation target input thereto using the decoder P.sub.θ(x|z) of the deep generation model and outputs the fundamental frequency pattern to the output unit 90.
(43) <Operation of Voice Signal Analysis Apparatus>
(44) Next, the operation of the voice signal analysis apparatus 100 according to an embodiment of the present invention will be described. First, when the input unit 10 receives parallel data of a fundamental frequency pattern in a voice signal and a parameter included in the fundamental frequency pattern in the voice signal, the learning unit 30 of the voice signal analysis apparatus 100 learns a deep generation model including an encoder Q.sub.φ(z|x) which estimates a latent variable from the fundamental frequency pattern in the voice signal and a decoder P.sub.θ(x|z) which reconstructs the fundamental frequency pattern in the voice signal from the latent variable, and stores the deep generation model in the deep generation model storage unit 40.
(45) Next, when the input unit 10 receives a fundamental frequency pattern in a voice signal that is an estimation object, the parameter estimation unit 50 of the voice signal analysis apparatus 100 estimates a parameter included in the fundamental frequency pattern in the voice signal that is the estimation object from the fundamental frequency pattern using the encoder Q.sub.φ(z|x) of the deep generation model and outputs the parameter to the output unit 90.
(46) In addition, when the input unit 10 receives a parameter included in a fundamental frequency pattern in a voice signal that is an estimation object, the parameter estimation unit 50 of the voice signal analysis apparatus 100 estimates the fundamental frequency pattern from the parameter included in the fundamental frequency pattern of the input voice signal that is the estimation object using the decoder P.sub.θ(x|z) of the deep generation model and outputs the fundamental frequency pattern to the output unit 90.
Effects of Experiments of Present Embodiment
Example 1 of Effects
(47) An F.sub.0 pattern was extracted from a voice signal and data of phrase and accent components was generated from the F.sub.0 pattern using manpower. After learning of the aforementioned model (deep generation model) using parallel data of the F.sub.0 pattern and the phrase and accent components, experiments of estimating phrase and accent components from the F.sub.0 pattern through estimation processing and estimating the F.sub.0 pattern from the phrase and accent components were performed, and to what degree the estimated F.sub.0 pattern and phrase and accent components restored the original F.sub.0 pattern and phrase and accent components was confirmed.
Example 2 of Effects
(48) An F.sub.0 pattern was extracted from a singing voice signal and notes were extracted from a musical piece signal thereof to generate parallel data. After learning of a singer dependency model of the aforementioned model (deep generation model) for each singer using the parallel data of the F.sub.0 pattern and the notes, experiments of estimating an F.sub.0 pattern from the notes through estimation processing were performed, and to what degree the estimated F.sub.0 pattern restored the original F.sub.0 pattern was confirmed.
(49) As described above, according to the voice signal analysis apparatus according to an embodiment of the present invention, it is possible to estimate a parameter included in a fundamental frequency pattern of a voice from the fundamental frequency pattern with accuracy and reconstruct the fundamental frequency pattern of the voice from the parameter included in the fundamental frequency pattern by learning the deep generation model including the encoder and the decoder described below. Decoder: regards a parameter included in a fundamental frequency pattern in a voice signal as a latent variable of the deep generation model on the basis of parallel data of the fundamental frequency pattern in the voice signal and the parameter included in the fundamental frequency pattern in the voice signal and estimates the latent variable from the fundamental frequency pattern in the voice signal. Decoder: reconstructs the fundamental frequency pattern in the voice signal from the latent variable.
(50) Meanwhile, the present invention is not limited to the above-described embodiment and various modifications and applications can be made without departing from the spirit and scope of the present invention.
REFERENCE SIGNS LIST
(51) 10 Input unit 20 Operation unit 30 Learning unit 40 Deep generation model storage unit 50 Parameter estimation unit 60 Fundamental frequency pattern estimation unit 90 Output unit 100 Voice signal analysis apparatus