Method of generating estimated value of local inverse speaking rate (ISR) and device and method of generating predicted value of local ISR accordingly
11200909 · 2021-12-14
Assignee
Inventors
- Chen-Yu Chiang (Hsinchu, TW)
- Guan-Ting Liou (Hsinchu, TW)
- Yih-Ru Wang (Hsinchu, TW)
- Sin-Horng Chen (Hsinchu, TW)
Cpc classification
G06N7/01
PHYSICS
G10L15/02
PHYSICS
G10L15/14
PHYSICS
International classification
G10L25/00
PHYSICS
G10L15/14
PHYSICS
Abstract
A method is disclosed. The proposed method includes: providing an initial speech corpus including plural utterances; based on a condition of maximum a posteriori (MAP), according to respective sequences of syllable duration, syllable duration prosodic state, syllable tone, base-syllable type, and break type of the k.sup.th utterance, using a probability of an ISR of the k.sup.th utterance x.sub.k to estimate an estimated value {circumflex over (x)}.sub.k of the x.sub.k; and through the MAP condition, according to respective sequences of syllable duration, syllable duration prosodic state, syllable tone, base-syllable type, and break type of the given l.sup.th breath group/prosodic phrase group (BG/PG) of the k.sup.th utterance, using a probability of an ISR of the l.sup.th BG/PG of the k.sup.th utterance x.sub.k,l to estimate an estimated value {circumflex over (x)}.sub.k,l of the x.sub.k,l wherein the {circumflex over (x)}.sub.k,l is the estimated value of local ISR, and a mean of a prior probability model of the {circumflex over (x)}.sub.k,l is the {circumflex over (x)}.sub.k.
Claims
1. A method of generating an estimated value of a local inverse speaking rate (ISR), comprising: providing an initial speech corpus including plural utterances, a baseline speaking rate-dependent hierarchical prosodic module (SR-HPM), plural linguistic features corresponding to the plural utterances, plural raw utterance-based ISRs, and plural observed prosodic-acoustic features (PAFs) to train the baseline SR-HPM and to label each the utterance in the initial speech corpus with a prosodic tag having a break type and a prosodic state to obtain a first prosody labeled speech corpus; based on a condition of maximum a posteriori (MAP), according to respective sequences of a syllable duration, a syllable tone, a base-syllable type, and a break type of the k.sup.th one of the plural utterances, using a probability of a first ISR of the k.sup.th utterance x.sub.k to estimate an estimated value {circumflex over (x)}.sub.k of the x.sub.k; through the MAP condition, according to respective sequences of a syllable duration, a syllable tone, a base-syllable type, and a break type of an l.sup.th breath group/prosodic phrase group (BG/PG) of the k.sup.th utterance, using a probability x.sub.k,l of a second ISR of the l.sup.th BG/PG of the k.sup.th utterance to estimate an estimated value {circumflex over (x)}.sub.k,l of the x.sub.k,l, wherein the {circumflex over (x)}.sub.k,l is the estimated value of the local ISR, and a prior probability model of the {circumflex over (x)}.sub.k,l has a mean being the {circumflex over (x)}.sub.k; and through the MAP condition, according to respective sequences of a syllable duration, a syllable tone, a base-syllable type, and a break type of an m.sup.th prosodic phrase (PPh) of the l.sup.th BG/PG of the k.sup.th utterance, using a probability of a third ISR of the m.sup.th PPh of the l.sup.th BG/PG of the k.sup.th utterance x.sub.k,l,m to estimate an estimated value {circumflex over (x)}.sub.k,l,m of the x.sub.k,l,m, wherein the {circumflex over (x)}.sub.k,l,m is the estimated value of the local ISR, and a prior probability model of the {circumflex over (x)}.sub.k,l,m has a mean being the {circumflex over (x)}.sub.k,l.
2. The method according to claim 1, wherein the prior probability model of the {circumflex over (x)}.sub.k,l and the prior probability model of the {circumflex over (x)}.sub.k,l,m are a first Gaussian distribution and a second Gaussian distribution respectively, the baseline SR-HPM includes speaking rate (SR) normalization functions (NFs) and five prosodic main sub-models, and the step of training the baseline SR-HPM further includes: constructing the baseline SR-HPM by a Prosody Labeling and Modeling (PLM) algorithm; training the NFs with the plural linguistic features, the plural observed PAFs, and the plural raw utterance-based ISRs; engaging a normalization of the plural observed PAFs by the trained NFs to obtain plural SR normalized PAFs; training the five prosodic main sub-models by the plural SR normalized PAFs, the plural linguistic features and the plural raw utterance-based ISRs; and using the PLM algorithm to label each the utterance in the initial speech corpus with the break type and the prosodic state to obtain the prosodic tag of each the utterance and to produce the first prosody labeled speech corpus.
3. The method according to claim 2, wherein the five prosodic main sub-models are a syntactic-prosodic sub-model, a prosodic state sub-model, a syllable prosodic-acoustic sub-model, a break-acoustic sub-model and a prosodic state syntactic sub-model, the break type includes a break tag sequence, each one of components in the break tag sequence is selected from a group consisting of a BG/PG boundary prosodic break, a PPh boundary prosodic break, a first type prosodic word (PW) prosodic break with an F0 reset, a second type PW prosodic break with a perceived short pause, a third type PW prosodic break with a preboundary syllable duration lengthening, a normal prosodic break within a PW, and a tightly coupled syllable juncture prosodic break within a PW, the prosodic state includes a fundamental frequency prosodic state sequence, a syllable duration prosodic state sequence and an energy level prosodic state sequence, the prosodic tag is used to label each the utterance in the initial speech corpus with a four-level prosodic structure including four prosodic components of a syllable, a PW, a PPh and a BP/GP to describe the four-level prosodic structure accordingly, and to obtain the first prosody labeled speech corpus, the baseline SR-HPM is built up by using the plural raw utterance-based ISRs, and prosodic variations of each of the plural utterances are assumed to be controlled by the same SR.
4. The method according to claim 3, wherein the prior probability model of the first Gaussian distribution has a variation set to be a statistical variance of raw BG/PG-based ISRs of plural BG/PGs included in C utterances, a selection condition of the C utterances is that the C utterances are those C utterances having utterance-based ISRs being the closest ones to the {circumflex over (x)}.sub.k in the first prosody labeled speech corpus, and the prior probability model of the second Gaussian distribution has a variation set to be a statistical variance of raw PPh-based ISRs of plural PPhs included in D BG/PGs, a selection condition of the D BG/PGs is that the D BG/PGs are the D BG/PGs having BG/PG-based ISRs being the closest ones to the {circumflex over (x)}.sub.k,l in the first prosody labeled speech corpus.
5. The method according to claim 4, further comprising: re-training the baseline SR-HPM with the estimated value of the local ISR to obtain a re-trained SR-HPM, wherein the syntactic-prosodic, the prosodic state, the syllable prosodic-acoustic and the break-acoustic sub-models being all influenced by the SR, and the NFs are re-trained; according to the estimated value of the local ISR, labeling all the utterances in the first prosody labeled speech corpus by an estimated prosodic tag having an estimated break type and an estimated prosodic state with the PLM algorithm to obtain a second prosody labeled speech corpus; and using the estimated value of the local ISR and the re-trained SR-HPM to construct and train a PPh ISR prediction module, wherein the PPh ISR prediction module provides a predicting feature required by generating a predicted value of a PPh-based ISR, includes a neural network, and uses a regression scheme of the neural network to train the PPh ISR prediction module, the neural network has a hidden layer, an activation function and an output layer, and the neural network has plural input features including an ISR of utterance (ISR_Utt), a syllable number of utterance (#S_Utt), a BG/PG number of utterance (#B_Utt), a syllable number of current BG/PG (#S_B), a forward position of normalized BG/PG (Pos_B), a PPh number of current BG/PG (#P_B), a syllable number of current PPh (#S_P) and a forward position of normalized PPh (Pos_P), wherein the predicted value is a predicted value of the local ISR, the activation function is a hyperbolic tangent, the output layer is a node outputting the predicting feature, and the forward position is defined as (l−1)/(L−1), where L is the BG/PG number of utterance, and l is the forward position of BG/PG.
6. The method according to claim 5, further comprising: providing a first feature, linguistic features of a given utterance and a given utterance-based ISR to generate a predicted prosodic tag having a predicted break type and a predicted prosodic state of the given utterance, wherein the first feature is a feature required to generate the predicted prosodic tag, and is provided by the re-trained SR-HPM; using the predicting feature and the predicted prosodic tag to generate the predicted value of the local ISR; providing a second feature and the predicted prosodic tag to generate a predicted value of an SR-normalized prosodic-acoustic feature (PAF), wherein the second feature is a feature required to generate the predicted value of the SR-normalized PAF, and is provided by the re-trained SR-HPM; and providing a third feature and the predicted value of the local ISR to denormalize the SR-normalized PAF so as to generate a synthesized PAF, wherein the third feature is a feature required to denormalize the SR-normalized PAF, and is provided by the re-trained SR-HPM, and the synthesized PAF includes a syllable pitch contour, a syllable duration, a syllable energy level and a duration of silence between syllables.
7. An apparatus using the method according to claim 1 to generate a predicted value of the local ISR, comprising: a second prosody labeled speech corpus obtained from labeling each the utterance in the first prosody labeled speech corpus by an estimated prosodic tag having an estimated break type and an estimated prosodic state according to the estimated value of the local ISR; a re-trained SR-HPM receiving each the estimated value of each the local ISR and each the estimated prosodic tag, and obtained by re-training the baseline SR-HPM accordingly; a prosodic tag predictor receiving a first feature, a given utterance-based ISR and linguistic features of a given utterance to generate a predicted prosodic tag having a predicted break type and a predicted prosodic state; a PPh ISR prediction module receiving plural input features and each the estimated value of each the local ISR, having a neural network, using the neural network to train the PPh ISR prediction module, and outputting a predicting feature required by generating the predicted value of the local ISR; and a local ISR predictor receiving the predicting feature and the predicted prosodic tag to generate the predicted value of the local ISR.
8. The apparatus according to claim 7, further comprising a predicted value generator of an SR-normalized PAF and a synthesized PAF generator, wherein the predicted value generator of the SR-normalized PAF receives the predicted prosodic tag and a second feature to generate a predicted value of the SR-normalized PAF, and the synthesized PAF generator receives the predicted value of the local ISR, the predicted value of the SR-normalized PAF and a third feature, and denormalizes the predicted value of the SR-normalized PAF to generate a synthesized PAF, the first feature, the second feature and the third feature are respectively required to generate the predicted prosodic tag, to generate the predicted value of the SR-normalized PAF, and to denormalize the predicted value of the SR-normalized PAF, the first to the third features are provided by the re-trained SR-HPM, the neural network has a hidden layer, an activation function and an output layer, the PPh ISR prediction module is trained by using a regression scheme of the neural network, and the neural network has plural input features including an ISR of utterance (ISR_Utt), a syllable number of utterance (#S_Utt), a BG/PG number of utterance (#B_Utt), a syllable number of current BG/PG (#S_B), a forward position of normalized BG/PG (Pos_B), a PPh number of current BG/PG (#P_B), a syllable number of current PPh (#S_P) and a forward position of normalized PPh (Pos_P), wherein the activation function is a hyperbolic tangent, the output layer is a node, and the forward position is defined as (l−1)/(L−1), where L is the BG/PG number of utterance, and l is the forward position of BG/PG.
9. A method using the method according to claim 1 to generate a predicted value of the local ISR, comprising: labeling each the utterance in the first prosody labeled speech corpus by an estimated prosodic tag having an estimated break type and an estimated prosodic state according to each the estimated value of each the local ISR to generate a second prosody labeled speech corpus; receiving each the estimated value of each the local ISR and each the estimated prosodic tag so as to train the baseline SR-HPM into a re-trained SR-HPM; providing a first feature, linguistic features of a given utterance, and a given utterance-based ISR to generate a predicted prosodic tag having a predicted break type and a predicted prosodic state; providing a PPh ISR prediction module having a neural network; causing the PPh ISR prediction module to receive plural input features and each the estimated value of each the local ISR, using the neural network to train the PPh ISR prediction module, and outputting a predicting feature required by generating the predicted value of the local ISR; and using the predicting feature and the predicted prosodic tag to generate the predicted value of the local ISR.
10. The method according to claim 9, further comprising: generating a predicted value of the SR-normalized PAF via the predicted prosodic tag and a second feature; and denormalizing the predicted value of the SR-normalized PAF via a third feature and the predicted value of the local ISR to generate a synthesized PAF, wherein the first feature, the second feature and the third feature are respectively required to generate the predicted prosodic tag, to generate the predicted value of the SR-normalized PAF, and to denormalize the predicted value of the SR-normalized PAF, the first to the third features are provided by the re-trained SR-HPM, and the synthesized PAF includes a syllable pitch contour, a syllable duration, a syllable energy level and a duration of silence between syllables.
11. A method of generating an estimated value of a local inverse speaking rate (ISR), comprising: providing an initial speech corpus including plural utterances; based on a condition of maximum a posteriori (MAP), according to respective sequences of a syllable duration, a syllable tone, a base-syllable type, and a break type of the k.sup.th one of the plural utterances, using a probability of a first ISR of the k.sup.th utterance x.sub.k to estimate an estimated value {circumflex over (x)}.sub.k of the x.sub.k; through the MAP condition, according to respective sequences of a syllable duration, a syllable tone, a base-syllable type, and a break type of the l.sup.th breath group/prosodic phrase group (BG/PG) of the k.sup.th utterance, using a probability of a second ISR of the l.sup.th BG/PG of the k.sup.th utterance x.sub.k,l to estimate an estimated value {circumflex over (x)}.sub.k,l of the x.sub.k,l, wherein the {circumflex over (x)}.sub.k,l is the estimated value of the local ISR, and a prior probability model of the {circumflex over (x)}.sub.k,l has a mean being the {circumflex over (x)}.sub.k; and through the MAP condition, according to respective sequences of a syllable duration, a syllable tone, a base-syllable type, and a break type of the m.sup.th prosodic phrase (PPh) of the l.sup.th BG/PG of the k.sup.th utterance, using a probability of a third ISR of the m.sup.th PPh of the l.sup.th BG/PG of the k.sup.th utterance x.sub.k,l,m to estimate an estimated value {circumflex over (x)}.sub.k,l,m of the x.sub.k,l,m, wherein the {circumflex over (x)}.sub.k,l,m is the estimated value of the local ISR, and a prior probability model of the {circumflex over (x)}.sub.k,l,m has a mean being the {circumflex over (x)}.sub.k,l.
12. The method according to claim 11, wherein the prior probability model of the {circumflex over (x)}.sub.k,l and the prior probability model of the {circumflex over (x)}.sub.k,l,m are a first Gaussian distribution and a second Gaussian distribution respectively, the step of providing the initial speech corpus including plural utterances further comprises providing plural linguistic features corresponding to plural utterances, plural raw utterance-based ISRs, and plural observed prosodic-acoustic features (PAFs) to train a baseline speaking rate-dependent hierarchical prosodic module (SR-HPM) and to label each the utterance in the initial speech corpus including the plural utterances with a prosodic tag having a break type and a prosodic state to obtain a first prosody labeled speech corpus, the baseline SR-HPM includes speaking rate (SR) normalization function (NFs) and five prosodic main sub-models, and the step of training the baseline SR-HPM further includes: constructing the baseline SR-HPM by a Prosody Labeling and Modeling (PLM) algorithm; training the NFs with the plural linguistic features, the plural observed PAFs, and the plural raw utterance-based ISRs; engaging a normalization of the plural observed PAFs by the trained NFs to obtain plural SR-normalized PAFs; training the five prosodic main sub-models by the plural SR-normalized PAFs, the plural linguistic features and the plural raw utterance-based ISRs; and using the PLM algorithm to label each the utterance in the initial speech corpus with the break type and the prosodic state to obtain the prosodic tag of each the utterance and to produce the first prosody labeled speech corpus.
13. The method according to claim 12, wherein the five prosodic main sub-models are a syntactic-prosodic sub-model, a prosodic state sub-model, a syllable prosodic-acoustic sub-model, a break-acoustic sub-model and a prosodic state syntactic sub-model, the break type includes a break tag sequence, each one of components in the break tag sequence is selected from a group consisting of a BG/PG boundary prosodic break, a PPh boundary prosodic break, a first type PW prosodic break with an F0 reset, a second type PW prosodic break with a perceived short pause, a third type PW prosodic break with a preboundary syllable duration lengthening, a normal prosodic break within a PW, and a tightly coupled syllable juncture prosodic break within a PW, the prosodic state includes a pitch prosodic state sequence, a syllable duration prosodic state sequence and an energy level prosodic state sequence, the prosodic tag is used to label each the utterance in the initial speech corpus with a four-level prosodic structure including four prosodic components of a syllable, a PW, a PPh and a BP/GP to describe the four-level prosodic structure accordingly, and to obtain the first prosody labeled speech corpus, the baseline SR-HPM is built up by using the plural raw utterance-based ISRs, and prosodic variations of the whole utterance are assumed to be controlled by the same SR.
14. The method according to claim 13, wherein the prior probability model of the first Gaussian distribution has a variation set to be a statistical variance of raw BG/PG-based ISRs of plural BG/PG included in C utterances, a selection condition of the C utterances is that the C utterances are those C utterances having the utterance-based ISRs being the most closest ones to the {circumflex over (x)}.sub.k in the first prosody labeled speech corpus, and the prior probability model of the second Gaussian distribution has a variation set to be a statistical variance of raw PPh-based ISRs of plural PPhs included in D BG/PGs, a selection condition of the D BG/PGs is that the D BG/PGs are those having the BG/PG-based ISRs being the closest ones to the {circumflex over (x)}.sub.k,l in the first prosody labeled speech corpus.
15. The method according to claim 14, further comprising: re-training the baseline SR-HPM with the estimated value of the local ISR to obtain a re-trained SR-HPM, wherein the syntactic-prosodic, the prosodic state, the syllable prosodic-acoustic and the break-acoustic sub-models being all influenced by the SR, and the NFs are re-trained; according to the estimated value of the local ISR, labeling all the utterances in the first prosody labeled speech corpus by an estimated prosodic tag having an estimated break type and an estimated prosodic state with the PLM algorithm to obtain a second prosody labeled speech corpus; and using the estimated value of the local ISR and the re-trained SR-HPM to construct and train a PPh ISR prediction module, wherein the PPh ISR prediction module provides a predicting feature required by generating a predicted value of a PPh-based ISR, includes a neural network, and uses a regression scheme of the neural network to train the PPh ISR prediction module, the neural network has a hidden layer, an activation function and an output layer, and the neural network has plural input features including an ISR of utterance (ISR_Utt), a syllable number of utterance (#S_Utt), a BG/PG number of utterance (#B_Utt), a syllable number of current BG/PG (#S_B), a forward position of normalized BG/PG (Pos_B), a PPh number of current BG/PG (#P_B), a syllable number of current PPh (#S_P) and a forward position of normalized PPh (Pos_P), wherein the predicted value is a predicted value of the local ISR, the activation function is a hyperbolic tangent, the output layer is a node outputting the predicting feature, and the forward position is defined as (l−1)/(L−1), where L is the BG/PG number of utterance, and l is the forward position of BG/PG.
16. The method according to claim 15, further comprising: providing a first feature, linguistic features of a given utterance and a given utterance-based ISR to generate a predicted prosodic tag having a predicted break type and a predicted prosodic state of the given utterance, wherein the first feature is required to generate the predicted prosodic tag, and is provided by the re-trained SR-HPM; using the predicting feature and the predicted prosodic tag to generate the predicted value of the local ISR; providing a second feature and the predicted prosodic tag to generate a predicted value of an SR-normalized prosodic-acoustic feature (PAF), wherein the second feature is required to generate the predicted value of the SR normalized PAF, and is provided by the re-trained SR-HPM; and providing a third feature and the predicted value of the local ISR to denormalize the SR normalized PAF so as to generate a synthesized PAF, wherein the third feature is required to denormalize the SR normalized PAF, and is provided by the re-trained SR-HPM, and the synthesized PAF includes a syllable pitch contour, a syllable duration, a syllable energy level and a duration of silence between syllables.
17. A method of generating an estimated value of a local inverse speaking rate (ISR), comprising: providing an initial speech corpus including plural utterances; based on a condition of maximum a posteriori (MAP), according to respective sequences of a syllable duration, a syllable tone, a base-syllable type, and a break type of the k.sup.th one of the plural utterances, using a probability of a first ISR x.sub.k of the k.sup.th one of the plural utterances to estimate an estimated value {circumflex over (x)}.sub.k of the x.sub.k; and through the MAP condition, according to respective sequences of a syllable duration, a syllable tone, a base-syllable type, and a break type of the l.sup.th breath group/prosodic phrase group (BG/PG) of the k.sup.th utterance, using a probability of a second ISR of the l.sup.th BG/PG of the k.sup.th utterance x.sub.k,l to estimate an estimated value {circumflex over (x)}.sub.k,l of the x.sub.k,l, wherein the {circumflex over (x)}.sub.k,l is the estimated value of the local ISR, and a prior probability model of the {circumflex over (x)}.sub.k,l has a mean being the {circumflex over (x)}.sub.k.
18. The method according to claim 17, further comprising through the MAP condition, according to respective sequences of a syllable duration, a syllable tone, a base-syllable type, and a break type of the m.sup.th prosodic phrase (PPh) of the l.sup.th BG/PG of the k.sup.th utterance, using a probability of a third ISR of the m.sup.th PPh of the l.sup.th BG/PG of the k.sup.th utterance x.sub.k,l,m to estimate an estimated value {circumflex over (x)}.sub.k,l,m of the x.sub.k,l,m, the estimated value of the local ISR is reset as {circumflex over (x)}.sub.k,l,m, wherein a prior probability model of the {circumflex over (x)}.sub.k,l,m has a mean being the {circumflex over (x)}.sub.k,l, the prior probability model of the {circumflex over (x)}.sub.k,l and the prior probability model of the {circumflex over (x)}.sub.k,l,m are a first Gaussian distribution and a second Gaussian distribution respectively, and the providing an initial speech corpus including plural utterances step further comprises providing plural linguistic features corresponding to plural utterances, plural raw utterance-based ISRs, and plural observed prosodic-acoustic features (PAFs) to train a baseline speaking rate-dependent hierarchical prosodic module (SR-HPM) and to label each the utterance in an initial speech corpus including the plural utterances with a prosodic tag having a break type and a prosodic state to obtain a first prosody labeled speech corpus.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Other objectives, advantages and the efficacy of the present invention will be described in detail below taken from the preferred embodiments with reference to the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
(10) The present invention will now be described more specifically with reference to the following embodiments. It is to be noted that the following descriptions of the preferred embodiments of this invention are presented herein for purposes of illustration and description only; they are not intended to be exhaustive or to be limited to the precise form disclosed.
(11) An estimation method of local inverse SR is provided in the present invention. Based on the existing speaking rate-dependent hierarchical prosodic module (SR-HPM), a large amount of training corpus is analyzed via the existing SR-HPM to extract the prosodic structure therein, and to estimate features in each the prosodic phrase (PPh) such as the ISR, the prosodic state, the tone influence factor and the syllable type influence factor to assist the local SR estimation. The removal of these influence factors can cause the estimation of the SR to be free from the influence of bias, and can make the estimated SR match with the SR sensing and the reasonable segment unit related to the prosodic structure can engage in the estimation of the SR. Using the reasonable segment unit can make the estimation of SR match with the variation of human speaking rate and it will be engaged in a prosodic range rather than varied randomly, and provide a unit to estimate the reasonable SR. Finally, a hierarchical ISR estimation module is built up to engage in the local SR estimation method, which can stabilize the local ISR estimation via the estimated results of ISR in a wide range of the upper level. The method proposed in the present invention is mainly aimed at solving the inability to estimate local ISR of the conventional ISR estimation method due to the reasonable SR estimation segment not being defined, improving the bias problem of SR estimation caused by the lack of consideration of the influences of content of the text and the prosodic structure in the conventional ISR estimation method, causing the estimation of ISR to further approach the SR variation of actual voice, solving the bias problem of the ISR estimation, and proposing a local ISR estimation method using the prosodic information as features.
(12)
(13) As shown in
(14)
(15) The structure of local ISR estimation.
(16) The present invention proposes a method of estimating a local ISR using a four-layer prosodic structure.
(17) In the present invention, the reason that the SR estimation method does not use the SR (syllable/second) and use the ISR (second/syllable) instead is because the ISR is more convenient to be used as the PAF of Text-to-Speech (TTS) applications. Here, the PPh is used as the segment unit of the estimation for example, engages in the estimation of the local ISR, through the SR-HPM, a large amount of training corpus is analyzed to extract the prosodic structure therein, and each the ISR in all the PPhs is estimated so as to allow the SR-HPM to further reinforce the robustness of the module.
(18)
(19) The details of the proposed system will be described as follows, and as shown in
(20) 1. Training a baseline SR-HPM 304 and labeling the initial speech corpus 301 with the prosodic tag having the break type and the prosodic state to obtain the first prosody-labeled speech corpus 303; and
(21) 2. Estimating the local SR 305.
(22) In the first step (referring to
(23) In the second step, the PPh ISRs are estimated to be based on the MAP condition, and that is, the local ISRs defined by the present invention. Assume that this local ISR (PPh ISR) is an ISR deviated from the prosodic units of the upper layer (utterance or BG/PB). Intuitively, an ISR of the PPh can be estimated by the mean syllable duration of all the syllables in a PPh, and the ISR estimated by this simple method is named a raw local ISR. However, the number of syllables in a PPh is usually a small quantity, so it is easy to result in inaccuracies of the ISR estimation. Thus, the raw utterance-based ISR relatively cannot represent the actual ISR. Therefore, the present invention provides a hierarchical MAP estimation method to sequentially estimate the local ISR from the highest layer to the bottom layer to make sure the difference between the estimated ISR and the ISR of the prosodic unit (utterance or BG/PG) of the upper layer won't be too much. This method also considered the syllable tone, the base-syllable type and the prosodic structure found by the baseline SR-HPM to curb the deviation amount of the estimation.
(24)
(25) Estimation Method of the Local ISR
(26) Here, the utterance-based ISR {circumflex over (x)}.sub.k is estimated first, then the statistical data of {circumflex over (x)}.sub.k is used as the prior probability to assist the estimation of the BG/PG-based ISR {circumflex over (x)}.sub.k,l. Finally, the statistical data of {circumflex over (x)}.sub.k,l, is used as the prior probability to assist the estimation of the BG/PG-based ISR, which is the local ISR of the present invention {circumflex over (x)}.sub.k,l,m, and the proposed method is described in detail sequentially.
(27) 1. Estimation of Utterance-Based ISR {circumflex over (x)}.sub.k
(28) We assume that the prior probability density function is a Gaussian distribution, and thus the maximum likelihood condition and the following mathematical formula are used to estimate the utterance-based ISR {circumflex over (x)}.sub.k:
(29)
where sd.sub.k={sd.sub.k,n}.sub.n=1˜N.sub.
sd.sub.k,n=sd.sub.k,n′+γ.sub.t.sub.
(30) where γ.sub.t.sub.
(31)
where p(q.sub.k|sd.sub.k,x.sub.k′,t.sub.k,s.sub.k,B.sub.k) represents the posterior probability of syllable duration prosodic state q.sub.k={q.sub.k,n}.sub.n=1˜N.sub.
p(sd.sub.k|q.sub.k,x.sub.k,t.sub.k,s.sub.kB.sub.k)≈p(sd.sub.k|q.sub.k,x.sub.k,t.sub.k,s.sub.k)=Π.sub.n=1.sup.N.sup.
(32) To simplify the Eq., assume that the syllable duration prosodic state is only related to the break type sequence B.sub.k which labels the prosodic structure, and the prior probability can be simplified as:
p(q.sub.k|sd.sub.k,x.sub.k′,t.sub.k,s.sub.k,B.sub.k)≈p(q.sub.k|B.sub.k)=Π.sub.n=1.sup.N.sup.
(33) where probability p(q.sub.k,n|B.sub.k) can be estimated through the probability p(q.sub.k,n|q.sub.k,n-1,B.sub.k,n-1,B.sub.k,n) by using the forward-backward algorithm. APs γ.sub.t.sub.
(34) 2. Estimation of the BG/PG-Based ISR {circumflex over (x)}.sub.k,l,
(35) Next, the BG/PG-based ISR of the l.sup.th BG/PG of the k.sup.th utterance, x.sub.k,l is estimated, where {circumflex over (x)}.sub.k,l is derived by using a probability of an ISR of the k.sup.th utterance {circumflex over (x)}.sub.k as a mean of a prior probability to estimate an estimated value of the BG/PG-based ISR of the l.sup.th BG/PG of the k.sup.th utterance, x.sub.k,l, is estimated. Its mathematical eq. can be expressed as:
(36)
where sd.sub.k,l={sd.sub.k,l,n}.sub.n=1˜N.sub.
sd.sub.k,l,n=sd.sub.k,l,n′+γ.sub.t.sub.
where γ.sub.t.sub.
(37)
were p(q.sub.k,l|sd.sub.k,l,x.sub.k,l′,t.sub.k,l,s.sub.k,l,B.sub.k,l) represents the posterior probability of prosodic state; it is also assume that the syllable duration prosodic state is only related to the break type sequence which labels the prosodic structure, and the prior probability can be simplified as:
p(q.sub.k,l|sd.sub.k,lx.sub.k,l′,t.sub.k,l,s.sub.k,l,B.sub.k,l)≈Π.sub.n=1.sup.N.sup.
(38) p(sd.sub.k,l|q.sub.k,l,x.sub.k,l,t.sub.k,l,s.sub.k,l,B.sub.k,l) is the likelihood function as shown in Eq. (10):
p(sd.sub.k,l|q.sub.k,lx.sub.k,l,t.sub.k,l,s.sub.k,l,B.sub.k,l)≈Π.sub.n=1.sup.N.sup.
where probability p(q.sub.k,l,n|B.sub.k,l) can be estimated through the probability p(q.sub.k,l,n|q.sub.k,l,n-1,B.sub.k,l,n-1,B.sub.k,l,n) by using the forward-backward algorithm. APs γ.sub.t.sub.
(39) The prior probability p(x.sub.k,l) is a Gaussian distribution, that is x.sub.k,l˜N({circumflex over (x)}.sub.k,v.sub.x.sub.
(40)
(41) k.sub.c represents the c.sub.th smallest utterance index having a difference (|{circumflex over (x)}.sub.k−{circumflex over (x)}.sub.k.sub.
(42) 3. Estimation Local/PPh-Based ISR {circumflex over (x)}.sub.k,l,m
(43) Next, the local/PPh-based ISR is estimated via the MAP condition, the BG/PG-based ISR of l.sup.th BG/PG of the k.sup.th utterance is used as the prior probability to estimate the PPh-based ISR of the m.sup.th PPh of the l.sup.th BG/PG of the k.sup.th utterance, {circumflex over (x)}.sub.k,l,m. Similar to the estimation method of BG/PG-based ISR {circumflex over (x)}.sub.k,l, the mathematical Eq. of the estimation of {circumflex over (x)}.sub.k,l,m can be expressed as:
(44)
where sd.sub.k,l,m={sd.sub.k,l,m,n}.sub.n=1˜N.sub.
(45)
(46) {k.sup.(l.sup.
(47)
with the BG/PG-based ISR, {circumflex over (x)}.sub.k,l, and that is the l.sub.d BG/PG of the k.sup.(l.sup.
(48)
represents the raw PPh-based ISR of m.sup.th PPh ISR of the l.sub.d BG/PG of the k.sup.(l.sup.
(49) Embodiments of Local ISR
(50)
(51) Regarding the training phase, as shown in
(52) As above-mentioned, the present invention provides a method of generating an estimated value of a local inverse speaking rate (ISR) in the training phase, the method includes: providing plural linguistic features corresponding to plural utterances, plural raw utterance-based ISRs, and plural observed prosodic-acoustic features (PAFs) to train a baseline speaking rate-dependent hierarchical prosodic module (referring to
(53) In the method of generating an estimated value of the ISR above, the prior probability model of the {circumflex over (x)}.sub.k,l and the prior probability model of the {circumflex over (x)}.sub.k,l,m are a first Gaussian distribution and a second Gaussian distribution respectively. The baseline SR-HPM 304/504 includes speaking rate normalization functions (NFs) 103 and five main prosodic sub-models 104 (referring to
(54) As shown in
(55) A variation of a prior probability model of the first Gaussian distribution is set to be a statistical variance (referring to
(56) As shown in
(57) In
(58) As shown in
(59) The method shown in
(60) As shown in
(61) As shown in
(62)
(63) As shown in
(64) In
(65) In the present invention, several prosody generation experiments are conducted to verify that the estimated local ISRs are meaningful and could accurately describe the speaker's speaking rate variations. Two experiments are designed: an oracle one and a real one. Regarding the prosody generation of the oracle prosody generation experiment, the correct break type sequence is given, then the PAFs are generated, and this sequence is the break type sequence generated by the given trained SR-HPM. In other words, the correct prosodic structure is given, and then the PAFs are synthesized. The PAFs of the real prosody generation experiment are predicted through the real and entire prosody generation procedure, wherein the break type sequence is generated by the break type and prosodic state prediction or the prediction of prosodic tag 511 (referring to
(66) The purpose of the oracle experiment is to examine if the estimated local ISR could accurately model the prosodic variations in terms of the objective measures. The objective measures used here are the root-mean-square error (RMSE) and the correlation coefficients calculated with the true and generated PAFs. We compare the performances of the utterance-based, BG/PG-based and the PPh-based ISR estimations, and the associated estimation methods: RAW, EM, and EM-MAP. The RAW method is to simply estimate the ISR by averaging syllable durations of a prosodic unit. The EMMAP method estimates the local ISR by Eq. (6) and Eq. (14), while the EM method estimates the local ISR by Eq. (6) and Eq. (14) without the prior probability p(x). It can be seen from Table 1(a) that regarding the generation of PAFs, estimation with EM-MAP yielded the lowest RMSEs and the highest correlation coefficient than with EM and with RAW in general. Especially, the PPh-based ISR obtained by the EM-MAP possesses the lowest RMSE and the highest correlation coefficient for sd (syllable duration) and sp (syllable pitch contour).
(67) Table 1(b) shows the RMSE and correlation coefficient between the PAFs predicted by the real prosody generation experiments and the true PAFs. We compare the results by the three configurations of real prosody generation: UTT-based RAW, UTT-based EM, and PPh-based EM-MAP. The prediction results by the UTT-based RAW configuration are obtained by the PAFs generated by the baseline SR-HPM with the raw utterance-based ISR. The UTT-based EM results are obtained by the utterance-based ISR estimated by Eq. (1) and the PAFs generates by the re-trained SR-HPM with linguistic features and the utterance-based EM-estimated ISR. The PAF prediction of the PPh-based EM-MAP results are obtained by using the re-trained SR-HPM with the PPh-based ISR estimated by Eq. (14) and the local ISR predictor with the PPh-based ISR generated by the PPh-based prediction module. As shown in Table 1(b), PPh-based EM-MAP has the best performance. An informal listening test confirmed that the synthesized speech of the new method using PPh-based ISR estimates is more vivid than that of the existing SR-HPM method using a given utterance-based ISR.
(68) Table 2 shows the prediction results of the PPh ISR prediction module, uses this structure to predict BG/PG-based ISR, and tests the influences of various prosodic related features towards the SR estimation of the local ISR, wherein NN is a neural network, and the total residual error (TRE) includes two items of training and test. Results show that adding the number of syllables included in local ISR predictor unit can surely help the prediction of the SR, mainly because usually the faster the local ISR predictor is, the more syllables are included, and the slower the local ISR predictor is, the fewer syllables are included. This explains that the number of syllables included in the local ISR predictor is related to the SR, the prosody is influenced by the SR also, and thus using the prosodic related features can indeed assist the estimation of the SR.
(69) TABLE-US-00001 TABLE 1 RMSEs and correlation coefficients between the predicted and true PAFs under the conditions of (a) with correct break and correct local ISR, and (b) with predicted break and predicted local ISR. (a) UTT-based.sup.a BG/PG-based.sup.b PPh-based.sup.c RAW.sup.d EM.sup.e RAW EM EM-MAP.sup.f RAW EM EM-MAP RMSE sd.sup.g 48.2 47.7 47.7 48.3 47.2 48.0 46.2 45.4 sp.sup.h .1472 .1467 .1650 .1469 .1469 .1472 .1465 .1463 se.sup.i 3.54 3.53 3.56 3.56 3.52 3.57 3.55 3.56 pd.sup.j 55.2 55.2 58.2 56.8 55.5 61.9 60.6 59.6 COR.sup.k sd .779 .784 .783 .784 .790 .786 .802 .810 sp.sup.l .776 .776 .775 .774 .776 .773 .780 .779 .815 .814 .815 .815 .816 .815 .815 .816 .631 .631 .634 .631 .632 .633 .633 .632 .524 .524 .524 .524 .527 .526 .525 .527 se .887 .888 .887 .887 .890 .887 .887 .887 pd .954 .954 .948 .951 .954 .941 .943 .945 (b) RMSE COR sd sp se pd sd sp.sup.m se pd UTT-based RAW 49.1 .1597 3.63 88.2 .770 [.727 .774 .600 .494] .881 .881 UTT-based EM 48.8 .1580 3.63 87.4 .773 [.731 .773 .602 .501] .882 .881 PPh-based 48.0 .1578 3.63 87.6 .783 [.734 .775 .602 .498] .883 .880 EM-MAP .sup.aUTT-based: SR-HPM with utterance-based ISR. .sup.bBG/PG-based: SR-HPM trained with BG/PG-based ISR. .sup.cPPh-based: SR-HPM trained with PPh-based ISR. .sup.dRAW: Raw ISR obtained by simply averaging syllable duration. .sup.eEM: ISR estimated with EM algorithm without the prior.sub.p(x). .sup.fEM-MAP: ISR estimated by the EM algorithm with MAP criterion. .sup.gsd: second, .sup.hsp: logHz, .sup.ise: dB, .sup.jpd: second .sup.kCOR: correlation coefficient .sup.lsp: CORs of four-dimensional logF0 contour
(70) TABLE-US-00002 TABLE 2 the total residual errors (TRE) of the predicted BG/PG and PPh ISR ISR_Utt TREs # B_Utt # P_B Training/ Pos_B # S_Utt # S_B Pos_P # S_P Test BG/PG v 1.09/1.20 NN v v 1.12/1.24 v v 1.14/1.18 v v v 1.02/1.14 PPh v v v v 0.93/0.98 NN v v v v v 0.89/0.94
(71) According to the aforementioned descriptions, the present invention discloses an estimation method of ISR using a hierarchical structure to combine a prosodic model with a prosodic structure to solve the problem of the inability to estimate local ISR of a small region due to the incapability of combining the prosodic structure to provide a reasonable estimation range of SR in the conventional ISR estimation method, and to provide a clean ISR estimation without being influenced by the text and prosodic structure to solve the problem that the ISR estimation is easily influenced by the bias caused by the SR influence factors so as to allow the estimation of the ISR to be more accurate to model prosody variety in speech, and the ISR estimation can be used in the areas of Speech Synthesis, Speech Recognition, and Natural Language Processing as training features, or used in analytical applications, and thus has non-obviousness and novelty.
(72) While the invention has been described in terms of what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention need not be limited to the disclosed embodiments. Therefore, it is intended to cover various modifications and similar configurations included within the spirit and scope of the appended claims, which are to be accorded with the broadest interpretation so as to encompass all such modifications and similar structures.