Streaming encoder, prosody information encoding device, prosody-analyzing device, and device and method for speech synthesizing

Abstract

A speech-synthesizing device includes a hierarchical prosodic module, a prosody-analyzing device, and a prosody-synthesizing unit. The hierarchical prosodic module generates at least a first hierarchical prosodic model. The prosody-analyzing device receives a low-level linguistic feature, a high-level linguistic feature and a first prosodic feature, and generates at least a prosodic tag based on the low-level linguistic feature, the high-level linguistic feature, the first prosodic feature and the first hierarchical prosodic model. The prosody-synthesizing unit synthesizes a second prosodic feature based on the hierarchical prosodic module, the low-level linguistic feature and the prosodic tag.

Claims

1. A speech-synthesizing device, comprising: a hierarchical prosodic module generating at least a first hierarchical prosodic model; a prosody structure analyzing device, receiving a low-level linguistic feature, a high-level linguistic feature and a first prosodic feature, and generating at least a prosodic tag based on the low-level linguistic feature, the high-level linguistic feature, the first prosodic feature and the first hierarchical prosodic model, wherein the prosodic tag includes a prosodic break sequence describing at least an inter-syllable pause duration and a prosodic state sequence defining at least a syllable pitch contour, a syllable duration and a syllable energy level, and describes a Mandarin Chinese prosodic hierarchical structure including a syllable, a prosodic word, a prosodic phrase and one of a breath group and a prosodic phrase group; a prosody-synthesizing unit synthesizing a second prosodic feature based on the hierarchical prosodic module, the low-level linguistic feature and the prosodic tag; a prosodic feature extractor receiving a speech input and the low-level linguistic feature, segmenting the speech input to form a segmented speech, and generating the first prosodic feature based on the low-level linguistic feature and the segmented speech; and a prosody-synthesizing device, wherein the first hierarchical prosodic model is generated based on a first speech speed, on a condition that when the prosody-synthesizing device is going to generate a second speech speed being different from the first speech speed, the first hierarchical prosodic model is replaced with a second hierarchical prosodic model having the second speech speed and the prosody-synthesizing unit changes the second prosodic feature to a third prosodic feature, and the speech-synthesizing device generates a speech synthesis based on the third prosodic feature and the low-level linguistic feature.

2. A speech-synthesizing device as claimed in claim 1, further comprising: an encoder receiving the prosodic tag and the low-level linguistic feature to generate a code stream; and a decoder receiving the code stream, and restoring the prosodic tag and the low-level linguistic feature.

3. A speech-synthesizing device as claimed in claim 2, wherein the encoder includes a first codebook providing an encoding bit corresponding to the prosodic tag and the low-level linguistic feature so as to generate the code stream, and the decoder includes a second codebook providing the encoding bit to reconstruct code stream to the prosodic tag and the low-level linguistic feature.

4. A speech-synthesizing device as claimed in claim 2, further comprising: a prosody-synthesizing device receiving the prosodic tag and the low-level linguistic feature reconstructed by the decoder to generate the second prosodic feature including the syllable pitch contour, the syllable duration, the syllable energy level and the inter-syllable pause duration.

5. A speech-synthesizing device as claimed in claim 4, wherein the second prosodic feature is reconstructed by a superposition module.

6. A speech-synthesizing device as claimed in claim 4, wherein the inter-syllable pause duration is reconstructed by looking up a codebook.

7. A method for synthesizing a speech, comprising steps of: providing a hierarchical prosodic module, a low-level linguistic feature, a high-level linguistic feature and a first prosodic feature; generating at least a prosodic tag based on the low-level linguistic feature, the high-level linguistic feature, the first prosodic feature and the hierarchical prosodic module, wherein the prosodic tag includes a prosodic break sequence describing at least an inter-syllable pause duration and a prosodic state sequence defining at least a syllable pitch contour, a syllable duration and a syllable energy level, and describes a Mandarin Chinese prosodic hierarchical structure including a syllable, a prosodic word, a prosodic phrase and one of a breath group and a prosodic phrase group; and outputting the speech according to the prosodic tag.

8. A method as claimed in claim 7, further comprising steps of: providing an inputting speech; segmenting the inputting speech to generate a segmented input speech; extracting a prosodic feature from the segmented input speech according to the low-level linguistic feature to generate the first prosodic feature; analyzing the first prosodic feature to generate the prosodic tag; encoding the prosodic tag to form a code stream; decoding the code stream; synthesizing a second prosodic feature based on the low-level linguistic feature and the prosodic tag; and outputting the speech based on the low-level linguistic feature and the second prosodic feature.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a schematic diagram showing a speech-synthesizing apparatus according to one embodiment of the present invention;

(2) FIG. 2 is a schematic diagram showing a Mandarin Chinese speech hierarchical prosodic structure according to one embodiment of the present invention;

(3) FIG. 3 shows a flow chart of utilizing a HMM-based speech synthesizer to generate the synthesized speech according to one embodiment of the present invention;

(4) FIGS. 4A-4B are schematic diagrams showing examples of prosodic features, including speaker dependent, speaker independent original, encoded and reconstructed after being encoded, according to one embodiment of the present invention; and

(5) FIGS. 5A-5D are schematic diagrams showing differences between the waveforms and pitch contours of speeches of different speed synthesized and transformed after encoding the original speech and prosodic information according to one embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

(6) The present invention will now be described more specifically with reference to the following embodiments. It is to be noted that the following descriptions of preferred embodiments of this invention are presented herein for the purposes of illustration and description only; it is not intended to be exhaustive or to be limited to the precise form disclosed.

(7) To achieve the aforementioned objective, the present invention employs a hierarchical prosodic module in a prosody encoding apparatus whose block diagram is shown in FIG. 1. Referring to FIG. 1, the speech-synthesizing apparatus 10 includes a speech segmentation and prosodic feature extractor 101, a hierarchical prosodic module 102, a prosodic structure analysis unit 103, an encoder 104, a decoder 105, a prosodic feature synthesizer unit 106, a speech synthesizer 107, a prosodic structure analysis device 108, a prosodic feature synthesizer device 109, a prosodic message encoding device 110 and a prosodic message decoding device 111.

(8) Basic concepts of the present invention are set forth as below: Firstly, inputting a speech signal and its corresponding low-level linguistic feature A1 into the speech segmentation and prosodic feature extractor 101, so as to perform syllable boundary division to the input speech utilizing acoustic model and obtain syllable prosodic features for the use by the next prosodic structure analysis unit 103.

(9) The main usage of the hierarchical prosodic module 102 is to describe prosodic hierarchical structure of Mandarin Chinese, including syllable prosodic-acoustic model, syllable juncture prosodic-acoustic model, prosodic state model, and break-syntax model.

(10) The main usage of the prosodic structure analysis unit 103 is to take advantage of the hierarchical prosodic module 102 to analyze the prosodic feature A3, which is generated by the speech segmentation and prosodic feature extractor 101, and then to represent the speech prosody by prosodic structures in terms of prosodic tags.

(11) The main function of the encoder 104 is to perform encoding to the messages necessary for the reconstruction of speech prosody and bit streaming. Those messages include the prosodic tag A4 generated by the prosodic structure analysis unit 103 and the input low-level linguistic feature A1.

(12) The main functions of the decoder 105 include decoding the bit stream A5 and decoding the prosodic tag A6 required by the prosodic feature synthesizer unit 106 and the low-level linguistic feature A1.

(13) The main function of the prosodic feature synthesizer unit 106 is to make use of the decoded prosodic tag A6 and the low-level linguistic feature A1 to synthesize and reconstruct the speech prosodic feature A7, with the input from the hierarchical prosodic module 102 as side information.

(14) The main function of the speech synthesizer 107 is to synthesize the speech with the reconstructed prosodic feature A7 and the low-level linguistic feature A1 based on the hidden Markov model.

(15) The prosodic structure analysis device 108 comprises the hierarchical prosodic module 102 and the prosodic structure analysis unit 103, and takes advantage of the prosodic structure analysis unit 103 while using the hierarchical prosodic module 102 to represent the prosodic feature A3 of the speech input by prosodic structures in terms of prosodic tags A4.

(16) The prosodic feature synthesizer device 109 comprises the hierarchical prosodic module 102 and the prosodic feature synthesizer unit 106, and takes advantages of the prosodic feature synthesizer unit 106, while using the hierarchical prosodic module 102 as side information provider, to generate a second prosodic feature A7 using inputs of the second prosodic tag A6 and the low-level linguistic feature A1 reconstructed by the decoder 105.

(17) The prosodic message encoding device 110 comprises the speech segmentation and prosodic feature extractor 101, the hierarchical prosodic module 102, the prosodic structure analysis unit 103, the encoder 104 and the prosodic structure analysis device 108. The prosodic message encoding device 110 firstly uses the speech segmentation and prosodic feature extractor 101 to segment the input speech by the low-level linguistic feature A1 and to obtain a first prosodic feature A3. Then the prosodic structure analysis device 108 generates a first prosodic tag A4 based on the first prosodic feature A3, the low-level linguistic feature A1 and a high-level linguistic feature A2. The encoder 104 then forms a code stream A5 based on the first prosodic tag A4 and the low-level linguistic feature A1.

(18) The prosodic message decoding device 111 comprises the hierarchical prosodic module 102, the decoder 105, the prosodic feature synthesizer unit 106, the speech synthesizer 107 and the prosodic feature synthesizer device 109. The decoder 105 decodes the code stream A5, generated from the prosodic message encoding device 110, to reconstruct a second prosodic tag A6 and the low-level linguistic feature A1, which are used to synthesize a second prosodic feature A7 by the prosodic feature synthesizer device 109. The second prosodic feature A7 is then used to generate the output speech by the speech synthesizer 107.

(19) The equations set forth hereinafter are for introducing some preferred embodiments according to the present invention. The following equation is employed by the prosodic structure analysis unit 103 for representing the speech prosody by prosodic structures in terms of prosodic tags. The method is to input the prosodic acoustic feature sequence (A) and the linguistic feature sequence (L) into the prosodic structure analysis unit 103, which may output the best prosodic tag sequence (T). The best prosodic tag sequence (T) can be used for representing the prosodic features of the speech and then for later encoding. The corresponding mathematical equation is:

(20) $\begin{matrix} T^{*} = {B^{*}, P^{*}} = \arg \max_{T} P (T .Math. A, L) = \arg \max_{T} P (T, A .Math. L) \\ = \arg \max_{T} P (A .Math. T, L) P (T .Math. L) \\ = \arg \max_{B, P} P (X, Y, Z .Math. B, P, L) P (B, P, .Math. L) \\ \approx \arg \max_{B, P} \underset{\underset{Hierarchical Prosodic Model}{︸}}{P (X .Math. B, P, L) P (Y, Z .Math. B, L) P (P .Math. B) P (B .Math. L)} \end{matrix}$
wherein A={X,Y,Z}={A.sub.1.sup.N}={X.sub.1.sup.N,Y.sub.1.sup.N,Z.sub.1.sup.N} is the prosodic acoustic feature sequence, N is the number of syllables in the speech, and X, Y and Z denote syllable-based prosodic acoustic feature, inter-syllable prosodic acoustic feature and differential prosodic acoustic feature, respectively.
L={POS,PM,WL,t,s,f}={L.sub.1.sup.N}={POS.sub.1.sup.N,PM.sub.1.sup.N,WL.sub.1.sup.N,t.sub.1.sup.N,s.sub.1.sup.N,f.sub.1.sup.N} is a linguistic feature sequence, wherein {POS, PM, WL} is a high-level linguistic sequence, POS, PM and WL denote part-of-speech sequence, punctuation mark sequence and word length sequence respectively, {t,s,f} is a low-level linguistic feature sequence, and the letters t, s and f denote tone, base-syllable type and syllable final type, respectively.
T={B,P} is a prosodic tag sequence, where B={B.sub.1.sup.N} is a prosodic break sequence, P={p,q,r} a prosodic state sequence, and the letters p, q and r denote syllable pitch prosodic state sequence, syllable duration prosodic state sequence and syllable energy prosodic state sequence, respectively.

(21) The prosodic tag sequence is to describe the Mandarin Chinese prosodic hierarchical structure concerned by the hierarchical prosodic module 102. Referring to FIG. 2, the structure includes 4 types of prosodic constituents: syllable (SYL), prosodic word (PW), prosodic phrase (PPh), and breath group or prosodic phrase group (BG/PG). The prosodic break B.sub.n, where the subscript n denotes syllable index, is to describe the break type between the syllable n and the syllable n+1. There are totally seven prosodic break types for describing the boundary of the 4 types of prosodic constituents. The other prosodic tag P is the prosodic state denoted as P={p,q,r} and represents an aggregated effect on syllable prosodic acoustic feature resulted from the upper-level prosodic constituents of PW, PPh and BG/PG.

(22) Hierarchical Prosodic Module
P(X|B,P,L)P(Y,Z|B,L)P(P|B)P(B|L)

(23) For realizing the hierarchical prosodic module, more details are described. The model has 4 sub-models, which are syllable prosodic-acoustic model P(X|B,P,L), syllable juncture prosodic-acoustic model P(Y,Z|B,L), prosodic state model P(P|B) and break-syntax model P(B|L).

(24) The syllable prosodic-acoustic model P(X|B,P,L) can be approximated with the following sub-models:

(25) $\begin{matrix} P (X .Math. B, P, L) \approx P (sp .Math. B, p, t) P (sd .Math. B, q, t, s) P (se .Math. B, r, t, f) \\ \approx {.Math.}_{n = 1}^{N} P ({sp}_{n} .Math. B_{n - 1}^{n}, p_{n}, t_{n - 1}^{n + 1}) P ({sd}_{n} .Math. q_{n}, s_{n}, t_{n}) \\ P ({se}_{n} .Math. r_{n}, f_{n}, t_{n}) \end{matrix}$
Wherein the P(sp.sub.n|B.sub.n−1.sup.n,p.sub.n,t.sub.n−1.sup.n+1), P(sd.sub.n|q.sub.n,s.sub.n,t.sub.n) and P(se.sub.n|r.sub.n,f.sub.n,t.sub.n) respectively denote the pitch contour model, the duration model and the energy level model of the n-th syllable, the reference characters t.sub.n, s.sub.n and f.sub.n respectively denote the tone, the base-syllable and final types of the n-th syllable, while B.sub.n−1.sup.n=(B.sub.n−1,B.sub.n) and t.sub.n−1.sup.n+1=(t.sub.n−1,t.sub.n,t.sub.n+1) respectively denote the prosodic break sequence and the tone sequence.

(26) In this embodiment, the three sub-models take more factors into account. Those factors are combined by means of superimposing. Taking the pitch contour of the n-th syllable for example, one may obtain the formula:
sp.sub.n=sp.sub.n.sup.r+β.sub.t.sub.n+β.sub.p.sub.n+β.sub.B.sub.n−1.sub.,tp.sub.n−1+β.sub.B.sub.n.sub.,tp.sub.n.sup.b+μ.sub.sp
where sp.sub.n=[α.sub.0,n,α.sub.1,n,α.sub.2,n,α.sub.3,n] is a four-dimensional vector for representing the pitch contour observed from the n-th syllable. The coefficients can be derived from:

(27) $α_{j, n} = \frac{1}{M_{n} + 1} {.Math.}_{i = 0}^{M_{n}} F_{n} (i) .Math. ϕ_{j} (\frac{i}{M_{n}}) j = 0 ~ 3$
Where F.sub.n(i) is the i-th frame pitch of the n-th syllable, M.sub.n+1 the number of frames of the n-th syllable having pitch, and

(28) $ϕ_{j} (\frac{i}{M_{n}})$
the j-th orthogonal basis.

(29) sp.sub.n.sup.r is the modeling residual of sp.sub.n. β.sub.t.sub.n and β.sub.p.sub.n are affecting factors of tone and prosodic state, respectively. β.sub.B.sub.n−1.sub.,tp.sub.n−1.sup.f and β.sub.B.sub.n.sub.,tp.sub.n.sup.b are forward coarticulation affecting factor and backward coarticulation affecting factor respectively. μ.sub.sp is the global mean of the pitch vector. Assuming sp.sub.n.sup.r is zero-mean and normal distributed, we may express the data with Gaussian distribution:
P(sp.sub.n|B.sub.n−1.sup.n,p.sub.n,t.sub.n−1.sup.n+1)=N(sp.sub.n;β.sub.t.sub.n+β.sub.p.sub.n+β.sub.B.sub.n−1.sub.,tp.sub.n−1.sup.f+β.sub.B.sub.n.sub.,tp.sub.n.sup.b+μ.sub.sp,R.sub.sp)
It is noted that sp.sub.n.sup.r is a noise-like residual signal of very small deviation so that one can model the data with a normal distribution. Likewise, the syllable duration model P(sd.sub.n|q.sub.n,s.sub.n,t.sub.n) and the syllable energy level model P(se.sub.n|r.sub.n,f.sub.n,t.sub.n) can be expressed as follows:
P(sd.sub.n|q.sub.n,s.sub.n,t.sub.n)=N(sd.sub.n;γ.sub.t.sub.n+γ.sub.s.sub.n+γ.sub.q.sub.n+μ.sub.sd,R.sub.sd)
P(se.sub.n|r.sub.n,f.sub.n,t.sub.n)=N(se.sub.n;ω.sub.t.sub.n+ω.sub.f.sub.n+ω.sub.r.sub.n+μ.sub.se,R.sub.se)

(30) Where sd.sub.n and se.sub.n are the observed duration and energy level of the n-th syllable respectively, and γ.sub.x and ω.sub.x respectively represent affecting factors of syllable duration and syllable energy level with the factor x.

(31) The syllable-juncture prosodic-acoustic model P(Y,Z|B,L) describes the inter-syllable acoustic characteristics specified for different break type and surrounding linguistic features, and can be approximated with the following 5 sub-models:

(32) $\begin{matrix} P (Y, Z .Math. B, L) \approx P (pd, ed, pj, dl, df .Math. B, L) \\ \approx {.Math.}_{n = 1}^{N - 1} P ({pd}_{n}, {ed}_{n}, {pj}_{n}, {dl}_{n}, {df}_{n} .Math. B, L) \\ \approx {.Math.}_{n = 1}^{N - 1} {g ({pd}_{n}; α_{B_{n}, L_{n}}, η_{B_{n}, L_{n}}) \\ N ({ed}_{n}; μ_{ed, B_{n}, L_{n}}, σ_{ed, B_{n}, L_{n}}^{2}) .Math. \\ N ({pj}_{n}; μ_{pj, B_{n}, L_{n}}, σ_{pj, B_{n}, L_{n}}^{2}) \\ N ({dl}_{n}; μ_{dl, B_{n}, L_{n}}, σ_{dl, B_{n}, L_{n}}^{2}) .Math. \\ N ({df}_{n}; μ_{df, B_{n}, L_{n}}, σ_{df, B_{n}, L_{n}}^{2})} \end{matrix}$

(33) The aforementioned formulas describe the pause duration pd.sub.n, the energy-dip level ed.sub.n, the normalized pitch jump pj.sub.n, and two normalized syllable lengthening factors (i.e. dl.sub.n and df.sub.n) across the n-th syllable juncture.

(34) The prosodic state model P(P|B) is simulated by three sub-models:

(35) $P (P .Math. B) = P (p .Math. B) P (q .Math. B) P (r .Math. B) \approx P (p_{1}) P (q_{1}) P (r_{1}) [{.Math.}_{n = 2}^{N} P (p_{n} .Math. p_{n - 1}, B_{n - 1}) P (q_{n} .Math. q_{n - 1}, B_{n - 1}) P (r_{n} .Math. r_{n - 1}, B_{n - 1})]$

(36) The break-syntax model P(B|L) can be described as follows:

(37) $P (B .Math. L) \approx {.Math.}_{n = 1}^{N - 1} P (B_{n} .Math. L_{n})$
where P(B.sub.n|L.sub.n) is the break type model for the n-th juncture, and L.sub.n denotes the linguistic feature of the n-th syllable.

(38) The probability can be estimated by many methods. The present embodiment uses the method of decision tree algorithm for the estimation. The method of sequential optimization algorithm is used to train the prosodic models, and the maximum likelihood criterion is used to generate prosodic tags.

(39) Prosodic Structure Analysis Unit

(40) The prosodic structure analysis unit is for labeling the hierarchical prosodic structure of the input speeches, that is, looking for the best prosodic tag T={B,P} based on the prosodic-acoustic feature vector sequence (A) and the linguistic feature sequence (L). The formula is:

(41) $T^{*} = {B^{*}, P^{*}} = \arg \max_{B, P} Q$
Where Q=P(B|L)P(P|B)P(X|B,P,L)P(Y,Z|B,L).

(42) The methods used by the prosodic structure analysis unit can be realized by obtaining the best solution through the iterative method set forth below:

(43) (1) Initialization: For i=0, the best prosodic break type sequence can be found by:

(44) $B^{i} = \arg \max_{B} P (Y, Z .Math. B, L) P (B .Math. L)$
(2) Iteration: Obtaining the prosodic break type sequence and the prosodic state sequence by iterating the following three steps:
Step 1: Given with B.sup.i−1, re-labeling the prosodic state sequence of each utterance by the Viterbi algorithm so as to maximize the value of Q:

(45) 0 $P^{i} = \arg \max_{P} P (X .Math. B^{i - 1}, P, L) P (Y, Z .Math. B^{i - 1}, L) P (P .Math. B^{i - 1}) P (B^{i - 1} .Math. L)$
Step 2: Given with P.sup.i, re-labeling the break type sequence of each utterance by the Viterbi algorism so as to maximize the value of Q:

(46) $B^{i} = \arg \max_{B} P (X .Math. B, P^{i}, L) P (Y, Z .Math. B, L) P (P^{i} .Math. B) P (B .Math. L)$
Step 3: If a convergence of the value of Q is reached, exit the iteration process. Otherwise, increase the value of i by 1 and then go back to Step 1.
(3) Termination: Obtaining the best prosodic tag B*=B.sup.i and P*=P.sup.i.

(47) Coding the Prosodic Messages

(48) It is appreciated from the hierarchical prosodic module 102 that, the syllable pitch contour sp.sub.n, the syllable duration sd.sub.n and the syllable energy level se.sub.n are linear combinations concerning multiple factors, which include low-level linguistic features such as tone t.sub.n, base-syllable type s.sub.n and final type f.sub.n. Others are prosodic-state tags for indicating the hierarchical prosodic structure (obtained by the prosodic structure analysis unit 103): prosodic break-type tag B.sub.n and prosodic state tags p.sub.n, q.sub.n and r.sub.n. Thus, the syllable pitch contour sp.sub.n, the syllable duration sd.sub.n and the syllable energy level se.sub.n can be obtained by simply coding and transmitting these factors. The following formulas are applied by the prosodic feature synthesizer unit 106 to reconstruct these three prosodic acoustic features by using these factors:
sp.sub.n′=β.sub.t.sub.n+β.sub.p.sub.n+β.sub.B.sub.n−1.sub.,tp.sub.n−1.sup.f+β.sub.B.sub.n.sub.,tp.sub.n.sup.b+μ.sub.sp
sd.sub.n′=γ.sub.t.sub.n+γ.sub.s.sub.n+γ.sub.q.sub.n+μ.sub.sd
se.sub.n′=ω.sub.t.sub.n+ω.sub.f.sub.n+ω.sub.r.sub.n+μ.sub.se
Notably, the three modeling residuals, sp.sub.n.sup.r, sd.sub.n.sup.r and se.sub.n.sup.r may be neglected because their variance are all small. The three means, μ.sub.sp, μ.sub.sd and μ.sub.se, are sent in advance to the decoder as side information.

(49) The pause duration pd.sub.n is modeled by the syllable juncture pause duration sub-model, g(pd.sub.n;α.sub.B.sub.n.sub.,L.sub.n, η.sub.B.sub.n.sub.,L.sub.n), which describes the variation of syllable juncture pause duration pd.sub.n influenced by some contextual linguistic features and break type, and is organized into 7 break type-dependent decision trees (BDTs). For each break type, a decision tree is used to determine the probability density function (pdf) of syllable juncture pause duration according to the contextual linguistic features. Here, all pdfs are assumed to be Gamma distributed. In this coding scheme, all parameters of the sub-model are trained in advance and sent to the decoder as side information. In the encoder 104, the break type of the current syllable juncture and the leaf node in the corresponding decision tree that the syllable juncture resides are determined by the prosody analysis operation. Only the two symbols, i.e., the break type and the leaf-node index, are needed to be encoded and sent to the decoder 105. The decoder 105 reconstructs the syllable-juncture pause duration as the mean of the pdf of the leaf node it resides. Those distributions are considered as the side information used for transmitting information relevant to pause duration between syllables. Thus, the pause duration between syllables can be shown by merely the leaf-node index and prosodic break types B.sub.n. Notably, the leaf-node index corresponding to each syllable can be obtained from the prosodic structure analysis unit 103, while the syllable-juncture pause duration can be reconstructed by looking up the BDT for the corresponding value of β.sub.T.sub.n.sup.pd, based on the leaf-node index and prosodic break type information in the prosodic feature synthesizer unit 106.

(50) In summary, the symbols needed to be encoded by the encoder 104 include: tone t.sub.n base-syllable type s.sub.n, final type f.sub.n, break type tag B.sub.n, three prosodic-state tags (p.sub.n,q.sub.n,r.sub.n) and the index of the occupied leaf node T.sub.n in the corresponding BDT. The encoder 104 encodes with different bit length based on the aforementioned types of symbols, and eventually composes bit streams which will be sent to the decoder 105 to decode and then transmitted to the prosodic feature synthesizer unit 106 to be reconstructed to prosodic messages for speech synthesis by the speech synthesizer 107. Aside from bit steams, some features of the hierarchical prosodic module 102 are regarded as side information, which is for the use of restoring prosodic features and includes the affecting patterns (APs) {β.sub.t, β.sub.p, β.sub.B,tp.sup.f, β.sub.B,tp.sup.b, μ.sub.sp)} of the syllable pitch-contour sub-model, the APs {γ.sub.t, γ.sub.s, γ.sub.s, μ.sub.sd} of the syllable duration sub-model, the APs {ω.sub.t, ω.sub.f, ω.sub.r, μ.sub.se} of the syllable energy level sub-model and the means {μ.sub.T.sub.n.sup.pd} of the leaf-node pdfs of the syllable juncture pause duration sub-model.

(51) Speech Synthesis

(52) The task of the speech synthesizer 107 is to synthesize speech with HMM-based speech synthesis technology based on the base-syllable type, the syllable pitch contour, the syllable duration, the syllable energy level and the pause duration between syllables. The HMM-based speech synthesis is a technology known to the skilled person in the art.

(53) FIG. 3 shows a schematic diagram of generating a synthesized speech with an HMM-based speech synthesizer. Firstly, the state durations for each syllable segment are generated by the HMM state duration and voiced/unvoiced generator 303 with HMM state duration model 301:
d.sub.n,c=μ.sub.n,c+ρ.Math.σ.sub.n,c.sup.2 for c=1˜C, n and c are integers
Wherein μ.sub.n,c and σ.sub.n,c.sup.2 represent correspondingly the mean and the variance of the Gaussian model for the c-th HMM state of the n-th syllable. ρ is an elongation coefficient, which can be obtained from the following formula:

(54) $ρ = ({sd}_{n}^{'} - {.Math.}_{c = 1}^{C} μ_{n, c}) / ({.Math.}_{c = 1}^{C} σ_{n, c}^{2})$

(55) Notably, the factor sd.sub.n′ denotes the syllable duration reconstructed by the prosodic feature synthesizer unit 106. Since the voiced/unvoiced state of each HMM state is determined, the HMM state voiced/unvoiced model 302 and the HMM state duration model 301 together can be used to obtain the duration of voiced sound within a syllable, that is, the number of frames M.sub.n′+1. Further, contours of the syllable pitch can be reconstructed at the logarithm pitch contour and excitation signal generator 306 based on the following formula:

(56) $F_{n}^{'} (i) = {.Math.}_{j = 0}^{3} α_{j, n}^{'} .Math. ϕ_{j} (\frac{i}{M_{n}^{'}}) for i = 0 ~ M_{n}^{'}$
Wherein α.sub.j,n′ denotes the j-th dimension of the syllable pitch contour vector reconstructed by the prosodic feature synthesizer unit 106, i.e.:
sp.sub.n′=[α.sub.0,n′,α.sub.1,n′,α.sub.2,n′,α.sub.3,n′]

(57) Afterwards, the excitation signal required by the MLSA synthesis filter 307 can be generated from the reconstructed logarithm pitch contour. On the other hand, each of the frame spectrum information is the MGC parameter for each frame generated by the frame MGC generator 305 using the HMM acoustic model 304 given HMM state duration, voiced/unvoiced information, break type, prosodic-state tag, base-syllable type and syllable energy level. Energy level of each of the syllable is adjusted to the level reconstructed by the prosodic feature synthesizer unit 106. Finally, the excitation signal and the MGC parameters of each frame are input into the MLSA filter 307 so as be able to synthesize speeches.

(58) Experimental Results

(59) Table 1 shows important statistical information of experimental corpus, which includes two major portions: (1) Single speaker Treebank corpus; and (2) Multiple speaker Mandarin Chinese continuous speech database TCC300, which are respectively for evaluating the coding performance of the speaker-dependent and the speaker-independent embodiments of on-site testing as illustrated in FIG. 1.

(60) TABLE-US-00001 TABLE 1 No. of No. of No. of Length Corpus Subset Usage Speaker Utterance Syllable (Hour) Treebank TrainTB Training of the 1 376 51,868 3.9 hierarchical prosodic module, the acoustic model for forced-alignment and the models for HMM-based speech synthesizer TestTB Evaluation of prosodic 1 44 3,898 0.3 coding TCC300 TrainTC1 Training of acoustic 274 8,036 300,728 23.9 models for forced-alignment TrainTC2 Training hierarchical 164 962 106,955 8.3 prosodic module TestTC Evaluation of prosodic 19 226 26,357 2.4 coding

(61) Table 2 shows the codeword length required by each encoding symbol

(62) TABLE-US-00002 TABLE 2 Symbol Symbol Count Bit Count Tone t.sub.n 5 3 Base-syllable type s.sub.n 411 9 Syllable Pitch Prosodic State p.sub.n 16 4 Syllable Duration Prosodic 16 4 State q.sub.n Syllable Energy Prosodic 16 4 State r.sub.n Prosodic Pause B.sub.n 7 3 BDT Leaf Node 5/7/3/2/4/3/1(SI) 3/3/2/1/2/2/0(SI) B0/1/2-1/2-2/2-3/3/4 3/9/3/9/5/11/9(SD) 2/4/2/4/3/4/4(SD) Total Bit Count of Each 30 (SI) 31(SD) Syllable (Maximum)

(63) Table 3 displays the parameter count for the side information.

(64) TABLE-US-00003 TABLE 3 Type of Parameters Parameter Count Tone Affecting Parameters β.sub.t/γ.sub.t/ω.sub.t 20/5/5 Forward and Backward Coarticulation Affecting 720/720 Parameters β.sub.B,tp.sup.f/β.sub.B,tp.sup.b Prosodic State Affecting Parameters β.sub.p/γ.sub.q/ω.sub.r 16/16/16 Average of Whole Corpus μ.sub.sp/μ.sub.sd/μ.sub.se 1/1/1 Base-Syllable Type and Syllable final Type 411/40 Affecting Parameters γ.sub.s/ω.sub.f.sub.n Average BDT Leaf Node Pause Duration μ.sub.T.sub.n.sup.pd 25 (SI)/49 (SD) Total 1997 (SI)/2021 (SD)

(65) Table 4 shows the root-mean-square errors (RMSE) of the prosodic features reconstructed by the prosodic feature synthesizer unit 106. It is appreciated from Table 4 that those errors are relatively small.

(66) TABLE-US-00004 TABLE 4 Syllable Pitch contour Syllable Syllable Pause (Hz/ Duration Energy Duration semitone) (ms) Level (dB) (ms) Treebank TrainTB 16.2/1.42 4.81 0.68 38.7 TestTB 15.7/1.22 4.74 0.70 30.9 TCC300 TrainTC2 12.1/1.26 8.54 1.05 46.9 TestTC 11.7/1.13 12.49 1.86 63.0

(67) Table 5 shows the bit rate performance of the present invention. The average of speaker-dependent and speaker-independent transmission bit rates are 114.9±4.78 bits per second and 114.9±14.9 bits per second respectively, both are very low. FIGS. 4A and 4B illustrate examples of speaker-dependent (401, 402, 403 and 404) and speaker-independent (405, 406, 407 and 408) prosodic features respectively, including original and reconstruction ones. Those features includes speaker-dependent syllable pitch level 401, syllable duration 402, syllable energy level 403 and syllable juncture pause duration 404 (without B0 and B1 for conciseness) and speaker-independent syllable pitch level 405, syllable duration 406, syllable energy level 407 and syllable-juncture pause duration 408. According to FIGS. 4A and 4B, it is appreciated that the reconstructed prosodic features are very close to the original prosodic features.

(68) TABLE-US-00005 TABLE 5 Average ± Std. Deviation Maximum Minimum Treebank Train TB 116 ± 5.25 131.5 91.5 Test TB 114.9 ± 4.78 124.1 99.1 TCC300 Train TC2 113.3 ± 9.2 138.0 66.1 Test TC 114.9 ± 14.9 158.8 84.7

(69) Examples of Speech Rate Conversion

(70) The prosodic encoding method according to the present invention also provides systematic speech rate conversion platform. The method includes replacing the hierarchical prosodic module 102 having the original speech rate with another hierarchical prosodic module 102 having a target speech rate by the prosodic feature synthesizer unit 106. The statistic data relevant to the training corpus for on-site testing are shown in Table 6. The speaker-dependent training corpus for the experimental test is recorded in a normal speed. Based on the corpus with the normal speed, the other corpus of different speech rate are the fast speed corpus and the slow speed corpus, whose corresponding hierarchical prosodic modules can be constructed by the training method the same as that for normal speed ones. FIG. 5A illustrates waveform 501 and pitch contour 502 of original speech. FIG. 5B illustrates waveform 505 and pitch contour 506 of prosodic information after encoding and synthesizing. FIG. 5C illustrates waveform 509 and pitch contour 510 of speeches whose speed is converted to a faster rate. FIG. 5D illustrates waveform 513 and pitch contour 514 of speeches whose speed is converted to a slower rate. The straight line portions in FIGS. 5A-5D indicates the position of syllable segmentation (can be shown as Mandarin Chinese pronunciation 503, 507, 511 and 515) and syllable segmentation time information 504, 508, 512 and 516. According to FIGS. 5A-5D, it is appreciated that there are significant differences in syllable duration and pause duration among the normal speed, faster speed and lower speed speeches. When the synthesized speech with different speech speed is listened by informal audio experiment, the prosody seems fluent and natural.

(71) TABLE-US-00006 TABLE 6 Articulation Rate = Speech Rate = (Syllable (Syllable Count)/ Count)/ (Total (Total Syllable Length of No. of Syllable Length Duration in Utterances in Corpus Utterance Count (Hour) Second) Second) FastTB 368 50,691 3.4 5.52 4.40 TrainTB 376 51,868 3.9 5.05 3.82 TestTB 44 3,895 0.3 4.89 3.78 SlowTB 372 51231 6.0 3.78 2.46

(72) While the invention has been described in terms of what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention needs not be limited to the disclosed embodiments. On the contrary, it is intended to cover various modifications and similar arrangements included within the spirit and scope of the appended claims which are to be accorded with the broadest interpretation so as to encompass all such modifications and similar structures.

Embodiments

(73) 1. A speech-synthesizing device, comprising:

(74) a hierarchical prosodic module generating at least a first hierarchical prosodic model;

(75) a prosody-analyzing device, receiving a low-level linguistic feature, a high-level linguistic feature and a first prosodic feature, and generating at least a prosodic tag based on the low-level linguistic feature, the high-level linguistic feature, the first prosodic feature and the first hierarchical prosodic model; and

(76) a prosody-synthesizing unit synthesizing a second prosodic feature based on the hierarchical prosodic module, the low-level linguistic feature and the prosodic tag.

(77) 2. A speech-synthesizing device of Embodiment 1, further comprising:

(78) a prosodic feature extractor receiving a speech input and the low-level linguistic feature, segmenting the input speech to form a segmented speech, and generating the first prosodic feature based on the low-level linguistic feature and the segmented speech.

(79) 3. A speech-synthesizing device of Embodiment 2 further comprising a prosody-synthesizing device, wherein the first hierarchical prosodic model is generated based on a first speech speed, on a condition that when the prosody-synthesizing device is going to generate a second speech speed being different from the first speech speed, the first hierarchical prosodic model is replaced with a second hierarchical prosodic model having the second speech speed and the prosody-synthesizing unit changes the second prosodic feature to a third prosodic feature.
4. A speech-synthesizing device of Embodiment 3, wherein the speech-synthesizing device generates a speech synthesis with the second synthesized speech based on the third prosodic feature and the low-level linguistic feature.
5. A speech-synthesizing device of Embodiment 1, further comprising:

(80) an encoder receiving the prosodic tag and the low-level linguistic feature to generate a code stream; and

(81) a decoder receiving the code stream, and restoring the prosodic tag and the low-level linguistic feature.

(82) 6. A speech-synthesizing device of Embodiment 5, wherein the encoder includes a first codebook providing an encoding bit corresponding to the prosodic tag and the low-level linguistic feature so as to generate the code stream, and the decoder includes a second codebook providing the encoding bit to reconstruct code stream to the prosodic tag and the low-level linguistic feature.
7. A speech-synthesizing device of Embodiment 5, further comprising:

(83) a prosody-synthesizing device receiving the prosodic tag and the low-level linguistic feature reconstructed by the decoder to generate the second prosodic feature including a syllable pitch contour, a syllable duration, a syllable energy level and an inter-syllable pause duration.

(84) 8. A speech-synthesizing device of Embodiment 7, wherein the second prosodic feature is reconstructed by a superposition module.

(85) 9. A speech-synthesizing device of Embodiment 7, wherein the syllable juncture pause duration is reconstructed by looking up a codebook.

(86) 10. A prosodic information encoding apparatus, comprising:

(87) a speech segmentation and prosodic feature extracting device receiving a speech input and a low-level linguistic feature to generate a first prosodic feature;

(88) a prosodic structure analysis unit receiving the first prosodic feature, the low-level linguistic feature and a high-level linguistic feature, and generating a prosodic tag based on the first prosodic feature, the low-level linguistic feature and the high-level linguistic feature; and

(89) an encoder receiving the prosodic tag and the low-level linguistic feature to generate a code stream.

(90) 11. A code stream generating apparatus, comprising:

(91) a prosodic feature extractor generating a first prosodic feature;

(92) a hierarchical prosodic module providing a prosodic structure meaning for the first prosodic feature; and

(93) an encoder generating a code stream based on the first prosodic feature having the prosodic structure meaning,

(94) wherein the hierarchical prosodic module has at least two parameters being ones selected from the group consisting of a syllable duration, a pitch contour, a pause timing, a pause frequency, a pause duration and a combination thereof.

(95) 12. A method for synthesizing a speech, comprising steps of:

(96) providing a hierarchical prosodic module, a low-level linguistic feature, a high-level linguistic feature and a first prosodic feature;

(97) generating at least a prosodic tag based on the low-level linguistic feature, the high-level linguistic feature, the first prosodic feature and the hierarchical prosodic module; and

(98) outputting the speech according to the prosodic tag.

(99) 13. A method of Embodiment 12, further comprising steps of:

(100) providing an inputting speech;

(101) segmenting the inputting speech to generate a segmented input speech;

(102) extracting a prosodic feature from the segmented input speech according to the low-level linguistic feature to generate the first prosodic feature;

(103) analyzing the first prosodic feature to generate the prosodic tag;

(104) encoding the prosodic tag to form a code stream;

(105) decoding the code stream;

(106) synthesizing a second prosodic feature based on the low-level linguistic feature and the prosodic tag; and

(107) outputting the speech based on the low-level linguistic feature and the second prosodic feature.

(108) 14. A prosodic structure analysis unit, comprising:

(109) a first input terminal receiving a first prosodic feature;

(110) a second input terminal receiving a low-level linguistic feature;

(111) a third input terminal receiving a high-level linguistic feature; and

(112) an output terminal, wherein the prosodic structure analysis unit generates a prosodic tag at the output terminal based on the first prosodic feature, the low-level and the high-level linguistic features.

(113) 15. A speech-synthesizing device, comprising:

(114) a decoder receiving a code stream and restoring the code stream to generate a low-level linguistic feature and a prosodic tag;

(115) a hierarchical prosodic module receiving the low-level linguistic feature and the prosodic tag to generate a second prosodic feature; and

(116) a speech synthesizer generating a synthesized speech based on the low-level linguistic feature and the second prosodic feature.

(117) 16. A prosodic structure analysis apparatus, comprising:

(118) a hierarchical prosodic module generating a hierarchical prosodic model; and

(119) a prosodic structure analysis unit receiving a first prosodic feature, a low-level linguistic feature and a high-level linguistic feature, and generating a prosodic tag based on the first prosodic feature, the low-level and the high-level linguistic features and the hierarchical prosodic model.

(120) 17. A prosodic structure analysis apparatus of Embodiment 16, wherein the low-level linguistic feature includes a base-syllable type, a syllable-final type, and a tone type of a language.

(121) 18. A prosodic structure analysis apparatus of Embodiment 16, wherein the high-level linguistic feature includes a word, a part of speech and a punctuation mark.

(122) 19. A prosodic structure analysis apparatus of Embodiment 16, wherein the prosodic feature includes a syllable pitch contour, a syllable duration, a syllable energy level and a syllable juncture pause duration.

(123) 20. A prosodic structure analysis apparatus of Embodiment 16, wherein the prosodic structure analysis device performs an optimization algorithm by referring to the low-level linguistic feature and the high-level linguistic feature to generate the prosodic tag.

Streaming encoder, prosody information encoding device, prosody-analyzing device, and device and method for speech synthesizing

Assignee

Inventors

Cpc classification

Classification Explorer

G10L19/00

PHYSICS

Classification Explorer

G10L13/02

PHYSICS

Classification Explorer

G10L19/0018

PHYSICS

Classification Explorer

G10L13/10

PHYSICS

International classification

Classification Explorer

G10L19/18

PHYSICS

Classification Explorer

G10L13/10

PHYSICS

Classification Explorer

G10L13/02

PHYSICS

Classification Explorer

G10L19/00

PHYSICS

Abstract

Claims

Description