Methods, apparatus and data structure for cross-language speech adaptation
09798653 · 2017-10-24
Assignee
Inventors
Cpc classification
G06F12/00
PHYSICS
G06F40/58
PHYSICS
International classification
Abstract
Adapted speech models produce fluent synthesized speech in a voice that sounds as if the speaker were fluent in a language in which the speaker is actually non-fluent. A full speech model is obtained based on fluent speech in the language spoken by a first person who is fluent in the language. A limited set of utterances is obtained in the language spoken by a second person who is non-fluent in the language but able to speak the limited set of utterances in the language. The full speech model of the first person is then processed with the limited set of utterances of the second person to produce an adapted speech model. The adapted speech model may be stored to a multi-lingual speech model as a child node of a root with an associated language selection question and branches pointed to the adapted speech model and other speech models, respectively.
Claims
1. A system comprising: data storage for storing: a full speech model based on speech in a language spoken by a first person who is fluent in the language, a limited set of utterances in a fluent language of a second person based on speech spoken by the second person who is non-fluent in the language spoken by the first person, and a full speech model of the second person based on speech by the second person, and a processor configured to implement: a cross-language speech adapter that processes the full speech model based on speech in the language spoken by the first person and the limited set of utterances in the fluent language of the second person based on speech spoken by the second person who is non-fluent in the language spoken by the first person and outputs an adapted speech model, the processing including applying at least one transformation to the full speech model according to the limited set of utterances to produce the adapted speech model, and a tree combination unit the tree combination unit combining the full speech model of the second person based on speech by the second person and the adapted speech model with Text-to Speech (TTS) engine files of the adapted speech model and the full speech model of the second person, wherein the transformation includes a plurality of: (1) a constrained maximum likelihood linear regression (CMLLR) transformation, (2) a MLLR adaptation of the mean (MLLRMEAN) transformation, (3) a variance MLLR (MLLRVAR) transformation, and (4) a maximum a posteriori (MAP) linear regression transformation.
2. The system of claim 1 further comprising a text-to-speech (TTS) engine.
3. The system of claim 2 wherein the text-to-speech (TTS) engine outputs fluid synthesized speech.
4. The system of claim 3 wherein the text-to-speech (TTS) engine receives a multi-lingual phoneme stream.
5. The system of claim 4 wherein the multi-lingual phoneme stream was transformed from multi-lingual text by a text processor.
6. A method comprising: receiving at an input interface of a computer system having at least a processor and a memory in addition to the input and output interface, a full speech model based on speech in a language spoken by a first person who is fluent in the language; receiving at the input interface, a limited set of utterances in a fluent language of a second person based on speech spoken by the second person who is non-fluent in the language spoken by the first person; applying, in the computer system, a transformation technique with an adaptation module to the full speech model according to the limited set of utterances to produce a plurality of adapted speech models, wherein a cross-language speech adapter processes the full speech model based on speech in the language spoken by the first person and the limited set of utterances in the fluent language of the second person based on speech spoken by the second person who is non-fluent in the language spoken by the first person and outputs an adapted speech model, the processing including applying at least one transformation to the full speech model according to the limited set of utterances to produce the adapted speech model; and synthesizing, in the computer system, speech using each of the plurality of adapted speech models to generate a plurality of synthesized speech samples, wherein the transformation technique includes a plurality of: (1) a constrained maximum likelihood linear regression (CMLLR) transformation, (2) a MLLR adaptation of the mean (MLLRMEAN) transformation, (3) a variance MLLR (MLLRVAR) transformation, and (4) a maximum a posteriori (MAP) linear regression transformation.
7. The method of claim 6 wherein a plurality of speech samples are presented to the adaptation module for selection of one by the plurality of transformations that produced a synthesized speech sample having a voice that most closely resembles the voice of a second person and sounds as if the second person were fluent in the language.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The foregoing and other objects, features and advantages of the invention will be apparent from the following description of particular embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION
(8)
(9)
(10) In step 204, the cross-language speech adapter 110 processes the full speech model 120 of the first person 125 with the limited set of utterances 130 of the second person 135 to produce the adapted speech model 140 that can be used (e.g., by a TTS engine) to produce fluent synthesized speech in the language 122 in a voice that sounds as if the second person 135 were fluent in the language 122.
(11) In certain embodiments, the limited set of utterances 130 obtained in step 201 includes 30 or fewer utterances spoken by the second person 135. An example limited set of phonetic-balanced utterances, as understood in the art for training a speech model, is provided below.
(12) It's very possible that's the spa Oscar likes.
(13) Wealth creation and quality of life go hand in hand.
(14) The survey is required by federal law.
(15) Several hundred yesterday registered to buy one via the Internet.
(16) Prices vary in different locations, because of local economic conditions.
(17) They have no chance to learn.
(18) He hid the shirt under the bed.
(19) This brings the monthly total to seven hundred twenty two.
(20) The coins all had bees carved on them.
(21) This is what they had known was coming.
(22) We hope you enjoy it.
(23) Many fishermen face bankruptcy as a result.
(24) But their impact can't be measured in pure numbers alone.
(25) The location will be online in London.
(26) The fine print of the NBA policy suggests why.
(27) The central idea in the act is balance, not assertion.
(28) Teachers have been appalled by the language, he said.
(29) After all we, as taxpayers, are the client.
(30) Michael was in the middle of the partnership.
(31) She is already working on a new novel, about poker.
(32) Coalition forces reported no casualties overnight.
(33) The chefs all gather their equipment and take a bow.
(34) The officials now acknowledge that those tactics were wrong.
(35) Roger Penrose, at Nuance dot COM.
(36) He also allowed alternative Americans.
(37) Welcome both old friends and newcomers.
(38) Nobody could be that weird.
(39) I was ready to go on reading.
(40) Everyone wants a dog just like her.
(41) And the issue is more complex than it seems.
(42) In an example embodiment, to process the full speech model 120 of the first person 125 with the limited set of utterances 130 of the second person 135, the cross-language speech adapter 110, in step 205, adapts the full speech model 120 with voice features 133 of the limited set of utterances 130 to produce the adapted speech model 140. In certain example embodiments, in step 206, to adapt the full speech model 120 with voice features 133 of the limited set of utterances 130, the cross-language speech adapter 110 produces a voice feature map 137 that describes, for each voice feature 133 of the second person 135 in the limited set of utterances 130 to be applied to the full speech model 120 of the first person 125, an adaptation of the voice feature 133 of the second person 135 in the limited set of utterances 130 to a corresponding voice feature 123 of the first person 125 in the full speech model 120 to produce the adapted speech model 140. In step 207, the cross-language speech adapter 110 may store voice feature entries 138a, 138b in the voice feature map 137 for at least one of spectrum, pitch, maximum voice frequency and delta-delta coefficients.
(43)
(44) Note that the terms “first” and “second” are used to differentiate the speakers of a particular full speech model 320-1-320-N and a speaker of the limited set of utterances 330-1-330-M. However, the speaker of each full speech model 320-1-320-N need not be the same person (i.e., any or all of the full speech models 320-1-320-N may be based on fluent speech spoken by a unique person). Likewise, the number of full speech models 320-1-320-N is not necessarily the same as the number of limited sets of utterances 330-1-330-M (i.e., there may be more or fewer limited sets of utterances 330-1-330-M than the number of full speech models 320-1-320-N).
(45) The system 300 also includes a full speech model of the second person 380 based on speech by the second person 335 (i.e., the person who spoke the limited set of utterances 330-1-330-M). In certain embodiments, the full speech model 380 may be based on fluent speech spoken by the second person (i.e., in the native language of the second person) or an adapted speech model 340-1-340-N. Note, however, that speech synthesized based on the full speech model 320-1-320-N based on fluent speech spoken by the first person 325 will sound like the first person 325. Similarly, speech synthesized based on the full speech model of the second person 380 based on fluent speech spoken by the second person 335 will sound like the second person 335.
(46) The full speech model 320-1-320-N may then be passed to the cross-language speech adapter 310, along with the corresponding limited set of utterances 330-1-330-M (e.g., corresponding to the language 1-N spoken in the full speech model), for processing into the adapted speech model 340-1-340-N for the language 1-N spoken in the full speech model (as will be discussed in greater detail below with regard to
(47) The adapted speech model 340-1, for example, generated and output by the cross-language speech adapter 310, is passed along with the full speech model of the second person 380 to a tree combination unit 385. The tree combination unit 385 combines TTS engine files of the adapted speech model 340-1 and the full speech model 380 of the second person 335 by generating a new root node 365 for a multi-lingual speech model 360 according to language differentiator features 372-1. Branches are then created to each child node 370-0, 370-1. Each non-leaf node (i.e., the root node 365, although other intermediate nodes may be present) in the multi-lingual speech model 360 includes a question. For example, in the example multi-lingual speech model 360 illustrated, the root node 365 question is “Is it language 1?” (i.e., the language modeled by the first adapted speech model 340-1). The two branches (e.g., “yes” and “no”) from the root node 365 allow selection of a child node 370-0, 370-1 according to their answer to the parent language selector node or respective language differentiator feature 372-1, 382.
(48) It should be understood that the multi-lingual speech model 360 may include a root node, one or more intermediate non-leaf nodes and a plurality of child nodes, with each child node storing one speech model (e.g., full speech model or adapted speech model) according to a respective language differentiator feature assigned to the speech model. For example, in the system 300 illustrated in
(49) Referring back to
(50) For example, the text processor 395 may parse the multi-lingual text 390 into a phoneme string according to a configuration file that stores language tag information. The text processor 395 is able to identify a language characteristic of each word and assign a respective language tag to each phoneme converted from the word. In operation, the text processor 395 then converts the phoneme string with the language tag into the multilingual phoneme stream 397 with tri-phone context, prosodic features and language differentiator features added. For example, each line of the phoneme stream may correspond to one phoneme with particular prosodic features and language differentiator features.
(51) The TTS engine 350 loads both the multi-lingual speech model or tree 360 and receives the multi-lingual phoneme stream 397. In hidden Markov model (HMM)-based TTS final results, all speech models are in binary tree form. Each tree includes a root node, non-leaf nodes and leaf nodes. It should be understood that there may be many layers of non-leaf nodes in the HMM. Each non-leaf node has two branches and two child nodes and has an attached question regarding the language, such as a prosodic feature. The decision of which child node is assigned to which branch depends on how the child node answers the question in parent node. The leaf nodes include real models, such as Gaussian distributions, including means, covariances and weights. During synthesis, a TTS engine 350 searches the model for the proper leaf node for a particular phoneme according to its prosodic feature or the answers to the questions attached to the non-leaf odes.
(52) For each phoneme 398 in the multi-lingual phoneme stream 397, the TTS engine 350 searches the multi-lingual speech model 360 for the appropriate speech model 340-1, 380 according to the language differentiator features 399 of the respective phoneme 398 of the multi-lingual phoneme stream 397. For example, for each line of phoneme stream, the TTS engine 350 first finds the language differentiator feature 399 for a phoneme 398. The TTS engine 350 then follows the appropriate branch of the multi-lingual speech model or tree 360 to match other prosodic features within the branch to determine which model or leaf-node is going to be used during speech synthesis.
(53) The language differentiator feature 399 enables the TTS engine 350 to search the multi-lingual speech model 360 for the appropriate language branch of the multi-lingual speech model 360. For example, if the language differentiator feature 399 indicates that the respective phoneme 398 to be synthesized by the TTS engine 350 is in language 1, the branch labeled “yes” is followed from the root node 365 to the child node 370-1 for the adapted speech model 340-1. Likewise, if the language differentiator feature 399 indicates that the respective phoneme 398 to be synthesized by the TTS engine 350 is not in language 1, the branch labeled “no” is followed from the root node 365 to the child node 370-0 for the full speech model 380.
(54) The TTS engine 350 then obtains the selected speech model 340-1, 380 and applies it to respective portions of the text 390 to produce fluent synthesized speech 355. For portions of the text 390 in respective different languages, the TTS engine 350 may obtain a speech model 340-1, 380 in each of the respective languages and apply each respective speech model 340-1, 380 to each portion of the text 390 in the respective language to synthesize a voice that sounds as if the second person 335 were fluent in each of the respective languages. It should be understood that the text processor 395 may assign a plurality of language differentiator features 399 while parsing the text 390, and that selection of a speech model 340-1, 380 may repeatedly switch between the speech models available in the multi-lingual speech model 360.
(55)
(56) Following the identification of the corresponding models in the model space of the first person, the unseen model 422-2 is then used to build a regression tree (420) (i.e., a set of well-organized transition matrices from the limited set of utterances spoken by the second person). These regression trees put the HMM in a third form 422-3 and act as a bridge, when passed as an input with a limited set of utterances (e.g., limited set of utterances 330-1-330-N of
(57) The adaptation module 430 processes the regression tree 422-3 and the limited set of utterances 330 according to the transformation technique 425 to produce transforms 435. The tree structure of the HMM in the third form 422-3 is not altered during the transformation. The transforms 435 are then passed to a TTS engine converter 440 with the HMM for the full speech model in the third form 422-3 to generate an adapted speech model (e.g., adapted speech model 340-1 of
(58) In general, the adaptation technique moves the full speech model 320 from one language space to another according to the limited set of utterances 330. The limited set of utterances 330 has two roles: (1) to determine the base points to be transformed; and (2) to calculate a transition matrix. The adaptation module 430 then moves all models from the language space of the full speech model 330 to the language space of the limited set of utterances 340 via the transition matrix. The adaptation technique 425, the full speech model 330 and the limited set of utterances 340, as inputs to the adaptation module 430, will determine the transfer of the models from the language space of the full speech model 330 to the language space of the limited set of utterances 340. It should be understood that the more utterances or model samples, the more base points the transition matrix can have.
(59)
(60) In step 210, in certain embodiments, the adaptation module 430 may apply a plurality of the transformation techniques 425 (i.e., CMLLR, MLLRMEAN, MLLRVAR and MAP) to the full speech model 320 according to the limited set of utterances 330 to produce a plurality of adapted speech models 340. Speech may be synthesized, in step 211, using each of the plurality of adapted speech models 340 to generate a plurality of synthesized speech samples. In step 212, the plurality of speech samples may be presented for selection of a one of the plurality of transformations 425 that produced a synthesized speech sample having a voice that most closely resembles the voice of the second person 335 and sounds as if the second person 335 were fluent in the language.
(61)
(62) In step 214, the tree combination unit 385 assigns respective language differentiator features (e.g., language differentiator features 372-1, 382) to the full speech model 380 and the adapted speech model 340-1. In step 215, the tree combination unit 385 generates a new root node 365 of the multi-lingual speech model 360, the root node 365 including an associated language selection question. In step 216, the tree combination unit 385 then attaches the full speech model 380 and the adapted speech model 340-1 to respective branches from the root node 365 of the multi-lingual speech model 360 according to respective answers to the associated language selection question determined by the respective language differentiator features 372-1, 382 assigned to the speech model 380 and the adapted speech model 340-1.
(63) While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.