METHOD AND APPARATUS FOR EXEMPLARY MORPHING COMPUTER SYSTEM BACKGROUND

Abstract

Method and apparatus for reducing a size of databases required for recorded speech data.

Claims

1. An exemplary computer system configured to obtain diphones and morph them into a target speakers voice, comprising a first module configured as a speech recognizer, a second module configured as a pitch extractor, a third module configured as a diphone database, and a fourth module configured as a unit selector.

2. The speech recognizer of claim 1, where the speech recognizer is configured to obtain an audio waveform from a first source speaker and converts said audio waveform into a sequence of phonemes.

3. The pitch extractor of claim 1 where the pitch extractor is configured to determine the pitch contour of each diphone.

4. The unit selector of claim 1 where the unit selector is configured to obtain the sequence of phonemes from the speech recognizer, select a first diphone from the speech recognizer, obtain a list of potential candidate matches from the Diphone database, compare the transcription labels of the first diphone with the transcripts labels of each potential candidate diphone.

5. The method of claim 4 of comparing transcription labels by comparing the consonants, determining the consonant distance, assigning a weight, comparing the vowels, determining the vowel distance, assigning a lesser weight.

6. The unit selector of claim 1 where the unit selector is configured compare pitch contours between the first diphone and the potential candidate diphones comprising the method of calculating the difference between the pitch at the beginning and end of the first diphone, calculating the pitch at the beginning and end of each potential candidate diphone, then calculate the difference.

7. The unit selector of claim 1 where the unit selector is configured to compare the speaking rate between the first diphone and the potential candidates diphones comprising the method of calculating the difference between the speaking rate of the first diphone and the speaking rate of each of the potential candidate diphones.

8. The unit selector of claim 1 where the unit selector is configured to match the formants of the surrounding diphones comprising the method of calculating the difference between the formants at the end of the first diphone with the formants at the beginning of each potential candidate diphone.

9. The method of matching formants in claim 8 further comprising the method of calculating the difference between formants for the first three formants of both the first diphone and each candidate diphone.

10. The unit selector of claim 1 where the unit selector is configure to match the pitch of the surrounding diphones comprising the method of calculating the difference between the pitch at the end of the first diphone with the pitch at the beginning of each potential candidate diphone.

11. The unit selector of claim 1 where the unit selector makes a weighted average of the difference between the first diphone and each potential candidate diphone.

12. The method of claim 11 where the label matching is weighted at 42%.

13. The method of claim 11 where the pitch contour match is weighted at 25%.

14. The method of claim 11, where the pitch matching is weighted at 8%

15. The method of claim 11, where the speaking rate matching is weighted at 17%.

16. The method of claim 11, where the formant matching is weighted at 8%.

17. The unit selector of claim 1 where the unit selector re-scores the diphone matches.

18. An exemplary computer system configured to obtain diphones and morph them into a target speakers voice, comprising a first module configured as a pronunciation generator, a second module configured as an intonation generator, a third module configured as a diphone database, and a fourth module configured as a unit selector.

19. The pronunciation generator of claim 18, where the pronunciation generator is configured to convert the written text into the phonetic alphabet.

20. The pitch extractor of claim 18 where the intonation generator is configured to generate the pitch from the typed text.

21. The unit selector of claim 18 where the unit selector is configured to obtain the sequence of phonemes from the pronunciation generator and intonation generator, select a first diphone, obtain a list of potential candidate matches from the Diphone database, and compare the transcription labels of the first diphone with the transcripts labels of each potential candidate diphone.

22. The method of claim 21 of comparing transcription labels by comparing the consonants, determining the consonant distance, assigning a weight, comparing the vowels, determining the vowel distance, assigning a lesser weight.

23. The unit selector of claim 18 where the unit selector is configured compare pitch contours between the first diphone and the potential candidate diphones comprising the method of calculating the difference between the pitch at the beginning and end of the first diphone, calculating the pitch at the beginning and end of each potential candidate diphone, then calculate the difference.

24. The unit selector of claim 18 where the unit selector is configured to compare the speaking rate between the first diphone and the potential candidates diphones comprising the method of calculating the difference between the speaking rate of the first diphone and the speaking rate of each of the potential candidate diphones.

25. The unit selector of claim 18 where the unit selector is configured to match the formants of the surrounding diphones comprising the method of calculating the difference between the formants at the end of the first diphone with the formants at the beginning of each potential candidate diphone.

26. The method of matching formants in claim 8 further comprising the method of calculating the difference between formants for the first three formants of both the first diphone and each candidate diphone.

27. The unit selector of claim 18 where the unit selector is configure to match the pitch of the surrounding diphones comprising the method of calculating the difference between the pitch at the end of the first diphone with the pitch at the beginning of each potential candidate diphone.

28. The unit selector of claim 18 where the unit selector makes a weighted average of the difference between the first diphone and each potential candidate diphone.

29. The method of claim 28 where the label matching is weighted at 42%.

30. The method of claim 28 where the pitch contour match is weighted at 25%.

31. The method of claim 28, where the pitch matching is weighted at 8%

32. The method of claim 28, where the speaking rate matching is weighted at 17%.

33. The method of claim 28, where the formant matching is weighted at 8%.

34. The unit selector of claim 18 where the unit selector re-scores the diphone matches.

Description

DESCRIPTION OF THE DRAWINGS

[0012] FIG. 1 illustrates a system level overview of the computer system

[0013] FIG. 2 illustrates a flow diagram of one embodiment of the invention.

[0014] FIG. 3 illustrates a flow diagram of one embodiment of the invention.

[0015] FIG. 4 illustrates a flow diagram of one embodiment of the invention.

[0016] FIG. 5 illustrates a flow diagram of one embodiment of the invention.

[0017] FIG. 6 illustrates a flow diagram of one embodiment of the invention.

[0018] FIG. 7 illustrates a flow diagram of one embodiment of the invention.

[0019] FIG. 8 illustrates a flow diagram of one embodiment of the invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0020] FIG. 1 illustrates a system level overview of one embodiment of the exemplary computer system configured to convert audio or written speech into output audio of a desired voice. In one embodiment of the invention, Source 110 is a audible speech. ASR 130 creates a phoneme list from Source 110's speech and Pitch Extractor 135 extracts the pitch from Source 110's speech.

[0021] In another embodiment of the invention, Source 110 is typed words along with phonetic information. Phonetic Generator 120 converts the written text into the phonetic alphabet. Intonation Generator 125 generates the pitch from the typed text.

[0022] In both embodiments of the invention, Unit Selector 145 compares the generated diphones of Source 110 with the candidate diphones of Diphone Database 150 to select and output the best match.

[0023] FIG. 2 illustrates a flow diagram of one embodiment of the computer system selecting the best match for the subject diphone. At step 210 the computer system compares the phonetic transcription, i.e. the label, of the subject diphone from the original speech to the phonetic transcription of each potential diphone match in the diphone database and determines the quality of the match, i.e. the label match lm. As step 220, the computer system compares the pitch contour (pc) of the subject unit with each of the potential matches to determine how close they are to each other. This difference is delta_pc.

[0024] At step 230 the computer system compares the speaking rate (sr), aka duration, of the phone, to the speaking rates of each of the potential diphone matches. This difference is delta_sr.

[0025] At step 240, the computer system considers the first three formants (fm1, fm2, fm3) of the diphones which surround both the subject diphone as well as each of the potential matches. Specifically, the computer system matches the first 3 formants, i.e. delta_fm1, delta fm2, delta_fm3.

[0026] At step 250, the computer system matches the pitches (p) of the subject diphone with the potential target diphones. The difference between the pitch is delta_p.

[0027] At step 260 the computer system does a weighted average of the quality of the match for each of the five characteristics

[0028] FIG. 3 illustrates a flow diagram of the process of label matching between Source 101's diphones and the target speakers diphones located in Diphone Database 140.

[0029] At step 310, Unit Selector 145 obtains a diphone from either Phonetic Generator 120 or ASR 130. Unit Selector 145 obtains a list of candidate matches to the target speaker's voice from Diphone Database 140 at step 320. Generating this list of candidate matches is well known to someone skilled in the art of speech morphology.

[0030] At Step 330 Unit Selector 145 compares the consonant portions of the original subject diphone with the consonant portion or each potential diphone match. Step 330 assigns one of three weighting number to represent the consonant difference cd; “0”, which means the consonant portions are identical, i.e. there is no phonetic difference between the consonants; “1”, which means the consonant portions are distinct, but in the same phoneme class and “3” or higher, which means the consonant portions are distinct and in different phoneme classes.

[0031] Similarly, at step 340, Unit Selector 145 compares the vowel portions of both Source 101's diphone with the vowel portion or each potential diphone candidate match. Similar to Step 330, Step 340 assigns one of three weighting number to represent the vowel difference vd; “0”, which means the vowel portions are identical, i.e. there is no phonetic difference between the vowel, “½”, which means the vowel portions are distinct, but in the same phoneme class and “1½ ”, which means the consonant portions are distinct and in different phoneme classes. Since vowels are easier to morph than consonants, they are given less weight.

[0032] At step 350, Unit Selector 145 computes the quality of the label matches (lm) between Source 101's diphone and each of potential diphone candidate matches from Diphone Database 140. The label match weighting factor lm equals the sum of the consonant distance cd and the vowel distance vd.

lm=cd+vd ####EQ001###

[0033] At step 360, lm is normalized. In the specific embodiment, the normalization factor is 150, to ensure that lm is in the single digits.

[0034] FIG. 4 illustrates a flow diagram of Unit Selector 145 comparing the pitch contour pc of Source 101's subject diphone with the pitch contours of each of the potential target diphones from Diphone Database 140.

[0035] At step 410, Unit Selector 145 measures the pitch at the beginning and end of the source speaker's diphone and obtains the difference, i.e. the delta_pitch_source. At step 420, Unit Selector 145 measures the pitch at the beginning and end of each of the potential target diphones and obtains the difference for each diphone, i.e. delta_pitch_target.

[0036] At step 430, Unit Selector 145 computes the difference between the delta pitch of Source 101's diphone to the delta pitch of each of the target to obtain the delta pitch contour between the source speaker's diphone and each of the potential diphone matches for the target speaker.

delta_pitch=delta_pitch_target−delta_pitch_source ####EQ0002####

[0037] At step 440, the difference is normalized to be on the same order as the label match weighting factor, i.e. between “0” and “1”. In the current embodiment the normalization factor is 50.

[0038] FIG. 5 illustrates a flow diagram of Unit Selector 145 matching the pitches between the source speaker's diphone and each of the potential diphones of the target speaker.

[0039] At step 510, Unit Selector 145 measures the pitch of the end of the preceding diphone in the output speech. At step 520, Unit Selector 145 measures the pitch of each potential diphone match.

[0040] At step 530, Unit Selector 140 determines the absolute value of the difference between the pitch at the end of the preceding diphone in the output speech and the pitch at the beginning of each of the potential output diphones. At step 540, the difference is normalized to be on the same order as the label match weighting factor lm and the pitch contour weighting factor pc. In the specific embodiment the normalization factor is 150.

[0041] FIG. 6 illustrates a flow diagram of Unit Selector 145 matching the first three formants between the source speaker's diphone with the first three formants of each of potential diphones candidates of the target speaker.

[0042] At step 610, Unit Selector 145 measures the first three formants of the end of the preceding diphone in the output speech. At step 620, Unit Selector 145 measures the first three formants of each potential diphone match.

[0043] At step 630, Unit Selector 145 determines the difference between each of the first three formants at the end of the preceding diphone in the output speech and the first three formants at the beginning of each of the potential output diphones, i.e. delta_fm. At step 640, this difference is normalized.

[0044] At step 710 Unit Selector 145 obtains measure the durations of both the diphone from Source 101 and the candidate target diphones at step 720. At step 730 Unit Selector 145 calculates the difference between the durations.

[0045] FIG. 8 illustrates a flow diagram of determining which of the potential matches of Diphone Database 140 the best match. At step 810, Unit Selector 145 assigns a score to each of the potential diphone candidate matches which correspond to the quality of the match with the subject diphone; i.e. the lower the score the better the match. The score is calculated as follows

Score=(delta_pc*0.3)+(delta_pitch*0.1)+(lm*0.5)+(delta_fm*0.1)+(delta_sr*0.2) ###EQ003###

[0046] At step 820, Unit Selector 145 selects the target diphone that has the lowest score. This is repeated for each diphone in from Source 110 until a string of the best Diphones has been selected.

[0047] At step 830, Unit Selector does a backward match to rescore and determine if better matches can be found. The mechanics of a backwards match are known to one versed in the art of speech morphology.

METHOD AND APPARATUS FOR EXEMPLARY MORPHING COMPUTER SYSTEM BACKGROUND

Inventors

Cpc classification

Classification Explorer

G10L15/02

PHYSICS

Classification Explorer

G10L2021/0135

PHYSICS

Classification Explorer

G10L25/51

PHYSICS

Classification Explorer

G10L21/013

PHYSICS

Classification Explorer

G10L2015/025

PHYSICS

Classification Explorer

G10L25/90

PHYSICS

Classification Explorer

G10L2015/022

PHYSICS

Classification Explorer

G10L13/06

PHYSICS

International classification

Classification Explorer

G10L21/013

PHYSICS

Classification Explorer

G10L25/51

PHYSICS

Classification Explorer

G10L15/02

PHYSICS

Classification Explorer

G10L25/90

PHYSICS

Abstract

Claims

Description