METHOD AND APPARATUS FOR DESIGNATING A SOUNDALIKE VOICE TO A TARGET VOICE FROM A DATABASE OF VOICES

20170301340 · 2017-10-19

Assignee

Inventors

Cpc classification

International classification

Abstract

A soundalike system to improve speech synthesis by training a text to speech engine on a voice like the target speakers voice

Claims

1. A computerized system optimized to identify which voice, from a collection of voices is the most similar to a target speakers voice, comprising a first configured to receive a collection of voices; a second mule configured to train a database of voices; where training the database is building a mathematical model of the database; a fourth module configured to build a mathematical model of each voice contained within said database; a fifth module configured to cluster the voices contained in said database, where the voices are clustered based upon the voice features; a sixth module configured to calculate a single i-vector for each cluster of voices; a seventh module configured to calculate a single i-vector for the target voice; and 8.sup.th module configured to identify the cluster most likely to contain the soundalike voice; a ninth module configured to calculate the i-vectors of each voice within said cluster; a tenth module configured to identify the soundalike voice.

Description

DESCRIPTION OF THE DRAWINGS

[0014] FIG. 1 is a schematic diagram of the soundalike computer system.

[0015] FIG. 2 illustrates a high flow diagram of the soundalike selection process.

[0016] FIG. 3 illustrates a flow diagram of training the Database 125.

[0017] FIG. 4 illustrates a K-Means clustering.

[0018] FIG. 5 illustrates a flow diagram of the soundalike system creating a mathematical model for the database at the cluster level and calculating the i-vector of the target voice.

[0019] FIG. 6 illustrates a flow diagram of Group Selector 175 determining which group contains the soundalike voice.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0020] FIG. 1 illustrates a block diagram for selecting a voice, from a database of voices, which is substantially similar to a target voice.

[0021] The soundalike system in FIG. 1 may be implemented as a computer system 110; a computer comprising several modules, i.e. computer components embodied as either software modules, hardware modules, or a combination of software and hardware modules, whether separate or integrated, working together to form an exemplary computer system. The computer components may be implemented as a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks. A unit or module may advantageously be configured to reside on the addressable storage medium and configured to execute on one or more processors or microprocessors. Thus, a unit or module may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. The functionality provided for in the components and units may be combined into fewer components and units or modules or further separated into additional components and units or modules.

[0022] Input 120 is a module configured to receive the Voice 120b from an Audio Source 120a. The Audio Source 120a maybe be one of several sources including, but not limited to, Human 121 speaking, Streamed Speech 122 or preferentially the Database 125 containing human speech, aka voices, but may also be a live person speaking into a microphone, synthesize speech, streamed speech, etc.

[0023] DB Trainer 130 is a module configured to train a database by extracting the Mel Frequency Cepstral Coefficients (MFCCs) from the Voice 120b in Audio Source 120a, using the extracted MFCCs to create the DB Model 130a of the database.

[0024] Individual Voice Modeler 140 is a module configured to build a mathematical model of each individual voice obtained from Audio Source 120a.

[0025] Voice Clusterer 150 is a module configured to cluster aka classify voices from Audio Source 120a into two or more groups, the Group 150a by characteristic inherent with each voice, including, but not limited to gender, pitch and speed.

[0026] Group I-Vector 160 is a model configured to calculate a single i-vector for each Group 150a.

[0027] Target Voice Calculator 170 is a module configured to calculate the i-vector of the target voice, the Target i-Vector 170a.

[0028] Group Selector 175 is a module configured to select the closest Group 150a to the Target I-Vector 170a, e.g. with the smallest Euclidean distance between the Target i-Vector 170a and the Group 150a or the highest probability score.

[0029] Individual i-Vector 180 is a module configured to calculate the i-vectors of each Voice 180a, the Voice 180a within the selected Group 150a.

[0030] Voice Selector 190 is a module configured to select the voice with the smallest Euclidean distance between the target i-Vector 170a and Voice 180a.

[0031] FIG. 2 illustrates a high flow diagram of the soundalike selection process. At step 210, the soundalike system trains the database. At step 220, the soundalike system builds mathematical models of each voice within the database. At step 230, the soundalike system groups, i.e. creates clusters, of voices based on similarities between the voices e.g. pitch, speed, etc. step 240, the soundalike system creates mathematical models of each cluster. At step 260, the soundalike selects the cluster most likely to contain the soundalike voice. At step 270, the soundalike system selects the voice from within the selected cluster that is closest to the target voice.

[0032] FIG. 3 illustrates a flow diagram of training the Database 125. At step 310, the Input 120 received the Voice 120b from Database 125. The Database 125 should contain enough Voice 120b to be statistically significant. Optimally Database 125 should contain at least 300 voices, each voice having spoken 300 sentences of 5 to 6 seconds duration per sentence. Thus Database 125 will have 300,000 to 340,000 seconds or approximately 55 to 66 hours of voice data.

[0033] The Database 125 needs to be trained. Training a database means building a mathematical model to represent database. In speech synthesis, the ultimate result of training for soundalike is creating i-vectors for the cluster and speaker level. This is a final low dimension representation of a speaker. At Step 320, the DB Trainer 130 divides the human speech into a plurality of frames, Frames 130a, each Frame 130a being generally the length of a single phoneme or 30 milliseconds. At step 325, DB Trainer 130 calculates N Mel Frequency Cepstral Coefficients, or MFCCS, for each Frame 130a which corresponds to the number of features extracted, i.e. the number of features in the target voice such as pitch, speed, etc., which will matched against the voices in the Database 125. In the preferred embodiment, DB Trainer 130 calculates 42 MFCCs per Frame 130a over a sliding window equal which increments by ½ the length of Frame 130a.

[0034] At step 330, the DB Trainer 130, uses the extracted MFCCs from Database 125 to create UBM 130b, a universal background model of the Database 125. Creating a universal background model is within the scope of one skilled in the art of speech synthesis. The UBM 130b results in three matrices, the Weight 135a, the Means 135b and the Variance 135c.

[0035] Subsequent to modeling the Database 125, each Voice 120b must be modeled. At step 340, the Individual Voice Modeler 140 builds a mathematical model for each Voice 120b using a Maximum Apriori Probability, or MAP, algorithm which combines the UBM 130b with the extracted MFCCs from each Voice 120b. Building a mathematical model of a single voice using a Maximum Apriori Probability algorithm is within the ordinary scope of one skilled in the art of speech synthesis.

[0036] In another embodiment, Individual Voice Modeler 140 creates a mathematical model of each voice directly using the universal background model. Building individual voice mathematical models using the universal background model algorithm is within the scope of one skilled in the art of speech synthesis.

[0037] FIG. 4 illustrates a K-Means clustering. Applying a clustering algorithm is within the scope of one skilled in the art of speech synthesis. In the preferred embodiment, the clustering algorithm is a k-means algorithm. K-means stores k centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid. K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) choosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.

[0038] There is no well-defined value for “k”, but experimentally, between 40 and 50 clusters is ideal for a database containing millions of voices.

[0039] FIG. 4 illustrates a sample of k=2, i.e. two clusters (e.g. male and female voices).

[0040] Once the number of clusters has been determined, the soundalike system builds a cluster model. A cluster model is a mathematical representation of each cluster within the selected database. A cluster model allows all of the voices within the cluster to be represented with a single mathematical model.

[0041] FIG. 5 illustrates a flow diagram of the soundalike system creating a mathematical model for the database at the cluster level and calculating the i-vector of the target voice. At step 510 Group I-Vector 160 selects a single cluster of voices. At step 520, Group I-Vector 160 selects the MFCCs from all of the voice within the selected cluster. At step 530, the feature vectors, or MFCCs are combined together using any number of mathematical combinations. In the preferred embodiment, at step 530, Group I-Vector 160 simply creates the matrix 160a by stacking the vectors, although other combinations such as summation, averages, means, etc. can be applied. A universal background model algorithm is applied to the Matrix 160a. At step 540, Group I-Vector 160 calculates the i-vector of the selected cluster. The result is the mathematical model of the selected cluster. Group I-Vector 160 repeats for each cluster in Database 125.

[0042] At step 550, the Target Voice Selector 170 extracts the MFCCs of the target voice over a plurality of frames, each frame being approximately 20s, the length of a phoneme. In the preferred embodiment, the MFCC's are calculated over a sliding window equal in length to a single Frame 130a

[0043] At Step 560, the Target i-Vector 165 is calculated by applying the universal background model to the MFCCs of the Voice 120b. Calculating an i-Vector is within the scope of someone skilled in the art of speech synthesis.

[0044] FIG. 6 illustrates a flow diagram of Group Selector 175 determining which group contains the soundalike voice. At step 610, Group Selector 175 calculates the Euclidean distance between the i-vector of each group and the Target I-Vector 165. At Step 620, Group Selector 175 selects the Group with the lowest Euclidean distance to the Target I-Vector 165.

[0045] Once the Group 175a has been selected, the i-vectors of each individual voice must be calculated.

[0046] At step 630, Individual I-Vector 180 selects the Voice 120b within Group 175a. At step 640 Individual I-Vector 180 calculates the i-vector of each Voice 120b.

[0047] At step 650, Voice Selector 190 compares the I-Vector of each voice in Group 175a with the Target I-Vector 165 and closest I-vector as the soundalike voice. In the preferred embodiment of the invention, the soundalike system selects the Voice 120b with the smallest Euclidean distance to the target voice as the soundalike voice.