SYSTEMS AND METHODS FOR ADAPTING HUMAN SPEAKER EMBEDDINGS IN SPEECH SYNTHESIS
20220335925 · 2022-10-20
Assignee
Inventors
- Cong ZHOU (Fremont, CA, US)
- Xiaoyu Liu (San Mateo, CA, US)
- Michael Getty Horgan (San Francisco, CA, US)
- Vivek Kumar (Foster City, CA, US)
Cpc classification
International classification
Abstract
Novel methods and systems for adapting a voice cloning synthesizer for a new speaker using real speech data are disclosed. Utterances from one or more target speakers are parameterized and are used to initialize an embedding vector for use with a voice synthesizer, by means of clustering the utterance data and determining the centroid of the data, using a speaker identification neural network, and/or by finding the closest stored embedded vector to the utterance data.
Claims
1. A method to synthesize a voice in a target style, comprising: receiving as input at least one waveform, each corresponding to an utterance in the target style; extracting features of the at least one waveform to create at least one embedding vector; clustering the at least one embedding vector producing at least one cluster, each cluster having a centroid; determining the centroid of a cluster of the at least one cluster; designating the centroid of the cluster as an initial embedding vector for a speech synthesizer; and adapting the speech synthesizer based on at least the initial embedding vector, thereby producing a synthesized voice in the target style.
2. The method of claim 1, further comprising: pre-processing the at least one waveform to remove non-language sounds and silence.
3. The method of claim 1, wherein each cluster has a threshold distance from its centroid and the adapting further comprises fine-tuning based on the at least one embedding vector of the target style in the threshold distance.
4. The method of claim 1, wherein the speech synthesizer is a neural network.
5. The method of claim 1, wherein the extracting features further comprises combining sample embedding vectors extracted from window samples of a waveform of the at least one waveform to produce an embedding vector for the waveform.
6. The method of claim 5, wherein the combining comprises averaging the sample embedding vectors.
7. The method of claim 1, wherein the input is from a film or video source.
8. The method of claim 1, wherein the target style comprises a speaking style of a target person.
9. The method of claim 8, wherein the target style further comprises at least one of age, accent, emotion, and acting role.
11. The method of claim 8, wherein the target person is an actor and the target style is the target person at an age younger than their current age.
12. The method of claim 1, further comprising receiving as the input further waveforms, each corresponding to an utterance in a second style different than the target style; and extracting features of the further waveforms to create at least a second embedding vector; wherein the clustering further includes clustering on the second embedding vector.
13. The method of claim 12, further comprising determining an expected number of clusters prior to the clustering, wherein the clustering is based on the expected number of clusters.
14. The method of claim 13, wherein the determining an expected number of clusters uses a statistical analysis of the input.
15. A method to synthesize a voice in a target style, comprising: receiving as input at least one waveform, each corresponding to an utterance in the target style; extracting features on the at least one waveform to create at least one embedding vector; calculating vector distances on an embedding vector of the at least one embedding vector to determine embedding vector distances to each of a plurality of known embedding vectors; determining a known embedding vector of the known embedding vectors with a shortest distance from the embedding vector; designating the known embedding vector as an initial embedding vector for a speech synthesizer; adapting the speech synthesizer based on the initial embedding vector; and synthesizing a voice in the target style with the adapted speech synthesizer.
16. A method to synthesize a voice in a target style, comprising: receiving as input at least one waveform, each corresponding to an utterance in the target style; extracting features of the at least one waveform to create at least one embedding vector; using a voice identification system on an embedding vector of the at least one embedding vector, producing a known embedding vector corresponding to a voice identified by the voice identification system as being a closest correspondence to the embedding vector; designating the known embedding vector as an initial embedding vector for a speech synthesizer; adapting the speech synthesizer based on the initial embedding vector; and synthesizing a voice in the target style with the adapted speech synthesizer.
17. The method of claim 16, wherein the voice identification system is a neural network.
18. The methods of claim 1, further comprising updating a voice synthesizer table with the initial embedding vector.
19. A non-transitory computer readable medium configured to perform on a computer the method of claim 1.
20. A device configured to perform the method of claim 1.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
DETAILED DESCRIPTION
[0025] As used herein, a voice “style” refers to any grouping of waveform parameters that distinguishes it from another source and/or another context. Examples of “styles” include differentiating between different speakers. It could also refer to differences in the waveform parameters for a single speaker speaking in different contexts. The different contexts can include, for example, the speaker speaking at different ages (e.g. a person speaking when they are a teenager sounds different then they do when they are middle aged, so those would be two different styles), the speaker speaking in different emotional states (e.g. angry vs. sad vs. calm etc.), the speaker speaking in different accents or languages, the speaker speaking in different business or social contexts (e.g. talking with friends vs. talking with family vs. talking with strangers etc.), actors speaking when playing different roles, or any other contextual difference that would affect a person's mode of speaking (and, therefore, produce different voice waveform parameters generally). So, for example, person A speaking in a British accent, person B speaking in a British accent, and person A speaking in a Canadian accent would be considered 3 different “styles”.
[0026] As used herein, “waveform parameters” refer to quantifiable information that can be derived from an audio waveform (digital or analog). The derivation can be made in the time and/or frequency domain. Examples include pitch, amplitude, pitch variation, amplitude variation, phasing, intonation, phonic duration, phoneme sequence alignment, mel-scale pitch, spectra, mel-scale spectra, etc. Some or all of the parameters can also be values derived from the input audio waveform that don't have any specifically understood meaning (e.g. a combination/transformation of other values). In practice, the waveform parameters can refer to both directly measured parameters and estimated parameters.
[0027] As used herein, an “utterance” is a relatively short sample of speech, typically the equivalent of a line of dialog from a screenplay (e.g. a phrase, sentence, or series of sentences over a few seconds).
[0028] As used herein, a “voice synthesizer” is a machine learning model that can convert an input of text or speech into an output of that text or speech spoken in with particular qualities that the model has learned. The voice synthesizer uses an embedding vector for a particular “identity” of output speaking style. See e.g. Chen, Y., et al. “Sample efficient adaptive text-to-speech.” In International Conference on Learning Representations, 2019.
[0029]
[0030] The waveforms from the target source(s) are then parameterized (110) by feature extraction into a number of waveform parameters, such that a vector is formed for each utterance. The number of parameters depends on the input for the voice synthesizer (135), and can be any number (such as 32, 64, 100, or 500).
[0031] These vectors can be used to determine an initialization vector (115) to go in the embedding vector table (125), a listing of all styles that can be used by the voice synthesizer (135) for training a new model for cloning. Additionally, some or all of the vectors can be used as tuning data (120) for fine tuning the voice synthesizer (135). The voice synthesizer (135) adapts a machine learning model, like a neural network, to take language input (130) in the form of voice audio or text and produce an output waveform (140) of synthesized speech in a style of the target source (105). Adaption of the model can be performed by updating the model and the embedding vector through stochastic gradient descent.
[0032] One example of parameterization is phoneme sequence alignment estimation. This can be performed by the use of a forced aligner (e.g. Gentile™) based on a speech recognition system (e.g. Kaldi™). This converts audio to Mel-frequency cepstral coefficient (MFCC) features, and converts text to known phonemes through a dictionary. It then does an alignment between the MFCC features and phonemes. The output contains 1) a sequence of phonemes and 2) the timestamp/duration of each phoneme. Based on the phonemes and phoneme durations, one can compute the statistics of phoneme duration and the frequency of phonemes being spoken, as parameters.
[0033] Another example of parameterization is pitch estimation, or pitch contour extraction. This can be done with a program such as the WORLD vocoder (DIO and Harvest pitch trackers) or the CREPE neural net pitch estimator. For example, one can extract pitch for every 5 ms, so that for every 1 s speech data as input, one would get 200 floating numbers in sequence representing pitch absolute values. Taking the log operation on these floating numbers, then normalizing them for each target speaker, one can produce a contour around 0.0 (e.g., values like “0.5”), instead of absolute pitch values (e.g. 200.0 Hz). For systems like the WORLD pitch estimator, it uses speech temporal characteristics in high level. It first uses a low-pass filter with different cutoff frequencies, and if the filtered signal only consists of the fundamental frequency, it forms a sine wave, and the fundamental frequency can be obtained based on the period of this sine wave. Zero-crossing and peak dip intervals can be used to choose the best fundamental frequency candidate. The contour shows the pitch variation, so one can calculate the variance of normalized contour to know how much variation is in the waveform.
[0034] Another example of parameterization is amplitude derivation. This can be done, for example, by first calculating the short-time Fourier transform (STFT) of the waveform to get the spectra of the waveform. A Mel-filter can be applied to the spectra to get a mel-scale spectra, and this can be log-scale converted to a log-mel-scale spectra. Parameters such as absolute loudness and amplitude variance can be calculated based from the log-mel-scale spectra.
[0035] In some embodiments, the parameterization step (110) includes labeling the data from the speaker. Since this is based on the source, the labeling step can be performed for the data en masse rather than piece-by-piece. Note that data labelled for a single speaker could contain multiple styles of speaking.
[0036] In some embodiments, the parameterization (110) includes phenome extraction and alignment with the input waveform. An example of this process is to transcribe the waveforms into text (manually or by an automatic speech recognition system), then convert a sequence of the text to a sequence of phonemes by a dictionary search (for example, using the t2p Perl script), then aligning the phoneme sequences with the waveforms. A timestamp (starting time and ending time) can be associated to each phoneme (for example, using the Montreal Forced Aligner to convert audio to MFCC features, and create alignment between MFCC features and phonemes). For this, the output contains: 1) a sequence of phonemes 2) the timestamp/duration of each phoneme.
[0037]
[0038] In one embodiment, the initialization can be performed by clustering.
[0039] In some embodiments, the number of clusters are determined using a statistical analysis of the input and attempts to represent the number of distinct styles in the input data. In some embodiments, the statistics of phoneme and tri-phone duration (indicating how fast the speaker is speaking), statistics of pitch variance (indicating how dramatic the speaker is changing tone), statistics of absolute loudness (indicating how loud the speaker is talking) are analyzed as features to estimate the number of spoken styles (clusters), e.g. calculating one mean and one variance for each of the feature sequences, and then looking at all the means and variances, and then roughly estimate how many mean/variance clusters there are.
[0040] In some embodiments, the number of clusters are automatically determined by the clustering algorithm, for certain data. A clustering algorithm (225) is performed on the data to find clusters of input. This can be, for example, a k-means or Gaussian mixture model (GMM) clustering algorithm. With the clusters identified, the centroids of each cluster are determined (230). The centroids are used as initialized embedding vectors for each cluster/style for training/adapting the synthesizer (235) for that style. The input data labeled for that style within the corresponding cluster variance from the corresponding centroid (inside the cluster space) can be used as the fine-tuning data (240) for the synthesizer adaptation (235).
[0041] Some embodiments of synthesizer adaption (235) only adapt the speaker embedding vector. For example, let the training objective be: p(x|x.sub.1 . . . t-1,emb,c,w), where x is the sample (at time t), x.sub.1 . . . t-1 is the sample history, emb is the embedding vector, c is the conditioning information which contains the extracted conditioning features (e.g. pitch contour, phoneme sequence with timestamp, etc.), and w represents the weights of conditional SampleRNN. Fix c and w and only perform stochastic gradient descent on emb. Once the training reaches convergence, stop training. The updated emb is assigned to the speaker target (the new speaker).
[0042] In some embodiments of synthesizer adaption (235), the speaker embedding vector is adapted first, then the model (all or part) is updated directly. For example, let the training objective be: p(x|x.sub.1 . . . t-1,emb,c,w), where x is the sample (at time t), x.sub.1 . . . t-1 is the sample history, emb is the embedding vector, c is the conditioning information which contains the extracted conditioning features (e.g. pitch contour, phoneme sequence with timestamp, etc.), and w represents the weights of conditional SampleRNN. Fix c and w and only do stochastic gradient descent on emb. Once the training of emb reaches convergence, start stochastic gradient descent on w. Alternatively, once the training of emb reaches convergence, start stochastic gradient descent on the last output layer of conditional SampleRNN. Optionally, train a few steps (e.g. 1000 steps) of gradient updates. The updated w and emb are assigned together to the speaker target (the new speaker).
[0043] As used herein, training reaching “convergence” refers to a subjective determination of when the training shows no substantial improvement. For speech cloning, this can include listening to the synthesized speech and making a subjective evaluation of the quality. When training a synthesizer, both the loss curve of training set and loss curve of validation set can be monitored and, if the loss of validation set does not decrease for some threshold number of epochs (e.g. 2 epochs), then the learning rate can be decreased (e.g. 50% rate).
[0044] In some embodiments, only the speaker embedding is adapted in the adaption stage. The loss curve can be monitored and a subjective evaluation can be made to determine if training has reached convergence. If there is no subjective improvement, training can be stopped and the rest of the model can be fine tuned at a low (e.g. 1×10.sup.−6) learning rate for a few gradient update steps. Again, subjective evaluation can be used to determine when to stop training. The subjective evaluation can also be used to gauge the efficacy of the training procedure.
[0045] Different approaches could be used to select the most appropriate number of clusters. In some embodiments, pitch analysis can be performed to determine the number of clusters. Preprocessing such as silence trimming and non-phonetic region trimming (similar to the filtering (210) shown in
[0046]
[0047]
[0048] The parameterized vectors (110) can be compared (distance) (505) to the values of the embedding vector table (125) to determine a closest vector from the table, which is used as the initialized embedding vector (510) to adapt the synthesizer (235). Either a random (e.g. first generated) parameterized vector can be used for the distance calculations (505), or an average parameterized vector can be built from multiple parameterized vectors and used for the distance calculations (505). The more embedding vectors from the table (125) that used for the distance calculations (505), the greater the accuracy of the resulting initialized embedding vector (510), since that provides a greater probability that a voice style very close to the input is available. The adaptation (235) can also be fine-tuned (520) from the parameterized vectors (110). The adaptation (235) can update the embedding vector based on the fine-tuning (520) for entry into the embedding vector table (125), or the initialized embedding vector (510) can be populated into the table (125) with a new identification relating it to the new style.
[0049] Vector distance calculations can include Euclidean distance, vector dot product, and/or cosine similarity.
[0050]
[0051] If it is the same, the parameterized vectors (605) are run through the voice ID system (610) to “identify” which entry in the voice ID database (625) matches the utterances. Obviously, the speaker is not normally in the voice ID database at this point, but if there is a large number of entries in the table (for example, 30 k), then the identified speaker from the table (625) should be a close match to the style of the utterances. This means that the embedded vector from the voice ID database (625) selected by the voice ID model (610) can be used as an initialized embedding vector to adapt the voice synthesizer (235). As with other initialization methods, this can be fine-tuned with the parameterized vectors (605) for the utterances.
[0052] If the parameters for the voice ID system are different than the parameters of the synthesizer, then the method is largely the same, but the initialized embedding vector will have to be looked up from the database (625) in a form appropriate for the synthesizer (235) and the fine-tuning data (120) will have to go through separate feature extraction from the voice ID parameterization (605).
[0053] In some embodiments, the feature extraction for the utterances can be done by combining extracted vectors from shorter segments of the longer utterance.
[0054] According to some embodiments, a voice synthesizer system can be as shown in
[0055]
[0056] A number of embodiments of the disclosure have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the present disclosure. Accordingly, other embodiments are within the scope of the following claims.
[0057] The present disclosure is directed to certain implementations for the purposes of describing some innovative aspects described herein, as well as examples of contexts in which these innovative aspects may be implemented. However, the teachings herein can be applied in various different ways. Moreover, the described embodiments may be implemented in a variety of hardware, software, firmware, etc. For example, aspects of the present application may be embodied, at least in part, in an apparatus, a system that includes more than one device, a method, a computer program product, etc. Accordingly, aspects of the present application may take the form of a hardware embodiment, a software embodiment (including firmware, resident software, microcodes, etc.) and/or an embodiment combining both software and hardware aspects. Such embodiments may be referred to herein as a “circuit,” a “module”, a “device”, an “apparatus” or “engine.” Some aspects of the present application may take the form of a computer program product embodied in one or more non-transitory media having computer readable program code embodied thereon. Such non-transitory media may, for example, include a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. Accordingly, the teachings of this disclosure are not intended to be limited to the implementations shown in the figures and/or described herein, but instead have wide applicability.