EFFICIENT CIRCUIT FOR SAMPLING
20230368774 · 2023-11-16
Inventors
Cpc classification
G10H2250/211
PHYSICS
G10H2250/311
PHYSICS
G10L13/02
PHYSICS
G10H2250/161
PHYSICS
International classification
Abstract
According to this disclosure, a method of synthesizing an audio stream sample using a processor is provided. The method comprises: generating a set of unnormalized log probabilities using a neural network, each unnormalized log probability associated with a possible value for the audio stream sample, sampling a Gumbel distribution for each of the unnormalized log probabilities, adding the samples from the Gumbel distribution to each of the respective unnormalized log probabilities to generate a set of modified log probabilities, each modified log probability associated with a possible value for the audio stream sample, and selecting the possible value of the audio stream sample associated with the largest modified log probability from the set of modified log probabilities as the audio stream sample.
Claims
1. A method of synthesizing an audio stream sample using a processor comprising: generating a set of unnormalized log probabilities using a neural network, each unnormalized log probability associated with a possible value for the audio stream sample; sampling a Gumbel distribution for each of the unnormalized log probabilities; adding the samples from the Gumbel distribution to each respective unnormalized log probabilities to generate a set of modified log probabilities, each modified log probability associated with a possible value for the audio stream sample; and selecting the possible value of the audio stream sample associated with the a largest modified log probability from the set of modified log probabilities as the audio stream sample.
2. A method according to claim 1 wherein the set of unnormalized log probabilities is generated as an array wherein an index of each unnormalized log probability in the array is associated with a respective possible value for the audio stream sample.
3. A method according to claim 1, wherein the audio stream sample is an N-bit number, wherein optionally N is at least 8, 16, 32, or 64.
4. A method according to claim 1, wherein sampling the Gumbel distribution for each of the unnormalized log probabilities comprises: generating a random number using a Pseudo Random Number Generator (PRNG) circuit; and looking up an address in a lookup table based on the random number, wherein the lookup table comprises samples from a Gumbel distribution.
5. A method according to claim 4, wherein the PNRG circuit comprises a Linear-Feedback Shift Register (LFSR) circuit configured to generate the random number.
6. A method according to claim 5, wherein the audio stream sample is an N-bit number, and the random number generated by the LFSR circuit is an M-bit random number, where M is less than N.
7. A method according to claim 1, wherein a data bus provides the set of unnormalized log probabilities from the neural network to the processor in parallel, wherein the samples from the Gumbel distribution are added to the unnormalized log probabilities in parallel.
8. (canceled)
9. A method according to claim 1, wherein selecting the possible value of the audio stream sample associated with the largest modified log probability from the set of modified log probabilities comprises using a plurality of comparator circuits arranged as a comparator tree structure, each comparator circuit arranged to compare two modified log probabilities and select the possible value of the audio stream sample associated with the largest modified log probability.
10-12. (canceled)
13. An audio stream synthesizing circuit for synthesizing an audio stream sample, the audio stream synthesizing circuit configured to receive a set of unnormalized log probabilities from a neural network, each unnormalized log probability associated with a possible value for the audio stream sample, wherein the audio stream synthesizing circuit comprises: a Gumbel distribution sampling circuit configured to generate a plurality of samples of the Gumbel distribution; an adding circuit configured to add the plurality of samples of the Gumbel distribution to the set of unnormalized log probabilities to generate a set of modified log probabilities, each modified log probability associated with a possible value for the audio stream sample; and a value selecting circuit configured to select the possible value of the audio stream sample associated with the a largest modified log probability from the set of modified log probabilities as the audio stream sample.
14. An audio stream synthesizing circuit according to claim 13, wherein the set of unnormalized log probabilities is received as an array wherein an index of each unnormalized log probability in the array is associated with a respective possible value for the audio stream sample.
15. An audio stream synthesizing circuit according to claim 13, wherein the audio stream sample is an N-bit number, wherein optionally N is at least 8, 16, 32, or 64.
16. An audio stream synthesizing circuit according to claim 13, wherein the Gumbel distribution sampling circuit comprises: a lookup table circuit comprising samples from a Gumbel distribution; and a Pseudo Random Number Generator (PRNG) circuit configured to generate random numbers corresponding to addresses of a look-up table circuit.
17. An audio stream synthesizing circuit according to claim 16, wherein the PNRG circuit comprises a Linear-Feedback Shift Register (LFSR) circuit configured to generate the random number.
18. An audio stream synthesizing circuit according to claim 17, wherein the audio stream sample is an N-bit number, and the random number generated by the LFSR circuit is an M-bit random number, where M is less than N.
19. An audio stream synthesizing circuit according to claim 13, further comprising: a data bus, wherein the audio stream synthesizing circuit is configured to receive the set of unnormalized log probabilities from the neural network in parallel using the data bus, wherein the adding circuit is configured to add the samples from the Gumbel distribution to the unnormalized log probabilities in parallel.
20. An audio stream synthesizing circuit according to claim 19, wherein the audio stream sample is an N-bit number, and the data bus is configured to provide less than 2.sup.N unnormalized log probabilities of the set of unnormalized log probabilities in parallel per clock cycle of the audio stream synthesizing circuit.
21. An audio stream synthesizing circuit according to claim 13, wherein a value selecting module comprises a plurality of comparator circuits arranged as a comparator tree structure, each comparator circuit configured to compare two modified log probabilities and select the possible value of the audio stream sample associated with the largest modified log probability.
22. An audio stream synthesizing circuit according to claim 13, wherein a clock cycle of the audio stream synthesizing circuit has a frequency of at least 250 MHz, wherein optionally an audio stream sample is generated from a set of unnormalized log probabilities in less than 200 ns, or less than 190 ns, 180 ns, or 170 ns.
23. An audio stream synthesizing circuit according to claim 13, wherein the audio stream synthesizing circuit is implemented as Field Programmable Gate Array (FPGA), or an Application Specific Integrated Circuit (ASIC).
24-26. (canceled)
Description
BRIEF DESCRIPTION OF THE FIGURES
[0047] Aspects of the present disclosure will be described, by way of example only, with reference to the following drawings, in which:
[0048]
[0049]
[0050]
DETAILED DESCRIPTION
[0051] According to an embodiment of this disclosure, a speech stream synthesizing circuit 1 is provided. The speech stream synthesizing circuit 1 is configured to receive a set of unnormalized log probabilities for possible values of the speech stream sample from a neural network and generate a speech stream sample. The speech stream synthesizing circuit 1 may be provided as part of a Text To Speech (TTS) system for synthesizing human sounding speech from a text input. For example, according to one embodiment of the disclosure, the speech stream synthesizing may be provided as part of a system configured to implement the WaveNet algorithm.
[0052] The WaveNet algorithm is an autoregressive neural network which has a general structure as shown in
[0053] A single output value needs to be selected from this distribution in order to have a sequence of integer samples for the waveform. For example, for speech stream with a bit depth of 8, p produces a vector of length 256 denoting the probability of each possible sample.
[0054] In some embodiments, the audio features h may comprise one or more mel spectrograms. The mel spectrogram may be generated by a neural network. For example, in some embodiments the mel spectrograms may be generated by a Tacotron 2 neural network model. Methods for generating audio features h are well known to the skilled person, so are not discussed in detail herein. The audio features may be generated by other components of the speech synthesizer circuit 1, or may be provided to the speech synthesizer circuit 1 from another circuit.
[0055] The Neural Network Core of the TTS system shown in
[0056]
[0057]
[0058] The speech stream synthesizing circuit 1 of
[0059] The Gumbel distribution sampling circuit 10 is configured to generate a plurality of samples of the Gumbel distribution.
[0060] The PRNG circuit is configured to generate a random number. In this disclosure, the term random number encompasses pseudo random numbers generated by a PRNG circuit 12 and the like. As such, the term random number includes numbers that are truly random, and also a sequence of pseudo random numbers.
[0061] The PRNG circuit 12 of
[0062] The Lookup Table circuit 14 comprises samples from a Gumbel distribution. In the embodiment of
is one of the random numbers generated by the PRNG circuit, the entries in the lookup table circuit 14 store values for:
[0063] In the embodiment of
[0064] The Gumbel distribution sampling circuit 10 is configured to output samples of the Gumbel distribution. The Gumbel distribution sampling circuit 10 is configured to output a sample of the Gumbel distribution for each of the 2^N unnormalized log probabilities used to generate a speech stream sample. The samples may be output from the Gumbel distribution sampling circuit 10 sequentially, or in parallel. In the embodiment of
[0065] In some embodiments, such as shown in
[0066] Input bus 40 is configured to transfer the set of unnormalized log probabilities generated by the Neural Network Core to the adding circuit 20. The input bus 40 is configured to transfer the set of unnormalized log probabilities in parallel. In some embodiments, all of the set of unnormalized log probabilities for calculating a speech stream sample are transferred in a single clock cycle of the speech stream synthesizer 1. In other embodiments, at least some of the set of unnormalized log probabilities are transferred in a single clock cycle of the speech stream synthesizer 1. In the embodiment of
[0067] Whilst the embodiment of
[0068] Each unnormalized log probability transferred by the input bus 40 may be provided as a BFP number. In some embodiments, each unnormalized log probability may be provided as at least a 32, 64, or 128 bit BFP number. In some embodiments, each unnormalized log probability may be provided in the same format as the numbers generated by the Gumbel distribution sampling circuit 10. For example, in the embodiment of
[0069] The adding circuit 20 is circuit configured to add the plurality of samples of the Gumbel distribution to the set of unnormalized log probabilities to generate a set of modified log probabilities. In the embodiment of
[0070] The adding circuit 20 may comprise a plurality of adders. In the embodiment of
[0071] In some embodiments, the adding circuit 20 may also comprise a modified log probability lookup table. The results of the adders may be stored in the modified log probability lookup table of the adding circuit 20 for output to the value selection circuit 30. Each modified log probability may be stored as a BFP number in the modified log probability lookup table. In the embodiment of
[0072] The adding circuit 20 is configured to output the set of modified log probabilities to the value selection circuit 30. The adding circuit 20 may output the modified log probabilities in series, or in parallel. In the embodiment of
[0073] The value selection circuit 30 is configured to select the possible value of the speech stream sample associated with the largest modified log probability from the set of modified log probabilities as the speech stream sample.
[0074] In the embodiment of
[0075] Where the complete set of modified log probabilities is provided to the value selecting circuit 30 over multiple clock cycles (such as in the embodiment of
[0076] The value selecting circuit 30 is configured to keep track of the possible value for the speech stream sample value associated with each modified log probability. In some embodiments, the index of each modified log probability provided as part of the array to the value selecting circuit 30 may be stored along with its associated modified log probability in each layer. For example, the index and associated modified log probability may be stored in one or more lookup tables.
[0077] The value selecting circuit 30 is configured to select a final value for the speech stream sample based on the largest modified log probability from the set of modified log probabilities and output the final value as the next speech stream sample. The speech stream sample is output as an N bit number. This process is statistically equivalent to sampling from the distribution p which is derivable from the set of unnormalized log probabilities provided by the Neural Network core.
[0078] The speech stream synthesizing circuit 1 may be configured to calculate a plurality of speech stream samples over time. As such, the speech stream synthesizing circuit 1 may repeat the functionality described above in order to generate a continuous stream of speech samples. The speech stream samples generated may resemble human speech due to the statistical method used to sample the unnormalized log probabilities calculated by the Neural Network Core.
[0079] In some embodiments, the speech stream synthesizer circuit 1 may be implemented on a Field Programmable Gate Array. For example, a Gumbel distribution sampling circuit 10 for a single value may be implemented on a FPGA using 20 Flip-Flops (FF) and 49 Lookup Tables (LUT). In the embodiment of
[0080] In the embodiment of
[0081] The adding circuit 20 can be implemented by reusing logic in the rest of the speech stream synthesizer circuit 1. For example, in some embodiments, a speech stream synthesizer circuit 1 may comprise an adding circuit which is configured to perform other computational operations. The adding circuit 20 can thus be implemented by time sharing the use of the adding circuit 20 with other parts of the speech stream synthesizing circuit 1. For example, in some embodiments, the adding circuit 20 may be provided as part of a circuit configured to perform dot product matrix operations. As such, the elementwise addition of the adding circuit 20 may be performed by time sharing an adding circuit 20 with other parts of the speech stream synthesizing circuit 1. Of course, in other embodiments of the disclosure, the speech stream synthesizing circuit 1 may include an adding circuit 20 which is dedicated to the elementwise addition step.
[0082] Accordingly, the speech stream synthesizing circuit 1 of
[0083] In other embodiments, a speech stream synthesizing circuit 1 designed to work on a single data width bus would require approximately 84FFs and 49LUTS - an extremely small circuit suitable for an embedded application.
[0084] Latency is also very critical for WaveNet implementation as the full network has a stringent latency budget of 62.5 .Math.s to complete so that result can feed back into the next input of the computation. The speech stream synthesizer circuit 1implementation of
[0085] Accordingly, a speech stream synthesizing circuit 1 is provided. The speech stream synthesizing circuit 1 is capable of generating a speech stream samples from a set of unnormalized log probabilities by sampling the set of unnormalized log probabilities with low latency. That is to say, the speech stream synthesizing circuit 1 calculates each speech stream sample within a timeframe suitable for outputting e.g. 16 kHz bandwidth audio. For example, in some WaveNet implementations, the complete TTS system may have a latency budget of about 62.5 .Math.s to complete the generation of a speech stream sample. The speech stream synthesizing circuit 1 in the embodiment of
[0086] Next, a method of synthesizing a speech stream sample using a processor will be described with reference to
[0087] The method comprises generating a set of unnormalized log probabilities for possible values of the speech stream sample using a neural network. As described above, a Neural Network Core may generate a set of unnormalized log probabilities that are provided to the speech stream synthesizing circuit 1 by the input bus 40.
[0088] The method also comprises sampling a Gumbel distribution for each of the unnormalized log probabilities of the set of unnormalized log probabilities. The Gumbel distribution samples may be generated by the Gumbel distribution sampling circuit 10 discussed above.
[0089] The method also comprises adding the samples from the Gumbel distribution to each of the respective unnormalized log probabilities to generate a set of modified log probabilities. The adding of the samples may be performed by the adding circuit 20 described above.
[0090] The method also comprises selecting the possible value of the speech stream sample with the largest modified log probability from the set of modified log probabilities as the speech stream sample. This step may be performed by the value selection circuit 30 discussed above.
[0091] The method according to embodiments of this disclosure is not limited to the speech stream synthesizing circuit 1 discussed above. For example, the method according to embodiments of this disclosure may be performed by a processor such as a central processing unit (CPU). As such, it will be appreciated that methods according to this disclosure may be performed on dedicated hardware (e.g. a hardware accelerator), or methods may be performed using a software implementation. For example, methods according to the disclosure may be performed by a processor (e.g. a CPU) executing a set of instructions stored in a memory.
[0092] It will also be appreciated that the embodiments in this description relate to the generation of a speech stream sample by a speech stream synthesizing circuit 1. It will be appreciated that the present disclosure is not limited to the synthesis of speech stream samples as discussed above. As such, the skilled person will appreciate that the methods and systems of this disclosure may equally be applied to the synthesis of audio samples from a set of unnormalized log probabilities provided by a neural network. For example a neural network may provide a set of unnormalized log probabilities for the synthesis of audio samples including: music samples, speech samples, or noise cancellation samples.