EFFICIENT CIRCUIT FOR SAMPLING

20230368774 · 2023-11-16

    Inventors

    Cpc classification

    International classification

    Abstract

    According to this disclosure, a method of synthesizing an audio stream sample using a processor is provided. The method comprises: generating a set of unnormalized log probabilities using a neural network, each unnormalized log probability associated with a possible value for the audio stream sample, sampling a Gumbel distribution for each of the unnormalized log probabilities, adding the samples from the Gumbel distribution to each of the respective unnormalized log probabilities to generate a set of modified log probabilities, each modified log probability associated with a possible value for the audio stream sample, and selecting the possible value of the audio stream sample associated with the largest modified log probability from the set of modified log probabilities as the audio stream sample.

    Claims

    1. A method of synthesizing an audio stream sample using a processor comprising: generating a set of unnormalized log probabilities using a neural network, each unnormalized log probability associated with a possible value for the audio stream sample; sampling a Gumbel distribution for each of the unnormalized log probabilities; adding the samples from the Gumbel distribution to each respective unnormalized log probabilities to generate a set of modified log probabilities, each modified log probability associated with a possible value for the audio stream sample; and selecting the possible value of the audio stream sample associated with the a largest modified log probability from the set of modified log probabilities as the audio stream sample.

    2. A method according to claim 1 wherein the set of unnormalized log probabilities is generated as an array wherein an index of each unnormalized log probability in the array is associated with a respective possible value for the audio stream sample.

    3. A method according to claim 1, wherein the audio stream sample is an N-bit number, wherein optionally N is at least 8, 16, 32, or 64.

    4. A method according to claim 1, wherein sampling the Gumbel distribution for each of the unnormalized log probabilities comprises: generating a random number using a Pseudo Random Number Generator (PRNG) circuit; and looking up an address in a lookup table based on the random number, wherein the lookup table comprises samples from a Gumbel distribution.

    5. A method according to claim 4, wherein the PNRG circuit comprises a Linear-Feedback Shift Register (LFSR) circuit configured to generate the random number.

    6. A method according to claim 5, wherein the audio stream sample is an N-bit number, and the random number generated by the LFSR circuit is an M-bit random number, where M is less than N.

    7. A method according to claim 1, wherein a data bus provides the set of unnormalized log probabilities from the neural network to the processor in parallel, wherein the samples from the Gumbel distribution are added to the unnormalized log probabilities in parallel.

    8. (canceled)

    9. A method according to claim 1, wherein selecting the possible value of the audio stream sample associated with the largest modified log probability from the set of modified log probabilities comprises using a plurality of comparator circuits arranged as a comparator tree structure, each comparator circuit arranged to compare two modified log probabilities and select the possible value of the audio stream sample associated with the largest modified log probability.

    10-12. (canceled)

    13. An audio stream synthesizing circuit for synthesizing an audio stream sample, the audio stream synthesizing circuit configured to receive a set of unnormalized log probabilities from a neural network, each unnormalized log probability associated with a possible value for the audio stream sample, wherein the audio stream synthesizing circuit comprises: a Gumbel distribution sampling circuit configured to generate a plurality of samples of the Gumbel distribution; an adding circuit configured to add the plurality of samples of the Gumbel distribution to the set of unnormalized log probabilities to generate a set of modified log probabilities, each modified log probability associated with a possible value for the audio stream sample; and a value selecting circuit configured to select the possible value of the audio stream sample associated with the a largest modified log probability from the set of modified log probabilities as the audio stream sample.

    14. An audio stream synthesizing circuit according to claim 13, wherein the set of unnormalized log probabilities is received as an array wherein an index of each unnormalized log probability in the array is associated with a respective possible value for the audio stream sample.

    15. An audio stream synthesizing circuit according to claim 13, wherein the audio stream sample is an N-bit number, wherein optionally N is at least 8, 16, 32, or 64.

    16. An audio stream synthesizing circuit according to claim 13, wherein the Gumbel distribution sampling circuit comprises: a lookup table circuit comprising samples from a Gumbel distribution; and a Pseudo Random Number Generator (PRNG) circuit configured to generate random numbers corresponding to addresses of a look-up table circuit.

    17. An audio stream synthesizing circuit according to claim 16, wherein the PNRG circuit comprises a Linear-Feedback Shift Register (LFSR) circuit configured to generate the random number.

    18. An audio stream synthesizing circuit according to claim 17, wherein the audio stream sample is an N-bit number, and the random number generated by the LFSR circuit is an M-bit random number, where M is less than N.

    19. An audio stream synthesizing circuit according to claim 13, further comprising: a data bus, wherein the audio stream synthesizing circuit is configured to receive the set of unnormalized log probabilities from the neural network in parallel using the data bus, wherein the adding circuit is configured to add the samples from the Gumbel distribution to the unnormalized log probabilities in parallel.

    20. An audio stream synthesizing circuit according to claim 19, wherein the audio stream sample is an N-bit number, and the data bus is configured to provide less than 2.sup.N unnormalized log probabilities of the set of unnormalized log probabilities in parallel per clock cycle of the audio stream synthesizing circuit.

    21. An audio stream synthesizing circuit according to claim 13, wherein a value selecting module comprises a plurality of comparator circuits arranged as a comparator tree structure, each comparator circuit configured to compare two modified log probabilities and select the possible value of the audio stream sample associated with the largest modified log probability.

    22. An audio stream synthesizing circuit according to claim 13, wherein a clock cycle of the audio stream synthesizing circuit has a frequency of at least 250 MHz, wherein optionally an audio stream sample is generated from a set of unnormalized log probabilities in less than 200 ns, or less than 190 ns, 180 ns, or 170 ns.

    23. An audio stream synthesizing circuit according to claim 13, wherein the audio stream synthesizing circuit is implemented as Field Programmable Gate Array (FPGA), or an Application Specific Integrated Circuit (ASIC).

    24-26. (canceled)

    Description

    BRIEF DESCRIPTION OF THE FIGURES

    [0047] Aspects of the present disclosure will be described, by way of example only, with reference to the following drawings, in which:

    [0048] FIG. 1 shows a block diagram of representative of the WaveNet algorithm;

    [0049] FIG. 2 shows a block diagram of a method for generating an speech stream sample from a set of unnormalized log probabilities;

    [0050] FIG. 3 shows a block diagram of a speech stream synthesizing circuit according to an embodiment of the disclosure.

    DETAILED DESCRIPTION

    [0051] According to an embodiment of this disclosure, a speech stream synthesizing circuit 1 is provided. The speech stream synthesizing circuit 1 is configured to receive a set of unnormalized log probabilities for possible values of the speech stream sample from a neural network and generate a speech stream sample. The speech stream synthesizing circuit 1 may be provided as part of a Text To Speech (TTS) system for synthesizing human sounding speech from a text input. For example, according to one embodiment of the disclosure, the speech stream synthesizing may be provided as part of a system configured to implement the WaveNet algorithm.

    [0052] The WaveNet algorithm is an autoregressive neural network which has a general structure as shown in FIG. 1. The WaveNet algorithm generates audio waveforms. A speech stream is a sequence of integer values (x.sub.t), or samples, which can have values in a given range that is defined by the audio bit depth. For example, a common speech waveform bit depth is 8, such that a given speech stream sample can have one of 256 possible values or levels ([0, 255]). Let x.sub.t be the speech stream sample at step t in the sequence. Let h denote the audio features input to the WaveNet algorithm. The WaveNet model produces a probability distribution over all possible values given all previous samples and the audio features:

    [00002]pxt+1x1,.Math.,xt,h

    [0053] A single output value needs to be selected from this distribution in order to have a sequence of integer samples for the waveform. For example, for speech stream with a bit depth of 8, p produces a vector of length 256 denoting the probability of each possible sample.

    [0054] In some embodiments, the audio features h may comprise one or more mel spectrograms. The mel spectrogram may be generated by a neural network. For example, in some embodiments the mel spectrograms may be generated by a Tacotron 2 neural network model. Methods for generating audio features h are well known to the skilled person, so are not discussed in detail herein. The audio features may be generated by other components of the speech synthesizer circuit 1, or may be provided to the speech synthesizer circuit 1 from another circuit.

    [0055] The Neural Network Core of the TTS system shown in FIG. 1 does not provide p such that it can be directly sampled. Rather, the neural network core generates a set of unnormalized log probabilities (often known as logits). p can be calculated from the set of unnormalized log probabilities by but it is computationally expensive to do so. The embodiments of the disclosure provide a method for sampling the set of unnormalized log probabilities which is equivalent to sampling p without the computational expense of calculating p.

    [0056] FIG. 2 shows an overview block diagram of the process for converting the set of unnormalized log probabilities generated by the Neural Network Core of the TTS system into a speech stream sample. The value selection block receives the set of unnormalized log probabilities from the Neural Network Core and outputs an integer value as the next speech stream sample. The speech stream synthesizing circuit 1 is configured to provide the functionality of the value selection block shown in FIG. 2. Of course, the speech stream synthesizing circuit 1 may also include additional circuitry for implementing other parts of a TTS algorithm. That is to say, the speech stream synthesizing circuit may also include other circuitry for implementing other parts of the TTS algorithm, for example other components configured to implement parts of a WaveNet algorithm (e.g. the Neural Network Core). As such, in some embodiments, the speech stream synthesizing circuit 1 may include all the components of a TTS system. In other embodiments the speech stream synthesizing circuit 1 may provide a circuit which is dedicated to performing the value selection block as part of a TTS system.

    [0057] FIG. 3 shows a block diagram of a circuit to implement the value selection block of FIG. 2. As such, FIG. 3 shows a diagram of a speech stream synthesizing circuit 1 according to an embodiment of the disclosure. The speech stream synthesizing circuit 1 is configured to generate a speech stream sample from a set of unnormalized log probabilities provided by a neural network. In the embodiment of FIG. 3, the speech stream synthesizing circuit 1 may be configured to generate a speech stream sample from provided by a neural network core via the WaveNet algorithm. The speech stream sample generated may be an N-bit number, where N is a positive integer. For example, in some embodiments, N may be at least: 64, 32, 16, 8 or 4.

    [0058] The speech stream synthesizing circuit 1 of FIG. 3 comprises a Gumbel distribution sampling circuit 10, an adding circuit 20 and a value selecting circuit 30. In some embodiments of the disclosure, for example as shown in FIG. 3, the speech stream synthesizing circuit may also comprise an input bus 40.

    [0059] The Gumbel distribution sampling circuit 10 is configured to generate a plurality of samples of the Gumbel distribution.

    [0060] The PRNG circuit is configured to generate a random number. In this disclosure, the term random number encompasses pseudo random numbers generated by a PRNG circuit 12 and the like. As such, the term random number includes numbers that are truly random, and also a sequence of pseudo random numbers.

    [0061] The PRNG circuit 12 of FIG. 3 may be implemented as a Linear-feedback Shift Register (LFSR) circuit. The LFSR circuit is configured to generate a (pseudo) random number. The random number generated is an M-bit positive integer. In the embodiment of FIG. 3, M=7, although in other embodiments other values may be used. The 7-bit random number may be generated using 13 shift registers arranged in a LFSR circuit. LFSR circuits are known to the skilled person and so are not further discussed herein. The length of the LFSR determines the frequency that the PRNG sequence repeats. In the embodiment of FIG. 3, for a 7-bit number, a sequence of 13 shit registers produces a sequence of random numbers that are sufficiently random for the purposes of sampling a Gumbel distribution. As such, a (pseudo) random number may be generated by the LFSR circuit in a computationally efficient manner.

    [0062] The Lookup Table circuit 14 comprises samples from a Gumbel distribution. In the embodiment of FIG. 3, the lookup table circuit 14 contains 2^M entries (where M is the bit depth of the random number generated by the PRNG circuit 12. In the embodiment of FIG. 3, M=7, and so the lookup table circuit comprises 128 entries. Each entry in the lookup table circuit 14 is addressable by one of the 2^M random numbers generated by the PRNG circuit 12. Each entry in the lookup table circuit 14 comprises a sample from the Gumbel distribution. As such, where:

    [00003]yi0,2N1xi0,2N1

    is one of the random numbers generated by the PRNG circuit, the entries in the lookup table circuit 14 store values for:

    [00004]loglogyi/2N1loglogxi/2N1.

    [0063] In the embodiment of FIG. 3, each value may be stored in the lookup table circuit 14 as a block floating point (BFP) number. Block floating-point arithmetic is a form of floating-point arithmetic that can be used on a fixed-point processors. For BFP numbers, a block of numbers are assigned a single exponent (rather than each number having its own exponent, as in floating-point). The exponent is typically determined by the number in the block with the largest magnitude. In the embodiment of FIG. 3, each value may be stored as at least a 24 bit BFP number. In some embodiments, each value may be stored as at least a 32, 64, or 128 bit BFP number. Of course, in other embodiments each value could be stored in other known data formats (e.g. fixed point representations or floating point representations). In the embodiment of FIG. 3, a BFP number is stored to allow for simplified addition of numbers in the adding circuit 20.

    [0064] The Gumbel distribution sampling circuit 10 is configured to output samples of the Gumbel distribution. The Gumbel distribution sampling circuit 10 is configured to output a sample of the Gumbel distribution for each of the 2^N unnormalized log probabilities used to generate a speech stream sample. The samples may be output from the Gumbel distribution sampling circuit 10 sequentially, or in parallel. In the embodiment of FIG. 3, the samples from the lookup table circuit 14 are output in parallel to the adding circuit 20. In some embodiments, the parallel output of the Gumbel distribution sampling circuit 10 may have the same number of outputs as the number of parallel inputs on the input bus 40. For example, in the embodiment of FIG. 3, the Gumbel distribution sampling circuit 10 outputs 120 samples of the Gumbel distribution in parallel to the adding circuit 20.

    [0065] In some embodiments, such as shown in FIG. 3, the set of unnormalized log probabilities are sent from the Neural Network Core over input bus 40. The Neural Network Core provides the set of unnormalized log probabilities and the associated possible values for the speech stream sample as an array. That is to say, the set of unnormalized log probabilities and the associated possible values for the speech stream sample are structured as an array. In some embodiments, the set of unnormalized log probabilities are provided as an array where the index of the array represents the possible value of the speech stream sample associated with each unnormalized log probability. In other embodiments, a two dimensional array could be used to provide the set of unnormalized log probabilities and the associated possible values for the speech stream sample.

    [0066] Input bus 40 is configured to transfer the set of unnormalized log probabilities generated by the Neural Network Core to the adding circuit 20. The input bus 40 is configured to transfer the set of unnormalized log probabilities in parallel. In some embodiments, all of the set of unnormalized log probabilities for calculating a speech stream sample are transferred in a single clock cycle of the speech stream synthesizer 1. In other embodiments, at least some of the set of unnormalized log probabilities are transferred in a single clock cycle of the speech stream synthesizer 1. In the embodiment of FIG. 3, the input bus 40 is configured to transfer 120 unnormalized log probabilities from the set to the adding circuit 20 each clock cycle, although other widths of data bus may also be used. As such, the input bus 40 is configured to transfer less than 2^N unnormalized log probabilities per clock cycle. Accordingly, the complete set of unnormalized log probabilities for a single speech stream sample are transferred over a plurality of clock cycles.

    [0067] Whilst the embodiment of FIG. 3 uses a parallel input bus 40 to transfer a set of unnormalized log probabilities over multiple clock cycles, in other embodiments a data bus may be provided to transfer the set of unnormalized log probabilities in a single clock cycle. In other embodiments, a serial data bus could be used to transfer single the set of unnormalized log probabilities one at a time (i.e. one unnormalized log probability per clock cycle).

    [0068] Each unnormalized log probability transferred by the input bus 40 may be provided as a BFP number. In some embodiments, each unnormalized log probability may be provided as at least a 32, 64, or 128 bit BFP number. In some embodiments, each unnormalized log probability may be provided in the same format as the numbers generated by the Gumbel distribution sampling circuit 10. For example, in the embodiment of FIG. 3, each unnormalized log probability may be provided as a 24 bit BFP number.

    [0069] The adding circuit 20 is circuit configured to add the plurality of samples of the Gumbel distribution to the set of unnormalized log probabilities to generate a set of modified log probabilities. In the embodiment of FIG. 3, the adding circuit 20 is configured to add the set of unnormalized log probabilities transferred by the input bus 40 to the samples of the Gumbel distribution output by the Gumbel distribution sampling circuit 10. In the embodiment of FIG. 3, the adding circuit is configured to perform an element-wise addition of the each of the unnormalized log probabilities to a respective sample from the Gumbel distribution sampling circuit 10. As discussed above, the input bus 40 and the Gumbel distribution sampling circuit 10 output data in parallel, with the same number of parallel outputs. Thus, the adding circuit may efficiently perform the addition in an elementwise manner.

    [0070] The adding circuit 20 may comprise a plurality of adders. In the embodiment of FIG. 3, the adder circuit 20 comprises 120 adders. Each adder is configured to add a Gumbel sample to a respective unnormalized log probability. By adding a sample from the Gumbel distribution to each of the unnormalized log probabilities, a set of modified log probabilities is calculated. The set of modified log probabilities comprises 2^N modified log probabilities, with each modified log probability having an associated possible value for the speech stream sample. Where the set of unnormalized log probabilities is provided as an array, the set of modified log probabilities may also be generated as an array. As such, the index of the array of modified log probabilities may provide the associated possible value for the speech stream sample.

    [0071] In some embodiments, the adding circuit 20 may also comprise a modified log probability lookup table. The results of the adders may be stored in the modified log probability lookup table of the adding circuit 20 for output to the value selection circuit 30. Each modified log probability may be stored as a BFP number in the modified log probability lookup table. In the embodiment of FIG. 3, each of the modified log probabilities is stored in the same formal is the respective unnormalized log probability input (i.e. 24-bit BFP), although in other embodiments any suitable format for the values may be used.

    [0072] The adding circuit 20 is configured to output the set of modified log probabilities to the value selection circuit 30. The adding circuit 20 may output the modified log probabilities in series, or in parallel. In the embodiment of FIG. 3, the adding circuit 20 outputs the modified log probabilities in parallel. For example, in the embodiment of FIG. 3, the adding circuit 20 outputs the 120 modified log probabilities from the set of modified log probabilities per clock cycle. As such, the parallel output of the adding circuit 20 is the same as the parallel input from the input bus 40 and the Gumbel distribution sampling circuit 10. Such a configuration allows for a computationally efficient configuration of the speech stream synthesizer 1.

    [0073] The value selection circuit 30 is configured to select the possible value of the speech stream sample associated with the largest modified log probability from the set of modified log probabilities as the speech stream sample.

    [0074] In the embodiment of FIG. 3, the value selecting circuit 30 comprises a plurality of comparator circuits. Each comparator circuit is configured to compare two modified log probabilities and select the possible value associated with the largest modified log probability. The plurality of comparator circuits are arranged as a comparator tree structure. In the comparator tree structure, the comparator circuits are arranged in a series of layers. The outputs of two comparator circuits in the first layer are used as inputs to a comparator circuit in the second layer. The comparator tree structure includes sufficient layers to allow a single possible value to be selected from all the inputs to the first layer. In the embodiment of FIG. 3, the value selecting circuit 30 comprises a first layer comprising at least 60 comparator circuits. Accordingly, all 120 values of the modified log probabilities may be compared by the value selecting circuit 30 at the same time. Seven layers are provided in the embodiment of FIG. 3 to reduce the number of modified log probabilities for consideration from 120 to 1.

    [0075] Where the complete set of modified log probabilities is provided to the value selecting circuit 30 over multiple clock cycles (such as in the embodiment of FIG. 3), the value selecting circuit may include a selected value lookup table to store the output of the comparator tree as an intermediate result from the value selecting circuit. As such, an output (a modified log probability and associated possible value for the speech stream sample) from the comparator tree for one clock cycle may be stored by the value selecting circuit 30 as an intermediate result for comparison against outputs calculated in subsequent clock cycles.

    [0076] The value selecting circuit 30 is configured to keep track of the possible value for the speech stream sample value associated with each modified log probability. In some embodiments, the index of each modified log probability provided as part of the array to the value selecting circuit 30 may be stored along with its associated modified log probability in each layer. For example, the index and associated modified log probability may be stored in one or more lookup tables.

    [0077] The value selecting circuit 30 is configured to select a final value for the speech stream sample based on the largest modified log probability from the set of modified log probabilities and output the final value as the next speech stream sample. The speech stream sample is output as an N bit number. This process is statistically equivalent to sampling from the distribution p which is derivable from the set of unnormalized log probabilities provided by the Neural Network core.

    [0078] The speech stream synthesizing circuit 1 may be configured to calculate a plurality of speech stream samples over time. As such, the speech stream synthesizing circuit 1 may repeat the functionality described above in order to generate a continuous stream of speech samples. The speech stream samples generated may resemble human speech due to the statistical method used to sample the unnormalized log probabilities calculated by the Neural Network Core.

    [0079] In some embodiments, the speech stream synthesizer circuit 1 may be implemented on a Field Programmable Gate Array. For example, a Gumbel distribution sampling circuit 10 for a single value may be implemented on a FPGA using 20 Flip-Flops (FF) and 49 Lookup Tables (LUT). In the embodiment of FIG. 3, where 120 samples of the Gumbel distribution are calculated in parallel, the Gumbel distribution sampling circuit 10 uses 2400 FF and 5880 LUT to operate on the full 120 data values on a single clock cycle. Each LUT is able to store 64 bits of information.

    [0080] In the embodiment of FIG. 3, the comparator tree of the value selection circuit 30 may be implemented to store (64+32+16+8+4+2+1) intermediate results for both modified log probabilities and associated index (or possible value of speech stream sample), at 32bits each. Additional register stages may also be provided to improve pipelining. As such, the comparator tree of the value selection circuit may be implemented using 32bits*(64+32+16+8+4+2+1)*2 = 8,576 FFs (assuming 2 pipeline stages per comparator).

    [0081] The adding circuit 20 can be implemented by reusing logic in the rest of the speech stream synthesizer circuit 1. For example, in some embodiments, a speech stream synthesizer circuit 1 may comprise an adding circuit which is configured to perform other computational operations. The adding circuit 20 can thus be implemented by time sharing the use of the adding circuit 20 with other parts of the speech stream synthesizing circuit 1. For example, in some embodiments, the adding circuit 20 may be provided as part of a circuit configured to perform dot product matrix operations. As such, the elementwise addition of the adding circuit 20 may be performed by time sharing an adding circuit 20 with other parts of the speech stream synthesizing circuit 1. Of course, in other embodiments of the disclosure, the speech stream synthesizing circuit 1 may include an adding circuit 20 which is dedicated to the elementwise addition step.

    [0082] Accordingly, the speech stream synthesizing circuit 1 of FIG. 3 may be configured to process 120 unnormalized log probabilities in parallel using circuit resources of around 8,816 FF and 5,880 LUT. This circuit size is much smaller than that which would be required to sample 120 unnormalized log probabilities values using the softmax function and inverse transform sampling due, at least in part, to the reduction in the number of computing operations performed by methods according to this disclosure.

    [0083] In other embodiments, a speech stream synthesizing circuit 1 designed to work on a single data width bus would require approximately 84FFs and 49LUTS - an extremely small circuit suitable for an embedded application.

    [0084] Latency is also very critical for WaveNet implementation as the full network has a stringent latency budget of 62.5 .Math.s to complete so that result can feed back into the next input of the computation. The speech stream synthesizer circuit 1implementation of FIG. 3 is very fast. A computational model of the speech stream synthesizer circuit 1 discussed above estimates that a set of unnormalized log probabilities (N=8) can be calculated in 20 clock cycles. At a clock rate of 250 MHz this requires 160 ns of additional computation from the 62.5 .Math.s budget.

    [0085] Accordingly, a speech stream synthesizing circuit 1 is provided. The speech stream synthesizing circuit 1 is capable of generating a speech stream samples from a set of unnormalized log probabilities by sampling the set of unnormalized log probabilities with low latency. That is to say, the speech stream synthesizing circuit 1 calculates each speech stream sample within a timeframe suitable for outputting e.g. 16 kHz bandwidth audio. For example, in some WaveNet implementations, the complete TTS system may have a latency budget of about 62.5 .Math.s to complete the generation of a speech stream sample. The speech stream synthesizing circuit 1 in the embodiment of FIG. 3 can generate a speech stream sample in about 20 clock cycles overall. Thus, for a clock cycle frequency of around 250 MHz, the speech stream synthesizing circuit 1 can generate speech stream samples from a set of unnormalized log probabilities in around 160 ns. As such, speech stream synthesizing circuit 1 provides value selection functionality for a TTS system using a computationally efficient hardware implementation.

    [0086] Next, a method of synthesizing a speech stream sample using a processor will be described with reference to FIG. 3. As such, the method described below may be performed by the speech stream synthesizing circuit 1 described above.

    [0087] The method comprises generating a set of unnormalized log probabilities for possible values of the speech stream sample using a neural network. As described above, a Neural Network Core may generate a set of unnormalized log probabilities that are provided to the speech stream synthesizing circuit 1 by the input bus 40.

    [0088] The method also comprises sampling a Gumbel distribution for each of the unnormalized log probabilities of the set of unnormalized log probabilities. The Gumbel distribution samples may be generated by the Gumbel distribution sampling circuit 10 discussed above.

    [0089] The method also comprises adding the samples from the Gumbel distribution to each of the respective unnormalized log probabilities to generate a set of modified log probabilities. The adding of the samples may be performed by the adding circuit 20 described above.

    [0090] The method also comprises selecting the possible value of the speech stream sample with the largest modified log probability from the set of modified log probabilities as the speech stream sample. This step may be performed by the value selection circuit 30 discussed above.

    [0091] The method according to embodiments of this disclosure is not limited to the speech stream synthesizing circuit 1 discussed above. For example, the method according to embodiments of this disclosure may be performed by a processor such as a central processing unit (CPU). As such, it will be appreciated that methods according to this disclosure may be performed on dedicated hardware (e.g. a hardware accelerator), or methods may be performed using a software implementation. For example, methods according to the disclosure may be performed by a processor (e.g. a CPU) executing a set of instructions stored in a memory.

    [0092] It will also be appreciated that the embodiments in this description relate to the generation of a speech stream sample by a speech stream synthesizing circuit 1. It will be appreciated that the present disclosure is not limited to the synthesis of speech stream samples as discussed above. As such, the skilled person will appreciate that the methods and systems of this disclosure may equally be applied to the synthesis of audio samples from a set of unnormalized log probabilities provided by a neural network. For example a neural network may provide a set of unnormalized log probabilities for the synthesis of audio samples including: music samples, speech samples, or noise cancellation samples.