Hardware accelerated discretized neural network

Abstract

An innovative low-bit-width device may include a first digital-to-analog converter (DAC), a second DAC, a plurality of non-volatile memory (NVM) weight arrays, one or more analog-to-digital converters (ADCs), and a neural circuit. The first DAC is configured to convert a digital input signal into an analog input signal. The second DAC is configured to convert a digital previous hidden state (PHS) signal into an analog PHS signal. NVM weight arrays are configured to compute vector matrix multiplication (VMM) arrays based on the analog input signal and the analog PHS signal. The NVM weight arrays are coupled to the first DAC and the second DAC. The one or more ADCs are coupled to the plurality of NVM weight arrays and are configured to convert the VMM arrays into digital VMM values. The neural circuit is configured to process the digital VMM values into a new hidden state.

Claims

1. A method comprising: converting a digital input signal into an analog input signal; computing, using a plurality of non-volatile memory (NVM) weight arrays, a plurality of vector matrix multiplication (VMM) arrays based on the analog input signal, wherein multiple parallel NVM cells in the plurality of NVM weight arrays represent one synaptic weight element; converting the VMM arrays into digital VMM values; processing the digital VMM values through at least one activation function unit, wherein processing the digital VMM values calculates a new memory cell state; and feeding the new memory cell state as an input for processing the digital VMM values during a next cycle.

2. The method of claim 1, wherein processing the digital VMM values through at least one activation function unit comprises: processing the digital VMM values into a forget gate value, an input gate value, an output gate value, and a new candidate memory cell value; calculating a hidden state based on the forget gate value, the input gate value, the output gate value, and the new candidate memory cell value; and feeding the hidden state as an input for computing the plurality of VMM arrays during the next cycle.

3. The method of claim 2, wherein calculating the hidden state is further based on the new memory cell state of a current cycle.

4. The method of claim 1, wherein the multiple parallel NVM cells include at least three parallel NVM cells per one synaptic weight element.

5. The method of claim 1, further comprising: averaging, before ADC quantization, redundant runs of the plurality of VMM arrays.

6. The method of claim 5, wherein at least three redundant runs are used for averaging redundant runs of the plurality of VMM arrays.

7. The method of claim 5, wherein at least five redundant runs are used for averaging redundant runs of the plurality of VMM arrays.

8. The method of claim 1, further comprising: averaging redundant runs of a plurality of activation function units to determine a plurality of activation function unit values; and processing the plurality of activation function unit values through element-wise calculations to calculate a hidden state.

9. The method of claim 1, wherein the NVM weight arrays comprise resistive cross-point arrays.

10. A device comprising: at least one digital-to-analog converter (DAC) configured to convert a digital input signal into an analog input signal; a plurality of non-volatile memory (NVM) weight arrays configured to compute a plurality of vector matrix multiplication (VMM) arrays based on the analog input signal, wherein: the plurality of NVM weight arrays is coupled to the at least one DAC; and multiple parallel NVM cells in the plurality of NVM weight arrays represent one synaptic weight element; at least one analog-to-digital converter (ADC) coupled to the plurality of NVM weight arrays, wherein the at least one ADC is configured to convert the VMM arrays into digital VMM values; and a neural circuit configured to: process the digital VMM values through at least one activation function unit, wherein processing the digital VMM values calculates a new memory cell state; and feed the new memory cell state as an input for processing the digital VMM values during a next cycle.

11. The device of claim 10, wherein the plurality of NVM weight arrays comprises a plurality of resistive cross-point arrays.

12. The device of claim 10, wherein: at least one array from the plurality of NVM weight arrays includes a plurality of junctions; and each junction of the plurality of junctions includes multiple parallel NVM cells.

13. The device of claim 10, wherein the multiple parallel NVM cells include at least three parallel NVM cells per one synaptic weight element.

14. The device of claim 10, wherein: the at least one ADC comprise a plurality of ADCs; the neural circuit comprises a plurality of activation function units coupled to the plurality of ADCs; and the plurality of activation function units is configured to receive and process the digital VMM values.

15. The device of claim 14, wherein: the neural circuit further comprises arithmetic circuitry coupled to the plurality of activation function units; and the arithmetic circuitry is configured to generate a hidden state based on an output received from each activation function unit of the plurality of activation function units.

16. The device of claim 15, further comprising: at least one averaging component within the neural circuit and configured to average redundant runs of each activation function unit of the plurality of activation function units to determine a plurality of activation function unit values as output to the arithmetic circuitry.

17. The device of claim 10, further comprising: at least one analog integrate and average component situated between the plurality of NVM weight arrays and the at least one ADC and configured to average, before ADC quantization, redundant runs of the plurality of VMM arrays.

18. The device of claim 17, wherein the at least one analog integrate and average component is configured to average at least three redundant runs for each VMM array of the plurality of VMM arrays.

19. The device of claim 17, wherein the at least one analog integrate and average component is configured to average at least five redundant runs for each VMM array of the plurality of VMM arrays.

20. A circuit, comprising: means for converting a digital input signal into an analog input signal; means for computing, using a plurality of non-volatile memory (NVM) weight arrays, a plurality of vector matrix multiplication (VMM) arrays based on the analog input signal; means for converting the VMM arrays into digital VMM values; means for processing the digital VMM values through at least one activation function unit to calculate a new memory cell state; and means for feeding the new memory cell state as an input for processing the digital VMM values during a next cycle.

Description

BRIEF DESCRIPTION OF DRAWINGS

(1) The techniques introduced herein are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals are used to refer to similar elements.

(2) FIG. 1 depicts a schematic an example NVM crosspoint array configured to accelerate Vector-Matrix-Multiplication using Ohm's law.

(3) FIG. 2 depicts an architecture of an example NVM weight array accelerated LSTM unit.

(4) FIG. 3 depicts an example Penn Treebank dataset result.

(5) FIG. 4 depicts an example Penn Treebank dataset result with an exploration of the bit-widths of weight and ADC/DAC.

(6) FIG. 5 depicts a graph showing the example effect of ADC noise on the Penn Treebank embodiment.

(7) FIG. 6 depicts a graph showing the example effect of weight noise on the Penn Treebank embodiment.

(8) FIG. 7 depicts a schematic of a further example architecture of an NVM weight array accelerated LSTM unit that uses redundant runs to address ADC noise.

(9) FIG. 8 depicts example results of the method of using redundant runs to address ADC noise.

(10) FIG. 9 depicts various example configurations for suppressing weight noise.

(11) FIG. 10 depicts a graph showing example results obtained from using multiple parallel NVM cells per weight to address the weight noise effect.

(12) FIG. 11 depicts a flowchart of an example method for quantized processing of inputs.

(13) FIG. 12 depicts a flowchart of an example method for calculating a hidden state.

(14) FIG. 13 depicts a graph showing example classification accuracy relative to bit precision.

(15) FIG. 14 depicts an example low-bit-width processing architecture.

DESCRIPTION

(16) This application discloses an innovative low-bit-width architecture that includes systems, methods, and other aspects that can be trained, process inputs, and provide predictions efficiently. An example implementation includes an LSTM unit based on NVM (non-volatile memory) weight arrays that can accelerate VMM (vector matrix multiplication) operations. Innovative aspects on the bit precision of the NVM weights and periphery circuit components (ADCs and DACs) are disclosed, as are approaches for addressing noise effects coming from the real hardware device. Various circuits are also provided for various disclosed implementations of a quantized LSTM unit.

(17) Beneficially, the technology described herein can effectively quantize LSTM neural networks and includes a hardware design that provides state-of-the-art machine learning while lowering memory size and computation complexity. Specifically, by way of example, the NVM weights, analog-to-digital converter(s) (ADCs), digital-to-analog converters (DACs), and NVM cross-point arrays described herein can accelerate the VMM operations that are heavily used in most machine learning algorithms for artificial neural networks, including but not limited to LSTM, CNN and MLP. However, it should be understood that the innovative technology described herein is generally applicable to any type of non-volatile memory architecture, such as but not limited to NAND-type flash memory, NOR-type flash memory, phase-change random access memory (PCRAM), resistive random-access memory (ReRAM), spin-transfer torque random access memory (STT-RAM), magnetoresistive random-access memory (MRAM), Ferroelectric RAM (FRAM), phase change memory (PCM), etc.

(18) While natural language processing is discussed in various implementations provided herein, the technology is applicable to variety of uses cases, such as speech recognition, natural language processing, signal processing and interpretation, data security, general classification, image recognition, recommendations, and prediction, etc., and can receive and process any suitable inputs for such use cases. By way of example, the quantized architecture described herein can be configured to receive and interpret data streams, sensor data, and/or other data inputs and process them to provide contextually relevant predictions, such as behavioral predictions. For instance, the technology may be implemented as hardware and/or software in a portable electronic device that is coupled to one or more sensors. In further examples, the quantized architecture can be used for video analysis, hand-written digit stroke recognition, and human activity recognition, etc.

(19) In a more specific example, a quantized LSTM device, as described herein, may be embedded in a client device to provide it with more robust artificial intelligence (AI) functionality. Such an implementation would, for instance, not require the device to have a network data connection to transmit data over the Internet to a server (e.g., to the cloud) so the data can be processed with machine learning logic. Instead, a device equipped with a quantized LSTM device can beneficially provide offline AI functionality (unlike current digital assistant solutions (e.g., Siri™, Google Voice™, Alexa™, etc.) which are unable to function when network instability or interruptions occur). Moreover, devices equipped with such low-power embedded hardware can run deep neural networks algorithms directly on power and/or processing-limited or restricted systems, such as mobile devices and self-driving cars.

(20) Example sensors may include, but are not limited to, photo sensors, gyroscopes, accelerometers, heart rate monitors, position sensors, touch sensors, capacitive sensors, thermometers, sound sensors, light sensors, proximity sensors, thermocouples, motion sensors, transceivers, etc. Example devices coupled to and/or including the sensors and/or the quantization-aware devices processing the sensor data from the sensors may include, but are not limited to storage drives, portable electronic devices (e.g., personal computers, tablets, phones, wearables, digital assistants), voice activated devices, Internet-of-things (IOT) devices, vehicle computers, servers, storage racks, etc.

(21) The technology may receive input from the one or more sensors, efficiently process the inputs with the low-bit-width architecture described herein, learn from the processed inputs, and provide predictions based on the processing. In some cases, an implementation may receive and process raw or pre-processed sensor data received from the one or more sensors, although other variations are also possible.

(22) As a further example, FIG. 13 depicts a graph 1300 depicting classification accuracy of a quantization-aware trained prediction unit according to the implementations described herein. In particular, the prediction unit was trained with sensor data reflecting six daily activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying, and then used to classify new sensor data reflecting the various different activities of a person. As shown in the graph, the prediction unit is capable of providing much more accurate (e.g., 70%+ in this use case) predictions of user activity using an architecture with low bit precision (e.g., 1-5 in this case), whereas the background art (which uses full precision numbers for training, floating point baseline) was unable to provide similarly accurate predictions at comparable low bit width/precision levels.

(23) FIG. 2 depicts an example of a quantized LSTM device 200, and FIG. 14 depicts a further example architecture of the device 200. As shown in FIG. 2, the device 200 may include a plurality of DACs 208a . . . 208n (also individually or collectively 208), such as a first DAC configured to convert a digital input signal 204 into an analog input signal and a second DAC configured to convert a digital previous hidden state (PHS) 206 signal into an analog PHS signal. The DACs 208 may be coupled to provide input to a plurality of memory arrays 212a . . . 212n (also individually or collectively 212). In some embodiments, each memory array 212 may be coupled to a DAC 208 (dedicated or shared) to receive input therefrom. In some embodiments, one DAC 208 may supply all of the memory arrays 212, each memory array 212 have a dedicated DAC 208, or some memory arrays may share a DAC 208.

(24) The plurality of memory arrays 212a . . . 212n may be coupled to a plurality of ADCs 216a . . . 216n (also individually or collectively 216), and the plurality of ADCs 216a . . . 216n may be coupled to a plurality of activation components 218a . . . 218n. Advantageously, various components of the device 200 may be quantized. For instance, an output of one or more of the first DAC, the second DAC, the plurality of NVM weight arrays, and the ADCs may be quantized to various degrees, as discussed elsewhere herein (e.g., to about 4 bits or less).

(25) In some embodiments, the activation components 218a . . . 218n may be the same components or different components. As depicted, the activation components 218 comprise a forget gate 218a, an input gate 218b, a new candidate memory cell 218c, and an output gate 218n. The forget gate 218a, the input gate 218b, the new candidate memory cell 218c, and the output gate 218n may be connected to logic units that perform operations on their output.

(26) As further shown in FIG. 14, architecture 1400 of the device 200, may include logical circuitry 1414 (e.g., multiplier and adder circuitry in this embodiment) that is coupled to the memory arrays 212a . . . 212n to receive their output and process it. The multiplier and adder circuitry 1414 may be coupled to the buffer array 1402 to store data. In some embodiments, the multiplier and adder circuitry 1414 may store the states computed by it in the buffer array 1402 for access by DACs 208 and/or other components. The multiplier and adder circuitry 1414 may be coupled to the activation function unit 1426, which may comprise the activation components (such as the activation components 218 in FIG. 2), and may send signals to and receive signals from the multiplier and adder circuitry 1414.

(27) Returning to FIG. 2, the activation components 218, and arithmetic circuitry 240 (which may comprise the multipliers, adders, and/or any other suitable logic), and/or other suitable components may collectively make up a neural circuit 214 that provides the machine learning functionality described herein, in association with the other components that are described. For instance, the neural circuit 214 may be configured to process the digital VMM values into a new hidden state.

(28) While the implementations depicted in FIGS. 2 and 14 reflect a device in which the components are coupled by a communications bus, wiring, and/or other connection components, it should be understood that other variations are contemplated where one or more of the components may be distributed across devices and coupled via a network using networking hardware.

(29) In the implementation depicted in FIG. 2, which reflects the architecture of an example NVM weight array-accelerated LSTM unit, the forget gate 218a may be connected to multiplier 226, the input gate 218b and the new candidate memory cell 218c may be coupled to multiplier 228, and the output gate 218n may be coupled to multiplier 230. The multipliers 226, 228, and 230 respectively perform multiplication operations and provide their output to downstream components to which they are coupled. In particular, multipliers 226 and 228 provide their output to adder 232, which in turn performs addition on the output and provides it to a scaling component 220 (e.g., tanh function). The tanh function can scale the output and output it as a new memory cell state 222. The new memory cell state is communicated to the multiplier 230, which multiplies it with the output of the output gate 218n. The output of the multiplier 230 embodies a new hidden state 224 which is provided as an input for the next operational cycle (206). The new memory cell state 222 is also provided as an input to the multiplier 226 for the next operational cycle.

(30) As shown by the shading in FIG. 2, certain elements of the quantized LSTM device 200 can be quantized. The various example quantization levels are further described below. In a more specific non-limited example, as demonstrated in two example natural language processing tasks described herein, a 4 bit NVM weight cell along with at least 2 bit ADC/DAC in the LSTM unit can deliver comparable performance as a floating-point baseline. For a simpler dataset for character level prediction, a 2 bit NVM weight cell along with 2 bit ADC/DAC also does not show noticeable degradation in performance. While ADC read noise and NVM weight noise can both harm the training results, these issues can be addressed using filters and/or employing redundant runs using multiple parallel NVM cells as one synaptic weight element, which can average out the weight noise caused by device variations.

(31) Forward and backward propagation may be used in the quantized LSTM device 200 during training or inference. For instance but not limitation, during training and inference, forward propagation may be used to quantize the weights, internal activations (e.g., ADCs), and input/output (e.g., DACs). Additionally or alternatively, during training, backward propagation may be implemented using a straight-through-estimator (STE) to propagate the gradients (using a floating-point number for a weight update).

(32) In an example hardware-accelerated quantized LSTM embodiment, the forward propagation operation of the LSTM unit contains 4 vector-matrix multiplications, 5 nonlinear activations, 3 element-wise multiplications, and 1 element-wise addition. As shown in Equation (1)-(4), the hidden state of the previous time step h.sub.t-1 is concatenated with the input of the current step x.sub.t to form the total input vector being fed into the weight arrays W.sub.d, W.sub.i, W.sub.o and W.sub.c to perform the VMM. The VMM results can be passed into 4 nonlinear activation function units 218 respectively to get the values of forget gate f.sub.t, input gate i.sub.t, output gate o.sub.t and new candidate memory cell c_c.sub.t. The new memory cell c.sub.t is comprised of the new information desired to be added by multiplying the new candidate memory c_c.sub.t with input gate i.sub.t, and the old information desired to be not forgotten by multiplying the old memory cell c.sub.t-1 and forget gate f.sub.t, shown in Equation (5). The final hidden state h.sub.t is calculated by the multiplier 230 by multiplying the output gate o.sub.t and the activation of the new memory cell Q, shown in Equation (6). During backpropagation, the values of W.sub.f, W.sub.i, W.sub.o and W.sub.c are updated according to the training algorithm, usually based on the stochastic gradient descent.
f.sub.t=sigmoid([x.sub.t,h.sub.t-1]W.sub.f) (1)
i.sub.t=sigmoid([x.sub.t,h.sub.t-1]W.sub.i) (2)
o.sub.t=sigmoid([x.sub.t,h.sub.t-1]W.sub.o) (3)
c_c.sub.t=tanh([x.sub.t,h.sub.t-1]W.sub.c) (4)
c.sub.t=f.sub.t.Math.c.sub.t-1+i.sub.t.Math.c_c.sub.t (5)
h.sub.t=o.sub.t.Math.tanh(c.sub.t) (6)

(33) In an example NVM weight array-accelerated LSTM unit, the 4 vector-matrix multiplications to calculate the forget gate, input gate, output gate, and new candidate memory cell can be accelerated by NVM weight arrays, as shown in FIG. 2. Four (4) weight arrays representing W.sub.f, W.sub.i, W.sub.o and W.sub.c can be concatenated into a whole NVM array to calculate the VMM results in parallel. As the input x.sub.t 204 and the previous hidden state h.sub.t-1 206 processed after the DACs 208 are in the form of analog voltages, NVM weight arrays 212 are resistive cross-point arrays, the VMM results are therefore in the form of analog currents that can go through the ADCs 216 to be converted into digital voltages. The digital voltages representing the VMM results can then be fed into different activation function units 218 (either sigmoid or tanh) to get the final values of the forget gate f.sub.t, input gate i.sub.t, output gate o.sub.t and new candidate memory cell c_c.sub.t that can later be processed in other hardware components to generate the new hidden state h.sub.t (224), which can then be fed into the DAC(s) 208 in the next cycle as part of the total input vector.

(34) Advantageously, a quantized LSTM neural network based on the NVM array architecture can provide accuracy performance that is comparable with that of a floating-point baseline (32 bit) implementation, even when lower bit-width NVM cells along with ADC/DACs are used. This beneficially can reduce costs and resource utilization as typically the higher the bit-width of the ADC or DAC, the higher the cost and area/power consumption. Further, in an NVM-specific implementation in which there may be limitations on the available number of stable resistance states on a single NVM cell, the technology described herein can lower the quantization bit precision of the weights. This enables use of a wider class of NVMs, including those NVMs typically not suited for high-precision bit level (e.g., 32-bit) implementations. As mentioned above, even though ReRAM and PCM can achieve almost continuous incremental conductance change, achieving 32 bit precision of weight is not realistic, while MRAM and NOR Flash are mostly binary-type memory cells.

(35) Depending on implementations, the output of some or all of the highlighted blocks (associated with the “quantized” label at the bottom) in FIG. 2 can be quantized to a value less than 32 bit, such as between 1 and 16 bit, such as 8 bit, 4 bit, 2 bit, or any other suitable value. In some embodiments, 4 bit or a value less than 4 bit is used. Note that in some embodiments the activation units 218 may also quantize naturally as the digital circuits to achieve such activation functions, such as through lookup tables (LUT).

(36) Example Bit Precision Requirement on LSTM Weight Array and Circuit Components.

(37) To evaluate the performance of an example implementation of the disclosed quantized LSTM neural network based on the NVM array architecture, various natural language processing tasks are may be used, such as Penn Treebank and national name prediction. As described herein, various different example bit precisions of the weights and ADC/DACs were used and compared with a floating-point baseline. The input embeddings and output embeddings may or may not be quantized depending on the use case.

(38) Penn Treebank.

(39) The Penn Treebank dataset, in the following example, contains 10K unique words from Wall Street Journal material annotated in Treebank style. As with the Treebank corpus, the task is to predict the next word so the performance is measured in perplexity per word (PPW). The perplexity is roughly the inverse of the probability of correct prediction. The hidden state size is fixed at 300.

(40) FIG. 3 depicts a Penn Treebank dataset result in graph 300. As can be seen from FIG. 3, as training progresses, the validation perplexity continues decreasing for the floating-point (FP) baseline, 2 bit weight 2 bit ADC/DAC, and 4 bit weight 4 bit ADC/DAC cases. The 1 bit weight 2 bit ADC/DAC example case shows a less successful training as the validation perplexity fluctuates and does not converge, while the 4 bit weight 4 bit ADC/DAC case produces a competitive training result with the FP baseline without noticeable degradation. Stated another way, FIG. 3 shows that perplexity does not converge for the 1 bit weight 2 bit ADC/DAC case while the other bit-width configurations produce successful training. It also shows 4 bit weight 4 bit ADC/DAC can generate close-to-equivalent training result with the FP.

(41) To fully explore the bit-width requirement on the weights and ADC/DAC, all combinations of bit precision ranging from 1 to 4 bit were tested. FIG. 4 depicts a Penn Treebank dataset result in graph 400 with full exploration of the bit-widths of weight and ADC/DAC, and in which the PPW is measured as the validation perplexity after 10 epochs of training. As shown, a 4 bit weight along with at least 2 bit of ADC/DAC may is desirable to achieve a comparable result with the floating-point baseline (less than 5% of perplexity increase). It can also be loosely concluded that the high bit precision of the weight plays a relatively more important role than the high bit precision of the ADC/DAC for the general performance of the LSTM network, as the PPW achieved at 1 bit weight 2 bit ADC/DAC is higher than that achieved at 2 bit weight 1 bit ADC/DAC. A similar phenomenon can be observed by comparing the 2 bit weight 4 bit ADC/DAC case performance and the 4 bit weight 2 bit ADC/DAC case performance. Therefore, improving the resolution of the conductance levels of the NVM cells may be a higher priority than using high precision peripheries, although both could be applicable in some cases.

(42) Character Prediction.

(43) A simpler task than the Penn Treebank is the national name prediction where the next character is predicted instead of the next word. The perplexity metric here is for per character. The hidden state size is fixed at 256. After 8,000 training iterations, the training perplexity and accuracy were measured. As can be seen from Table I, in terms of both training perplexity and accuracy, 2 bit weight 2 bit ADC/DAC is sufficient to produce a result within 5% degradation compared to the floating-point baseline (32 bit) case. As compared to the result from the Penn Treebank, a lower bit precision requirement on the weight and ADC/DAC is needed in this example case for the simpler character prediction task. To conclude and summarize from both tasks, a 4 bit weight 4 bit ADC/DAC can ensure almost-zero degradation for the online training performance. Such bit-width requirements also naturally help to ensure the performance of the inference whose result is not shown here, although other combinations of lower bit weight and bit ADC/DAC values can also produce results within acceptable parameters depending on the implementation.

(44) TABLE-US-00001 TABLE I NATIONAL NAME PREDICTION RESULT. TRAINING PERPLEXITY AND ACCURACY AT DIFFERENT BIT-WIDTHS CONFIGURATIONS National name prediction result Training Training accuracy perplexity (per LSTM configuration (%) character) Floating point baseline 85.09 1.52 1 bit weight + 1 bit ADC/DAC 72.82 2.27 2 bit weight + 2 bit ADC/DAC 83.6 1.58 4 bit weight + 4 bit ADC/DAC 85 1.55

(45) Example Effect of Device and Circuit Noise.

(46) In addition to the low bit precision of NVM weight cells and ADC/DAC circuit components, non-ideal effects coming from the hardware may be considered. For instance, the hardware noise can be broadly classified into read noise and write noise. The read noise can be reflected on the ADC noise when a readout operation is performed during forward propagation, while the write noise can be reflected on the weight noise after the weight update is performed during back propagation.

(47) Example Effect of ADC Noise.

(48) The ADC read noise can distort the correct VMM result. To simply model the ADC noise coming mainly from the transistors within the ADCs, an additive noise term may be added to the values at the forget gate, input gate, output gate and new candidate memory cell before the ADC quantization and activation function units. The noise follows a Gaussian distribution with a standard deviation proportional to the total input current range. For example, at the forget gate:
f.sub.t=sigmoid([x.sub.t,h.sub.t-1]W.sub.f+Z) (7)
Z˜N(0,σ.sup.2),σ=α(I.sub.max−I.sub.min) (8)

(49) Z is the ADC noise vector with the same dimension as [x.sub.t, h.sub.t-1] W.sub.f. It follows a Gaussian distribution with zero mean and a standard deviation a ranging from 0 to 20% of the maximum input signal range I.sub.max−I.sub.min. The percentage of the input VMM signal range α is defined as the ADC noise ratio. Using a from 0 to 20% may be realistic with an actual ADC hardware situation, depending on the use case, although other values may apply.

(50) FIG. 5 depicts a graph 500 showing the effect of ADC noise on the Penn Treebank embodiment. Some example bit-widths configurations for weight and ADC/DAC were considered in these results to compare with floating point baseline. As can be seen from FIG. 5, the influence of ADC noise on the training performance is quite small, especially when the ADC bit-width is low, such as 2 bit. The experiment was run on the Penn Treebank corpus measuring the validation perplexity after 10 epochs of training.

(51) Effect of Weight Noise.

(52) Similarly, the effect of weight noise caused by NVM device variations may also be considered. Due to mostly extrinsic fabrication issues or intrinsic device stochastic nature, the spatial device-to-device variation may be relevant when it comes to NVM array operations. Instead of programming the resistance to the desired values, the actual resistance values of different cells can deviate from the ideal values, especially when there is no read-verify after programming. And this can potentially harm the training or inference result. To model the weight noise, an additive noise term may be added to the values of the weight arrays. The noise follows a Gaussian distribution with a standard deviation proportional to the total weight range. For example, at the forget gate:
f.sub.t=sigmoid([x.sub.t,h.sub.t-1](W.sub.f+Z)) (9)
Z˜N(0,σ.sup.2),σ=β(w.sub.max−w.sub.min) (10)

(53) Z is the weight noise matrix with the same dimension as W.sub.t. It follows a Gaussian distribution with zero mean and a standard deviation a ranging from 0 to 20% of the total weight range w.sub.max−w.sub.min. The percentage of the weight range β is defined as the weight noise ratio. Using β from 0 to 20% may be realistic with actual NVM device performance in some cases, although other values may apply.

(54) FIG. 6 depicts a graph 600 showing the effect of weight noise on the Penn Treebank embodiment. Some example bit-widths configurations for weight and ADC/DAC were considered in the results to compare with floating point baseline. As can be seen from FIG. 6, the weight noise seems to have a more harmful effect than the ADC noise on the LSTM network training performance with the same Penn Treebank experiment setup.

(55) Example Noise Tolerance Techniques.

(56) Advantageously, while not required and depending on the use case, the following approach can be used without modifying the training algorithms or using any post error correction methods which usually introduce significant latency, space, and power overhead if needed. In particular, the approach may instead add reasonable redundancy in either running cycles or area to trade for better LSTM performance, although other hybrid approaches may apply and be used depending on the use case.

(57) Using Redundant Runs.

(58) To address the ADC read noise, an ADC noise component can be added, such as an averaging component. In some embodiments, redundant runs can be used to average the results before the ADC quantization and activation function units, as indicated by the averaging blocks (e.g., analog integrate and average) blocks 702 in FIG. 7, which depicts a schematic of a further example architecture of an NVM weight array accelerated LSTM device that uses redundant runs to address the ADC noise. As shown, analog integrate and average units 702 are added between the memory arrays 212 and the ADCs 216 (e.g., after the NVM arrays 212 and before the ADCs 702) so that the values of the forget gate, input gate, output gate, and new candidate memory cell can be an averaged result from redundant runs, and can then be used for subsequent element-wise calculations. In some further embodiments, suitable averaging blocks could be situated elsewhere, such as between the activation units and the element-wise calculations.

(59) The approach is tested with the Penn Treebank corpus with 4 bit weight 4 bit ADC/DAC configuration, and it is shown that for 20% ADC noise using 3 or 5 redundant runs is sufficient to improve the training performance to some extent. FIG. 8 depicts example graph 800 showing a result of the method of using redundant runs to address the ADC noise effect. While in the illustrated case improvement is moderate, performance degradation was not severe to begin with. In severe cases, improvement may be commensurately greater.

(60) Using Multiple Parallel NVM Cells as One Synapse.

(61) To address the weight noise/device variation issue, multiple NVM cells can be connected in parallel to represent one synaptic weight element, instead of just using one NVM cell as one synaptic weight element. Such an implementation in the resistive cross-point array is shown in FIG. 9, which depicts various example configurations 900 and 950 for suppressing weight noise (e.g., parallel NVM weight cells). In particular, (a) shows a first approach of using one NVM cell to represent one synaptic weight, and (b) shows a further approach that uses three parallel NVM cells to represent one weight element as an example implementation (three is shown as an example only as other numbers can be used). The variation or noise effect can be statistically averaged out by taking the summed current from the multiple parallel cells.

(62) From the simulation test on a 10% weight noise example case, it can be seen that using just 3 or 5 parallel NVM cells can improve the training performance significantly. FIG. 10 depicts a graph 1000 showing example results obtained from using multiple parallel NVM cells per weight to address the weight noise effect. The simulation was run on the Penn Treebank corpus with 4 bit weight 4 bit ADC/DAC configuration. By optimizing the layout of these parallel NVM devices, such as by sharing the wordlines (WLs) and bitlines (BLs) shown in FIG. 9, the area overhead can be advantageously reduced to a relatively small amount.

(63) FIG. 11 depicts a flowchart of an example method 1100 for quantized processing of inputs. In block 1102, the method 1100 converts a digital input signal into an analog input signal. The digital signal, in some embodiments, may comprise sensor data from one or more sensors coupled to, for instance, a DAC 208 (e.g., directly or via intervening components). In block 1104, the method 1100 converts a digital previous hidden state (PHS) signal into an analog PHS signal. For example, a DAC 208 may receive the previous state from the buffer array 1402 and/or directly from an arithmetic logic unit (e.g., 1414). In block 1106, the method 1100 computes, using a plurality of non-volatile memory (NVM) weight arrays (e.g., 212), a plurality of vector matrix multiplication (VMM) arrays based on the analog input signal and the analog PHS signal. In some implementations, the NVM weight arrays have a bit-width less than 32 bits and/or may comprise resistive cross-point arrays. In block 1108, the method 1100 converts the VMM arrays into digital VMM values, and in block 1110, the method 1100 processes the digital VMM values into a new hidden state. In some implementations, converting the VMM arrays into digital VMM values may comprise adding an ADC noise component.

(64) FIG. 12 depicts a flowchart of an example method 1200 for calculating a hidden state. In block 1202, the method 1200 processes the digital VMM values into a forget gate value, an input gate value, an output gate value, and a new candidate memory cell value. In block 1204, the method 1200 calculates the new hidden state based on the forget gate value, the input gate value, the output gate value, and the new candidate memory cell value. The method 1200, on a subsequent cycle, may input the new hidden state as a previous hidden state on a subsequent iteration of the method 1100, such as in block 1104.

(65) The foregoing description, for purpose of explanation, has been described with reference to various embodiments and examples. However, the illustrative discussions above are not intended to be exhaustive or to limit the claimed invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The various embodiments and examples were chosen and described in order to best explain the principles of the innovative technology described herein and its practical applications, to thereby enable others skilled in the art to utilize the innovative technology with various modifications as may be suited to the particular use contemplated.

Hardware accelerated discretized neural network

Assignee

Inventors

Cpc classification

Classification Explorer

G06N3/044

PHYSICS

Classification Explorer

H03M1/12

ELECTRICITY

Classification Explorer

G11C13/0002

PHYSICS

Classification Explorer

G11C2213/77

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06F17/16

PHYSICS

Classification Explorer

G11C13/003

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G11C11/54

PHYSICS

Classification Explorer

H03M1/74

ELECTRICITY

Classification Explorer

G06G7/163

PHYSICS

Classification Explorer

G06N3/065

PHYSICS

International classification

Classification Explorer

G06F17/16

PHYSICS

Classification Explorer

G06N3/08

PHYSICS

Classification Explorer

G11C13/00

PHYSICS

Abstract

Claims

Description