CROSS COUPLED CAPACITOR ANALOG IN-MEMORY PROCESSING DEVICE
20230229870 · 2023-07-20
Inventors
Cpc classification
H03K25/02
ELECTRICITY
International classification
Abstract
A system for performing analog multiply-and-accumulate (MAC) operations employs at least one cross coupling capacitor processing unit (C3PU). A system includes a wordline to which an analog input voltage is applied, a voltage supply line having a supply voltage (VDD), a bitline, a clock signal line, a current integrator op-amp connected to the bitline and to the clock signal line, and a C3PU connected to the wordline. The C3PU includes a CMOS transistor and a capacitive unit. The capacitive unit includes a cross coupling capacitor and a gate capacitor. The cross coupling capacitor is connected between the wordline and the gate terminal of the CMOS transistor. The gate capacitor is connected between the gate terminal and ground. The CMOS transistor is configured to conduct a current that is proportional to voltage applied to the gate terminal.
Claims
1. A system for performing analog multiply-and-accumulate (MAC) operations, the system comprising: a first wordline to which a first analog input voltage is applied; a voltage supply line having a supply voltage (VDD); a first bitline; a clock signal line; a first current integrator op-amp connected to the first bitline and to the clock signal line; and a first cross coupling capacitor processing unit (C3PU) connected to the first wordline, wherein the first C3PU comprises: a first C3PU CMOS transistor comprising a first C3PU gate terminal, a first C3PU VDD terminal connected to the voltage supply line, and a first C3PU current output terminal connected to the first bitline; and a first C3PU capacitive unit comprising a first C3PU cross coupling capacitor and a first C3PU gate capacitor, wherein the first C3PU cross coupling capacitor is connected between the first wordline and the first C3PU gate terminal, and wherein the first C3PU gate capacitor is connected between the first C3PU gate terminal and ground, wherein the first C3PU CMOS transistor is configured to conduct a current that is proportional to voltage applied to the first C3PU gate terminal.
2. The system of claim 1, further comprising: a second wordline to which a second analog input voltage is applied; a second C3PU connected to the second wordline, wherein the second C3PU comprises: a second C3PU CMOS transistor comprising a second C3PU gate terminal, a second C3PU VDD terminal connected to the voltage supply line, and a second C3PU current output terminal connected to the first bitline; and a second C3PU capacitive unit comprising a second C3PU cross coupling capacitor and a second C3PU gate capacitor, wherein the second C3PU cross coupling capacitor is connected between the second wordline and the second C3PU gate terminal, and wherein the second C3PU gate capacitor is connected between the second C3PU gate terminal and ground, wherein the second C3PU CMOS transistor is configured to conduct a current that is proportional to voltage applied to the second C3PU gate terminal.
3. The system of claim 2, comprising: an array of M×N C3PUs, including the first C3PU and the second C3PU, arranged in a crossbar architecture comprising M rows, N columns, wherein each of M and N is an integer number equal to 2 or greater, and wherein each of the array of M×N C3PUs comprises: a respective CMOS transistor comprising a respective gate terminal, a respective VDD terminal connected to the voltage supply line, and a respective current output terminal; and a respective C3PU capacitive unit comprising a respective C3PU cross coupling capacitor and a respective C3PU gate capacitor, wherein the respective C3PU cross coupling capacitor is connected between the respective wordline and the respective C3PU gate terminal, and wherein the respective C3PU gate capacitor is connected between the respective C3PU gate terminal and ground, wherein the respective CMOS transistor is configured to conduct a current that is proportional to voltage applied to the respective gate terminal; M wordlines, including the first wordline and the second wordline; N bitlines, including the first bitline; and N current integrator op-amps, including the first current integrator op-amp, wherein: each of the C3PUs in each respective column of the C3PUs has an current output terminal that is connected to a respective bitline of the N bitlines for the respective column of the C3PUs; and each of the C3PUs in each respective row of the C3PUs is connected to a respective wordline of the M wordlines for the respective row of the C3PUs; and the array of C3PUs are connected to the supply voltage line; and each of the bitlines of the N bitlines is connected to a respective one of the N current integrator op-amps.
4. The system of claim 3, wherein the array of M×N C3PUs comprises five rows and four columns.
5. The system of claim 3, wherein: the VDD is within a range from 0.1-0.5 V; the analog input voltage is within a range from 0.1-1 V; an equivalent capacitance of the capacitive unit is within a range from 0.1-1; a bias voltage provided by a wordline of the M wordlines, is within a range of 0-1 V; and a size of each respective CMOS transistor is 200 nm±1000 nm/60 nm±100 nm.
6. The system of claim 3 wherein: the VDD is 0.3 V; the analog input voltage is within a range from 0.5-1 V; an equivalent capacitance of each respective capacitive unit is within a range from 0.5-0.75 Femto-Farad; and a bias voltage, provided by a wordline of the M wordlines, is 1 V.
7. The system of claim 1, wherein the CMOS transistor is configured to conduct current corresponding to a gate voltage applied to the CMOS transistor falling in a range of 0.45-0.75 V.
8. The system of claim 1 wherein the CMOS transistor is configured to conduct a drain-source current that is linearly proportional to a gate voltage applied to the CMOS transistor.
9. The system of claim 1 wherein a non-linear mode of the CMOS transistor corresponds to a gate voltage applied to the CMOS transistor falling in a range of 0.25-0.45 V, the non-linear mode corresponding to a drain-source current conducted by the CMOS transistor of less than 100 nA.
10. The system of claim 1, wherein the analog input voltage is modulated.
11. The system of claim 1, wherein the analog input voltage has a modulated pulse width.
12. The system of claim 11, further comprising a voltage-to-time converter (VTC) that generates the analog input voltage from an input voltage.
13. A method of mapping a crossbar architecture comprising N columns of M cross coupling capacitive units (C3PUs) to an artificial neural network (ANN), where ‘N’ and ‘M’ are positive integers greater than one, the method comprising: mapping A rows of the crossbar architecture to A input nodes of an input layer of the ANN, where A is an integer greater than one and less than M; mapping the A input nodes and a first bias node to B hidden nodes of a hidden layer, where B is an integer greater than one and less than A; mapping the B hidden nodes and a second bias node to B output nodes of an output layer; applying A input voltages to the A input nodes; generating a plurality of weighting factors; determining a minimum weight value, such that none of the weighting factors are less than zero; and generating an output measurement based on the A input voltages.
14. The method of claim 13, wherein generating the output measurement comprises normalizing and mapping a feature set comprising A features to A voltage values.
15. The method of claim 13, wherein generating the output measurement comprises mapping the plurality of weighting factors to a plurality of capacitance ratios corresponding to an array of C3PUs making up the crossbar architecture.
16. The method of claim 15, wherein mapping the plurality of weighting factors to a plurality of capacitance ratios corresponding to array of C3PUs comprises: generating the weighting factors by training a simulated ANN using the A voltage values in a simulated crossbar architecture.
17. The method of claim 13, wherein generating the output measurement further comprises: applying an M×N weight matrix comprising the weighting factors and the minimum weight value to the A input voltages, according to the mapping of the input layer to the hidden layer; generating B voltage levels for the B hidden nodes at least in part by summing and integrating over time N output currents generated by the N columns of C3PUs; generating B output voltages by applying an N×N weight matrix comprising the weighting factors according to the mapping of the hidden layer to the output layer; and classifying a feature set based at least in part on the B output voltages, the feature set corresponding to the A inputs to the input layer.
18. The method of claim 17, wherein classifying the feature set comprises: integrating and summing the B output voltages; and applying a sigmoid activation function to a result of integrating and summing the B output voltages.
19. The method of claim 13, further comprising converting each of the A input voltages into an analog input voltage having a modulate pulse width via a respective voltage-to-time converter (VTC).
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
DETAILED DESCRIPTION
[0027] In the following description, various embodiments of the present invention will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the present invention may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being described.
[0028] According to various embodiments of the present disclosure, techniques for in-memory computing (IMC) can include implementations of synaptic memory that is utilized for weight storage in an artificial neural network in an analog system. According to certain specific embodiments, to implement analog MAC operation, a cross-coupling capacitor processing unit (C3PU) is provided having a circuit design using a crossbar architecture.
[0029] Example C3PU Circuit and Operation
[0030] The following sections discuss the design details and operation of an example C3PU. The A coupling capacitance is used to transfer apply a voltage to the gate of the transistor. Current is passed through the transistor based on the voltage applied to the gate of the transistor.
[0031] Turning now to the drawing figures in which similar reference identifiers refer to similar elements,
[0032] The value of Vg determines the operational mode of the transistor 102 and affects its trans-conductance value and hence its linearity.
[0033] To overcome the former issues that significantly affect the functionality of the C3PU multiplier, the analog input voltage can be processed in time domain rather than voltage domain. This can be achieved using a voltage-to-time converter (VTC) 106 as shown in
[0034] Presenting the data Vin in time domain has several advantages where both time and capacitance scale better with technology than voltage. In addition, it has less variations and provides better noise immunity compared voltage domain where the signal-to-noise ratio is degraded due to voltage scaling.
[0035]
[0036] During the sampling phase as shown in
as given in Eq. 5. The Iavg value depends on the amount of charge stored in the capacitors, which varies linearly with Vin given that VDDvtc is fixed. Thus, t.sub.d has a linear relationship with Vin. Equation. 6 shows the time delay when Vin=VDDvtc, which depends on the difference between VDDvtc and Vsp.
[0037]
[0038] The VTC circuit 106 was designed, implemented, and simulated in 65 nm industry standard CMOS technology. The input voltage is set between 0.1 V to 1.0 V at VDDvtc=1.0 V. so that linear voltage-to-time conversion is achieved. The capacitors C1 and C2 and the transistor M4 are sized to support a minimum time delay of 165 ps at the minimum Vin of 0.1 V. Metal insulator metal (MIM) capacitors of C1=27 fF and C2=10 fF are utilized. The M4 size of 500 nm/140 nm controlled by its gate voltage of Vb=0.5 V provides a current source of 14 μA. The inverter is carefully sized to provide the desired Vsp. Hence, the aspect ratio of M9 is 5 times the aspect ratio of M8 such that Vsp=0.35 V. Table 1 summarizes the specifications of the VTC design.
TABLE-US-00001 TABLE 1 Specifications of the VTC. VDD.sub.vtc (V) 1 V.sub.in (V) [0-1] C.sub.1 (fF) 27 C.sub.2 (fF) 10 W.sub.1 ,2, 5, 6/L.sub.1, 2, 5, 6 (nm/nm) 600/60 W.sub.3, 7/L.sub.3, 7 (nm/nm) 200/60 W.sub.4/L.sub.4 (nm/nm) 500/140 W.sub.8/L.sub.8 (nm/nm) 200/60 W.sub.9/L.sub.9 (μm/nm) 1/60 V.sub.b 0.5 V V.sub.sp 0.35 V
[0039]
[0040] To quantify the impact of process variation on pulse width value, Monte Carlo Spice simulation with 200 samples and with mismatch model is investigated.
[0041] Example C3PU Crossbar Architecture for IMC Applications
[0042]
[0043] The operation of the example 5×4 C3PU crossbar architecture 200 depends on two phase functions: computation and isolation. In the computation phase when the clock signal Vclk=1, the MAC operation is achieved by multiplying the V.sub.pw,i pulse widths with the capacitance ratios C.sub.c,ij/(C.sub.c,ij+C.sub.b,ij+C.sub.g,ij). Then, the transistors transfer this multiplication into current that is summed on each bitline. The summed currents are integrated over a period of time t.sub.1-t.sub.2 using a virtual ground current integrator op-amp in order to provide the outputs as voltage levels V.sub.1-4 as given in Eq. 7.
[0044] The value of output voltages depends on two main parameters: a) time that the current will be accumulated t.sub.1-t.sub.2 and b) capacitor size C.sub.j. The time t1-t2 can be fixed and represent the pulse width of the clock. This time is set to be greater than the maximum pulse width of V.sub.pw,i. The maximum pulse width of V.sub.pw is approximately 2 ns when the maximum input voltage V.sub.in=1. Thus, the pulse width of the clock can be set to 3 ns to ensure the computation and accumulation of the currents. In addition, the C.sub.j size plays an important role in determining the scaling factor that is required to approximately allow V.sub.1-4 to reach the expected output levels. The scaling factor is calculated by dividing the obtained MAC output voltages V.sub.1-4 by the expected values and hence the C.sub.j size is set. Once the approximate voltages are achieved, the C3PU elements are isolated from the outputs by setting V.sub.clk=0 to enter the isolation phase. The isolation phase is essential in order to allow the functionality of the VTC and to initialize the output stage of a virtual ground op-amp 203. The period T including computation and isolation time taken to operate the MAC calculations is 6 ns. Table 2 shows the specifications of the C3PU crossbar architecture 200.
TABLE-US-00002 TABLE 2 5 × 4 C3PU Crossbar Specifications VDD (V) 0.3 V.sub.in (V) 1 V.sub.pw (V) 1 t.sub.pw (ns) 0-2 X.sub.eq 0.5-0.75 V.sub.g (V) 0.5-0.75 T (ns) 6 Transistor size 500 nm/60 nm
[0045] The 5×4 C3PU crossbar architecture 200 can be implemented employing 65 nm technology. The input voltages can be fed to the C3PU crossbar architecture 200 for 30 continuous clock cycles. Each cycle can have different sets of input voltage levels that are converted into modulated pulse width signals.
[0046] In order to evaluate the 5×4 C3PU crossbar architecture 200, a 5×4 fixed point (FXP) crossbar units have been implemented using ASIC design flow in 65 nm CMOS. Table 3 shows the 3×3-bit, 4×4-bit, 8×4-bit and 8×8-bit FXP crossbars performance compared to the 5×4 C3PU crossbar 200. The error of the C3PU crossbar 200, 5.6%, is approximately close to the error of the 8×4-bit MAC unit, 6.52%. However, the advantage of the C3PU crossbar 200 is the lower energy and area consumption by 3.4 times and 2.4 times compared with the 8×4-bit MAC unit.
TABLE-US-00003 TABLE 3 Evaluation of 5 × 4 FXP crossbar MAC units with differnet input and weight resolutions. MAC Unit Energy Error Area Type (fJ/MAC) (%) (μm.sup.2/MAC) 3 × 3-bit 60.9 64.7 127.7 4 × 4-bit 107 10 246.2 8 × 4-bit 226.2 6.52 655.8 8 × 8-bit 526 0.74 1380.7 C3PU 66.4 5.6 277.1
[0047] C3PU Demonstrator For ANN Applications
[0048] The advantage of the C3PU 100 is demonstrated by accelerating the MAC operations found in an ANN using an iris flower database. The iris flower data set consists of 150 samples in total divided equally between the three different classes of the iris flower namely, Setosa, Versicolour, and Virginica. Each sample holds the following features all in cm: sepal length, sepal width, petal length, and petal width. The architecture of the ANN consists of two layers: four nodes for the input layer each representing one of the input features, followed by three hidden neurons and lastly three output neurons for each class. In order to implement the MAC operations in the ANN, the iris features are considered as the first operands and are mapped into voltage values. The weights are considered as second operands and are stored as capacitance ratios in the capacitive unit of the C3PU. A simple linear mapping algorithm is used between the neural weights and capacitance ratios.
[0049] The training phase is performed offline using MATLAB by dividing the data set between training and testing as 80% and 20%, respectively. Post-training weights can have values with both positive and negative polarities. Hence, before mapping these weights into capacitance ratio values, they need to be shifted by the minimum weight value w.sub.min. After performing the multiplication between the inputs and shifted weights, the effect of the shifting operation must be removed by subtracting the following term from all weights Σ.sub.i=1.sup.n=IN×|W.sub.min|, where IN is the input to the hidden/output layer and n is the number of input nodes. Mapping such operation into C3PU architecture requires adding an additional column to the hidden and output crossbars to store the w.sub.min value in each layer.
[0050]
[0051] Once V.sub.1-4 are generated, the classifier switches to phase 2 in order to process them to the second layer. But before that, the impact of shift operation that is implemented on the weights needs to be removed by subtracting V.sub.4 from V.sub.1-3. Then, the subtracted outputs are passed through Relu activation function. In the ANN classifier, the subtraction operation and Relu function are implemented in time domain. In order to achieve such implementation, V.sub.1-4 are first converted to pulse width modulated signals using VTCs and then passed to the time domain subtractor and Relu activation function to generate V.sub.o-pw1-3. These output signals may have small pulse widths due to the subtraction operation which does not correspond to the expected subtraction outputs. Therefore, the pulse widths of the V.sub.o-pw1-3 are scaled by a constant factor depending on the expected subtraction output from the ANN using MATLAB and the observed outputs from the ANN using C3PU. After that, the scaled pulse width signals V.sub.o-pw1-3-s are fed to the 4×4 C3PU weight matrix. The output voltages from the weight matrix V.sub.o1-4 are passed to the subtractor and then softmax function in order to generate the proper class based on the input features.
[0052]
[0053] The ANN classifier has been designed and simulated in 65 nm CMOS technology with a supply voltage of 1V except the 5×4 and 4×4 weight matrices that operate at a supply voltage of 0.3 V. The input voltages V.sub.in1-4 have a range of 0.0 V to 1.0 V in addition to V.sub.bias=1.0 V. The five input voltages are converted into modulated pulse width signals V.sub.pw1-5 that have pulse widths in the range of 165 ps to 2 ns. The modulated pulse width input signals V.sub.o1-4 of the second weight matrix have a pulse width in the range of 1.6 ns to 7.5 ns. The pulse width T.sub.1 of V.sub.clk is set to 3 ns and the pulse width T2 of ˜V.sub.clk-d is set to 9 ns. The example ANN classifier using C3PU shown in
[0054] The advantage of utilizing a cross-coupling capacitor for storage and processing element is that it can perform simultaneously as a high density and a low energy storage. One operand in the C3PU can be stored in the capacitive unit. While the second operand can be a modulated pulse width signal using voltage-to-time converter. The multiplication outputs can be transferred to an output current using CMOS transistors and then integrated using current integrator op-amp. The 5×4 C3PU crossbar 200 was developed to run all data simultaneously realizing fully parallel vector-matrix multiplication in one cycle. The energy consumption of the 5×4 C3PU is 66.4 fJ/MAC at 0.3V voltage supply with an error of 5.4% in 65 nm technology. The inference accuracy for the ANN architecture has been evaluated using the example C3PU for an iris flower data set achieving a 90% classification accuracy.
[0055] Other variations are within the spirit of the present invention. Thus, while the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific form or forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention, as defined in the appended claims.
[0056] The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. The term “connected” is to be construed as partly or wholly contained within, attached to, or joined together, even if there is something intervening. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
[0057] Preferred embodiments of this invention are described herein, including the best mode known to the inventors for carrying out the invention. Variations of those preferred embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the invention to be practiced otherwise than as specifically described herein. Accordingly, this invention includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the invention unless otherwise indicated herein or otherwise clearly contradicted by context.
[0058] All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and were set forth in its entirety herein.