INTEGRATED CIRCUITS FOR LARGE-SCALE TRANSISTOR TENSOR OPERATIONS

20260072645 · 2026-03-12

Assignee

Teracore Systems, Inc. (Cupertino, CA, US)

Inventors

Cpc classification

International classification

Abstract

An integrated circuit includes a plurality of current-mode computation (CMC) branches, each including a plurality of CMC cells and a branch summation line. An individual CMC cell includes at least one computation transistor that produces a CMC output current that is a function of a channel current of the computation transistor. A branch summation line receives CMC cell output currents produced by a plurality of CMC cells, and produces a branch output current that is a current-mode summation of all received CMC cell output currents. A layer-one current summation circuit receives the branch output current produced by at least one branch summation line, and produces a layer-one current summation circuit output current that is a function of the received branch output currents. The layer-one current summation circuit may be field-programmable. In operation, the integrated circuit can perform transistor tensor operations by programming one or more layer-one current summation circuits to combine all branch output currents received by those layer-one current summation circuits. Using embodiments of the present invention, millions, billions, trillions or more of the CMC cells can be field-programmed to execute computations in parallel to support transistor tensor operations.

Claims

1. An integrated circuit comprising: a plurality of current-mode computation (CMC) branches, wherein each of the CMC branches comprises: a plurality of CMC cells, wherein each of the CMC cells comprises at least one computation transistor and produces a CMC cell output current that is a function of a channel current of the computation transistor; a branch summation line comprising an electrical conductor that is electrically coupled to, and receives the CMC cell output currents produced by, a plurality of the CMC cells in that individual CMC branch, and that produces a branch output current that is a current-mode summation of the CMC cell output currents electrically coupled thereto; a plurality of layer-one current summation circuits, wherein each of the layer-one current summation circuits is electrically coupled to, and receives the branch output current produced by, at least one branch summation line, and further comprises a feedback circuit that can limit at least an amplitude of the branch output current received thereto, and produces a layer-one current summation circuit output current that is a function of the branch output currents received thereto; a plurality of layer-two summation lines, wherein each of the layer-two summation lines comprises an electrical conductor that is electrically coupled to, and receives the layer-one current summation circuit output currents produced by, a plurality of layer-one current summation circuits, and that produces a layer-two summation line output current that is a current-mode summation of the received layer-one current summation circuit output currents; and a plurality of layer-two current summation circuits, wherein each of the layer-two current summation circuits is electrically coupled to, and receives the layer-two summation line output current produced by, at least one layer-two summation line, and produces a layer-two current summation circuit output current that is a function of the received layer-two summation line output currents.

2. (canceled)

3. The integrated circuit of claim 1, further comprising: a plurality of layer-three summation lines, wherein each of the layer-three summation lines comprises an electrical conductor that is electrically coupled to, and receives the layer-two current summation circuit output currents produced by, a plurality of layer-two current summation circuits, and that produces a layer-three summation line output current that is a current-mode summation of the received layer-two current summation circuit output currents; and a plurality of layer-three current summation circuits, wherein each of the layer-three current summation circuit is electrically coupled to, and receives the layer-three summation line output current produced by, at least one layer-three summation line, and produces a layer-three current summation circuit output current that is a function of the received layer-three summation line output currents.

4. The integrated circuit of claim 1, wherein the computation transistor of each of said CMC cells comprises an MOS transistor.

5. The integrated circuit of claim 4, further comprising a select transistor having a source or drain terminal that is electrically coupled to a gate terminal of a computation transistor of each of said CMC cells.

6. The integrated circuit of claim 5, further comprising a storage-control capacitor electrically coupled to the gate terminal of a computation transistor of each of said CMC cells.

7. The integrated circuit of claim 6, further comprising a storage-control capacitor electrically coupled to the gate terminal of a computation transistor of each of said CMC cells while the other terminal of the storage-control capacitor is electrically coupled to a field-programmable CMC cell gate voltage control signal.

8. The integrated circuit of claim 1, wherein a computation transistor of each of said CMC cells comprises a native transistor.

9. The integrated circuit of claim 1, wherein a computation transistor of each of said CMC cells comprises a programmable-threshold-voltage transistor.

10. The integrated circuit of claim 9, wherein said programmable-threshold-voltage transistor comprises a floating-gate transistor.

11. The integrated circuit of claim 1, further comprising a select transistor having a source or drain terminal that is electrically coupled to gate terminals of computation transistors of a plurality of CMC cells.

12. The integrated circuit of claim 1, wherein each of said CMC cells is field-programmable to operate in multiple modes.

13. The integrated circuit of claim 12, wherein each of said CMC cells is field-programmable to operate in a multiplier mode.

14. The integrated circuit of claim 12, wherein each of said CMC cells is field-programmable to operate in a ReLU mode.

15. The integrated circuit of claim 12, wherein each of said CMC cells is field-programmable to operate in a rectifier mode.

16. The integrated circuit of claim 1, wherein the integrated circuit comprises multiple semiconductor dice that are connected by inter-dice connections.

17. The integrated circuit of claim 1, wherein the integrated circuit is field-programmable to disable parts of the circuits.

18. The integrated circuit of claim 1, wherein: each of said layer-one current summation circuits is field-programmable; and the integrated circuit can perform transistor tensor operations by programming one or more of said layer-one current summation circuits to combine branch output currents received by those layer-one current summation circuits.

19. The integrated circuit of claim 1, wherein: each of said layer-two current summation circuits is field-programmable; and the integrated circuit can perform transistor tensor operations by programming one or more of said layer-two current summation circuits to combine layer-two summation line output currents received by those layer-two current summation circuits.

20. The integrated circuit of claim 3, wherein: each of said layer-three current summation circuits is field-programmable; and the integrated circuit can perform transistor tensor operations by programming one or more of said layer-three current summation circuits to combine the layer-three summation line output currents received by those layer-three current summation circuits.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0063] FIG. 1(a) is a typical schematic symbol of a metal-oxide-semiconductor (MOS) transistor;

[0064] FIG. 1(b) is a typical schematic symbol of a floating-gate transistor;

[0065] FIGS. 1(c-e) show examples of the current vs voltage (I-V) relationship of a typical MOS transistor;

[0066] FIG. 1(f) is a symbolic diagram illustrating the principle of current-mode summation;

[0067] FIGS. 2(a-d) are simplified symbolic diagrams of examples of prior art floating-gate transistor arrays configured to support neural network computations;

[0068] FIG. 3(a) is a simplified symbolic block diagram of one embodiment of a layer-two block of the present invention;

[0069] FIG. 3(b) is a simplified symbolic block diagram of one embodiment of a layer-one block in FIG. 3(a) that uses current-mode computation memory cells;

[0070] FIG. 3(c) is a simplified schematic diagram of one embodiment of a layer-one input circuit (VI1) in FIG. 3(b);

[0071] FIG. 3(d) is a simplified schematic diagram of one embodiment of a layer-one current summation circuit (IS1) in FIG. 3(b);

[0072] FIG. 3(e) is a simplified schematic diagram of one embodiment of a layer-two current summation circuit (IS2);

[0073] FIG. 3(f) is a simplified schematic diagram of one embodiment of an interface output circuit (VOUT) in FIG. 3(e);

[0074] FIG. 3(g) is a simplified schematic diagram of one embodiment of a layer-one gate voltage input circuit (VI1g) in FIG. 3(b);

[0075] FIG. 3(h) is a simplified symbolic block diagram of one embodiment of a layer-one block in FIG. 3(a);

[0076] FIGS. 3(i-m) are simplified schematic diagram of embodiments of current mode computation cells;

[0077] FIG. 4(a) is a simplified symbolic block diagram of one embodiment of a layer-three block of the present invention;

[0078] FIG. 4(b) a simplified schematic diagram of one embodiment of a layer-three current summation circuit (IS3);

[0079] FIG. 4(c) a simplified schematic diagram of one embodiment of a layer-one current summation circuit (IS1) that can be used when the layer-one array is configured at saturation mode;

[0080] FIG. 5(a) is a simplified symbolic block diagram of one embodiment of a multiple-layer current summation configuration of the present invention;

[0081] FIG. 5(b) is a simplified symbolic block diagram of one embodiment of a multiple-layer input configuration of the present invention;

[0082] FIG. 5(c) is a simplified symbolic block diagram of one embodiment of a system configuration of the present invention;

[0083] FIGS. 5(d-e) show simplified embodiments of inter-dice connections;

[0084] FIG. 6(a) is a simplified schematic diagram for a two-transistor CMC cell that can have positive or negative values;

[0085] FIG. 6(b) is a simplified schematic diagram for a current subtraction circuit that allows synapses to have positive or negative values;

[0086] FIG. 6(c) is a simplified schematic diagram for a dual polarity current mirror circuit;

[0087] FIG. 6(d) is a simplified schematic diagram for a four-transistor synapse cell that can have positive or negative values in both inputs and outputs;

[0088] FIG. 6(e) is a simplified flow chart for implementing double-precision parameters;

[0089] FIG. 7(a) is a simplified flow chart for programing a programmable-threshold voltage transistor;

[0090] FIG. 7(b) is a simplified flow chart for programing programmable-threshold voltage transistors to support double-precision parameter values;

[0091] FIG. 8(a) is a simplified flow chart for one exemplary learning algorithm of an embodiment the present invention;

[0092] FIG. 8(b) is a simplified flow chart for one exemplary learning algorithm of an embodiment of the present invention that uses current sample-and-hold circuits;

[0093] FIG. 8(c) is a simplified schematic diagram of one exemplary current sample-and-hold circuit that can support the algorithm in FIG. 8(b);

[0094] FIG. 8(d) is a simplified flow chart for one exemplary output function learning algorithm of an embodiment of the present invention;

[0095] FIG. 9(a) is a simplified block diagram of an exemplary convolution matrix computation circuit of an embodiment of the present invention;

[0096] FIG. 9(b) shows schematic symbols of the convolution computation cells in FIG. 9(a);

[0097] FIG. 9(c) is a simplified schematic diagram of the current summation circuit in FIG. 9(a); and

[0098] FIG. 9(d, e) are simplified schematic diagrams of exemplary circuits that execute pulling computations.

DETAILED DESCRIPTION

[0099] As discussed in previous sections, prior art floating-gate transistor arrays cannot support large-scale transistor matrix-vector operations due to the non-ideal effects listed in Table 1. These non-ideal effects increase rapidly with the size of the transistor arrays. Embodiments of the present invention overcome these problems by distribute computation cells into a large number of small units and combine the computations executed in different units to execute full large-scale computations. This architecture also reduces power consumption, solves the current overload problem, and provides configurability.

[0100] FIG. 3(a) is a simplified symbolic block diagram for an embodiment of a layer-two block (B.sup.(2)) of an embodiment of the present invention. This layer-two block (B.sup.(2)) comprises P columns and U rows of layer-one blocks (B.sub.m,n), as shown in FIG. 3(a), where P and U are positive integers, n is an integer greater than or equal to 1 and less than or equal to U, and m is an integer greater than or equal to 1 and less than or equal to P.

[0101] FIG. 3(b) is a simplified schematic diagram for one embodiment of a layer-one block in FIG. 3(a). This layer-one block (B.sub.m,n) comprises a two-dimensional array of N columns and M rows of CMC cells that are configured to support transistor matrix-vector operations, where N and M are positive integers, i is an integer greater than or equal to 1 and less than or equal to N, and j is an integer greater than or equal to 1 and less than or equal to M. In this embodiment, each CMC cell comprises one programmable-threshold voltage transistor (M.sub.i,j), which is a floating-gate transistor that is used as a computation transistor and as a memory device. The drain terminals of a subset of the floating-gate transistors (M.sub.1,j, M.sub.2,j, . . . , M.sub.i1,j, M.sub.i,j, . . . M.sub.N1,j, M.sub.N,j) are connected by a conductor line (Vd.sub.j) to an drain voltage input circuit (VI1)(312), as shown in FIG. 3(b), where j is an integer greater than or equal to 1 and less than or equal to M. A conductor line (e.g Vd.sub.1, Vd.sub.2, . . . , Vd.sub.j, . . . , Vd.sub.M) that controls the drain voltages of a subset of the computation transistors in the layer-one block (B.sub.m,n) will be called a drain voltage input line in the following discussions. The number (N) of transistors connected to each drain voltage input line is designed to be low enough such that the aforementioned parasitic parameter induced problems are negligible, allowing the drain voltage input line to be considered as an equal-potential electrical conductor line. FIG. 3(c) shows a simplified schematic diagram for an exemplary sample-and-hold (S&H) circuit that can serve the function of the drain voltage input circuit (VI1) that drives a drain voltage input line. In sample mode, when the field-programmable control signal Smp is activated, transistor Me4 is turned on such that the voltage (Vhi) on the storage capacitor (Ch) is substantially equal to the input voltage (Vdp), as shown in FIG. 3(c). A unit gain amplifier (321) senses and drives the voltage on its input node (Vhi) to its output (Vh) at a voltage equal to Vdp during sampling mode. During hold mode, when the field-programmable control signal Smp is deactivated, transistor Me4 is turned off such that the voltage on the storage capacitor (Ch) is held at the previously sampled value even when the input voltage Vdp changes. When the field-programmable enable signal (ENI) is activated, and the field-programmable reset signal (Rst) is deactivated, transistor Me5 is turned on, transistor Mrst is turned off, and the output voltage on the drain voltage input line (Vdj) is driven by the unit gain amplifier (321) to be equal to its output voltage at Vh. When the field-programmable enable signal (ENI) is deactivated, and the field-programmable reset signal (Rst) is activated, transistor Me5 is turned off, transistor Mrst is turned on, and the output voltage on the drain voltage input line (Vdj) is reset to voltage Vs. The number of transistors (N) connected to the drain voltage input line (Vdj) is designed to be low enough such that the R*C delay time caused by the parasitic parameters of the input conductor line (Vdj) is no more than a fraction of the intrinsic delay time of the input circuit (VI1).

[0102] The gate terminals of the same subset of the floating-gate transistors (M.sub.1,j, M.sub.2,j, . . . , M.sub.i1,j, M.sub.i,j, . . . M.sub.N1,j, M.sub.N,j) that share the same drain voltage input line (Vd.sub.j) in the layer-one block (B.sub.m,n) are connected by another conductor line (Vg.sub.j) to another gate voltage input circuit (VI1g) (313), as shown in FIG. 3(g). A conductor line (e.g Vg.sub.1, Vg.sub.2, . . . , Vg.sub.j, . . . , Vg.sub.M) that controls the gate voltages of a subset of the computation transistors in the layer-one block (B.sub.m,n) will be called a gate voltage input line in the following discussions. In this embodiment, the sample-and-hold (S&H) circuit (VI1g) in FIG. 3(g) is nearly identical to the S&H circuit in FIG. 3(c), except that one terminal of its storage capacitor (Ch) is connected to a field-programmable reference voltage (Vsg) instead of a fixed voltage (Vs), as shown in the design in FIG. 3(g). When Vsg=Vs, VI1g is identical to VI1 and serves the same functions of a voltage sample-and-hold circuit based on the same operation principles exhibited by the circuit in FIG. 3(c). In hold mode, a change in Vsg causes a change in output voltage (Vg.sub.i), providing a convenient way to execute training algorithms of an embodiment of the present invention. The number of transistors (N) connected to the gate voltage input line (Vg.sub.j) is designed to be low enough such that the R*C delay time caused by the parasitic parameters of the Vg input conductor line (Vg.sub.j) is no greater than a fraction of the intrinsic delay time of the gate voltage input circuit (VI1g).

[0103] The source terminals of a subset of the floating-gate transistors (M.sub.i,1, M.sub.i,2, . . . , M.sub.i,j1, M.sub.i,j, . . . M.sub.i,M1, M.sub.i,M) in the layer-one block (B.sub.m,n) are connected by a branch summation line (315) to a layer-one current summation circuit (IS1) (311), as shown in FIG. 3(b), where j is an integer greater than or equal to 1 and less than or equal to M. During computation, feedback circuits in the layer-one current summation circuits (IS1) fix the voltages on the branch summation lines to a predefined voltage (V.sub.s). Under this configuration, the channel current (Ids.sub.i,j) of the transistor (M.sub.i,j) at the i'th column and j'th row is a function of the threshold voltage (V.sub.Ti,j), drain voltage (Vd.sub.j), source voltage (Vs), and gate voltage (Vg.sub.j) of the transistor (M.sub.i,j) is Ids.sub.i,j((Vd.sub.jV.sub.s), (Vg.sub.jVsV.sub.Ti,j)). The number of transistors (M) connected to each branch summation line (Is.sub.i) is designed to be low enough so that the total parasitic leakage currents on the conductor line is negligible relative to the transistor channel currents. Under these conditions, the branch output current (Is.sub.i) carried by the branch summation line (315) to the layer-one current summation circuit (311) is equal to the current-mode summation of the channel currents (Ids.sub.i,j) of all the transistors ((M.sub.i,1, M.sub.i,2, . . . , M.sub.i,j1, M.sub.i,j, . . . M.sub.i,M1, M.sub.i,M)) connected to the branch summation line:

[00017] ${Is}_{i} = {.Math.}_{j}^{} {Ids}_{i, j} (({Vd}_{j} - Vs), ({Vg}_{j} - Vs - V_{Ti, j})) .$

[0104] In this embodiment, the transistor M.sub.i,j serves the functions of a current mode computation (CMC) cell that executes a scalar function of two scalar inputs, (Vd.sub.jV.sub.s) and (Vg.sub.jVsV.sub.Ti,j), to generate a CMC cell output current (Ids.sub.i,j) that is a function of the channel current (Ids.sub.i,j) of the computation transistor (M.sub.i,j). A plurality of CMC cells (M.sub.i,1, M.sub.i,2, . . . , M.sub.i,j1, M.sub.i,j, . . . M.sub.i,M1, M.sub.i,M) are coupled to a branch summation line (315) to form a current mode computation (CMC) branch (CB.sub.i). This CMC branch (CB.sub.i) executes transistor vector-vector operation on two input vectors (Vd.sub.jVs).sup.M and (Vg.sub.jVsV.sub.Ti,j).sup.M, and produces a branch out current (Is.sub.i). The branch summation line (315) is an electrical conductor that is electrically coupled to, and receives the CMC cell output currents (Ids.sub.i,1, Ids.sub.i,2, . . . , Ids.sub.i,j1, Ids.sub.i,j, . . . , Ids.sub.i,M1, Ids.sub.i,M) produced by, a plurality of CMC cells (M.sub.i,1, M.sub.i,2, . . . , M.sub.i,j1, M.sub.i,j, . . . M.sub.i,M1, M.sub.i,M) in the CMC branch (CB.sub.i), and produces a branch output current (Is.sub.i) that is a current-mode summation of all received CMC cell output currents. Under ideal conditions, the current-mode summation of a plurality of electrical currents (Ids.sub.i,1, Ids.sub.i,2, . . . , Ids.sub.i,j1, Ids.sub.i,j, . . . , Ids.sub.i,M1, Ids.sub.i,M) equals to the summation of those currents (.sub.jIds.sub.i,j), while in reality none ideal effects such as parasitic leakage currents or timing delay may introduce inaccuracies. The layer-one current summation circuit (IS1) is electrically coupled to, and receives the branch output current (Is.sub.i) produced by, at least one branch summation line (315), and produces a layer-one current summation circuit output current (Io.sub.i) that is a function of the received branch output currents (Is.sub.i). Combining the functions of a plurality of the CMC branches (CB.sub.1, CB.sub.2, . . . , CB.sub.i, . . . , CB.sub.N), the layer-one block in FIG. 3(b) executes transistor matrix-vector operations. Combining the functions of large number of layer-one blocks (B.sub.m,n) in a multiple layer architecture, millions, a billion, a trillion or more of the CMC cells can be field-programmable to execute computations in parallel using their computation transistors, together with the branch summation lines that are coupled to those CMC cells, and the layer-one current summation circuits that are coupled to those branch summation lines, to execute large scale multiple-level transistor tensor operations that can have millions, billions, trillions or more scalar computations.

[0105] The layer-one block in FIG. 3(b) has two input lines for each row of transistorsone drain voltage input line and one gate voltage input line. This configuration provides the flexibility to field-program the computation mode of each computation transistor, as listed in the following table (Table 2).

TABLE-US-00002 TABLE 2 gate voltage input line drain voltage input line Transistor computation mode Constant input Multiplier input Constant ReLU input input Rectifier input Vs Saturation Vs Don't care Turned off Source also at Vs Vs Turned off

[0106] When the voltage on the gate voltage input line is set to a constant voltage and the drain voltage input line is used to provide input signals, the computation transistors on the row operate in multiplier mode. When the voltage on the drain voltage input line is fixed and the gate voltage input line is used to provide input signals, the computation transistors on the row operate in ReLU mode. When both the drain voltage input line and the gate voltage input line are set to provide the same input voltage, the computation transistors on the row operate in rectifier mode. When the gate voltage input line is set to source voltage Vs, the computation transistors on the row are all turned off. When the drain voltage input line is set to source voltage Vs and the branch summation line is also set to voltage Vs, the computation transistors on the row are all turned off. When the drain voltage input line is set to source voltage Vs, and the branch summation lines are coupled to the layer-one saturation mode current summation circuit (IS1) in FIG. 4(c), the floating-gate transistors on the row operate in saturation mode. In other words, this configuration provides the flexibility to program a selection of the scalar functions for computation transistors.

[0107] While the preferred embodiments have been illustrated and described herein, other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. In the above embodiments, while each current-mode computation (CMC) cell comprises one floating-gate transistor, the CMC cell can also use other types of programmable-threshold voltage transistors. For example, instead of trapping electrical charges in an isolated conductor, it is also possible to trap electrical charges in the insulator layer that forms the gate insulator, creating a programmable-threshold voltage transistor. Each CMC cell can have multiple transistors. CMC cells can be arranged in different geometries instead of simple two-dimensional arrays. The layer-one blocks can be arranged in different geometries instead of two-dimensional arrays. The inputs and outputs can be configured in various directions. Instead of having two sets of input lines, the layer-one blocks can have one set of input lines or more than two sets of input lines. The layer-one block in the above embodiment does not have multiplexers at its inputs and outputs, but multiplexers can be used for embodiments of the present invention to improve array efficiency. A branch summation line does not have to be a line, it can be of any shape; it also can be a combination of electrical conductors of different shapes or materials. It is to be understood that there are many other possible modifications and implementations so that the scope of the invention is not limited by the specific embodiments discussed herein.

[0108] FIG. 3(d) is a simplified schematic diagram for one embodiment of the layer-one current summation circuit (IS1) in FIG. 3(b). A high gain amplifier (331) fixes the voltage on the branch summation line (315) to a predefined reference voltage V.sub.s and provides an output (Vgm) that controls the gate voltages of three matched transistors (Mm1, Mm2, Mm3) whose source terminals are connected together to the same voltage at Vss. The voltage V.sub.s also can be field-programmable. When the field-programmable input enable signal (Ens) is activated, select transistor Me1 is turned on and the amplifier (331) adjusts the current flowing through Mm1 to be equal to the branch output current (Is.sub.i) by adjusting the gate voltage (Vgm) of matched transistors. Matched transistor Mm2 has the same gate voltage, source voltage, and properties as transistor Mm1. Therefore, when transistor Me2 is turned on by field-programmable enable signal EN2, the channel current flowing through Mm2 is designed to be equal to (Wm2/Wm1)*Is.sub.i, where Wm2 is the effective channel width of transistor Mm2 and Wm1 is the effective channel width of transistor Mm1. Similarly, when transistor Me3 is turned on by field-programmable enable signal EN3, the channel current flowing through Mm3 is designed to be equal to (Wm3/Wm1)*Is.sub.i, where Wm3 is the effective channel width of transistor Mm3. Matched transistors (Mm1, Mm2, Mm3) are arranged in a configuration known as current mirrors in the art of circuit design. The layer-one current summation circuit output current (Io.sub.i) is therefore proportional to the branch output current (Is.sub.i):

[00018] ${Io}_{i} = K_{i} * {In}_{i}$

[0109] K.sub.i is a scale factor depending on the design of the circuit IS1 and the status of the field-programmable enable signals (Ens, EN2, EN3). For example, when EN2 is activated and EN3 is deactivated, K.sub.i=(Wm2/Wm1); when EN2 is deactivated and EN3 is activated, K.sub.i=(Wm3/Wm1). Having multiple matched transistors (Mm2, Mm3) allows the scale factor K.sub.i to be field-programmable, and we can use more matched transistors to increase the range of selection for the scale factor. The layer-one current summation circuit output current (Io.sub.i) is coupled to the layer-two summation line. If all the field-programmable enable signals (EN2, EN3) are deactivated, this output is isolated from the layer-two summation line.

[0110] A small enough layer-one block yields negligible parasitic leakage current on branch summation lines. The CMC cell output currents can be adjusted to be small while meeting the required signal-to-noise ratio. This not only saves operation power, but also mitigates the current overload problem. The layer-one current summation circuit (IS1) is also designed to prevent the current overload problem. Firstly, M is chosen to be low enough such that when all computation memory cells along the branch summation line are turned on at their maximum currents, the layer-one block will still not exhibit current overload. In addition, the output range of the amplifier (331) is designed to work, meaning to output a current proportional to the input summation current (Is.sub.i), only when the branch current (Is.sub.i) is less than a pre-calibrated maximum value (IS.sub.max). When (Is.sub.i) is close to Is.sub.max, the amplifier (331) no longer can hold the voltage of the branch summation line at Vs, reducing the gate-to-source voltage of all transistors connected to the branch summation line and therefore the branch output current. Meanwhile, the current mirror output current (Io.sub.i) is no longer proportional to the input current so that the maximum value of the output current is also limited, which prevents current overload problem at upper layers. In this way, the amplifier (331) and transistor Mm1 form a feedback circuit that can limit the amplitude of the received branch output current (Is.sub.i). Transistor Mm1 and the maximum output voltage of the high gain amplifier (331) defines Is.sub.max. Adjusting the size of transistor Mm1 can adjust Is.sub.max. It maybe desirable to make the size of Mm1 field-programmable. Such design effectively prevents the current overload problem, and provides a way to limit operation power. The computation remains accurate as well because when Is.sub.i is close to Is.sub.max, the output current is limited by the output function.

[0111] When the input functions are operating in saturation mode, the polarity of channel currents is changed and the layer-one saturation mode current summation circuit (IS1) shown in the simplified schematic diagram in FIG. 4(c) can be used. IS1 has the same n-channel current mirrors as those in FIG. 3(d). The difference is that its input stage is a p-channel current mirror formed by two matched transistors (Mp7, Mp8) and two enable transistors (Mpe7, Mpe8) as shown in FIG. 4(c). When field-programmable enable signals Enp7 and ENp8 are both activated, the branch output current (Is.sub.i) is duplicated to the input of the n-channel current mirrors, providing a layer-one saturation mode current summation circuit output current (Io.sub.i) that is compatible with the circuit in FIG. 3(d). At saturation mode, the channel currents of MOS transistors are not sensitive to the value of the drain voltage. IS1 therefore does not need a feedback mechanism to fix the voltage of the branch summation line in this case. Because a current mirror is significantly faster than a high gain amplifier (331), saturation mode can achieve better performance than other modes. The circuit in FIG. 4(c) is also able to prevent the current overload problem because current mirror circuits by nature have upper limitations on the amplitudes of output currents.

[0112] While the preferred embodiments have been illustrated and described herein, other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. For instance, in the above embodiments, the output currents produced by the layer-one current summation circuits (IS1) or layer-one saturation mode current summation circuit (IS1) is proportional to the branch output current (Is.sub.i). For more general cases, Io.sub.i can be other functions of Is.sub.i instead of a simple proportional relationship. IS1, IS1, VI1, VI1g can be designed in multiple ways. It is to be understood that there are many other possible modifications and implementations so that the scope of the invention is not limited by the specific embodiments discussed herein.

[0113] To avoid the non-ideal parasitic parameter induced problems listed in Table 1, the layer-one block (B.sub.m,n) for embodiments of the present invention typically has a small transistor array (for example: 128 rows by 64 columns) that when combined with other blocks can support large-scale transistor tensor operations. FIG. 3(a) is a simplified symbolic block diagram showing an embodiment of a layer-two block (B.sup.(2)) for embodiments of the present invention. The layer-two summation lines (301) are coupled to a plurality of layer-one blocks (B.sub.m,n) as shown in FIG. 3(a). An individual layer-two summation line comprises an electrical conductor that is electrically coupled to, and receives the layer-one current summation circuit output currents produced by, a plurality of layer-one current summation circuits (IS1 or IS1) in the layer-one blocks (B.sub.m,n), and that produces a layer-two summation line output current that is a current-mode summation of all received layer-one current summation circuit output currents. FIG. 3(e) is a simplified schematic diagram for one embodiment of the layer-two current summation circuit (IS2) in FIG. 3(a). In this embodiment, four matched transistors (Mp1, Mp2, Mp3, Mp4) and four enable transistors (Mpe1, Mpe2, Mpe3, Mpe4) are configured as current mirrors, as shown in FIG. 3(e). P-channel transistors are used in IS2 because, in this embodiment, the polarity of the input current (Is.sup.(2)) is opposite to that of the layer-one current summation circuits (IS1). When the field-programmable enable signals Enp1 and Enp2 are activated and Enp3 is deactivated, the layer-two summation circuit output current Io.sup.(2)=(Wp2/Wp1)Is.sup.(2), where Wp2 is the effective channel width of transistor Mp2 and Wp1 is the effective channel width of transistor Mp1. When the field-programmable enable signals Enp1 and Enp3 are activated and Enp2 is deactivated, the layer-two summation circuit output current Io.sup.(2)=(Wp3/Wp1)Is.sup.(2), where Wp3 is the effective channel width of transistor Mp3. When the field-programmable enable signals Enp1 and Enp4 are activated, a layer-two output circuit input current Iso=(Wp4/Wp1)Is.sup.(2) is produced as the input current to the interface output circuit (Vout) illustrated in FIG. 3(f), where Wp4 is the effective channel width of transistor Mp4. IS2 is also able to prevent the current overload problem because current mirror circuits by nature have an upper limit on the output currents. An individual layer-two current summation circuit (IS2) is electrically coupled to, and receives the layer-two summation line output current (Is.sup.(2)) produced by, at least one layer-two summation line (301), and produces a layer-two current summation circuit output current (Io.sup.(2)) that is a function of all received layer-two summation line output currents.

[0114] FIG. 3(f) is a simplified schematic diagram for one embodiment of the interface output circuit (Vout) in FIG. 3(e). The layer-two output circuit input current (Iso) in FIG. 3(e) passes through a variable impedance formed by a diode (D1) and 4 field-programmable variable resistors (VR1, VR2, VR3, VR4) to generate a voltage (Vi4) connected to the input of a programmable gain amplifier (351) which drives an layer-two output voltage (Avo) that is proportional to the voltage (Vi4) at the input of the amplifier (351). The gain of this programmable gain amplifier (351) is designed to be adjustable by a reference transistor (fgTa), as shown in FIG. 3(f). This reference transistor (fgTa) matches the computation transistors used in CMC cells such that it has matched temperature dependence in channel current. Using this reference transistor (fgTa) as a thermometer to adjust the gain of the amplifier (351) can mitigate temperature induced variations. The field-programmable gate voltage (Vga) allows field-programmable gain control of the amplifier (351). Avo can be used as input voltage for the next level of neural network computation; it is also connected to an analog-to-digital converter (ADC) to provide digital outputs (DVo) to digital interfaces. The field-programmable variable resistors (VR1, VR2, VR3, VR4) can typically be implemented by transistors with field-programmable gate voltages. These variable resistors (VR1, VR2, VR3, VR4) allow the interface output circuit (Vout) to support different output functions. For example, when VR2 and VR3 are set to zero, Vi4 is a sigmoid function with minimum level controlled by the field-programmable voltage Vref; when VR1 and VR4 are set to zero, the output is a ReLU function; when VR3 and VR4 are set to zero, the output is a linear function. This design allows field-programmable individual selection of output functions for transistor tensor operations.

[0115] The layer-two summation lines (301) and layer-two current summation circuits (SI2) provide effective ways to expand the number of inputs for transistor tensor operations. For example, if a layer-one block can support 128 inputs, and a layer-two summation line (301) is coupled to 64 layer-one current summation circuits (IS1), then the layer-two block with two-layer current summation architecture can support transistor tensor operations with up to 8192 inputs. If less inputs are needed, unused rows or layer-one blocks are disabled. If more inputs are needed, then more columns of layer-one blocks or layer-three summation lines can be used. Biases can be implemented as one layer-one row with constant inputs.

[0116] Layer-two summation lines (301) also can suffer parasitic parameter induced non-ideal effects similar to those in Table 1. It is important to choose a proper number (U) of layer-one current summation circuits (IS1) coupled to each layer-two summation line to avoid such parasitic induced errors. Because current mirrors are not sensitive to the voltage on input lines, parasitic voltage differences are less important. The width and the length of the layer-two summation line should be designed properly to avoid significant R*C delay. The current amplification factor in the layer-one current summation circuit (IS1) should be chosen properly to have a good signal-to-noise ratio and optimum power consumption.

[0117] The layer-two block in FIG. 3(a) also has layer-two input circuits (VI2) that drive layer-two input lines (303). These layer-two input circuits (VI2) also can be implemented by S&H circuits similar to that (VI1) of FIG. 3(c). The layer-two input lines (303) are coupled to the inputs of layer-one input circuits (VI1, VI1g) in layer-one blocks (B.sub.m,n). Therefore, the voltages on the layer-one input lines (Vd.sub.j, Vg.sub.j) in each layer-one block can be controlled by layer-two input circuits (VI2). This multiple-layer input configuration supports field-programmable input configurations that approach the flexibility of prior art software configurations. Such flexible multiple-layer input configurations can be used to duplicate the same sets of inputs into different layer-one blocks that have different branch summation lines in order to expand the number of outputs. For example, if a layer-one block can support 64 outputs, and each layer-two input line (303) is coupled to 32 sets of layer-one input circuits (VI1, VI1g), then the layer-two block with two-layer input architecture can support transistor tensor operations with up to 2048 outputs. If less outputs are needed, unused columns or layer-one blocks are disabled. If more outputs are needed, more columns of layer-one blocks or layer-three input lines can be used to handle more outputs.

[0118] While the preferred embodiments have been illustrated and described herein, other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. For example, the layer-two, layer-three, and other upper-layer summation lines and input lines can be in different shapes (do not need to be a line) or travel along different directions. These summation lines can be a combination of different layers of metals, vias, contacts, and other components. Blocks in different layers do not need to have the same dimensions or structures, and they can be arranged in different geometries instead of two-dimensional arrays. IS2, IS3, and VOUT can be designed in multiple ways. It is to be understood that there are many other possible modifications and implementations so that the scope of the invention is not limited by the specific embodiments discussed herein. The input lines of the layer-one block in FIG. 3(b) are always horizontal to the branch summation lines; such configuration can support transistor matrix-vector (TMV) operations, while layer-one blocks for embodiments of the present invention can be configured in other ways to support other tensor operations such as TVV or TMM operations.

[0119] FIG. 3(h) shows another layer-one block (370) that comprises a plurality of current-mode computation (CMC) branches (CM.sub.1, CM.sub.2, . . . , CM.sub.i1, CM.sub.i, . . . , CM.sub.N), where i is an integer larger or equal to 1 and less than or equal to N. An individual CMC branch (CM.sub.i) comprises (a) a plurality of CMC cells (C.sub.i,1, C.sub.i,2, . . . , C.sub.i,j1, C.sub.i,j, . . . , C.sub.i,M), where an individual CMC cell comprises at least one computation transistor and produces a CMC cell output current that is a function of the channel current of the computation transistor, such as the exemplary CMC cells illustrated in FIGS. 3(i-m); and (b) a branch summation line (371) that comprises an electrical conductor that is electrically couples to, and receives the CMC cell output currents produced by, a plurality of the CMC cells (C.sub.i,1, C.sub.i,2, . . . , C.sub.i,j1, C.sub.i,j, . . . , C.sub.i,M) in the CMC branch (CM.sub.i), and that produces a branch output current (Is.sub.i) that is a current-mode summation of all received CMC cell output currents. The branch summation line (371) is connected to a layer-one current summation circuit (IS1) that receives the branch output current (Is.sub.i) produced by the branch summation line (371), and produces a layer-one current summation circuit output current (Io.sub.i) that is a function of the received branch output current. In this embodiment, each layer-one current summation circuit (IS1) is coupled to one CMC branch (CM.sub.i), while a layer-one current summation circuit can also be coupled to multiple CMC branches. Multiplexers can sometimes be used between the CMC branches and the layer-one current summation circuits. Field-programmable selection circuits can be used to selectively activate a subset of or all of the layer-one and layer-two current summation circuits in order to combine the computation capabilities of different CMC branches to support parts of or all of a large-scale transistor tensor operation.

[0120] The input connections to the CMC branches (CM.sub.1, CM.sub.2, . . . , CM.sub.i1, CM.sub.i, . . . . CM.sub.N) are not shown in FIG. 3(h) because there are many possible configurations to arrange the input connections; different input configurations can support different types of transistor tensor operations. For example, the layer-one block (B.sub.m,n) shown in FIG. 3(b) is a special case when the input lines (Vd.sub.1, Vd.sub.2, . . . , Vd.sub.j, . . . , Vd.sub.M; Vg.sub.1, Vg.sub.2, . . . , Vg.sub.j, . . . , Vg.sub.M) connected to its CMC branches (CB.sub.1, CB.sub.2, . . . , CB.sub.i1, CB.sub.i, . . . , CB.sub.N1, CB.sub.N) all travel in a horizontal direction and are shared by every CMC branch in the layer-one block; such configuration is suitable to support transistor matrix-vector operations. FIG. 9(a) shows an embodiment of different configuration that is suitable to support convolution operations.

[0121] FIG. 3(i) shows the schematic symbol for an embodiment of a CMC cell (991) that can be used for the layer-one block (370) illustrated in FIG. 3(h). In this embodiment, the CMC cell (991) comprises one computation transistor (MM.sub.i,j) that is controlled by the gate-to-source voltage (VG.sub.i,j) and drain-to-source voltage (VD.sub.i,j) and outputs its channel current (I.sub.i,j) as the CMC cell output current. The channel current (I.sub.i,j) is a function of VG.sub.i,j and VD.sub.i,j according to the I-V relationship of the computation transistor (MM.sub.i,j). If the computation transistor is operating in triode region, we have

[00019] $I_{i, j} = K_{n} * {VD}_{i, j} * ({VG}_{i, j} - V_{T})$

where V.sub.T is the threshold voltage of the computation transistor (MM.sub.i,j). In this embodiment, The branch output current (Is.sub.i) of CMC branch CM.sub.i is the current mode summation of CMC cell output currents (I.sub.i,1, I.sub.i,2, . . . , I.sub.i,j, . . . , I.sub.i,M) of all the CMC cells (C.sub.i,1, C.sub.i,2, . . . , C.sub.i,j, . . . , C.sub.i,M) coupled to the branch summation line (371) as:

[00020] ${Is}_{i} = {.Math.}_{j}^{M} I_{i, j} = {.Math.}_{j}^{M} [K_{n} {VD}_{i, j} * ({VG}_{i, j} - V_{T})],$

where we assume the V.sub.T of all computation transistors are the same because they are matched transistors manufactured by IC technology. The above equation shows that the summation current is proportional to the dot product of two input vectors (VD.sub.i,1, . . . ,VD.sub.i,j, . . . , VD.sub.i,M) and ((VG.sub.i,1V.sub.T), . . . , (VG.sub.i,jV.sub.T), . . . , (VG.sub.i,MV.sub.T)). In this embodiment, a CMC branch executes transistor vector-vector operation of two input vectors and produces a branch output current to represent the transistor vector-vector computation results. The computation transistor (MM.sub.i,j) can use body effects as a way to reducer parasitic leakage currents. One convenient approach is to use a native transistor as the computation transistor (MM.sub.i,j). A native transistor is an MOS transistor with a threshold voltage close to zero. Native transistors can be manufactured by proper adjustment of threshold voltage implant dosage, by adjusting the storage charge of a field-programmable-threshold voltage transistor, or by other methods. Using native transistors, we have Is.sub.i=.sub.j.sup.M I.sub.i,j=.sub.j.sup.M[K.sub.nVD.sub.i,j*VG.sub.i,j].

[0122] Each CMC branch (CM.sub.i) in FIG. 3(h) with the CMC cells illustrated in FIG. 3(i) is able to execute the transistor vector-vector operation of two vectors of size M. The input vectors of an CMC branch can be a set of various kinds of parameters. For example, the vector can be parts or all of external input vectors, parts of all of internally vectors generated during deep level transistor tensor operation, a set of data stored in CMC memory cells, parts of or all of Hadamard parameters used for transistor convolution operations, parts of signed data, parts of multiple-precision data, or many other types of parameters. If the size of the input vector is larger than M, we can use layer-two summation lines to combine CMC branches in multiple layer-one blocks to execute dot products of large vectors. The number of layers can be increased to meet larger vector size. If vector components have both positive and negative numbers, the methods described in FIGS. 6(a-d) can be used to support negative components. If the computations require greater accuracy, the methods described in FIG. 6(e) can be used. With field-programmable configurations, CMC branches for embodiments of the present invention are able to support a wide variety of dot product computations. Since all tensor operations are based on vector dot products, CMC branches for embodiments of the present invention are therefore able to support vector-vector, matrix-vector, matrix-matrix, and other kinds of tensor operations of various size, level, and rank while achieving unprecedented performance by parallel computations.

[0123] In addition, the CMC cell in FIG. 3(i) not only can operate in the triode region to support multiplier mode, but can also operate in other modes such as ReLU, saturation, or rectifier modes. In other words, CMC branches for embodiments of the present invention can support transistor vector-vector operations using scalar functions other than multiplication, such as ReLU, saturation, or rectifier functions. Since all transistor tensor operations are based on transistor vector-to-vector operations, CMC branches for embodiments of the present invention are therefore able to support transistor vector-vector, transistor matrix-vector, transistor matrix-matrix, and other kinds of transistor tensor operations of various size, depth, and rank while achieving unprecedented performance by parallel computations executed by CMC cells.

[0124] Transistor tensor convolution operations also can be supported by CMC branches as illustrated by the embodiment in FIG. 9(a). In this embodiment, components of a 3 by 3 kernel matrix are used as an input vector (W.sub.11, W.sub.12, W.sub.13, W.sub.21, W.sub.22, W.sub.23, W.sub.31, W.sub.32, W.sub.33) to a plurality of CMC branches (CG.sub.11, CG.sub.12, . . . , CG.sub.17, . . . , CG.sub.31, CG.sub.32, . . . , CG.sub.37, . . . ). In this embodiment, the index (ij) of each CMC branch (CG.sub.ij) is the same as the index of the starting component (A.sub.ij) of the input matrix. In FIG. 9(a), each CMC cell is represented by a rectangular box with two numbers; the number on the left (A.sub.ij) represents which element from the input matrix is provided to the input terminal of the CMC cell; and the number on the right represents which element in the convolution matrix is provided to the other input terminal of the CMC cell. For example, FIG. 9(b) shows the schematic view of one of the CMC cells (901) in one of the CMCC branches (CG.sub.33). This CMC cell (901) comprises one computation transistor (MW.sub.32); the drain-to-source voltage of the computation transistor is controlled to be equal to the element A.sub.54 of the input matrix; the gate-to-source voltage of the computation transistor is controlled to be (W.sub.32+V.sub.T), where V.sub.T is the threshold voltage of the computation transistor (MW.sub.32). Therefore, the CMC cell output current (I.sub.c) of the CMC cell (901) is I.sub.c=A.sub.54*W.sub.32, which is received by the branch summation line (903) of the CMC branch (CG.sub.33). FIG. 9(b) is another embodiment that shows the schematic view of another CMC cell (902) in another CMC branch (CG.sub.36). This CMC cell (902) comprises one computation transistor (MW.sub.23); the drain-to-source voltage of the computation transistor is controlled to be equal to the element A.sub.48 of the input matrix; the gate-to-source voltage of the computation transistor is controlled to be (W.sub.23+V.sub.T), where V.sub.T is the threshold voltage of the computation transistor (MW.sub.23); therefore, the output current (I.sub.c) of the CMC cell (902) is I.sub.c=A.sub.48*W.sub.23, which is collected by the branch summation line (904) of the CMC branch (CG.sub.36). The CMC cells are therefore executing the functions of a multiplier. The CMC cells also can operate in other operation modes such as ReLU mode, saturation mode, or rectifier mode. Each component of the input matrix can be used as the input of up to 9 CMC cells in 9 different CMC branches. For example, A.sub.3,4 is the input to the 9 CMC cells marked with shaded rectangles in FIG. 9(a). When inputs to the CMC branches are configured as shown in FIG. 9(a), CMC branch (CG.sub.ij) executes one transistor Hadamard product that is equivalent to the transistor vector-vector operation of two input vectors (W.sub.11, W.sub.12, W.sub.13, W.sub.21, W.sub.22, W.sub.23, W.sub.31, W.sub.32, W.sub.33) and (A.sub.i,j, A.sub.i,j+1, A.sub.i,j+2, A.sub.i+1,j, A.sub.i+1,j+1, A.sub.i+1,j+2, A.sub.i+2,j, A.sub.i+2,j+1, A.sub.i+2,j+2). Therefore, the CMC branches in FIG. 9(a) can support convolution operations and calculate large numbers of transistor Hadamard products in parallel.

[0125] FIG. 9(c) shows the schematic diagram for an exemplary circuit that can be used as the convolution current summation circuits (S.sub.ij) in FIG. 9(a). The first half (921) of the circuit are the same as the layer-one current summation circuit (IS1) in FIG. 3(d), which uses current mirrors to generate a convolution current summation circuit output current (Io.sub.ij) that is proportional to the branch output current (Is.sub.ij) provided by the branch summation line. In addition, a current-to-voltage converter (922) generates a convolution current summation circuit output voltage (AV.sub.ij) that is proportional to the branch output current (Is.sub.ij). The gain of this current-to-voltage converter (922) is field-programmable and is temperature compensated using a matched transistor (923) as a thermometer.

[0126] FIG. 9(d) shows the schematic diagram for an exemplary pulling circuit (929) that takes the current-mode summation of the convolution current summation circuit output currents (Io.sub.11, Io.sub.12, Io.sub.13, Io.sub.21, Io.sub.22, Io.sub.23, Io.sub.31, Io.sub.32, Io.sub.33) using a pulling summation line (925), and uses a current-to-voltage converter (926) to generate an output voltage (VA.sub.11) that is proportional to the summation of the convolution current summation circuit output current (Io.sub.11+Io.sub.12+Io.sub.13+Io.sub.21+Io.sub.22+Io.sub.23+Io.sub.31+Io.sub.32+Io.sub.33). The gain of the current-to-voltage converter (926) is field-programmable and temperature compensated using a matched transistor (927) as a thermometer. Therefore, this circuit (929) can support average-pulling operations.

[0127] FIG. 9(e) shows the schematic diagram for an exemplary pulling circuit (920) that connects the anode terminals of 9 native rectifiers (D.sub.11, D.sub.12, D.sub.13, D.sub.21, D.sub.22, D.sub.23, D.sub.31, D.sub.32, D.sub.33) to the convolution current summation circuit output voltages (AV.sub.11, AV.sub.12, AV.sub.13, AV.sub.21, AV.sub.22, AV.sub.23, AV.sub.31, AV.sub.32, AV.sub.33) of 9 convolution current summation circuits (S.sub.ij); the cathode terminals of the native rectifiers are connected together to one terminal of a load resistor (924) and to the input of a voltage amplifier (928), as shown in FIG. 9(e). Each native rectifier (D.sub.11, D.sub.12, D.sub.13, D.sub.21, D.sub.22, D.sub.23, D.sub.31, D.sub.32, D.sub.33) comprises one native n-channel MOS transistor which has a threshold voltage close to zero. The drain and gate terminals of the native MOS transistor are connected together as the anode terminal of the native rectifier; the source terminal of the native MOS transistor is used as the cathode terminal of the native rectifier. In this configuration, the cathode voltage is always larger or equal to the anode voltage of the native rectifier. The input voltage to the voltage amplifier (928) is therefore equal to the maximum voltage among all input voltages (AV.sub.11, AV.sub.12, AV.sub.13, AV.sub.21, AV.sub.22, AV.sub.23, AV.sub.31, AV.sub.32, AV.sub.33). The output voltage (VM.sub.11) is proportional to the maximum of the input voltages. The gain of the voltage amplifier (928) is field-programmable and temperature compensated using a matched transistor as a thermometer. Therefore, this circuit (920) can support max-pulling operations.

[0128] While the preferred embodiments have been illustrated and described herein, other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. For example, in the above embodiment the kernel matrix parameters are represented by gate voltages of the computation transistor in the CMCC cells; the drain voltage can also be used for the same purpose. It is to be understood that there are many other possible modifications and implementations so that the scope of the invention is not limited by the specific embodiments discussed herein. The above embodiments in FIGS. 9(a-e) are simplified examples for convolutions using a 3 by 3 kernel matrix, but the same principles are applicable to wide varieties of tensor convolution operations. For example, to support a 4 by 4 kernel, each CMC branch should have at least 16 CMC cells and each input matrix element can be the input to 16 different CMC branches. The pulling circuits in FIG. 9(d, e) can have greater or fewer inputs. Other types of pulling circuits also can be used. To support a standard convolution over a color image convolution, the kernel is typically three dimensional such as 5 by 5 by 3. Each CMCC branch in this case must comprise at least 75 CMCC cells. We can also use a layer-two summation line to combine more than one CMCC branch to support one transistor Hadamard product. Architectures for embodiments of the present invention can also be field programmed to support multiple levels of convolution tensor operations and/or pulling operations, in combination with multiple levels of transistor matrix-vector operations of different dimensions. Such operations can be done in one integrated circuit die or by a combination of multiple dice of integrated circuits. Tensor operations can be executed in parallel and/or pipelined using integrated circuits in architectures for embodiments of the present invention, achieving performances that are orders of magnitude greater than prior art GPUs and TPUs. It is highly desirable to connect multiple dice of integrated circuits of embodiments of the present invention using inter-dice connections illustrated in FIG. 5(d, e). It is also highly desirable to combine other types of integrated circuits such as image sensors, CPUs, GPUs, TPUs, image displays, and other ICs with ICs of embodiments of the present invention by inter-dice connections.

[0129] FIG. 3(j) shows the schematic symbol for another embodiment of a CMC cell (992) that can be used in the layer-one block (B.sub.m,n) illustrated in FIG. 3(h). In this embodiment, the CMC cell (992) comprises one field-programmable-threshold voltage transistor (FM.sub.i,j) that is controlled by the gate-to-source voltage (VG.sub.i,j), drain-to-source voltage (VD.sub.i,j), and the storage charge stored in its floating-gate (Q.sub.i,j). Its channel current (I.sub.i,j) is the CMC cell output current. The channel current (I.sub.i,j) is a function of (VG.sub.i,jV.sub.T0C.sub.eQ.sub.i,j) and VD.sub.i,j according to the I-V relationship of the transistor. A CMC branch equipped with such CMC cells (992) is therefore able to support transistor vector-vector operations of two input vectors (VD.sub.i,1, . . . , VD.sub.i,j, . . . , VD.sub.i,M) and ((VG.sub.i,1V.sub.T0C.sub.eQ.sub.i,1), . . . , (VG.sub.i,jV.sub.T0C.sub.eQ.sub.i,j), . . . , (VG.sub.i,MV.sub.T0C.sub.eQ.sub.i,M)). The most common embodiment of field-programmable-threshold voltage transistors are floating-gate MOS transistors. The floating-gate transistor (FM.sub.i,j) not only can execute scalar functions supported by common MOS transistors as a computation transistor, but can also serve as a memory device for storing operation parameters.

[0130] FIG. 3(k) shows the schematic symbol for another embodiment of a CMC cell (993) that can be used in FIG. 3(h). This CMC cell comprises one computation transistor (MD.sub.i,j) that executes scalar computations and provides its channel current (I.sub.i,j) as the cell output current of the CMC cell. One select transistor (MW.sub.i,j) provides the gate voltage (VG.sub.i,j) of the computation transistor from an cell gate voltage input line (BL) when the select transistor is activated by its gate select signal (WL) and isolates the gate terminal of the computation transistor from the cell gate voltage input line (BL) when the select transistor is deactivated by WL. This select transistor (MW.sub.i,j) can be an MOS transistor (including a native transistor), a field effect transistor, or other types of transistors; it also can be a variable resistor controlled by WL. This CMC cell (993) is able to support all transistor tensor operations that the CMC cell (991) in FIG. 3(i) can support. In addition, this CMC cell (993) also functions as a memory device by holding the value of the gate voltage of the computation transistor (MD.sub.i,j). However, it cannot fix the voltage indefinitely due to non-ideal effects such as leakage currents. To maintain the value of the gate voltage, it can refresh the value by rewriting the same value periodically, similar to how memory is updated in dynamic random access memory (DRAM). This CMC cell (993) is therefore a current-mode computation dynamic memory cell.

[0131] FIG. 3(I) shows the schematic symbol for another embodiment of a CMC cell (994) that is nearly identical to the CMC dynamic memory cell (993) in FIG. 3(k). This cell however has a storage-control capacitor (C.sub.i,j); one terminal of (C.sub.i,j) is coupled to the gate terminal of the computation transistor (MD.sub.i,j) and the other terminal is coupled to a field-programmable CMC cell gate voltage control signal (GC). When the voltage on GC is a constant, the storage-control capacitor (C.sub.i,j) behaves as a storage capacitor that helps to reduce the influence of leakage currents and voltage coupling problems, similar to the functions of the storage capacitors in DRAM memory cells. When GC is controlled as an input signal, then the CMC cell in FIG. 3(I) functions as a floating gate device that can support the functions of the CMC cell in FIG. 3(j), where VG.sub.i,j behaves as the floating gate and GC as the gate of the floating gate device, except that the storage charge of this device may need to be refreshed periodically.

[0132] FIG. 3(m) shows the schematic symbol for an embodiment of a plurality of CMC cells (995, 996, 997) that share one select transistor (MW.sub.s) that provides the gate voltage (VG.sub.s) of the computation transistors in those CMC cells (995, 996, 997) from an cell gate voltage input line (BL) when the select transistor is activated by its gate select signal (WL) and isolates the gate terminals of those computation transistors from the cell gate voltage input line (BL) when the select transistor is deactivated by WL. The CMC cells (995, 996, 997) can be in different CMC branches. The circuit in FIG. 3(m) is also a type of CMC dynamic memory cell because it can store and maintain datum with dynamic refresh procedures. Such design is useful when the same input is shared by multiple CMC cells like when performing transistor convolution tensor operations. For CMC dynamic memory cells of embodiments of the present invention, reduction of transistor leakage currents is more important than transistor performance. Use of body effects by proper selection of substrate voltages can reduce leakage currents.

[0133] While the preferred embodiments have been illustrated and described herein, other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. The transistor in the CMC cell shown in FIG. 3(i) is an n-channel MOS transistor, while other types of transistors such as p-channel transistors, field-effect transistors (FET), bipolar transistors, and combinations of other types of electrical devices also can be used in CMC cells. The CMC cell output current does not have to be the drain-to-source current; a source-to-drain current also can be used as the output current. Combinations of channel currents from multiple sources also can be used as the CMC cell output current. Current art IC technologies typically manufacture transistors that are optimized for speed. For transistors used in CMC cells of embodiments of the present invention, low leakage and low power are of greater priority. Body effects can be used to reduce leakage currents of the transistors in the CMC cells. It is to be understood that there are many other possible modifications and implementations so that the scope of the invention is not limited by the specific embodiments discussed herein.

[0134] FIG. 4(a) is a simplified symbolic block diagram of an embodiment of a layer-three block (B.sup.(3)) of the present invention. This layer-three block (B.sup.(3)) comprises R columns and T rows of layer-two blocks (B.sup.(2).sub.r,t), where R and T are positive integers, r is an integer greater than or equal to 1 and less than or equal to R, and t is an integer greater than or equal to 1 and less than or equal to T. The layer-three summation lines (401) are coupled to the layer-two current summation circuit (IS2) inside those layer-two blocks (B.sup.(2).sub.r,t). Embodiments of layer-two blocks are illustrated in FIG. 3(a, h), while an embodiment of layer-two current summation circuit (IS2) is illustrated in FIG. 3(e). An individual layer-three summation line (401) comprises an electrical conductor that is electrically coupled to, and receives the layer-two current summation circuit output currents produced by, a plurality of layer-two current summation circuits (IS2), and that produces a layer-three summation line output current (Is.sup.(3)) that is a current-mode summation of the received layer-two current summation circuit output currents (ideally, Is.sup.(3)=.sub.s Io.sup.(2).sub.s). This layer-three block (B.sup.(3)) comprises a plurality of layer-three current summation circuits (IS3), wherein an individual layer-three current summation circuit is electrically coupled to, and receives the layer-three summation line output current (Is.sup.(3)) produced by, at least one layer-three summation line (401), and produces a layer-three current summation circuit output current (Io.sup.(3)) that is a function of all received layer-three summation line output currents (Is.sup.(3)). FIG. 4(b) is a simplified schematic diagram of one embodiment of the layer-three current summation circuit (IS3) in FIG. 4(a). In this embodiment, four matched transistors (Mn1, Mn2, Mn3, Mn4) and four enable transistors (Mne1, Mne2, Mne3, Mne4) are connected as current mirrors, as shown in FIG. 4(b). N-channel transistors are used in IS3 because the polarity of the input current (Is.sup.(3)) is opposite of that of the layer-two current summation circuits (IS2). In addition, two matched p-channel transistors (Mp5, Mp6) and two enable transistors (Mpe5, Mpe6) are connected as a current mirror to duplicate the current amplitude and invert the polarity of the current flows through Mn4, as shown in FIG. 4(b). When the field-programmable enable signals ENn1 and ENn2 are activated and ENn3 is deactivated, the layer-three summation circuit output current Io.sup.(3)=(Wn2/Wn1)Is.sup.(3), where Wn2 is the effective channel width of transistor Mn2 and Wn1 is the effective channel width of transistor Mn1. When the field-programmable enable signals ENn1 and ENn3 are activated and ENn2 is deactivated, the output current Io.sup.(3)=(Wn3/Wn1)Is.sup.(3), where Wn3 is the effective channel width of transistor Mn3. When the field-programmable enable signals ENn1, ENn4, ENp5, and ENp6 are activated, a current Iso3=(Wn4/Wn1)Is.sup.(3) is sent to the interface output circuit (Vout), as illustrated in FIG. 4(b), where Wn4 is the effective channel width of transistor Mn4. The function of the interface output circuit (Vout) in this embodiment can be the same as that in FIG. 3(f). IS3 is also able to prevent current overload because current mirror circuits by nature have an upper limitation on maximum value of output currents.

[0135] The layer-three block (B.sup.(3)) in FIG. 4(a) also has layer-three input circuits (VI3) that drive layer-three input lines (403). These layer-three input circuits (VI3) also can be implemented by S&H circuits similar to that of FIG. 3(c). The layer-three input lines (403) and layer three input circuits (VI3) provide field-programmable controls on the layer-two input circuits (VI2) in all the layer-two blocks (B.sup.(2).sub.r,t). Therefore, the voltages on the layer-one input lines (Vd.sub.j, Vg.sub.j) in all layer-one blocks included in this layer-three block (B.sup.(3)) can be controlled by those layer-three input circuits (VI3). This multiple-layer input configuration supports field-programmable input configurations that approaches the flexibility of prior art software configurations. Such flexible multiple-layer input configurations can be used to duplicate the same sets of inputs into different layer-one blocks that have different branch summation lines in order to expand the number of outputs. If the number of inputs is greater than the capacity of one layer-two block, we can use the layer-three current summation circuits (IS3) to form a three-layer current summation structure to support more inputs without significant increases in delay time. If insufficient, more layers can be added until the requirement is fulfilled. Similarly, layer-three input circuits (VI3) can be used to expand the number of output neurons.

[0136] FIG. 5(a) is a simplified block diagram for an embodiment of a multiple-layer summation line architecture of the present invention described in FIGS. 3(a-h) and in FIGS. 4(a-c). A subset of CMC cells (501) represented symbolically by small circles in FIG. 5(a) are coupled to a branch summation line (503). All the cell output currents (502) from those CMC cells (501) flow into the branch summation line (503) that brings current mode summation of all the cell output currents to a layer-one current summation circuit (IS1). This circuit produces a layer-one current summation circuit output current (504) that is a function of the summation of all CMC cell output currents (502) from the CMC cells (501) coupled to the branch summation line (503). All layer-one current summation circuit output currents (504) of the layer-one current summation circuits (IS1) are received by a layer-two summation line (505) and are inputted to a layer-two current summation circuit (IS2) that produces a layer-two summation circuit output current (507) that is a function of the current-mode summation of the layer-one current summation circuit output currents received by the layer-two summation line (505). Similarly, the layer-two current summation circuit output currents (507) are received by a layer-three summation line (508) such that the summation of the IS2 output currents is input to a layer-three current summation circuit (IS3). This multiple layer architecture can be expanded to an arbitrary number of layers. If the number of inputs can be supported in layer-two, IS2 can send the result of current summation to an interface output circuit (506) that produces the inputs for the next level transistor tensor operation or produces digitized outputs to a digital interface. If a third layer is needed, then outputs are produced at the layer-three interface output circuit (509). One can continue to add additional layers as needed-outputs will be produced at the final layer. If one semiconductor dice cannot provide the needed computation memory cells, upper layer summation lines or input lines can be implemented by inter-dice connections to expand parallel transistor computations beyond dice boundaries, as illustrated in FIG. 5(d, e). The architecture illustrated in FIG. 5(a) provides flexibility to configure CMC cells to support large-scale transistor tensor operations. Prudent designs of the summation lines in each layer can mitigate non-ideal parasitic parameters induced problems such that the delay time introduced by each layer is only the delay time of current mirrors. With this architecture, large-scale transistor tensor operations can be executed within a few gate delays.

[0137] FIG. 5(b) is a simplified block diagram for an embodiment of input control architecture of the present invention described in FIGS. 3(a-h) and in FIGS. 4(a-c). Layer-three input circuits (VI3) drive layer-three input lines (513) under field-programmable controls to reach layer-two input circuits (VI2). Layer-two input circuits (VI2) drive layer-two input lines (512) under field-programmable controls to reach layer-one input circuits (VI1, VI1g). Layer-one drain voltage and gate voltage input circuits (VI1, VI1g) drive layer-one input lines inside layer-one blocks (B.sub.i,j) to configure CMC cells. All input circuits (VI3, VI2, VI1, VIg) in various layers only drive a limited number of active devices, mitigating non-ideal parasitic parameter induced problems. If the number of outputs is greater than the number of outputs in layer-one blocks, the inputs are duplicated to other layer-one blocks that have different current summation trees, effectively expanding the overall number of outputs. Careful designs for the input lines in each layer can mitigate non-ideal parasitic parameter induced problems such that the delay time introduced by each layer is the delay time of input circuits (VI3, VI2, VI1, VI1g): one analog gate delay per layer. The architecture illustrated in FIG. 5(b) provides flexibility to configure CMC cells to support transistor tensor operations of various sizes. With this architecture, large-scale transistor tensor operations can be executed in parallel within a few gate delays.

[0138] A transistor tensor operation of embodiments of the present invention comprises many field-programmable options to allow flexibility in configuration. These options are typically selected by register write, scan chain, or memory write operations to memory devices distributed around an integrated circuit (IC) of embodiments of the present invention. The procedures to define those available options require detailed knowledge of actual designs, and such details can differ between IC designs. Therefore, a packaged IC chip of embodiments of the present invention typically comprises integrated circuits of different functions packaged in the same chip, as shown by the symbolic block diagram in FIG. 5(c). In this embodiment, multiple layer current-mode computation integrated circuits of embodiments of the present invention (521) are packaged in the same package with CPU or GPU (522), a flash EPROM (523), a system memory (524) which is typically a dynamic random-access memory (DRAM) IC, and an image sensor or image display IC (528). The multiple layer CMC IC (521) of embodiments of the present invention can have multiple stacked ICs as described in above sections. FIG. 5(d) shows a simplified embodiment of horizontal inter-dice communication lines (537) between different IC dice (531, 532, 533, 534). FIG. 5(e) shows a simplified embodiment of through-die inter-dice communication lines (547) between vertically stacked IC dice (541, 542, 543, 544). The microprocessor (522) executes firmware stored in the flash EPROM (523) to configure CMC IC (521) through a digital interface (525). The system memory (524) is used for temporary storages to achieve better performance. The Flash EPROM (523) also can store information such as which parts in the transistor tensor operation (521) ICs are defective, which blocks are used by which application, and the parameters needed to configure the CMC IC (521). Libraries of firmware circuits are also stored in the flash EPROM (523) to support functions similar to those of an operating system. The CMC ICs (521) also can send data to the microprocessor (522) through a digital interface (525) such that the microprocessor (522) can assist in executing computations that are not built in the ICs computing the transistor tensor operations. This digital interface (525) also provides the capability to communicate with external systems through an external digital interface (526). The CMC ICs (521) can also have a direct communication interface (527) with image sensor and/or image display ICs (528). This direct communication interface (527) can be a digital interface or an analog interface.

[0139] The architectures illustrated in FIGS. 3(a-h), FIGS. 4(a-c), and FIGS. 5(a-c) provide flexibility to configure CMC cells to support transistor tensor operations of wide varieties of sizes. For example, if a layer-one array comprises 128 inputs and 64 outputs, while each upper layer summation line connects 32 lower layer current summation circuits and each upper layer input line connects 32 lower layer input circuits, then a layer-two block can support a transistor tensor operation of up to 8 million parameters; a layer-three block can support up to 8 billion parameters; a layer-four block would be able to support up to 8 trillion parameters. If all the devices in an integrated circuit are insufficient to support the parameters of the transistor tensor operation being executed, the analog or digital output circuits allow convenient interfaces to other integrated circuits to expand into multiple-dice operations. For smaller computations, the unused layer-one blocks can be configured to support computations at different levels, or computations of other applications. Under this architecture, multi-leveled large-scale transistor tensor operations can be implemented with a flexible field-programmable configuration. Each level of transistor tensor operations can be executed in parallel and finished in a few gate delays, and multiple-level transistor tensor operations can be executed in parallel for a few gate delays in each level. Using the sample-and-hold circuits, one can choose to pipeline multiple-level computations to achieve higher throughputs or disable unused circuits to save power.

[0140] To better understand the performance gains of a multiple-layer summation circuit of embodiments of the present invention, consider a three-layer network of an embodiments of present invention that comprises 10 billion CMC cells. Delay time for each computation is approximately that of one analog gate plus two current mirrors, which is about 2 nanoseconds. When all 10 billion of computation-summations are executed in parallel within 2 nanoseconds, 510.sup.18 computation-summations are executed per second. For a four-layer network of embodiments of the present invention, the performance can reach 510.sup.21 computation-summations per second or higher.

[0141] Under programmable controls, the CMC cells can be configured to operate at multiplier mode such that a deep learning model that comprises a sequence of transistor tensor operations of embodiments of the present invention can be fully compatible with existing digital neural network devices. That means we can port the results of an existing deep learning model into a device of an embodiment of the present invention to support the same functions. In addition, each individual neuron can be configured to operate in its own operation mode, such as multiplier mode, ReLU mode, rectifier mode, or saturation mode to adapt for the nature of individual neurons to potentially achieve better results than prior art digital neural network devices. Similarly, the output functions for each individual output neuron also can be configured to its own specific function to potentially achieve better results.

[0142] Because summations are commutative, if one CMC branch is defective, we can mark and disable that CMC branch and use a different CMC branch to replace its functions; if one layer-one block is defective, we can mark and disable that block and use a different layer-one block to replace its functions; defects at any layer can be handled the same way. The marks for defective parts can be stored in the flash EPROM (523) such that other applications can avoid known defective parts. This flexible redundancy architecture significantly improves yield, and reduces costs.

[0143] While the preferred embodiments have been illustrated and described herein, other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. Skillful circuit designers will be able to design circuits in wide varieties of ways. It is to be understood that there are many other possible modifications and implementations so that the scope of the invention is not limited by the specific embodiments discussed herein.

[0144] In the above embodiments, channel currents of computation transistors always flow in the same direction, meaning all parameters are positive. For some cases, it is desirable to have both positive and negative parameter values. FIG. 6(a) is a simplified schematic diagram for a CMC cell that comprises two computation transistors (fgTp, fgTn) with gate terminals coupled to the same input at voltage Vg. The source terminal of fpTp and the drain terminal of fpTn are coupled to a branch summation line (Is) which is forced at a voltage Vs, as shown in FIG. 6(a). The drain terminal of fpTp is coupled to a constant drain voltage (Vdp) greater than Vs, causing the channel current (Ip) of fpTp to flow into the branch summation line (Is); the source terminal of fpTn is coupled to a constant drain voltage (Vsn) less than Vs, causing the channel current (In) of fpTn to flow away from the branch summation line (Is), as shown in FIG. 6(a). The overall channel current of this two-transistor cell is therefore IpIn, which can be positive when Ip>In or negative when Ip<In. Two transistors configured as the embodiment in FIG. 6(a) to support one computation are therefore able to model both positive and negative parameter values.

[0145] Another embodiment to support both positive and negative values is to use the same layer-one block as discussed in previous embodiments while taking the difference of branch output currents from two nearby columns using the current subtraction circuit (611) illustrated by the simplified symbolic embodiment in FIG. 6(b). In this embodiment, two matched transistors (Mpp, Mpn) and two enable transistors (Mppe, Mpne) are configured as a current mirror. The input of the current mirror is connected to one branch summation line with a branch output current Isp and the other end of the current mirror is connected to another branch summation line with a branch output current Isn, as shown in FIG. 6(b). When field-programmable enable signals ENpp and ENpn are both activated, the output current of this current subtraction circuit (611) is Isp-Isn. Using such current subtraction circuits (611), each scalar computation in the level-one block is modeled by a CMC cell along one branch summation line and another CMC cell along the other branch summation line that have the same input voltages, resulting in scalar computations that can have positive or negative results.

[0146] In the above embodiment, the current summation circuit needs to support input currents of different polarities. FIG. 6(c) is a simplified schematic diagram for an exemplary dual polarity current mirror (DPCM). In this embodiment, matched p-channel transistors (Mp7, Mp8, Mp9) and enable transistors (Mpe7, Mpe8, Mpe9) are configured as p-channel current mirrors; matched n-channel transistors (Mn7, Mn8, Mn9) and enable transistors (Mne7, Mne8, Mne9) are configured as n-channel current mirrors, as shown in FIG. 6(c). When the input current (Isi) is positive, current flows through diode DN to the n-channel current mirrors while diode DP blocks the current such that the output currents of the p-channel current mirrors are approximately zero. The resulting output current (Ino) is therefore completely generated by the n-channel current mirrors. When the input current (Isi) is negative, current flows through diode DP to the p-channel current mirrors while diode DN blocks the current such that the output currents of the n-channel current mirrors are approximately zero. The resulting output current (Ino) is therefore completely generated by the p-channel current mirrors. Therefore, this dual polarity current mirror (DPCM) works for input currents of both polarities. The diodes DN and DP can be implemented by transistors configured in rectifier modes.

[0147] While the preferred embodiments have been illustrated and described herein, other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. For example, different types of CMC cells can be designed to support both positive and negative parameters. It is to be understood that there are many other possible modifications and implementations so that the scope of the invention is not limited by the specific embodiments discussed herein.

[0148] In the above embodiments, input values are always positive. It is often desirable to have both positive and negative input values. FIG. 6(d) is a simplified schematic for a CMC cell that comprises four matched transistors (Mpp, Mpn, Mnp, Mnn). The source terminals of Mpp and Mpn are connected to the branch summation line (Isp) that is connected to the current subtraction circuit in FIG. 6(b). The drain terminals of Mpp and Mnp are connected to the same input line (Vdip); the drain terminals of Mpn and Mnn are connected to another input line (Vdin); the gate terminals are all connected to the same gate voltage Vg, as shown in FIG. 6(d). The source terminals of Mnp and Mnn are connected to another branch summation line (Isn) that is connected to the input of the current subtraction circuit in FIG. 6(b). The contribution of this CMC cell (631) to the overall summation current is therefore Idpp+IdpnIdnpIdnn, where Idpp is the channel current of computation transistor Mpp, Idpn is the channel current of computation transistor Mpn, Idnp is the channel current of computation transistor Mnp, and Idnn is the channel current of computation transistor Mnn. Computation transistor Mpp and computation transistor Mnn are programmed to have the same storage charges. Therefore, when all inputs are equal, Idpp is equal to Idnn. Similarly, computation transistor Mnp and computation transistor Mpn are programmed to have equal storage charges; when all inputs are the same, Idnp equals Idpn. For a positive input at voltage Vdinp, Vdip is set to voltage Vdinp and Vdin is set to Vs. The contribution of this CMC cell (631) is (IdppIdnp). For a negative input at voltage Vdinp, Vdip is set to voltage V.sub.s and Vdin is set to voltage Vdinp. The contribution of this CMC cell (631) is (IdpnIdnn). Since the input voltages are equivalent under both conditions, we have (IdpnIdnn)=(IdppIdnp), which is equivalent to a negative input.

[0149] While the preferred embodiments have been illustrated and described herein, other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. For example, one can allow drain input voltage to be less than Vs to achieve positive and negative inputs. This works if the amplitudes of the input voltages are small. It is to be understood that there are many other possible modifications and implementations so that the scope of the invention is not limited by the specific embodiments discussed herein.

[0150] FIG. 7(a) shows a flow chart for an example procedure to program and set a desired amount of storage charge on a floating-gate transistor. The first step is to determine if the floating-gate transistor needs to be erased. If the floating-gate transistor's storage charge is too high, we execute an erase operation by putting a low voltage at its gate terminal and high voltages at its source/drain terminals to pull storage charges out of the floating-gate via tunneling. The next step is to read the floating-gate transistor to measure its threshold voltage (VT) or channel current (Id) to determine if its storage charge is already at the desired value. If not, we execute one short pulse of the programming operation by setting a high electrical field at the channel region of the floating-gate transistor to inject electrons into its floating-gate through hot electron effects. After each short pulse of programming, we measure VT or Id again and repeat the same procedure until the desired threshold voltage is achieved. The read operation used in this application also can be used to measure the value of the threshold voltage of each individual floating-gate transistor and output the result through a digital interface to external devices. This can be used to duplicate the training results to other devices.

[0151] Using the programming method described in FIG. 7(a), the threshold voltage of a floating-gate transistor can be controlled in analog values, but the floating-gate transistors programmed in this way typically cannot achieve the same precision as digital computations. One method to improve precision is illustrated by the simplified flow chart in FIG. 6(e). To achieve double precision, select one column of CMC cells in combination with a different column of CMC cells. For example, if the layer-one current summation circuit (IS1) is that of FIG. 3(d), configure the layer-one current summation circuit (IS1) of the first column with EN2 activated and EN3 deactivated. Configure the layer-one current summation circuit of another column with EN3 activated and EN2 deactivated, with the outputs of both layer-one current summation circuits connected to the same layer-two summation line. With this configuration, the combined output current is

[00021] $Isc = (Wm 2 / Wm 1) * {Is}_{1} + (Wm 3 / Wm 1) * {Is}_{2}$

where Is.sub.1 is the branch output current from the first column, Is.sub.2 is the branch output current from the second column, and Isc is the combined summation current from both columns. If we assume (Wm2/Wm1) is designed to be equal to 16, and is (Wm3/Wm1) designed to be equal to 1, then we have Isc=16*Is.sub.1+Is.sub.2. In this embodiment, the CMC cells in column 1 represent the most significant four bits, while those in column 2 provide the least significant 4 bits of the combined synapse value to achieve 8-bit precision. Therefore, double precision parameter value is achieved.

[0152] FIG. 7(b) is a simplified flow chart that shows an exemplary procedure to program floating-gate transistors configured to support double-precision accuracy. By measuring Id on column 1, the floating-gate transistor in the first column is programmed first to reach a predefined coarse value. Next, the combined channel current is measured and the floating-gate transistor in column 2 is programmed until the desired value is achieved.

[0153] While the preferred embodiments have been illustrated and described herein, other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. For example, triple precision can be achieved by combining three columns, and quadruple precision can be achieved by combining four columns. It is to be understood that there are many other possible modifications and implementations so that the scope of the invention is not limited by the specific embodiments discussed herein.

[0154] When training neural network models that use operations that are non-differentiable or do not have a simple closed-form expression for its derivative using programmable-threshold voltage transistors, the partial derivative should be taken against threshold voltages (C.sub.o/V.sub.Tj,k.sup.(i)) to determine optimal changes in threshold voltages. As shown by above embodiments, the procedures to change the threshold voltages of floating-gate transistors can be time-consuming, which can add up during training. Changing the storage charge of one floating-gate transistor and restoring the storage charge to its original value is not only time consuming but also inaccurate. To more efficiently calculate (C.sub.o/V.sub.Tj,k.sup.(i)), we observe that the channel current of a programmable-threshold voltage transistor is a function of (V.sub.gsV.sub.Tj,k), where V.sub.gs is the gate-to-source voltage and V.sub.Tj,k is the threshold voltage. While changing V.sub.Tj,k is complex, changing V.sub.gs is simple. When (V.sub.gsV.sub.Tj,k) is fixed, the behavior of the programmable-threshold voltage transistor is the same. Therefore, we have C.sub.o/V.sub.Tj,k=C.sub.o/V.sub.gs. During training, we can simulate a change in threshold voltage by changing in V.sub.gs. This results in an efficient method for approximating the derivatives required for training models using non-differentiable operations.

[0155] FIG. 8(a) is a flow chart for simplified exemplary training processes of embodiments of the present invention. Starting from a batch of input vectors with corresponding labels, the final output and cost of each input is determined by executing a forward pass on each input. Note that the size of a batch can range anywhere from a single vector to the entirety of the training set. Next, using the circuits of an embodiments of the present invention, one parameter or the values of a set of orthogonal parameters is changed while keeping all other parameters fixed. Next, a forward pass is executed, and the change in the cost function due to changed parameter(s) is calculated to determine partial derivative(s) of the cost function with respect to the parameter(s) under this test, as shown in FIG. 8(a). The same procedure is repeated for all parameters to calculate the derivative of the cost with respect to the parameters. Once the process has gone through all inputs in the batch, an approximate gradient of the cost function with respect to all parameters is calculated by aggregating the previously computed gradients for each input. Now, we know the deltas for each parameter, so program/erase operations are executed to change the storage charges of programmable-threshold voltage transistors and/or configuration parameters according to the approximated gradient. This procedure is repeated until the cost of the model reaches an acceptable threshold.

[0156] FIG. 8(b) is a flow chart for simplified exemplary training processes of embodiments of the present invention use to calculate partial derivative (C.sub.o/V.sub.Tj,k) by determining (C.sub.o/V.sub.gs) and the partial derivatives of output functions. For this embodiment, the layer-one gate voltage input circuit (VI1g) is the one in FIG. 3(g). Using this VI1g, the gate voltage along one row of floating-gate transistors can be conveniently changed by changing Vsg, as shown in FIG. 3(g). During normal operations, Vsg is set to be the same as Vs. During partial derivative calculation procedures described in FIG. 8(b), the sampling select signal (Smp) is deactivated while Vsg of the selected VI1g can be changed, which will change the gate voltage (Vg.sub.j) applied on the gate terminals of selected floating-gate transistors. After the partial derivative calculation, Vsg is changed back to Vs to restore original conditions. In the meantime, the current-sample-and-hold (IS&H) circuit shown in FIG. 8(c) is used as the layer-two current summation circuit (IS2). This IS&H is nearly identical to the IS2 shown in FIG. 3(e), except that a multiplex (MUX) is added to control the current mirror gate voltage (Vmo) as shown in FIG. 8(c). When the mux select signal SMi is activated and the other mux select signal SMh is deactivated, Vmo is determined by the input current (Is.sup.(2)) induced voltage (Vmi) so that this IS&H operates exactly the same as the circuit in FIG. 3(e). In the meantime, a voltage sample-and-hold (VS&H) circuit can be activated to sample Vmi so that the output of the VS&H (Vmh) can capture the value of Vmi and hold the value when the VS&H is in hold mode. When the mux select signal SMi is deactivated and the other mux select signal SMh is activated, Vmo is determined by the VS&H hold voltage (Vmh) so that the output currents (Io.sup.(2) and Iso) are the same as those of previous values when VS&H last sampled Vmi, and independent of input current (Is.sup.(2)); under this situation, the output currents are hold at previous values.

[0157] Using IS1g in FIG. 3(e) and IS&H in FIG. 8(c), the training algorithm of embodiments of the present invention can be described by the exemplary flow chart in FIG. 8(b). Execute computation using current parameters, sample-and-hold current mirror input voltage by the VS&H in the IS&H, then turn off the select muxes in related IS&Hs; the output currents of those IS&Hs will remain the same, so that all the neurons will remain the same throughout the whole large-scale multiple-level function matrix. While the output currents are held at the same values, disturb the gate voltage of all the floating-gate transistors connected to the same input, which will change the layer-two summation currents (Is.sup.(2)) related to the selected input without disturbing the rest of entire function matrix network; use the VS&H circuits in IS&Hs to record the current mirror voltage (Vmi) of the disturbed summation currents (Is.sup.(2)); after that, turn off Vg disturb by returning Vsg value back to Vs in the layer-one gate voltage input circuit (IS1g) so that Is.sup.(2) will return to undisturbed values, then turn on SMi of all selected muxes in related IS&Hs so that everything returns to undisturbed conditions. Now we are ready to determine (C.sub.o/V.sub.gs) one by one for all the floating-gate transistors connected to the selected input neuron, as illustrated in FIG. 8(b). Starting from the first output neuron, change the corresponding IS&H connected to the output neuron to hold mode by deactivate SMi and activate SMh of the mux in the IS&H of the selected output neuron, so that its output current is the disturbed output current, while all the other output neurons remain unchanged. In these ways, (C.sub.o/V.sub.gs) and therefore (C.sub.o/V.sub.Tj,k.sup.(i)) can be determined conveniently at high performance. Such procedures can be repeated until all (C.sub.o/V.sub.Tj,k.sup.(i)) of all floating-gate transistors are determined, as shown in FIG. 8(b).

[0158] A current-sample-and-hold (IS&H) circuit is an electrical circuit that supports a hold operation mode and a sample operation mode and provides an electrical output current. During hold mode, an IS&H holds its output current at a constant value that is independent of its input or inputs. During sample mode the IS&H detects and maintains the parameter or parameters needed to supply the correct output current for the next sample mode operation. IS&H circuits are very useful in supporting training algorithms of embodiments of the present invention that change gate voltages instead of threshold voltages in order to determine partial derivatives. IS&H circuits also can be used to support pipeline operations to increase transistor tensor operations throughput. When IS&H circuits are used as the current summation circuits (IS1, IS2, IS3, . . . ), they can also save power by holding correct output currents while shutting down the computation array, effectively mitigating the ISPC problem.

[0159] While the preferred embodiments have been illustrated and described herein, other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. Different cost functions for example can be used instead of mean-squared error. The aforementioned voltage sample-and-hold and current sample-and-hold circuits are oversimplified for clarity; actual S&H circuits require more complex designs. The flow charts in the above embodiments are also over simplified for clarity; the sequences of training processes can be executed in wide varieties. It is to be understood that there are many other possible modifications and implementations so that the scope of the invention is not limited by the specific embodiments discussed herein.

[0160] The output functions of embodiments of the present invention also can be optimized flexibly with high performance, as illustrated by the flow chart in FIG. 8(d). In this embodiment, we assume the output function is determined by the output circuit shown in FIG. 3(f). Since all the parameters (Vga, VR1, VR2, VR3, VR4) of the output function in FIG. 3(f) are programmable, it is trivial to implement a small change in each individual parameter to calculate the partial derivative against each individual parameter, as illustrated in FIG. 8(d).

[0161] While specific embodiments of the invention have been illustrated and described herein, it is realized that other embodiments of the invention may incorporate modifications and changes to the embodiments described herein. It is therefore to be understood that the appended claims are intended to cover all modifications and changes that fall within the true spirit and scope of the invention.