General purpose neural processor

Abstract

A computer processor includes an on-chip network and a plurality of tiles. Each tile includes an input circuit to receive a voltage signal from the network, and a crossbar array, including at least one neuron. The neuron includes first and second bit lines, a programmable resistor connecting the voltage signal to the first bit line, and a comparator to receive inputs from the two bit lines and to output a voltage, when a bypass condition is not active. Each tile includes a programming circuit to set a resistance value of the resistor, a pass-through circuit to provide the voltage signal to an input circuit of a first additional tile, when a pass-through condition is active, a bypass circuit to provide values of the bit lines to a second additional tile, when the bypass condition is active; and at least one output circuit to provide an output signal to the network.

Claims

1. A comparator configured for use in a computer processor, the comparator comprising: an input stage, comprising: a first input line; a second input line; a first input calibration line comprising a transistor and a programmable resistor and configured to add current to the first input line; a second input calibration line comprising a transistor and a programmable resistor and configured to add current to the second input line; a first output calibration line comprising a transistor and a programmable resistor and configured to drain current from the first input line; a second output calibration line comprising a transistor and a programmable resistor and configured to drain current from the second input line; a first plurality of parallel diodes configured to receive current from the first input line, wherein a voltage operating range across the first plurality of parallel diodes is configurable by selecting a subset of the first plurality of diodes to be activated; and a second plurality of diodes configured to receive current from the second input line, wherein a voltage operating range across the second plurality of parallel diodes is configurable by selecting a subset of the second plurality of diodes to be activated; an amplifier stage, electrically coupled to the input stage; and an output stage, electrically coupled to the amplifier stage.

2. The comparator as recited in claim 1, the amplifier stage comprising: a first amplifier stage, comprising: a first differential amplifier having a first voltage output and a second voltage output; and a first transmission gate electrically connecting the first voltage output to the second voltage output; and a second amplifier stage, comprising: a second differential amplifier; and a second transmission gate electrically connecting the second differential amplifier to an output stage; wherein the amplifier stage is configured such that whenever the first transmission gate is open, the second transmission gate is closed.

3. A method of calibrating the comparator of claim 1, wherein the comparator includes a plurality of input lines including the first input line and the second input line, and a plurality of calibration lines including the first input calibration line, the second input calibration line, the first output calibration line, and the second output calibration line, the method comprising: (a) repeatedly performing until a minimum number of +1 input values and −1 input values are set to on the processes of: setting weights associated with the input lines such that there are an equal number of +1 input values and −1 input values; setting each input line to be on; simultaneously cycling one +1 weight and one −1 weight by turning the +1 input off and then back on, while simultaneously turning the −1 input on and then back off, and recording the resulting output values; and reducing the number of input lines that are set to on by deactivating one input line having a +1 value and deactivating one input line having a −1 value; (b) adjusting programmable resistor values of the calibration lines based on the recorded output values; (c) repeatedly performing processes (a) and (b) until the output values indicate that acceptable operation has been reached.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a diagram of a neuron in a neuromorphic computing architecture;

(2) FIG. 2 is a block diagram of a neuromorphic processor architecture 301 in accordance with an embodiment of the present invention;

(3) FIG. 3 is a diagram illustrating a single-ended voltage architecture;

(4) FIG. 4 is a diagram illustrating a differential-voltage architecture;

(5) FIG. 5 is a diagram illustrating a differential-voltage architecture with high impedance;

(6) FIG. 6 is a block diagram of a comparator in accordance with an embodiment of the present invention;

(7) FIG. 7 is a simplified schematic of elements of a comparator in accordance with an embodiment of the present invention;

(8) FIG. 8 is a schematic of an input stage of a comparator in accordance with an embodiment of the present invention;

(9) FIG. 9 is a graph of a simulation performed for a comparator incorporating the elements of FIG. 8;

(10) FIG. 10 is a schematic of elements of an amplifier stage of a comparator in accordance with an embodiment of the present invention;

(11) FIGS. 11 and 12 are timing simulations illustrating operation of comparators according to embodiments of the present invention;

(12) FIGS. 13-15 are graphs of output delay times for comparators according to embodiments of the present invention; and

(13) FIGS. 16 and 17 are diagrams illustrating tile architectures according to embodiments of the present invention.

DETAILED DESCRIPTION

(14) Definitions: The term “memristor” refers to an electrical component whose resistance can be increased or decreased in a controllable fashion by modifying voltage/current inputs.

(15) Tiled Arrays

(16) The variety of applications known in the art of neuromorphic computing call for a variety of crossbar array sizes for efficient mapping of neural nets onto the hardware. Finding a single array size that can fit this variety, desirable for a general purpose neural processor, is problematic. Tiled arrays may be used to overcome this limitation, as discussed in detail below. A tile can be a complete array (when small arrays are efficient), and they can also be combined to create much larger arrays, which are necessary for certain neural networks. This novel contribution can be used as a key component of a general purpose neural processor.

(17) Comparator Design

(18) The tiled array architecture is made effective by the use of a 1/High Z neuron architecture, and the design of a compact, power efficient, fast comparator. Many memristor-based neural nets use a very fast, but large and power hungry analog-to-digital converter (ADC) for neuron evaluation. This would be a problematic approach for the tiled array concept. A comparator design is described herein that can efficiently support small tiles.

(19) Architecture and Key Components

(20) Because of their programmable conductance capability, memristors can function as dynamic weights in neuromorphic computing designs. Referencing FIG. 1, memristors can be programmed as the weights (w.sub.i) applied to the inputs (x.sub.i). Since the programming process can be controlled using feedback from the output, it is possible for memristor-based designs to be modified during operation, integrating learning into the system.

(21) FIG. 2 is a block diagram of a neuromorphic processor architecture 301 in accordance with an embodiment of the present invention. This architecture 301 can incorporate on-chip learning to enable algorithm flexibility, and it incorporates 1 transistor, 1 memristor (1T1M) array cells to provide for more precise control of the memristor conductance values during programming using known programming circuits 305. Comparators 309 are used for the evaluation circuit to enable a digital communication network 311 and digital input circuits 307. This building block can be replicated multiple times to create a full chip version for large neural processing applications. Unless otherwise specified, the array is a 256×64 array; this refers to the number of inputs and neurons. The number of columns (128 in this case) may be twice that, in the event a differential current architecture is used. This array size may be fabricated using, e.g., 45 nm technology.

(22) Memristor arrays implementing a threshold gate network (TGN) can be organized in multiple ways. Three approaches in particular are discussed below.

(23) 1. Single-Ended Voltage

(24) FIG. 3 is an example of a single-ended voltage architecture (SV) 401. Every input 403 and its inverse 405, along with any bias inputs (not shown), is connected to a single voltage rail 407 through a memristor 409 (represented by the blue circle) with a specific programmed conductance. This architecture 401 has been used to design and analyze embedded neural network processors. G.sub.0 represents an “off” state, or extremely low conductance, while G.sub.1 represents an “on” state, or high conductance. A typical on/off ratio can be ≥100. In our TGN, all the weights are integer values, which can be represented in this simplified SV architecture as one or more inputs with each neuron weight w.sub.i=G.sub.1. For example, w.sub.i=2 can be represented by two inputs with w.sub.i=G.sub.1 (for some applications, over 90% of the weights may be −1, 0, or +1). This representation is used here merely to simplify the analysis. For the actual design a single memristor is programmed with a conductance value (G.sub.i) that represents the desired weight for that particular input and neuron. The voltage V.sub.in is compared to a reference voltage with a differential voltage comparator to create the threshold gate. The circuit in FIG. 4 can represent any neuron and set of inputs in a TGN by selecting the proper values for α.sub.11, α.sub.10, α.sub.01, and α.sub.00.

(25) 2. Differential Voltage

(26) FIG. 4 is an example of a differential voltage architecture (DV) 501. Every input and bias is connected to both a positive voltage rail 503 and negative voltage rail 505 conductance. The equivalent circuits 511 for this architecture are shown. The two voltages are compared with a differential voltage comparator 507 to create the threshold gate.

(27) A differential voltage architecture should also have improved common mode noise rejection. This indicates DV may provide benefits over SV in certain embodiments.

(28) 3. Differential Current with 1/High Z Inputs

(29) FIG. 5 is an example of a differential current architecture 701 using 1/High Z inputs (DZ). When the input=0, the output of the row driver circuit is a high impedance node (High Z), rather than V.sub.ss. Every input and bias is connected to both a positive rail 703 and negative rail 705 conductance. The outputs 707 are two currents, each summed separately on its own bit line. The two currents are compared using a differential current comparator 709 to create the threshold gate. Only input=1 conditions provide current (and consume power).

(30) Circuit analysis indicates the 1/High Z differential current architecture (DZ) is power efficient compared to the SV and DV architectures. It also has a desirable property for circuit analysis: the current for each input is directly proportional to the weighted input for the neuron. This enables certain mathematical properties of the TGN to be verified as correctly implemented via simple analysis or simulation of the circuits.

(31) Comparator Design

(32) The comparator is an important element of the architecture. The speed of this circuit is one of the main factors in estimating the neural network throughput (the router network is another important factor). Since the comparator sinks the currents from the array, it has to be large enough to handle the total current while still being able to discriminate a minimum difference (ΔGmin=1 μS). This can have a major impact on the overall area and timing, and can limit the number of inputs allowed into a single neuron. It is also a significant consumer of the overall power. A comparator architecture 901 in accordance with an embodiment of the present invention is shown in FIG. 6. Two input currents enter an input stage 903, where they are transduced into differential voltages. The difference between these two voltages is amplified at an amplifier stage 905 to create an output, provided by an output stage 907, which can be buffered and latched for driving the data onto the communication network. The desired design will be compact, low power, and fast. If this can be achieved, a comparator circuit can be used for each neuron, instead of multiplexing as is often used. The inventor has developed an exemplary comparator that meets these specifications. For 45 nm processing technology, this comparator is compact (≈55 μm.sup.2), low power (˜15 μM), and fast (≈250 MHz).

(33) Comparator Architecture

(34) FIG. 7 provides a simplified schematic of elements of a comparator 1001 in accordance with an embodiment of the present invention. The input stage 1002 is built using an FET with drain and gate connected to create a diode-connected FET (one each for the positive and negative inputs). The amplifier stage 1003 uses simple 5 FET differential voltage amplifiers. Two amplifiers are used so that the output voltage of the amplifier stage is driven to ≈V.sub.dd or V.sub.ss as needed. The output stage (not shown) uses four inverters in series to drive the output load, which includes wire capacitance and the input gate of the router/switch. Data latching is enabled by the use of a controllable transmission gate between inverters 2 and 3.

(35) For traditional applications this design would be impractical for at least two reasons:

(36) 1. The maximum current input and minimum current difference that can be sensed are inversely related, limiting the operating range of the design; and

(37) 2. The design is very sensitive to device mismatches (such as small V.sub.t differences).

(38) In using the comparator for neural net applications, however, we can take advantage of some conditions that are not typically available.

(39) 1. The weights in the neural network must be programmed, and are therefore known in advance. For any given set of weights, the comparator needs to operate correctly in only a subset of the total range required.

(40) 2. The memristors are programmable conductance devices that can be used to ensure correct operation even under device mismatch conditions.

(41) We take advantage of this knowledge by modifying the base comparator design (see FIG. 8). The first modification is to modify the input circuits by using multiple diode-connected FETs 1101 in parallel for each of the two inputs. These parallel FETs have control gates 1102 that enable one or more diodes to be active, depending on the total weight for the neuron. The control PFETS have W/L=270/45 nm. The estimated voltage drop across these control PFETS will be <5 mV. The total weight (maximum possible conductance) is known in advance since the memristors need to be programmed. This enables a proper number of diodes to be activated, enabling the design to operate in its desired range under most input conditions. The second modification is to include additional 1T1M cells in the design. One set of two cells 1103 (G.sub.s.sup.+, G.sub.s.sup.−) is connected to the bit lines (like a bias or data input). The other set of two cells 1103 (G.sub.p.sup.+, G.sub.p.sup.−) is in parallel with the set of diodes. These memristors can be programmed in a manner similar to the network weights, and enable modification of the differential voltages (V.sup.+, V.sup.−) to compensate for device and parameter mismatches.

(42) The series memristors have greater effect when V.sup.+ (or V.sup.−) is low; the parallel memristors have greater effect for higher V.sup.+ (or V.sup.−). The memristor values would most likely be found as part of a chip calibration procedure. This procedure would be done before setting the desired programming weights into the array, and uses a majority function for this purpose:

(43) 1. Set the weights to create equal numbers of +1 and −1 values, and set all inputs high.

(44) 2. During each major time interval, cycle one +1 weight, and then one −1 weight by turning the +1 input off, then on, simultaneously turning the −1 input on, then off (each for one clock cycle of 5 ns, giving a total time interval of 10 ns). FIG. 9 shows a simulation in which a time interval of 10 ns was used. The bottom plot is the comparator output, using a 5 ns clock. The output signals are 100% correct and very clean.

(45) 3. After this, reduce both the total positive and negative weights by 1 (or any other equal decrement).

(46) 4. Repeat until the “common mode” weight (i.e., the base number of negative and positive weights) is a minimum.

(47) 5. Based on the outputs, adjust the G.sub.p and G.sub.s devices as follows:

(48) High common mode weights that create “0” errors require an increase in G.sub.p.sup.+, while low common mode weights that create “1” errors require an increase in G.sub.s.sup.+. G.sub.p.sup.− and G.sub.s.sup.− would be adjusted if the opposite conditions exist.

(49) Repeat the procedure until acceptable operation is reached.

(50) This procedure assigns a value of 1 to each correct output, −1 to each incorrect output, and adjusts the comparator bias memristors until the total value equals the number of outputs measured (fully correct functionality). This procedure can be modified in many ways. For example, heavier emphasis can be given to getting correct values for high total conductance values and ignoring incorrect values at very low conductance values (or specific biases can be used to ensure extremely low conductance levels are never seen). Other optimizing algorithms can be used as desired. Using 45 nm design technology, simulations have shown that up to ±10 mV (20 mV total) V.sub.t mismatch and up to ±5 nm (10 nm total) dimensional mismatch can be tolerated between the critical pairs of devices in this differential design (N1A and N1B in the input stage in FIG. 11, and the input FETs for the amplifiers). An example simulation for this level of mismatch is shown in FIG. 9. The top plot shows V.sup.+ and V.sup.−, while the middle plot is the differential voltage (ΔV.sub.diode=V.sup.+−V.sup.−). The differential voltage is almost always negative here. Without the additional 1T1M cells, a comparator with these levels of mismatch would almost always be incorrect. With the addition of the 1T1M cells and the procedure, the comparator output can still be 100% correct.

(51) Another design modification used to improve the comparator performance is illustrated in FIG. 10. As the two currents I.sup.+ and I.sup.− flow into the diode, they set the diode voltages 1301 (V.sup.+ and V.sup.−). The small difference in the currents caused by the differing weighted sums creates a small voltage difference. This differential voltage is amplified by amplifier 1310 to create a much larger voltage difference between V.sub.gate1 1302 and V.sub.amp1 (V.sub.gate1 stays relatively constant, while V.sub.amp1 swings over a large range). The second amplifier 1311 is used to drive its output 1303 (V.sub.amp2) nearly to V.sub.dd or V.sub.ss. The speed of this basic design is mainly dependent on the first amplifier, and is primarily determined by two factors:

(52) 1. How fast can the bias current change V.sub.amp1?

(53) 2. How much does the bias current need to change V.sub.amp1?

(54) This can be expressed using the fact that V.sub.amp1 has a node capacitance, and therefore
I=C*ΔV/Δt
or
Δt=C*ΔV/I

(55) To reduce Δt, you can increase I or reduce ΔV. The first factor is essentially a design optimization: higher bias currents can swing the output faster, but take more power and create a larger design (increased capacitance) that will slow down the amplifier. Larger amplifier input transistors will also slow down the rate at which the diode can swing the input voltages, but that is a smaller, secondary influence. The second factor is input dependent. In a situation where the previous weighted sum is highly negative and the current weighted sum=+1 (or where the previous sum is highly positive and the current sum=−1), the ΔV.sub.amp1 value is very high, and the final V.sub.amp1 voltage will be very close to V.sub.gate1. The bias current will take a relatively long time Δt to switch V.sub.amp1 past V.sub.gate1. Only then will the second amplifier switch as well. The second factor however, can be managed by the architecture. FIG. 10 depicts the changes to the amplifier architecture, and FIGS. 11 and 12 are timing simulations that display the effect of these changes.

(56) By adding a transmission gate connecting V.sub.amp1 to V.sub.gate1 1304, we can controllably force V.sub.amp1 to be very close to V.sub.gate1. ΔV.sub.amp1 will be very small (and relatively constant under all input conditions). This is done by using a strobe signal that turns this T-gate on during the early part of the comparison operation, and turning it off during the later part. The diodes are always on. While this T-gate is on, a second T-gate 1305 (connected to V.sub.amp2) is turned off, and the T-gate in the output driver (not shown) is turned on, which keeps the previous output valid and avoids Vout glitching. The T-gates are turned off (or on, respectively) during the later portion of the comparison operation. The simulations in FIGS. 11 and 12 are for the same array and inputs, with equal time scales. The comparator in the first plot (FIG. 11) does not have the T-gate/Tstrobe feature. Since the previous sum value is very positive (ΔV.sub.diode≈6.0 mV), V.sub.amp1 must swing very far (from 900 mV to 400 mV) to cross V.sub.gate1. This takes a long time, making ΔT large (6.865 ns). In the second plot (FIG. 12), the comparator with the T-gate/T.sub.strobe included drives V.sub.amp1 to drop very close to V.sub.gate1 almost instantly. This greatly reduces the delay (ΔT=1.559 ns).

(57) FIGS. 13-15 show how this affects the comparator speed. Without the T-gates (FIG. 13), the speed is dependent on both the previous inputs and the current inputs. With the new architecture (FIGS. 14 and 15), V.sub.amp1 always starts very close to V.sub.gate1 and the worst case time delay is drastically reduced. The comparator speed is now relatively independent of the inputs. The T.sub.strobe time does increase the comparator delay for very fast transitions, but these do not define the comparator speed. T.sub.strobe=1 ns appears to provide the best balance.

(58) The use of the 1/High Z differential current architecture (DZ), and the comparator design described above, enables an important architectural option, which we call a tile. One of the major difficulties in trying to design a general purpose neural processor is that the desired array sizes span a wide range. Just to use a few exemplary applications in the realm of neuromorphic computing (these applications are described, e.g., in D. J. Mountain, “A general purpose neural processor,” dissertation, University of Maryland, Baltimore County, 2017), the MNIST application maps well to 256×64 arrays, the CSlite decoder stage naturally fits into an 8×256 array, the CSlite detector stage has one layer in the network that requires 512×32 arrays (less than 512 inputs could not be mapped), and the AES-256 State Machine would prefer a 16×16 array mapping. Finding a single array size that can efficiently map all of these is a daunting task. The availability of tiles makes it more practical. The DZ architecture and our comparator allow for the use of control FETs to divert the differential current to specific diodes at the input stage of the comparator. This means that we can add one set of two additional control FETs per comparator 1901 (per neuron) that enable the current to be passed to a comparator in a different array. Keep in mind that the current being passed represents the weighted sum of the inputs. Therefore the function of the neural net is maintained. The second array is evaluating its inputs plus the inputs from the first array. The two arrays are then combined into a single neuron (or set of neurons). This concept is illustrated in FIG. 16. This feature means that smaller arrays (tiles) can be connected to form much larger arrays. For example, four 64×16 tiles can be connected together to make a 128×32 array (see FIG. 17). Here the inputs go into Tile 1 and Tile 3; they are also sent across to Tile 2 and Tile 4. The comparators in Tile 1 and Tile 2 are shut down, and the current (the sum of the weighted inputs) is passed to the comparators in Tile 3 and Tile 4, which now sum up all of the weighted inputs to create the final outputs.

(59) The design optimization is now to find the optimum tile size, not the optimum array size. This new architectural option also greatly expands the set of possible solutions. Without this, no array smaller than 512 inputs could have been used for the general purpose neural processor design for the applications discussed above. Instead, tiles that are very small (8×2 or 16×1, for example) are possible solutions.

(60) The tile concept is further enhanced by the ability to control the current (and therefore the power) in the unused portions of the tile (unit cells, comparators). Simulations of a 256×32 array show that the active power can be completely eliminated. The leakage power is an extremely small fraction (much less than 1%) of the total. The input circuits may need to send the input value across a tile to a neighboring tile, through another driver/latch circuit. This adds a small delay (≈30 ps per tile). As long as the number of horizontal tiles connected is reasonable (10 or less), the effect on performance is small.

(61) Another point to be made is that the control PFETs added to the comparator design need to pass a large amount of current, and are therefore large (W/L=1200/45 nm). This keeps the ΔV.sub.ds below 3 mV. This adds about 1.9 μm.sup.2 in area to the comparator. A more important issue is that all the tiles need to have comparators, which are relatively large. This is because a tile needs to be an array itself, not just part of a larger array. The compact low power comparator design disclosed herein makes this practical.

(62) While the above description has shown, described, and pointed out novel features as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the devices or algorithms illustrated can be made without departing from the spirit of the disclosure. As will be recognized, certain embodiments described herein can be embodied within a form that may not provide all of the features and benefits set forth herein, as some features can be used or practiced separately from others. The scope of the invention is indicated by the appended claims rather than the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

General purpose neural processor

Assignee

Inventors

Cpc classification

Classification Explorer

H03K5/2481

ELECTRICITY

Classification Explorer

G11C13/0069

PHYSICS

Classification Explorer

H03F3/45269

ELECTRICITY

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G11C2213/79

PHYSICS

Classification Explorer

G11C11/54

PHYSICS

Classification Explorer

G11C13/0026

PHYSICS

Classification Explorer

G06N3/065

PHYSICS

International classification

Classification Explorer

G06N3/063

PHYSICS

Classification Explorer

G06N3/084

PHYSICS

Classification Explorer

G11C13/00

PHYSICS

Classification Explorer

H03F3/45

ELECTRICITY

Classification Explorer

H03K5/24

ELECTRICITY

Abstract

Claims

Description