Hybrid accumulation method in multiply-accumulate for machine learning
11615256 · 2023-03-28
Inventors
Cpc classification
G06F2207/5523
PHYSICS
International classification
Abstract
Methods for performing mixed-mode Multiply-Accumulate (MAC) functions in an integrated circuit (IC) are disclosed. By performing part of the MAC operation spatially and in parallel, and part of it temporally and serially, the number of MAC operations can be programmed in the serial/temporal MAC segment as a multiple of the parallel/spatial MAC segment. Such a trait provides a degree of flexibility in programming the mixed-mode MAC function. A Programmable-Hybrid-Accumulation (PHA) method, performs the accumulation function of the MAC IC, by transforming the accumulation signal to a hybrid accumulation signal. The hybrid accumulation signal is comprised of a Most-Significant-Portion (MSP) and a Least-Significant-Portion (LSP), wherein the portions of the hybrid accumulation signal can be programmed in accordance with cost-performance objectives of an end application. Transforming the accumulated signal to a hybrid signal, and utilizing the PHA method, enables keeping the signal magnitudes bounded which prevent signal over-flow constraints while accumulation cycles proceed. Arranging a mixed-signal MAC in accordance with the PHA method can, among other benefits, help to limit the peak-to-peak analog signal swings which enhances performance attributes such as lower current consumption, faster speed, lower power supply voltage, and a wider signal accumulation range before power supply operating head-room conditions are breached.
Claims
1. A method of performing a spatial-temporal multiply-accumulate (MAC) operation in an integrated circuit, the method comprising: Time multiplexing an apparatus to perform a sequence of a plurality of j steps comprising: Starting the time multiplexing of the apparatus at a beginning j step; Feeding a plurality of digital weight (P.sub.I) signals and a respective plurality of digital activation (Q.sub.I) signals to a plurality of digital-input to current analog-output multiplier (iMULT) apparatuses to generate a respective plurality of current analog output (P.sub.IQ.sub.I) signals corresponding to a j step of the plurality of j steps; Coupling the (P.sub.I) signals and the (Q.sub.I) signals to generate a current summation output (ΣP.sub.IQ.sub.I) signal corresponding to the j step of the plurality of j steps; Feeding a digital offset (OFS) signal to a current-mode offset digital-to-analog converter (DAC.sub.OFS) to generate an offset current (I.sub.OFS) signal corresponding to the j step of the plurality of j steps; Coupling the (ΣP.sub.IQ.sub.I) signal and I.sub.OFS signal to generate a partial current Multiply-Accumulate (iMAC.sub.P) signal corresponding to the j step of the plurality of j steps; Memorizing the (iMAC.sub.P) signal corresponding to the j step of the plurality of j steps by at least one of current-mode sample-and-hold (iSH) apparatus and current-mode Analog-to-Digital Converter (iADC) apparatus; Accumulating the (iMAC.sub.P) signals corresponding to the j step and j−1 step of the plurality of j steps by at least one of a current-mode accumulator (iACC) apparatus communicating with the (iSH) apparatus and a memory (MEM) apparatus communicating with the (iADC) apparatus; Incrementing j by 1 if j<n−1, and returning to the beginning step, otherwise exiting the time multiplexing state machine if j=n−1; Generating a final current Multiply-Accumulate (ΣΣP.sub.IJQ.sub.IJ) signal corresponding to j=n−1; Wherein 1≤n≤100; Wherein 1≤j+1≤100; Wherein each of a PQ.sub.MIN signal≤the (P.sub.IQ.sub.I) signal≤a PQ.sub.MAX signal; Wherein a R.sub.I/j signal is half of the sum of the PQ.sub.MIN signal and the PQ.sub.MAX signal; Wherein the (OFS) signal is programmed to be substantially equal to the R.sub.I/j signal; and Wherein the plurality of (iMULT) apparatuses is configured as at least one of digital-input to current analog-output Binary Neural Network multiplier (iBNN) apparatuses and digital-input to current analog-output multiplying current-mode Digital-to-Analog Converters (iDAC) apparatuses.
2. The method of performing a spatial-temporal multiply-accumulate (MAC) operation in an integrated circuit of claim 1, the method further comprising: Processing at least one of the (P.sub.IQ.sub.I) signals, (ΣP.sub.IQ.sub.I) signals, and the (ΣΣP.sub.IJQ.sub.IJ) signal differentially.
3. The method of performing a spatial-temporal multiply-accumulate (MAC) operation in an integrated circuit of claim 1, the method further comprising: Combining the (ΣΣP.sub.IJQ.sub.IJ) signal with a bias (C.sub.K) signal to generate an activation (FK{ΣΣP.sub.IJQ.sub.IJ±C.sub.K}) signal.
4. The method of performing a spatial-temporal multiply-accumulate (MAC) operation in an integrated circuit of claim 1, the method further comprising: Performing a Programmable Hybrid Accumulation (PHA) operation by generating a Least-Significant-Portion (LSP) of the (ΣΣP.sub.IJ.Q.sub.IJ) signal by subtracting at least one mod signal (P) from the (ΣΣP.sub.IJ.Q.sub.IJ) signal when (ΣΣP.sub.IJ.Q.sub.IJ>P) is detected, and keeping track of such a detection in an event counter to generate a Most-Significant-Portion (MSP) of the (ΣΣP.sub.IJ.Q.sub.IJ)) signal.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The subject matter presented herein is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and illustrations, and in which like reference numerals, without regard to whether a reference numeral is written with upper or lower case letters, e.g., ADD.sub.1A and ADD.sub.1A, refer to similar elements, and in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
SUMMARY OF THE DISCLOSURE
(20) An aspect of the present disclosure is a method of performing a spatial-temporal multiply-accumulate (MAC) operation in an integrated circuit, the method comprising: multiplying simultaneously a first plurality of digital signal pairs, each digital signal pair comprising a digital weight (P.sub.I) signal and a digital activation (Q.sub.I) signal, to generate a first plurality of analog signal pair product (P.sub.I1.Q.sub.I1) signals; summing simultaneously the P.sub.I1.Q.sub.I1 signals to generate a first analog summation (ΣP.sub.I1.Q.sub.I1) signal; multiplying simultaneously a subsequent plurality of digital signal pairs, each subsequent digital signal pair comprising a digital weight (P.sub.IJ) signal and a digital activation (Q.sub.IJ) signal, to generate a subsequent plurality of pairs of analog products (P.sub.IJ.Q.sub.IJ) signals; and storing and accumulating serially the P.sub.IJ. Q.sub.IJ signals with the ΣP.sub.IJ.Q.sub.IJ signal to serially generate a multiply-accumulate (ΣΣP.sub.IJ.Q.sub.IJ) signal. Another aspect of the present disclosure is the method of performing a spatial-temporal multiply-accumulate (MAC) operation in an integrated circuit, the method further comprising: processing at least one of the P.sub.I1.Q.sub.I1 signals, the ΣP.sub.I1.Q.sub.I1 signal, and the ΣΣP.sub.I1.Q.sub.I1 signal differentially. Another aspect of the present disclosure is the method of performing a spatial-temporal multiply-accumulate (MAC) operation in an integrated circuit, the method further comprising: processing at least one of the P.sub.I1.Q.sub.I1 signals, the ΣP.sub.I1.Q.sub.I1 signal, and the ΣΣP.sub.IJ.Q.sub.IJ signal in at least one of the analog domain and the digital domain. Another aspect of the present disclosure is the method of performing a spatial-temporal multiply-accumulate (MAC) operation in an integrated circuit, the method further comprising: generating at least one of the P.sub.IJ.Q.sub.IJ signals in a binary neural network (BNN). Another aspect of the present disclosure is the method of performing a spatial-temporal multiply-accumulate (MAC) operation in an integrated circuit, the method further comprising: subtracting a running average ({circumflex over (R)}i/J) signal from at least one of the ΣΣP.sub.IJ.Q.sub.IJ signal and the ΣP.sub.I1.Q.sub.I1 signal. Another aspect of the present disclosure is the method of performing a spatial-temporal multiply-accumulate (MAC) operation in an integrated circuit, the method further comprising: combining the ΣΣP.sub.IJ.Q.sub.IJ signal with an offset (C.sub.K) signal to generate an activation (F.sub.K{ΣΣP.sub.IJ.Q.sub.IJ±C.sub.K}) signal.
(21) An aspect of the present disclosure is a method of performing a Programmable Hybrid Accumulation (PHA) operation in an integrated circuit, the method comprising: multiplying simultaneously a first plurality of digital signal pairs, each digital signal pair comprising a digital weight (P.sub.I) signal and a digital activation (Q.sub.I) signal, to generate a first plurality of analog signal pair product (P.sub.I1.Q.sub.I1) signals; summing simultaneously the P.sub.I1.Q.sub.I1 signals to generate a first analog summation (ΣP.sub.I1.Q.sub.I1) signal; multiplying simultaneously a subsequent plurality of digital signal pairs, each subsequent digital signal pair comprising a digital weight (P.sub.IJ) signal and a digital activation (Q.sub.IJ) signal, to generate a subsequent plurality of pairs of analog products (P.sub.IJ.Q.sub.IJ) signals; storing and accumulating serially the P.sub.IJ.Q.sub.IJ signals with the ΣP.sub.IJ.Q.sub.IJ signal to serially generate a multiply-accumulate (ΣΣP.sub.IJ.Q.sub.IJ) signal; and generating a Least-Significant-Portion (LSP) of the ΣΣP.sub.IJ.Q.sub.IJ signal by subtracting at least one mod signal (P) from the ΣΣP.sub.IJ.Q.sub.IJ signal when ΣΣP.sub.IJ.Q.sub.IJ>P is detected, and keeping track of such a detection in an event counter to generate the Most-Significant-Portion of the ΣΣP.sub.IJ.Q.sub.IJ signal.
(22) Another aspect of the present disclosure is the method of performing a Programmable Hybrid Accumulation (PHA) operation in an integrated circuit, the method further comprising: processing at least one of the P.sub.I1.Q.sub.I1 signals, the ΣP.sub.I1.Q.sub.I1 signal, and the ΣΣP.sub.IJ.Q.sub.IJ signal in at least one of the analog domain and the digital domain. Another aspect of the present disclosure is the method of performing a Programmable Hybrid Accumulation (PHA) operation in an integrated circuit, the method further comprising: processing at least one of the P.sub.I1.Q.sub.I1 signals, the ΣP.sub.I1.Q.sub.I1 signal, and the ΣΣP.sub.IJ.Q.sub.IJ signal differentially.
(23) Another aspect of the present disclosure is the method of performing a Programmable Hybrid Accumulation (PHA) operation in an integrated circuit, the method further comprising: processing at least one of the P.sub.I1.Q.sub.I1 signals, the ΣP.sub.I1.Q.sub.I1 signal, and the ΣΣP.sub.IJ.Q.sub.IJ signal in at least one of (i) switched capacitor voltage mode and (ii) switched current mode.
DETAILED DESCRIPTION
(24) Numerous embodiments are described in the present application and are presented for illustrative purposes only and are not intended to be exhaustive. The embodiments were chosen and described to explain principles of operation and their practical applications. The present disclosure is not a literal description of all embodiments of the disclosure(s). The described embodiments are also not limiting in any sense. One of ordinary skill in the art will recognize that the disclosed embodiment(s) may be practiced with various modifications and alterations, such as structural, logical, and electrical modifications. For example, the present disclosure is not a listing of features which must necessarily be present in all embodiments. On the contrary, a variety of components are described to illustrate a wide variety of possible embodiments of the present disclosure(s). Although features of the disclosed embodiments may be described with reference to one or more particular embodiments or drawings, it should be understood that such features are not limited to usage in any one or more particular embodiments or drawings with reference to which they are described, unless expressly specified otherwise. The scope of any inventions is defined by the claims.
(25) Although process (or method) steps may be described or claimed in a particular sequential order, such processes may be configured to work in different orders. In other words, any sequence or order of steps that may be explicitly described or claimed do not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order possible. Further, some steps may be performed simultaneously despite being described or implied as occurring non-simultaneously (e.g., because one step is described after another step). Moreover, the illustration of a process by its depiction in a drawing does not imply that the illustrated process is exclusive of other variations and modifications thereto and does not imply that the illustrated process or any of its steps are necessary to the embodiment(s). In addition, although a process may be described as including a plurality of steps, that does not imply that all or any of the steps are essential or required. Various other embodiments within the scope of the described disclosure(s) include other processes that omit some or all the described steps. In addition, although a circuit may be described as including a plurality of components, aspects, steps, qualities, characteristics or features, that does not indicate that any or all the plurality are essential or required. Various other embodiments may include other circuit elements or limitations that omit some or all the described plurality. In U.S. applications, only those claims specifically citing “means for” or “step for” should be construed in the manner required under 35 U.S.C. § 112(f).
(26) Throughout this disclosure, the following nomenclatures or abbreviations may be utilized: the term FET is field-effect-transistor; MOS is metal-oxide-semiconductor; MOSFET is MOS FET; PMOS is p-channel MOS; NMOS is n-channel MOS; BiCMOS is bipolar and MOS on the same chip; SPICE is a Simulation Program with Integrated Circuit Emphasis which is an industry standard circuit simulation program; micro is μ which is 10.sup.−6; nano is n which is 10.sup.−9; and pico is p which is 10.sup.−12. Bear in mind that V.sub.DD (as a positive power supply) and V.sub.SS (as a negative power supply) may be applied to circuitries, block, or systems in this disclosure, but may not be shown for clarity of illustrations. The V.sub.SS may be connected to a negative power supply or to ground (zero) potential. The body terminal of MOSFETs can be connected to their respective source terminals or to the MOSFET's respective power supplies, V.sub.DD and V.sub.SS.
(27) Most-Significant-Bit is MSB, Least-Significant-Bit is LSB. For example, for S.sub.REF=1V processed in a 7-bit system having 2.sup.7 increments compute to 128 increments, and thus each LSB is 1V/128˜7.8 mV. Most-Significant-Portion is MSP, and Least-Significant-Pit is LSP, wherein the Portions in MSP and LSP can be programmed in accordance with the cost-performance objectives of an end-application.
(28) Compute-In-Memory is CIM. Binary Neural Network in BNN, Artificial Neural Network is ANN, Multiply-Accumulate is MAC, Multiply-Add is MAD, Accumulator is ACC, Modulo Operator is MOD, Programmable-hybrid-accumulator is PHA, Sample and Hold is SH, Analog-to-Digital Converter is ADC, Digital-to-Analog Converter is DAC, Logical Exclusive OR is XOR, Logical Exclusive NOR is XNOR, Static Random Access memory is SRAM, Dynamic Random Access Memory is DRAM, Erasable Programmable Read-Only Memory is an EPROM, Electrical EPROM is EPROM, or Multiplexor is MUX, Comparator is CMP, Amplifier is AMP, Switch is SW, Capacitor is C, Resistor is R, Reference Signal is S.sub.R or S.sub.REF, Clock is CLK, Reference Current is I.sub.R or I.sub.REF, Reference Voltage is V.sub.R or V.sub.REF, event counter is EC or ec, Average of R.sub.i is {circumflex over (R)}.sub.l (sometimes also referred to as μ), standard deviation is sigma (σ) of a probability distribution function, and activation function may be a sigmoid or a sign function (SIG) or Rectified Linear Unit function (ReLu) or their variations or other activation functions, Σ denotes summation or addition,
(29)
is parallel or spatial summation of n of multiplications of P.sub.i by Q.sub.i that produce R.sub.i=P.sub.i×Q.sub.i, and time-multiplexed or serial (temporal) summation of m of
(30)
is
(31)
wherein i spans from 1 to n, and j spans from 1 to m.
(32) Keep in mind that for descriptive clarity, illustrations of this disclosure may be simplified, and their improvements beyond simple illustrations would be obvious to one skilled in the art. For example, it would be obvious for one skilled in the art that MOSFET current sources can be cascoded for higher output impedance and lower sensitivity to power supply variations, whereas throughout this disclosure current sources may be depicted with a single MOSFET for clarity of illustration. It would also be obvious to one skilled in the art that a circuit schematic illustrated in this disclosure may be arranged with NMOS transistors or arranged in a complementary version utilizing transistors such as PMOS.
(33) The illustrated circuit schematics of embodiments described in the proceeding sections may have the following benefits, some of which are outlined here to avoid repetition in each section in the interest of clarity and brevity:
(34) First, a temporal-spatial MAC method described in this disclosure can provide some programmability in the number of MAC operations, for example by arranging a fixed bank of 16-channel parallel MAC operations that communicate with an accumulator that is time multiplexed. In an application that requires 250 MAC operations, such temporal-spatial MAC IC can be clocked (program time-multiplexed) 16 times sequentially (16×16) which provides a total of 256 MAC operations to cover the objective minimum of 250 MAC operations (requiring the 6 excess multiplications to be suppressed in some manner). In another example, when 40 MAC operations are needed for another application, a temporal-spatial MAC IC can be clocked 3 times sequentially (16×3) which provides total of 48 MAC operations to cover the objective minimum of 40 MAC operations (similarly requiring the 8 excess multiplications to be suppressed in some manner).
(35) Second, accumulating multiplied signals can overflow a register in a digital adder, or cause analog outputs to breach a power supply V.sub.DD operating headroom. To overcome an overflow of MAC signals at the output of an accumulator, a programmable-hybrid-accumulator (PHA) method is disclosed. The PHA method described in this disclosure can keep the output of an accumulator within a programmed range (to avoid overflow, underflow, or breaching the operating power supply V.sub.DD and or V.sub.SS limits) without substantially hindering the precision of the accumulator. A PHA circuit can monitor the output value of the accumulator and subtract a reference signal (S.sub.REF) value or finer set of S.sub.REF increments or a signal or set of signals proportional to S.sub.REF from the input of the accumulator, when for example an output of the accumulator exceeds a value proportional to S.sub.REF. Also, another embodiment of the PHA circuit can vary the gain of the accumulator each time the output of the accumulator exceeds such limits as apportioned by a P value or a finer set of P incremental values. As a plurality of product signals are accumulated, and each time the output of the accumulator breaches its programmed limits, a signal (e.g., event counter) can be generated to represent a Most-Significant-Portion (MSP) of the final value of the accumulator signal value. Also, as a plurality of product signals are accumulated, in concert with the MSP, a residual Least-Significant-Portions (LSP) signal can be generated representing the LSP of the signal for the final value of the accumulator output signal. In summary, by effectively transforming an objective accumulating signal of a MAC IC into a HYBRID SIGNAL (comprising an MSP signal and an LSP signal), the span of the accumulating signal can widen without either breaching the V.sub.DD and or V.sub.SS operating headroom or causing over-flow and or over-flow conditions.
(36) Third, in end-applications where an activation function such as a magnitude comparator receives a LSP signal (sufficient in scale to compute a minimum cost function), a smaller size of the LSP signal goes together with for example a smaller magnitude comparator which can save on silicon die area and convergence speed.
(37) Fourth, for a given operating current, a PHA method enables limiting peak-to-peak signal output swing of a mixed-mode accumulator (representing the accumulator's LSP signal) which can make it faster given that smaller peak-to-peak analog and or mixed-signal swings slew and or settle faster compared to ones with wider swings.
(38) Fifth, because a PHA method enables limiting peak-to-peak current signal output swing of a mixed-signal current-mode accumulator (representing the accumulator's LSP signal), the power consumption can be lowered given the bounded current-swings internal to the accumulator.
(39) Sixth, also because a PHA method enables limiting peak-to-peak signal output swing of a mixed-signal accumulator, static and dynamic signal dependent current draw from power supplies (e.g., V.sub.DD and V.sub.SS and Ground) can be more bounded and stable, which could help relax power supply design considerations surrounding the accumulator.
(40) Seventh, moreover because a PHA method enables limiting peak-to-peak signal output swing of a mixed-signal accumulator, charge-pumps or multiple power supplies can be avoided which would otherwise be needed to power on-and-off the internal switch capacitors that must remain in compliance with larger peak-to-peak signal swings during an accumulation cycle.
(41) Eighth, for the embodiment in which a significant portion of product signal accumulation is processed in analog and stored in analog memory, the need for digital memory and its associated read-write cycles are eliminated, which can materially lower dynamic power consumption associated with read-write cycles into and out of digital memory.
(42) Ninth, because a PHA method enables the bounding of analog and mixed-signal peak-to-peak swings, power consumption and speed can be improved. For example, for a current-mode PHA circuit (arranged based on a PHA method) in which peak-to-peak output current spans are limited, the current consumption of such PHA circuit can be bounded as well as be nearly independent of the number of current signals that are to be accumulated. For a voltage-mode PHA circuit in which peak-to-peak output voltage swings are bounded, the speeds (i.e., slew rate and settling time) of such PHA circuit for a given current consumption, can be optimized for higher speeds and lower power consumption. Moreover, distortions attributed to wide peak-to-peak signal swing can be minimized considering the bounded output span of a PHA circuit. Moreover, peak-to-peak signal swing-dependent charge injections may also be reduced in a PHA circuit (voltage or current mode) given the bounded magnitude of accumulated signals at the output of the mod-hybrid-accumulator.
(43) Tenth, because voltage swings are small in current mode signal processing, the disclosed mixed-signal current-mode circuit designs can enable high speed signal processing. Moreover, because current mode signal processing can be made fast, the disclosed mixed-signal current-mode circuit designs can provide a choice of trade-off and flexibility between running at moderate speeds and operating with low currents to save on power consumption.
(44) Eleventh, the disclosed mixed-signal current-mode and switch-capacitor voltage-mode or charge-transfer-mode circuit designs can be arranged on a silicon die near memory to facilitate Compute-In-Memory (CIM) operation. Such an arrangement reduces the read/write cycles into and out of memory and thus lowers overall dynamic power consumption.
(45) Twelfth, performance of some of the disclosed mixed-signal current-mode circuit embodiments can be arranged to be independent of resistors and capacitor values and their normal variations in manufacturing. As such, manufacturing die yield can perform to specifications mostly independent of passive resistor or capacitor values and their respective manufacturing variations, which could otherwise reduce die yield and increase cost.
(46) Thirteenth, because voltage swings are small in current mode signal processing, the disclosed mixed-signal current-mode circuit designs can operate with low power supply voltage.
(47) Fourteenth, also because voltage swings are small in current mode signal processing, the disclosed mixed-signal current-mode circuit embodiments can enable internal analog signals to span between full-scale and zero-scale (e.g., a summing node of a MAC or analog input of an Analog-To-Digital-Converter or analog input of a comparator) which enables a full-scale dynamic range that is less restrictive of power supply voltage V.sub.DD levels.
(48) Fifteenth, the disclosed mixed-signal voltage-mode or current-mode circuit designs can be manufactured on low-cost standard and conventional Complementary-Metal-Oxide-Semiconductor (CMOS) fabrication, which are more mature, readily available, and process node portable than “bleeding-edge” technologies, thereby facilitating embodiments of ICs having relatively more rugged reliability, multi-source manufacturing flexibility, and lower manufacturing cost.
(49) Sixteenth, digital addition and digital subtraction can occupy a larger die area than similar analog operations. Because the disclosed circuit embodiments can operate in current mode, the function of addition in current mode simply requires the coupling of output current ports. For arrangements of the disclosed circuit embodiments that can operate in switched capacitor voltage-mode, the function of addition simply requires the coupling of capacitors that carry the intended charges or voltages. Thus, the disclosed embodiments can be arranged in smaller die areas and cost less.
(50) Seventeenth, multiplications or XOR/XNOR functions can be performed in mixed signals which can save area and reduce costs.
(51) Eighteenth, as noted earlier, digital addition and subtraction functions occupy large die areas and can be expensive. Some embodiments in the present disclosure eliminate the digital adding function of bitwise count of logic state ‘1’ required in BNNs by performing the population counting of states of ‘1’s in analog, mixed-mode, or both. Coupling together equally sized current sources or equally sized (equally charged) capacitors can potentially perform the bitwise count in mixed-mode, analog, or both more effectively, thereby taking less area, consuming less power, and costing less.
(52) Nineteenth, some of the disclosed mixed-signal current-mode circuit embodiments utilized in BNNs can help reduce inaccuracies attributed to the function of addition that stems from random but normal manufacturing variations (e.g., random transistor mismatches in normal fabrication). In the disclosed mixed-signal circuit embodiments wherein equally sized current-sources or equally sized capacitors are utilized, any non-linearity due to the non-systematic random statistical contribution of mismatches (of adding or deducting an incremental current or a voltage/charge stored on equally sized capacitors) roughly equals the square root of the sum of the squares of such non-systematic random mismatches. The benefit of this attenuated impact of imperfections due to random manufacturing variations attributed to equally sized current sources or equally sized capacitors on overall accuracy, is an inherent advantage of some of the disclosed embodiments which can improve manufacturing yield to specifications that is passed on to the BNNs.
(53) Twentieth, cascoding current sources can help increase output impedance and reduce sensitivity of output currents to power supply variations but require two cascoded transistors. Some of the disclosures herein can utilize power supply desensitization circuits for a current source that is not cascoded (e.g., single MOSFET current source).
(54) Twenty-first, for some of the disclosures herein, because each unit of cumulative current or voltage signals (that represents the bitwise count of logic state ‘1’ as an analog current through equally sized current sources or as an equally sized charge or voltage stored on equally sized capacitors), the incremental summation of a plurality of output current or voltage signals is thermometer-like. Accordingly, the disclosed mixed-signal current-mode or switch capacitor voltage-mode or charge-mode circuit embodiments provide monotonic incremental accumulation of adding current or voltage or charge signals which is beneficial for convergence on minimum cost function during training cycles of machine learning ICs.
(55) Twenty-second, some of the disclosed mixed-signal current-mode or switch-capacitor voltage-mode or charge-mode circuit embodiments utilized here have the option of enabling a meaningful portion of the computation circuitry to shut itself off (i.e., ‘smart self-power-down’) in the face of no incoming signal so that the remaining computation circuits can remain ‘always on’ while consuming low stand-by current consumption.
(56) Twenty-third, the disclosed MAC IC embodiments may be optimized for cost-performance with a smaller size accumulator and activation function circuit for kinds of incoming signals in which the signal population follows a predictable statistical distribution profile (e.g., Gaussian distribution with an average and a sigma).
(57) Twenty-fourth, utilizing a PHA method in a MAC IC facilitates having an option of communicating only a portion of an accumulated signal such as an LSP signal (to subsequent layers) that is pertinent to the machine learning operation of the edge-based device in the field (e.g., finding minimum-cost function). Given that an LSP signal would inherently represent a smaller portion of the objective final accumulated signal, the storage of such LSP signal (analog or digital) would be smaller, cheaper, and faster to access with lower dynamic power consumption. Bear in mind that a large part of power consumption of MAC ICs is due to read-and-write cycles in-an-out of memory. The disclosed MAC IC embodiments provide additional solutions, besides compute-in-memory (CIM), that transform the final accumulating signals and represent them as hybrid signals, and could enable applications to process, store, read, and write only pertinent and SMALLER/PARTIAL segments of such hybrid (final accumulating) signals which could provide substantial savings in dynamic power consumption.
(58) Twenty fifth, mixed-signal accumulators whose cumulative errors are substantially reduced by for example, performing part of the accumulation in analog mode and part of the accumulation in digital mode.
Section 1A—Description of FIG. 1A
(59)
(60) A plurality of n pairs of input signals P.sub.i and Q.sub.i are multiplied spatially (e.g., in parallel) via MULT.sub.1A circuit block whose plurality of outputs P.sub.i×Q.sub.i=R.sub.i signals are added via the ADD.sub.1A circuit block to generate a spatial partial summation
(61)
signal. Then, a temporal (e.g., serial) accumulation of
(62)
signals via a ACC.sub.1A+REG.sub.1A circuit block generates a final summation in m sequences (e.g., in series/time-multiplexed) wherein the final summation is
(63)
An activation function (F.sub.1A) receives the
(64)
signal, and a programmed offset signal (C.sub.K) generates an activation signal,
(65)
Note, the sign of C.sub.K must be either + or − depending on training algorithms and the objectives of an end-application.
(66) The MULT.sub.1A, ADD.sub.1A, ACC.sub.1A, REG.sub.1A, and F.sub.1A circuit blocks and the signals traversing through them can be digital, analog, or mixed-mode, or a combination thereof, and the said signals may be arranged differentially or in a single-ended fashion, depending on cost-performance requirements of a target application. For example, the MULT.sub.1A, and ADD.sub.1A circuit blocks can be arranged in differential mixed-signal current-mode, while ACC.sub.1A, and REG.sub.1A can be arranged in switching current-mode differentially (mixed-signal sampling), and the F.sub.1A can be arranged as a differential analog comparator to perform a sign or sigmoid function.
(67) One of the benefits of the temporal-spatial MAC arrangement is that m can be programmed for the temporal accumulator (comprising ACC.sub.1A and REG.sub.1A) to be clocked by j for up to m times depending on the application requirements.
(68) Keep in mind that some of the benefits summarized in the earlier section titled DETAILED DESCRIPTION are applicable here.
Section 1B—Description of FIG. 1b
(69)
(70) Here, a comparator (MC.sub.1B) compares the value of the final accumulated signal
(71)
to an offset signal (B.sub.K) divided by a normalization gain factor (G.sub.K), wherein the normalized offset signal (B.sub.K/G.sub.K) can be compiled and/or programmed.
(72) Also, note that some of the benefits summarized in section 1A and the earlier section titled DETAILED DESCRIPTION are applicable here.
Section 1C—Description of FIG. 1c
(73)
(74) Bear in mind that some of the benefits summarized in section 1B and the earlier section titled DETAILED DESCRIPTION are applicable here.
Section 2A—Description of FIG. 2A
(75)
(76) In the simplified embodiment of
(77) Consider that the output of the accumulator here would be a signed value with an average that follows an approximate value close to zero scale.
(78) Moreover, here the activation function input compares the
(79)
signal with a (compiled and programmed) normalized biased and offset signal C.sub.k=B.sub.k/G.sub.k−{circumflex over (R)}.sub.l wherein B.sub.K is an offset signal, G.sub.K is normalization gain factor, and {circumflex over (R)}.sub.l is the average of the predictably distributed set of R.sub.i=P.sub.i×Q.sub.i.
(80) For applications in which the profile of a set of R.sub.i=P.sub.i×Q.sub.i signals follow a {circumflex over (R)}.sub.l and a σ, an accumulator with smaller bit-width can be arranged for the ACC.sub.2A+REG.sub.2A circuit block, as well as a smaller bit-width magnitude comparator (e.g., for the activation function) can be arranged, which saves on IC die area and lowers cost.
(81) Additionally, notice that some of the benefits summarized in section 1A and the earlier section titled DETAILED DESCRIPTION are applicable here.
Section 2B—Description of FIG. 2b
(82)
(83) The simplified embodiment of
(84) Let's take an example of a 6-bit accumulator.
(85)
has an unsigned full-scale value of integer 64, and an unsigned average value of integer 32. Here, the integer 32 value corresponds to one-half of the XOR outputs in a logical 1 state (population counter at Half-Scale=HS), and the integer 64 value corresponds to all XOR outputs in a logical 1 state (population counter at Full-Scale=FS). Note that the output of the 6-bit accumulator could also be a signed value whose average would track to approximately zero scale. Let's for a moment ignore the −{circumflex over (R)}.sub.l/j accumulator offset for clarity of illustration. Assume a
(86)
having an integer value of 67 is fed (in its digital form) to the 6-bit accumulator. The accumulator computes a (small bit-width residue digital word corresponding to the) residual integer value 3 (67 modulo 2.sup.6), and the Carry Output (C.sub.O) of the accumulator ACC.sub.2B's MSB is activated (which may be ignored or utilized as an MSP event counter depending on the needs of the end-application). The small bit-width residue digital word corresponding to the residual value of 3 is then fed into a small bit-width magnitude comparator (CMP), which saves on area and cost.
(87) Also, please refer to the benefits summarized in section 1C and the earlier section titled DETAILED DESCRIPTION that are applicable here.
Section 3A—Description of FIG. 3A
(88)
(89) Here, the section inside the dashed-line box (XOR.sub.3A+ADD.sub.3A) illustrates a circuit schematic of a single-ended current-mode multiply-accumulate (iMAC) for binarized neural networks (BNN, see U.S. Pat. No. 10,915,298 issued Feb. 9, 2021). The mixed-signal current-mode XOR circuit block is comprised of a plurality of equally sized current sources (e.g., N′2.sub.3A) that are selected by a plurality (of 2.sup.n) pairs of digital words (P.sub.i and Q.sub.i). In the embodiment depicted in
(90)
current signal.
(91) The iDAC.sub.3A generates an offset current signal ˜−{circumflex over (R)}.sub.l/j that is added to the
(92)
current signal; that sum is digitalized via iADC.sub.3A.
(93) Note that the output of iADC.sub.3A is a signed digital signal that is (temporally) accumulated by the ACC.sub.3A+REG.sub.3A, which can be time-multiplexed for a programmable j number of times.
(94) Also, note that the digital output of ACC.sub.3A+REG.sub.3A is a signed digital word
(95)
that averages approximately around zero-scale.
(96) A magnitude comparator (MC.sub.3A) performs an activation function to generate a sign signal (S.sub.ij−C.sub.k) by comparing the
(97)
digital word and a normalized biased and offsetted digital word C.sub.k=B.sub.k/G.sub.k−{circumflex over (R)}.sub.l, wherein B.sub.K is an offset signal, G.sub.K is a normalization gain factor, and {circumflex over (R)}.sub.l is the average of a predictably distributed set of R.sub.i=P.sub.i×Q.sub.i.
(98) Similar to the example previously described in section 2B, the ACC.sub.3A+REG.sub.3A, as well as the MC.sub.3A functions, can be arranged with a smaller bit-width to improve cost-performance of the BNN IC embodiment disclosed here, when the set of output signal values (R.sub.i=P.sub.i×Q.sub.i) of the MULT function follows a distribution having an average ({circumflex over (R)}.sub.l) and a sigma (σ), for example and without limitation, a predictable distribution such as a Gaussian distribution.
(99) One of the benefits of the disclosed temporal-spatial MAC arrangement is reduction of the accumulated error signal during accumulation. This is because each
(100)
signal (which is a temporal analog summation signal) gets digitized by the ADC with a fresh start, and the digital accumulation performs each of the sequential temporal summations with each fresh batch of the ADC's digital output data in the digital-mode, which minimizes and or limits carry-over residual or cumulative errors.
(101) Additionally, please refer to the benefits summarized in section 2B and the earlier section titled DETAILED DESCRIPTION that are applicable here.
Section 3B—Description of FIG. 3b
(102)
(103) In this embodiment, a plurality of pairs of input signals (P.sub.i and Q.sub.i) are multiplied (via aMULT.sub.3B) wherein R.sub.i=P.sub.i×Q.sub.i and added via aADD.sub.3B to produce a summation output
(104)
which is digitized via an ADC.sub.3B. In other embodiments, the signals are mixed-signal, analog, or both.
(105) Note that the set of output signal (R.sub.i=P.sub.i×Q.sub.i) values of the MULT.sub.3B follows a distribution with an average ({circumflex over (R)}.sub.l) and a sigma (σ), and that the output of ADC.sub.3B is arranged as being unsigned. In another embodiment, the signals are signed.
(106) The explanations provided in section 2A (that pertained to
(107) One of the benefits of the disclosed temporal-spatial MAC arrangement is a substantial reduction of the total accumulated error signal during accumulation. This is because each batch of
(108)
signal (which is a temporal analog summation signal) gets digitized by the ADC with a fresh start, and the digital accumulation performs each of the sequential temporal summations with each fresh batch of the ADC's digital output data in the digital-mode, which minimizes any carry-over residual or cumulative errors.
(109) Also, please refer to the benefits summarized in section 2A and the earlier section titled DETAILED DESCRIPTION that are applicable here.
Section 4A—Description of FIG. 4A
(110)
(111) A full description of the schematic illustrated in
(112) An embodiment such as the circuit illustrated in
(113) Also, please refer to the benefits of current mode signal processing that is summarized in the earlier section titled DETAILED DESCRIPTION that are applicable here.
Section 4B—Description of FIG. 4b
(114)
(115) A full description of the schematic illustrated in
(116) A, embodiment such as the circuit illustrated in
(117) Please take note of the benefits of current mode signal processing that is summarized in the earlier section titled DETAILED DESCRIPTION that are applicable here.
Section 5A—Description of FIG. 5A
(118)
(119) The PHA method arranges a modulo operating (MOD.sub.5A) circuit block around an accumulation feed-back loop that facilitates keeping the output value of the accumulator (ACC.sub.5A) from overflowing, while the programmable-hybrid-accumulator generates a Most-Significant-Portion (MSP) of the accumulated signal as well as a Least-Significant-Portion (LSP) of the accumulated signal, and wherein the LSP of the accumulated signal communicates with an activation function (FSA).
(120) Modulo operation is amply discussed in the literature. A basic summary is provided here. Further discussions, citations, and references can be found in reports such as Daan Leijen, 2001, “Division and Modulus for Computer Scientists.” In computation, modulo operation returns the remainder or signed remainder of a division, after one number is divided by another (called the modulus of the operation). Given two positive numbers ‘a’ and ‘n’, ‘a’ modulo ‘n’ (abbreviated as ‘a mod n’) is the remainder of the Euclidean division of ‘a’ by ‘n’, where ‘a’ is the dividend and ‘n’ is the divisor.
(121) A similar mathematical computational system called a residue numeral system (RNS), and variations of RNS, also represent integers by their values modulo and several pairwise coprime integers referred to as the moduli. Note that RNS must be a set of moduli, not just one divisor as later explained. RNS can uniquely represent numbers from 0 to the product of the moduli−1. This representation, allowed by the Chinese remainder theorem (CRT) and variations of CRT, similarly asserts that if N is the product of the moduli, then there is in an interval of length N, exactly one integer having any given set of modular values. The arithmetic of a RNS is also called multi-modular arithmetic. To explain the RNS further, suppose a′ and ‘m’ are any two integers with ‘m’ not zero. We say ‘r’ is a residue of a modulo ‘m’ if a=r (mod m). This is the same as ‘m’ divides a−r, or a=r+q×m for some integer ‘q’. The division algorithm tells us that there is a unique residue ‘r’ satisfying 0<r<|m|, and this remainder ‘r’ is called the least non-negative residue of a modulo ‘m’.
(122) The MOD.sub.5A and ACC.sub.5A circuit block embodiment of
(123) For an objective accumulated analog value of 4 that is programmed with modulo analog value of 3 (4 mod 3′), the final analog output value at the output of the programmable-hybrid-accumulator would evaluate to an analog value of 1 because 4 divided by 3 has a quotient of 1 and a remainder of an analog value of 1. Here, the objective accumulated analog output value of 4 corresponds to an MSP that is the quotient 1, and the analog residual value of 1 is the LSP of the output of the mod-hybrid-accumulator. In arranging the programmable-hybrid-accumulator with a single comparator, for example in one embodiment, when the output signal of the programmable-hybrid-accumulator is greater than analog value of 3, then the comparator enables a value of 3 to be subtracted from the mod-hybrid-accumulator's input, thus bringing the output of a programmable-hybrid-accumulator back into a bounded range (avoiding overflow) by limiting the peak-to-peak signal swing at the output of the accumulator (e.g., less than an analog value of 3).
(124) As such, when arranging an accumulator in accordance with one embodiment of the PHA method, the accumulator can continue accumulating a plurality of product signals while the output of the accumulator can be kept from overflowing or breaching the operating V.sub.DD power supply limit. This is critical in low power machine learning applications (e.g., sub-1V). Note also that the programmable-hybrid-accumulator (PHA) circuit can be equipped with an event counter (EC) that keeps track of the number of times the output of the PHA detects an analog value greater than 3, in this example. This function can be beneficial for an accumulator that can be programmed to accumulate a wide range of a plurality of product signals while the MAC IC continues operating at low V.sub.DD levels. Such a trait, accordingly, enables a Machine Learning IC to optimize an objective cost-function by zooming in on the LSP of the accumulated product while concurrently having the option to capture the MSP of the output of the accumulator, if the inference or training functions require it.
(125) In another example, consider an objective accumulation target that corresponds to an analog value of 12, wherein the programmable-hybrid-accumulator is programmed to a modulo value of 4 (i.e., ‘12 mod 4’) which would evaluate to 0. This is because the division of 12 by 4 has a quotient of 3 and a remainder of 0. Here, as the output of the accumulator moves from zero-scale towards the objective analog value of 12, the PHA's comparator would trigger 3 times (e.g., each time the output of the PHA output has risen above 4) and 4 analog values have been consecutively subtracted from the input of the mod-hybrid-accumulator, which causes the output of the PHA to converge on zero (because there is nothing to subtract from 12 after multiplying 4 times 3). As such, the LSP of the output of the PHA has an analog value of 0, while the quotient of 3 is the MSP of the output of the PHA.
(126) In
(127)
(128) Setting aside the contribution of the MOD.sub.5A function initially (e.g., when the feed-back signal Z.sub.4=0, Z′.sub.2=Z.sub.1−Z.sub.4.fwdarw.Z.sub.1), the Z′.sub.2.fwdarw.Z.sub.1 signal is received by a ACC.sub.5A circuit which is depicted as an accumulator function block F.sub.5A(Z.sub.2) with a feedback around it illustrating a temporal accumulation of a time-multiplexed series of inputted Z′.sub.2 signals, while a Z.sub.3 signal would be in the making at the output of the ACC.sub.5A circuit.
(129) While streams of Z.sub.1 signals are received by the ACC.sub.5A+MOD.sub.5A circuit that is the mod-hybrid-accumulator, the MOD.sub.5A section (that is placed around a negative feedback loop of the ACC.sub.5A) receives the Z.sub.3 signals from the output of ACC.sub.5A which can then be processed through a F.sub.5A(Z.sub.3) functional block with a single-level or multi-level modulo P.sub.f signal(s), and in one embodiment, adjustable depending on a cost-performance objective.
(130) For example, in a single-level modulo operation, the F.sub.5A (Z.sub.3) can be arranged to function as follows: if Z.sub.3>P.sub.f−LSB/2, then Z.sub.4=P.sub.f, otherwise Z.sub.4=0. Bear in mind that the input to the ACC.sub.5A is Z′.sub.2=Z.sub.1−Z.sub.4 and if Z.sub.3>P.sub.f−LSB/2, then Z.sub.4=P.sub.f which translates to Z′.sub.2=Z.sub.1−P.sub.r This is the mechanism by which the accumulation process progress, while a plurality of programmed P.sub.f values are accordingly and sequentially subtracted (by design) from the input of ACC.sub.5A so that the ACC.sub.5A output does not overflow.
(131) Also, refer to section 5B which provides another embodiment of single-level modulo operations, and sections 6A and 6B which describe other embodiments of multiple-level finer modulo operations.
(132) The final Z.sub.3 signal represents the Least-Significant-Portion (LSP) of the final accumulated Z.sub.1 value in mod P.sub.f, whereas the event counter (EC) represents the Most-Significant-Portion (MSP) of the final accumulated Z.sub.1 value that represents the quotient which is the number of times the event Z.sub.3>LSB/2 could be detected up to the final accumulation Z.sub.1. Accordingly, Z.sub.3 represents the LSP of
(133)
and the number or times the event counter EC signal is triggered during the accumulation cycle represents the MSP of
(134)
(135) A programmed and compiled bias signal (C.sub.K) and the final Z.sub.3 are received by an activation function (F.sub.5A) circuit block which generates an activation signal, f.sub.K{Z.sub.3±C.sub.K}. Note again that the sign of C.sub.K must be either + or − depending on training algorithms and the objectives of an end-application
(136) In addition to the benefits of programmable programmable-hybrid-accumulator noted above, please take note of some of the other benefits summarized in the earlier section titled DETAILED DESCRIPTION that are applicable here.
Section 5B—Description of FIG. 5b
(137)
(138) Notice that for clarity of description and illustration, the embodiment of MULT.sub.5B+ADD.sub.5B circuit blocks depict only 4 pairs of input signals are multiplied and then added together. In other embodiments, a substantially larger number of pairs of signals can be processed in accordance with cost-performance trade-offs and objectives of an end-application.
(139) Here, the output of MULT.sub.5B & ADD.sub.5B communicate with a programmable-hybrid-accumulator (MOD.sub.5B & ACC.sub.5B) circuit block, arranged for a single-level modulo operation programmed at a P.sub.1 value.
(140) The ACC.sub.5B section of the MOD.sub.5B & ACC.sub.5B block is comprised of two cascaded, out-of-phase sample-and-holds (SH.sub.5B & SH′.sub.5B), wherein the output of the second SH′5B is fed back to the input of the first SH.sub.5B to perform the function of signal accumulation.
(141) For a single-level modulo operation, the MOD.sub.5B function (arranged around the feed-back loop of ACC.sub.5B) is comprised of a comparator that causes a single-level P.sub.1 value to be subtracted from the input of ACC.sub.5B when the output value of ACC.sub.5B exceeds a value of P.sub.1−LSB/2, and wherein P.sub.1 can be programmed to a value (here, but not necessarily) proportional to full-scale (e.g., P.sub.1=V.sub.REF). In other words, if Z.sub.3>P.sub.1−LSB/2, then Z.sub.4=P.sub.1, otherwise Z.sub.4=0. Notice that arranging such a single-level modulo operation (with a comparator whose output controls whether to subtract a P.sub.1 value or not) is equivalent to a one-bit ADC functioning as a single comparator whose output controls a one-bit DAC that has a full-scale of −P.sub.1.
(142) Note that Z.sub.3 represents the LSP of
(143)
and the number of times the event counter EC signal is triggered during the temporal accumulation cycle represents the MSP of
(144)
(145) In addition to the benefits summarized in section 5A, also relevant to the embodiment disclosed in
Section 6A—Description of FIG. 6A
(146)
(147) The ACC.sub.6A function is comprised of two cascaded out-of-phase sample-and-holds (SH.sub.6A & SH′.sub.6A) with the output of the second SH fed-back to the input of the first SH, like the embodiment disclosed in
(148) MOD.sub.6A is arranged to perform a finer modulo operation (arranged around the feed-back loop of ACC.sub.6A) wherein the Z.sub.3 signal is monitored and compared against multiple levels or incremental segments (e.g., P−LSB/2, 2P−LSB/2, and 3P−LSB/2) for signal processing by m.sub.1, m.sub.2, m.sub.3, and multiplexer (MUX.sub.6A) functional blocks in accordance with the following programmed arrangement:
(149) If Z.sub.3>1P−LSB/2, then Z.sub.4=Z.sub.3−1P, otherwise Z.sub.4=Z.sub.3.
(150) If Z.sub.3>2P−LSB/2, then Z.sub.4=Z.sub.3−2P, otherwise Z.sub.4=Z.sub.3.
(151) If Z.sub.3>3P−LSB/2, then Z.sub.4=Z.sub.3−3P, otherwise Z.sub.4=Z.sub.3.
(152) Although there is an IC cost-performance trade-off limit, note that there is no functional limit to the amount of stacking that could occur here, and the illustration in
(153) The signal Z.sub.2 is received by the accumulator ACC.sub.6A, wherein Z.sub.2=Z.sub.1+Z.sub.4, and wherein
(154)
with R.sub.i=P.sub.i×Q.sub.i for i ranging from 1 to n. Also keep in mind that a final accumulated Z.sub.3 value represents a (programmed) least significant portion of the
(155)
signal value.
(156) In the embodiment of
(157) As discussed earlier, the Z.sub.3 signal can represent the LSP of
(158)
and the event counter EC signal can represent the MSP of
(159)
In another embodiment, the Portion of the signal being in the Least Significant or Most Significant (e.g., MSP and LSP) position can be programmed in accordance with the cost-performance objectives of an end application.
(160) In addition to the benefits summarized in section 5A, also relevant to the embodiment disclosed in
Section 6B—Description of FIG. 6b
(161)
(162) The ACC.sub.6B function here is comprised of two cascaded out-of-phase sample-and-holds (SH.sub.6B & SH′.sub.6B) the 2.sup.nd SH's output fed-back to the 1.sup.st SH's input, like those disclosed in the embodiments of
(163) The aMOD.sub.6B is also arranged here to perform finer modulo operations (also arranged around the feed-back loop of aACC.sub.6B), wherein as the accumulation cycle proceeds, the Z.sub.3 signal is monitored and compared against multiple levels or incremental segments that are processed via ADC.sub.6B, D.sub.6B, and DAC.sub.6B.
(164) In
(165)
wherein R.sub.i=P.sub.i×Q.sub.i for i ranging from 1 to n, Z′.sub.2=Z.sub.1−Z.sub.4, and Z.sub.2=Z′.sub.2+Z.sub.3.
(166) As noted in previous sections, the Z.sub.3 signal value at the end of the accumulation cycle can represent the LSP of
(167)
and the event counter EC signal (e.g., number of triggered instances) can represent the MSP of
(168)
(169) In yet another embodiment, the ADC.sub.6B and DAC.sub.6B can be arranged to program the Portion of the signal in the Least Significant or Most Significant (e.g., MSP and LSP), which can be programmed in accordance with the cost-performance objectives of an end application.
(170) The ADC plus DAC of
(171) In addition to the benefits summarized in section 6A, also relevant to the embodiment disclosed in
Section 6C—Description of FIG. 6c
(172)
(173) Here, a differential ADC.sub.6C (including for example a differential comparator depicting a 1-bit ADC) can perform the activation function corresponding to F.sub.5A of
(174) In
(175)
wherein R.sub.i(z)=P.sub.i(z)×Q.sub.i(z) for i ranging from 1 to n.
(176) The Z.sub.1 differential signal is received by a differential analog accumulator (aACC.sub.6C) and as the accumulation cycle proceeds it generates Z.sub.3 while a differential analog modulo functional (aMOD.sub.6C) circuit block arranged around the feed-back loop of aACC.sub.6C performs the modulo operations. As discussed in earlier sections, besides subtracting a P value or multiples of P values from Z.sub.3 (when needed to keep Z.sub.3 from overflowing), the aMOD.sub.6C circuit also keeps track of the number of times Z.sub.3 exceeds the programmed single-level P values or set of multiple-levels of P values (or incremental segments) which can trigger an Event-Counter (EC).
(177) Accordingly, the differential Z.sub.3 signal value at the end of the accumulation cycle can represent the LSP of
(178)
and the event counter EC signal (e.g., number of triggered instances) can represent the MSP of
(179)
(180) A differential bias signal is generated by an analog bias circuit (aBIAS.sub.6C) to offset the differential Z.sub.3 signal value before it is fed into a differential ADC.sub.6C (including for example a differential comparator as a 1-bit ADC) which can perform the activation function (corresponding to F.sub.5A of
(181) In addition to the benefits summarized in section 5A, also relevant to the embodiment disclosed in
Section 7A—Description of FIG. 7A
(182)
(183) The signal that is to be accumulated V.sub.M (z) flows through a C.sub.M (z.sup.−1) that translates to V.sub.M(z)×C.sub.M (z.sup.−1)=V′.sub.M(z) where C.sub.M is a capacitance.
(184) A V.sub.P signal flows through a −C.sub.P (1−z.sup.−1) that translates to V.sub.P×C.sub.P (1−z.sup.−1)=V′.sub.p(z) where C.sub.P is a capacitance, and where V.sub.P=0 if V.sub.O (z)<V.sub.REF, else V.sub.P=V.sub.REF
(185) Combining V′.sub.p(z) and V′.sub.m(z) in the signal flow diagram translates to V′.sub.m(z)−V′.sub.p(z), and when taken through (1/C.sub.F)/(1−z.sup.−1), it yields a V.sub.O(z) signal in accordance with the following formulation:
(186)
wherein
(187)
(188)
(189) Similarly, as discussed in earlier sections with respect to other embodiments, each instance during the accumulation phase when Vo(z) exceeds a programmed signal level (e.g., V.sub.REF), an event counter is registered to keep track of the MSP of the signal that is being accumulated.
(190) In one embodiment, to keep Vo(z) from exceeding a programmed signal limit, the disclosed signal flow can be altered to adjust the gain of the integrator which can be accomplished by changing the CF instead of subtracting a signal proportional to V.sub.REF from Vo(z).
(191) In another embodiment, to improve precision, instead of the V.sub.p signal getting scaled by
(192)
capacitor ratio and V.sub.M signal getting scaled by a different capacitor ratio
(193)
the clock phases in the V.sub.M(z) path can be re-programmed and altered to sample a −V.sub.p equivalent voltage, thereby providing for the signal through the accumulation function and the signal through ratio
(194)
Section 7B—Description of FIG. 7b
(195)
(196) Applying superposition, let's consider an instance when V.sub.O (z) G V.sub.REF−LSB/2 and mixed mode aMOD.sub.7B circuit does not contribute to V.sub.O (z), and let us presume that charges across capacitors Cm.sub.7B and Cf.sub.7B are properly initialized. Accordingly, a first batch of V.sub.M (z) signal is received and loaded onto the switch capacitor network of the mixed-mode aACC.sub.7B circuit, wherein the aACC.sub.7B is arranged for SW1.sub.7B and SW4.sub.7B to be in phase (ϕ1) with one another and both as being arranged out of phase with SW2.sub.7B and SW3.sub.7B which are in phase (
(197)
until V.sub.O (z) exceeds a programmed signal level proportional to V.sub.REF when the aMOD.sub.7B circuit kicks-in.
(198) Assuming that charges across capacitors, including Cp.sub.7B, are properly initialized, if and when V.sub.O (z)>V.sub.REF−LSB/2, COMP.sub.7B enables SW5.sub.7B to charge the Cp.sub.7B capacitor to V.sub.REF which translates to subtracting
(199)
from V.sub.O(z), wherein V.sub.p can programmed in proportion to V.sub.REF Also, the output of COMP.sub.7B triggers the start of counting of the event counter that generates the EC signal.
(200) Accordingly, and as described in section 7A and illustrated in the signal flow diagram of
(201)
wherein
(202)
(203) As noted earlier, the differential V.sub.O (z) signal value at the end of the accumulation cycle can represent the LSP of
(204)
and the event counter EC signal (e.g., number of triggered instances) can represent the MSP of
(205)
(206) Note that in
(207) As noted in section 7A, instead of subtracting a signal proportional to V.sub.REF from Vo(z), the disclosed embodiment has the flexibility of keeping Vo(z) from exceeding a programmed signal magnitude limit, by altering the gain of the integrator which can be accomplished by for example changing the CF.
(208) In addition to the benefits of sharing circuitry between the mixed-mode switch capacitor based aMOD.sub.7B and aACC.sub.7B circuits which saves silicon die area and improves matching, please take note of some of the other benefits summarized in the earlier section titled DETAILED DESCRIPTION that are applicable here.
Section 8A—Description of FIG. 8A
(209)
(210) Cost-performance of the disclosed BNN is improved by intertwining functions via sharing capacitors and switches among functional blocks. The disclosed PHA in
(211) The disclosed BNN in one embodiment is arranged in a single-ended voltage-mode mixed-signal switch-capacitor configuration for descriptive and illustrative clarity, but as discussed in prior sections, a differential embodiment utilizing, for example, differential input-output amplifiers and comparators would provide additional benefits such as better noise rejection, lower drift, less offset, better power supply rejection, and lower switch charge injection.
(212) The embodiment of
(213) Notice that the circuit operation of aACC.sub.8A and aMOD.sub.8A are like their respective counterparts described and illustrated in section 7B and
(214) In
(215) In the BNN of
(216) Similar to the description provided in section 7B, (at the output ports of uMULT.sub.8A) the respective bitwise ‘1’ signals attributed to R.sub.i (z)=
(217)
is accumulated, integrated, and held on a feedback capacitor Cf.sub.8A. When a plurality of batches of
(218)
are time-multiplexed onto the accumulator for j time, and at the end of the accumulation cycle, a hybrid signal
(219)
is generated that is comprised of an LSP.sub.ij of
(220)
represented at the Vo(z) output of AMP.sub.8A in the aACC.sub.8A circuit, and as an MSP.sub.ij of
(221)
represented at the EC output of CMP.sub.8A in the aMOD.sub.8A circuit.
(222) In another embodiment, to further improve performance (referring to section 7B), the term
(223)
can be supplied by the same mixed-mode aADD.sub.8A circuit which is the plurality of Cmi.sub.8A switched-capacitor network. In yet another embodiment, instead of utilizing an independent C.sub.P capacitor that imposes a new mismatch variable, when V.sub.O (z)>V.sub.REF, different clock phases for Cmi.sub.8A capacitors may be programmed to be switched to V.sub.REF full-scale (e.g., V.sub.P=V.sub.REF) which helps to eliminate C.sub.p (and its attributable mismatches) by substituting the
(224)
term with the aggregate of Cmi.sub.8A switch capacitor values or
(225)
(226) In yet another embodiment, instead of subtracting a signal proportional to V.sub.REF from Vo(z), Vo(z) may be kept from exceeding a programmed signal magnitude limit by altering the gain of the integrator/accumulator which can be accomplished by for example changing the C.sub.F.
(227) In addition to the benefits outlined in section 7B, please take note of some of the other benefits summarized in the earlier section titled DETAILED DESCRIPTION that are applicable here.
Section 8B—Description of FIG. 8b
(228)
(229)
(230) As discussed in section 3A, the segment inside the dashed-line boxes (bMULT.sub.8B and aADD.sub.8B) illustrates a circuit schematic of a single-ended current-mode multiply-accumulate (iMAC) for binarized neural networks (BNN, see U.S. Pat. No. 10,915,298 issued Feb. 9, 2021). A plurality (of 2.sup.n) pairs of digital words (P.sub.i and Q.sub.i) selectively enable a mixed-signal current-mode XOR′.sub.8B circuit to generate a plurality of R.sub.i(z)=P.sub.i⊕Q.sub.i signals which consequently enables a selected set of equally sized current sources (e.g., N3′.sub.8B). By coupling the output ports of the selected set of N3′.sub.8B current sources, an equivalent bitwise population count function is performed in analog/mixed-mode which generates a
(231)
current signal.
(232) The accumulator section of programmable-hybrid-accumulator of
(233) Setting aside the role of aMOD.sub.8B circuit for now, via an enabled S1.sub.8B switch, the first batch of summed
(234)
current signals is fed onto an input-port (diode-connected gate-drain terminal of P3.sub.8B) of a first switching current iSH comprising P2.sub.8B, S3.sub.8B, C2.sub.8B, and P3.sub.8B. The output port of (drain terminal of P3.sub.8B) is then fed onto input-port (diode-connected gate-drain terminal of N6.sub.8B) of a second switching current iSH comprising N6.sub.8B, S4.sub.8B, C3.sub.8B, and N5.sub.8B. The output port of (drain terminal of N5.sub.8B) of the second iSH is coupled back to the input port of the first iSH to form a switching current-mode integrator/accumulator (aACC.sub.8B). In one embodiment, the S3.sub.8B and S4.sub.8B are controlled by out of phase non-overlapping clocks.
(235) Here is the role of the aMOD.sub.8B circuit that is comprised of P1.sub.8B, S2.sub.8B, C1.sub.8B, I.sub.R1.sub.8B, I.sub.R2.sub.8B, and N4.sub.8B: On each cycle, the accumulated batch of sum of
(236)
current signals are sampled and held via S2.sub.8B and C1.sub.8B onto P1.sub.8B whose current is compared with I.sub.R1.sub.8B. When the current through P1.sub.8B exceeds that of I.sub.R1.sub.8B current, then N4.sub.8B turns-on and steers an I.sub.R2.sub.8B current onto the drain port of P3.sub.8B, which effectively subtracts from the drain current of P3.sub.8B before it is (accumulated) fed back onto the first iSH in the next clock cycle. The value of I.sub.R1.sub.8B=I.sub.R2.sub.8B currents can be programmed proportional to a reference signal S.sub.REF=I.sub.REF.
(237) Accordingly, the final accumulated signal magnitude is bounded by the programmed magnitude of I.sub.R1.sub.8B=I.sub.R2.sub.8B, which can result in the following benefits: First, it helps save on power consumption since bounded current spans translate to bounded current consumption. Second, by effectively transforming the accumulating signal (at the output of the MAC IC) into a hybrid signal (comprising an MSP and an LSP), the span of the accumulating signal is widened without either breaching the V.sub.DD operating headroom or causing an overflow condition. Third, faster accumulator response can be achieved given smaller/bounded range of the accumulating signal movement. Fourth, there can be less signal-dependent charge injection within the accumulator/integrator circuit since the accumulated signals swing less. In another embodiment, a switching current differential aMOD.sub.8B and aACC.sub.8B would substantially improve some of the performance limitations of a single ended aMOD.sub.8B and aACC.sub.8B that is illustrated here for sake of clarity.
(238) The drain port of N7.sub.8B mirrors the N5.sub.8B accumulated current representing an LSP.sub.ij of
(239)
at the current output port I.sub.O of aACC.sub.8B circuit, and an MSP.sub.ij of
(240)
is represented at the EC port of the aMOD.sub.8B circuit.
(241) Please take note of some of the other benefits, including those attributed to current mode signal processing, summarized in the earlier section titled DETAILED DESCRIPTION that are applicable here.