Digital approximate squarer for machine learning
11416218 · 2022-08-16
Inventors
Cpc classification
G06F2207/5523
PHYSICS
G06F17/17
PHYSICS
International classification
Abstract
Digital approximate squarer (aSQR)s utilizing apparatuses, circuits, and methods are described in this disclosure. The disclosed aSQR methods can operate asynchronously and or synchronously. For applications where low precisions is acceptable, fewer interpolations can yield less precise square approximation, which can be computed faster and with lower power consumption. Conversely, for applications where higher precision are required, more interpolations steps can generate more precise square approximation. By utilizing the disclosed aSQR method, precision objectives of a squarer approximation function can be programmed real-time and on the fly, which enables optimizing for power consumption and speed of squaring, in addition to optimize for the approximate squarer's die size and cost.
Claims
1. An approximate digital squaring (aSQR) method in a digital state machine in an integrated circuit (IC), the digital state machine in the integrated circuit (IC) comprising at least one of an input port, an output port, a digital memory apparatus, a digital adder apparatus, a digital subtractor apparatus, a digital divider apparatus, a digital accumulator apparatus, and a digital maximum selector apparatus, the method comprising: receiving a digital input word (D) at the input port, wherein the digital input word (D) has a digital value spanning from zero scale (ZS) to full scale (FS); programming a number of digital interpolation clock steps (n) in the digital state machine; initializing each digital word of a digital array O in the digital memory apparatus to zero, wherein the digital array O is comprised of a plurality of digital words O.sub.j; initializing each digital word of a digital array P in the digital memory apparatus to zero, wherein the digital array P is comprised of a plurality of digital words P.sub.j; programming the digital state machine beginning at a first digital interpolation clock step of j=1 and ending at a last digital interpolation clock step of j=n, generating and storing a sequence of digital words into the plurality of digital words O.sub.j in the digital memory apparatus by, at each clock step j: generating, in the digital divider apparatus, an array of offset values equal to full scale (FS) divided by 2.sup.j to generate a digital offset equal to
2. The approximate digital squaring (aSQR) method in a digital state machine in an integrated circuit (IC), the digital state machine in the integrated circuit (IC) comprising at least one of the input port, the output port, the digital memory apparatus, the digital adder apparatus, the digital subtractor apparatus, the digital divider apparatus, the digital accumulator apparatus, and the digital maximum selector apparatus of claim 1, the method further comprising: beginning at the first digital interpolation clock step of j=1 and ending at the last digital interpolation clock step of j=n, generating a S.sub.n signal that is a summation of a product of P.sub.j and ½.sup.(j-2), in the digital adder apparatus, equal to
3. The approximate digital squaring (aSQR) method in a digital state machine in an integrated circuit (IC), the digital state machine in the integrated circuit (IC) comprising at least one of the input port, the output port, the digital memory apparatus, the digital adder apparatus, the digital subtractor apparatus, the digital divider apparatus, the digital accumulator apparatus, and the digital maximum selector apparatus of claim 2, the method further comprising: wherein the Sn signal represents a first S.sub.n signal; generating another Sn signal, wherein the another S.sub.n signal represents a second S.sub.n signal; and subtracting, in at least one of the digital subtractor apparatus and a second digital subtractor apparatus, the first Sn signal from the second Sn signal.
4. A method of generating an approximate square (aSQR) of input signals in an approximate squarer integrated circuit (IC), the approximate squarer integrated circuit (IC) comprising at least one of a subtracting apparatus, a summation apparatus, a maximum-selecting apparatus, and a scaling apparatus, the method comprising: generating, during a first clock step (J1), a first offsetted output signal (O1) by: generating, in the scaling apparatus, a one-half full scale signal (HS) by scaling a full scale signal (FS) by a factor of one-half; subtracting, in the subtracting apparatus, the one-half full scale signal (HS) from an input signal (D), wherein the input signal (D) has a maximum range equal to the full-scale signal (FS); selecting, during the first clock step (J1), in the maximum-selecting apparatus, a first positive P signal (P1) that is a maximum of the first offsetted output signal (O1) and a zero-scale signal (ZS), wherein the input signal (D) has a minimum range equal to the zero-scale signal (ZS); generating, during a second clock step (J2), a second offsetted output signal (O2) by: generating, in the scaling apparatus, a one-fourth full scale signal (QS) by scaling the full scale signal (FS) by a factor of one-fourth; generating, in the scaling apparatus, a double first positive P signal (P1) by scaling the first positive P signal (P1) by a factor of two; subtracting, in the subtracting apparatus, the one-fourth full scale signal (QS) and the double first positive P signal (P1) from the input signal (D); selecting, during the second clock step (J2), in the maximum-selecting apparatus, a second positive P signal (P2) that is the maximum of the second offsetted output signal (O2) and the zero-scale signal (ZS); and generating, during the second clock step (J2), a second approximate square signal (S2) by summing, in the summing apparatus, the double first positive P signal (P1), and the second positive P signal (P2).
5. The method of generating an approximate square (aSQR) of input signals in an approximate squarer integrated circuit (IC), the approximate squarer integrated circuit (IC) comprising at least one of the subtracting apparatus, the summation apparatus, the maximum-selecting apparatus, and the scaling apparatus of claim 4, the method further comprising: generating, during a third clock step (J3), a third offsetted output signal (O3) by: generating, in the scaling apparatus, a one-eighth full scale signal (ES) by scaling the full scale signal (FS) by a factor of one-eighth; generating, in the summing apparatus, a P1+P2 signal by summing the first positive P signal (P1) and the second positive P signal (P2); generating, in the scaling apparatus, a double P1+P2 signal by scaling the P1+P2 signal by a factor of two; subtracting, in the subtracting apparatus, the double P1+P2 signal and the one-eighth full scale signal (ES) from the input signal (D); selecting, during the third clock step (J3), in the maximum-selecting apparatus, a third positive P signal (P3) that is a maximum of the third offsetted output signal (O3) and the zero-scale signal (ZS); generating, during the third clock step (J3), a third approximate square signal (S3) by: generating, in the scaling apparatus, a one-half third positive P signal (P3) by scaling the third positive P signal (P3) by a factor of one-half; and summing, in the summation apparatus, the double first positive P signal (P1), the second positive P signal (P2), and the one-half third positive P signal (P3).
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The subject matter presented herein is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and illustrations, and in which like reference numerals refer to similar elements, and in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION
(15) Numerous embodiments are described in the present application and are presented for illustrative purposes only and is not intended to be exhaustive. The embodiments were chosen and described to explain principles of operation and their practical applications. The present disclosure is not a literal description of all embodiments of the disclosure(s). The described embodiments also are not, and are not intended to be, limiting in any sense. One of ordinary skill in the art will recognize that the disclosed embodiment(s) may be practiced with various modifications and alterations, such as structural, logical, and electrical modifications. For example, the present disclosure is not a listing of features which must necessarily be present in all embodiments. On the contrary, a variety of components are described to illustrate the wide variety of possible embodiments of the present disclosure(s). Although particular features of the disclosed embodiments may be described with reference to one or more particular embodiments and/or drawings, it should be understood that such features are not limited to usage in the one or more particular embodiments or drawings with reference to which they are described, unless expressly specified otherwise. The scope of the disclosure is to be defined by the claims.
(16) Although process (or method) steps may be described or claimed in a particular sequential order, such processes may be configured to work in different orders. In other words, any sequence or order of steps that may be explicitly described or claimed does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order possible. Further, some steps may be performed simultaneously despite being described or implied as occurring non-simultaneously (e.g., because one step is described after the other step). Moreover, the illustration of a process by its depiction in a drawing does not imply that the illustrated process is exclusive of other variations and modifications thereto, does not imply that the illustrated process or any of its steps are necessary to the embodiment(s). In addition, although a process may be described as including a plurality of steps, that does not imply that all or any of the steps are essential or required. Various other embodiments within the scope of the described disclosure(s) include other processes that omit some or all of the described steps. In addition, although a circuit may be described as including a plurality of components, aspects, steps, qualities, characteristics and/or features, that does not indicate that any or all of the plurality are essential or required. Various other embodiments may include other circuit elements or limitations that omit some or all of the described plurality.
(17) Throughout this disclosure, the terms FET is field-effect-transistor; MOS is metal-oxide-semiconductor; MOSFET is MOS FET; PMOS is p-channel MOS; NMOS is n-channel MOS; BiCMOS is bipolar CMOS; LSP of a signal is the Least-Significant-Portion of the signal; MSP of the signal is the Most-Significant-Portion of the signal; and the sum of the MSP of the signal plus the LSP of the signals is equal to the whole signal; and the MSP or LSP can be represented in analog or digital form or combination thereof; MSB is Most-Significant-Bit and LSB is Least-Significant-Bit; SPICE is Simulation Program with Integrated Circuit Emphasis which is an industry standard circuit simulation program; micro is μ which is 10.sup.−6; nano is n which is 10.sup.−9; and pico is p which is 10.sup.−12. Bear in mind that V.sub.DD (as a positive power supply) and V.sub.SS (as a negative power supply) are applied to all the circuitries, block, or systems in this disclosure, but may not be shown for clarity of illustrations. The V.sub.SS may be connected to a negative power supply or to the ground (zero) potential. Body terminal of MOSFETs can be connected to their respective source terminals or to the MOSFET's respective power supplies, V.sub.DD and V.sub.SS.
(18) Keep in mind that for descriptive clarity, illustrations of the disclosed inventions are simplified, and their improvements beyond simple illustrations would be obvious to one skilled in the arts.
Section 1A—Description of FIG. 1A
(19)
(20) In section A1′ and
(21) Some of the benefits of the aSQR method operating synchronously is summarized below:
(22) First, the aSQR method enables a digital IC state machine to perform on-the fly or pre-programming of precision versus power consumption, and speed of an approximate squarer. The lower the precision requirement, the faster the squaring and the lower the power consumption per the squaring operation. As such, the precision of squarer approximation can be traded off with cost, speed, and power consumption depending on application cots-performance objectives.
(23) Second, relatively speaking while addition (subtraction) occupy a large area in the digital domain, a digital IC state machine arranged in accordance with the disclosed aSQR method utilizes fewer adders compared to conventional digital IC squarers. Instead, the disclosed aSQR method requires functions such as multiply or divide by two, that can be implemented by a simple shift to the right or left in the digital domain, which takes a small die area. Moreover, the aSQR method utilizes functions such as adding or subtracting a fixed digital value (in proportion to an input digital word's full scale), which also take a relatively small area.
(24) Third, the disclosed digital IC approximate squaring can be arranged to perform approximate multiplication by utilizing the quarter square method. Accordingly, digital IC multiplication can be performed by deducting the square of subtraction of two digital words (x, y) from the square of their summation as in (x+y).sup.2+(x−y).sup.2=4xy.
(25) Fourth, the disclosed digital IC approximate squaring can be arranged, in the back-end, to performed square and accumulate (SAC) and multiply and accumulate (MAC) functions in mixed-mode. For example, plurality of outputs of approximate digital IC squarers or approximate digital IC multipliers can be inputted to plurality of current mode Digital-to-Analog-Converters (iDACs), wherein by the function of summation (e.g., adding two multiplications) can be performed simply by coupling together the current output terminals of plurality of iDACs.
Section 1A′—Description of FIG. 1A′
(26)
(27) In
(28) In the first interpolation (n=1), the D.sub.i word is shifted by half of full-scale to arrange a digital word O.sub.1 which is a D.sub.i word that is digitally offset by half of FS. As such, O.sub.1=D.sub.i−FS×2.sup.−1. Then, a maximum of the O.sub.1 word and zero-scale is selected that outputs a P.sub.1 digital word which is a positive word P.sub.1=max (O.sub.1, ZS). Accordingly, the P.sub.1 word represents the positive portion of the D.sub.i word above ½ of FS.
(29) In the second interpolation (n=2) stage, the D.sub.i word is shifted by a sum of 2×P.sub.1 word and a quarter of full-scale to generate a digital word O.sub.2 which is a D.sub.i word that is offset down by FS×2.sup.−2+2×P.sub.1. That is to say O.sub.2=D.sub.i−(FS×2.sup.−2+2×P.sub.1). Then, a maximum of the O.sub.2 word and zero-scale is selected that generates a P.sub.2 word which is a positive word with respect to zero-scale or P.sub.2=max (O.sub.2,ZS). Accordingly, the P.sub.2 word represents the positive portion of the D.sub.i word above the sum of ¼ of FS and 2×P.sub.1 word. Here at the second interpolation point, an approximate squared digital word S.sub.2=D.sub.i2.sup.2 (that is an approximate representation of the square of the D.sub.i word) is be generated by summing the binarily scaled P.sub.1 and P.sub.2 words. Stated mathematically, D.sub.i.sup.2≈S.sub.2=D.sub.i2.sup.2=2.sup.1×P.sub.1+2.sup.0×P.sub.2. As depicted in
(30) When a squarer with greater than 93.6% of precision is required, another interpolation (n=3) can be implemented in accordance to the aSQR method. In the third interpolation stage, the D.sub.i word is shifted by a sum of 2×(P.sub.1+P.sub.2) word and one eighth of full-scale to generate a digital word O.sub.3 which is a D.sub.i word that is offset by FS×2.sup.−3+2×(P.sub.1+P.sub.2). That is to say O.sub.3=D.sub.i−{FS×2.sup.−3+2×(P.sub.1+P.sub.2)}. Then, the maximum of the O.sub.3 word and zero-scale is selected which generates a P.sub.3 word that is a positive word with respect to zero-scale or P.sub.3=max (O.sub.3, ZS). Accordingly, the P.sub.3 word represents the positive portion of the D.sub.i word above the sum of ⅛ of FS and 2×(P.sub.1+P.sub.2) word. Here at the third interpolation point, an approximate squared digital word S.sub.3=D.sub.i3.sup.2 (that is an approximate representation of the square of the D.sub.i word) is be generated by summing the binarily proportioned P.sub.1, P.sub.2, and P.sub.3 words. Stated mathematically, D.sub.i.sup.2≈S.sub.3=D.sub.i3.sup.2=2.sup.1×P.sub.1+2.sup.0×P.sub.2+2.sup.−1×P.sub.3≈D.sub.i3.sup.2. As depicted in
(31) Where a squarer with greater than 98.4 of precision is required, another interpolation (n=4) can be implemented in accordance to the aSQR method. In the fourth interpolation stage, the D.sub.i word is shifted by a sum of 2×(P.sub.1+P.sub.2+P.sub.3) word and one sixteenth of full-scale to arrange a digital word O.sub.4 which is a D.sub.i word that is offset by FS×2.sup.−4+2×(P.sub.1+P.sub.2+P.sub.3). Put differently, O.sub.4=D.sub.i−{FS×2.sup.−4+2×(P.sub.1+P.sub.2+P.sub.3)}. Then, a maximum of the O.sub.4 word and zero-scale is selected that generates a P.sub.4 word which is a positive word with respect to zero-scale or P.sub.4=max (O.sub.4, ZS). Accordingly, the P.sub.4 word represents the positive portion of the D.sub.i word above the sum of 1/16 of FS and 2×(P.sub.1+P.sub.2+P.sub.3) word. Here again at the fourth interpolation point, an approximate squared digital word S.sub.4=D.sub.i4.sup.2 (that is an approximate representation of the square of the D.sub.i word) is be generated by summing binarily proportioned P.sub.1, P.sub.2, P.sub.3, and P.sub.4 words. Stated mathematically, D.sub.i.sup.2≈S.sub.4=D.sub.i4.sup.2=2.sup.1×P.sub.1+2.sup.0×P.sub.2+2.sup.−1×P.sub.3+2.sup.−2×P.sub.4. As depicted in
(32) Again, if an approximate squarer with higher precision than 0.4% (accurate to ˜8-bits) is required, another interpolation (n=5) can be implemented in accordance to the aSQR method. As such, the D.sub.i word can be shifted by a sum of 2×(P.sub.1+P.sub.2+P.sub.3+P.sub.4) words and 1/32 of full-scale to arrange a digital word O.sub.5 which is a D.sub.i word that is offset by FS×2.sup.−5+2×(P.sub.1+P.sub.2+P.sub.3+P.sub.4). Said differently, O.sub.5=D.sub.i−{FS×2.sup.−5+2×(P.sub.1+P.sub.2+P.sub.3+P.sub.4)}. Then, a maximum of the O.sub.5 word and zero-scale is selected that generates a P.sub.5 word which is a positive word with respect to zero-scale or P.sub.5=max (O.sub.5, ZS). Accordingly, the P.sub.5 word represents the positive portion of the D.sub.i word above the sum of 1/32 of FS and 2×(P.sub.1+P.sub.2+P.sub.3+P.sub.4) word. Here again, an approximate squared digital word S.sub.5=D.sub.i5.sup.2 can be arranged by summing binarily proportioned P.sub.1, P.sub.2, P.sub.3, P.sub.4, and P.sub.s words, wherein S.sub.5=D.sub.i5.sup.2 word is an approximate representation of the square of the D.sub.i word. Stated mathematically, D.sub.i.sup.2≈S.sub.5=D.sub.i5.sup.2=2.sup.1×P.sub.1+2.sup.0×P.sub.2+2.sup.−1×P.sub.3+2.sup.−2×P.sub.4+2.sup.−3×P.sub.5. As depicted in
(33) Similarly, if an approximate squarer with better than ˜0.1% precision (accuracy of ˜10-bits) is needed, another interpolation (n=6) can be implemented in accordance to the aSQR method. As such, the D.sub.i word is shifted by a sum of 2×(P.sub.1+P.sub.2+P.sub.3+P.sub.4+P.sub.5) words and 1/64 of full-scale to arrange a digital word O.sub.6 which is a D.sub.i word that is offset by FS×2.sup.−6+2×(P.sub.1+P.sub.2+P.sub.3+P.sub.4+P.sub.5). Stated differently, O.sub.6=D.sub.i−{FS×2.sup.−6+2×(P.sub.1+P.sub.2+P.sub.3+P.sub.4+P.sub.5)}. Then, a maximum of the O.sub.6 word and zero-scale is selected that generates a P.sub.6 word which is a positive word with respect to zero-scale or P.sub.6=max (O.sub.6, ZS). Accordingly, the P.sub.6 word represents the positive portion of the D.sub.i word above the sum of 1/64 of FS and 2×(P.sub.1+P.sub.2+P.sub.3+P.sub.4+P.sub.5) words. Here again at the sixth interpolation point, an approximate squared digital word S.sub.6=D.sub.i6.sup.2 (that is an approximate representation of the square of the D.sub.i word) is be generated by summing binarily proportioned P.sub.1, P.sub.2, P.sub.3, P.sub.4, and P.sub.5 words. Stated mathematically, D.sub.i.sup.2≈S.sub.5=D.sub.i5.sup.2=2.sup.1×P.sub.12.sup.0×P.sub.2+2.sup.−1×P.sub.3+2.sup.−2×P.sub.4+2.sup.−3×P.sub.5+2.sup.−4×P.sub.6. As depicted in
(34) Form the above description, it becomes clear that if an approximate squarer with better than ˜0.025% precision (accuracy of ˜12-bits) is needed, then more interpolation (n>6) can be implemented in accordance to the aSQR method.
(35) In summary, the benefits of:
(36) First, full adders occupy large area in the digital domain, generally speaking. The aSQR method can be implemented in the digital domain with fewer adders (compared to a conventional digital squarer) which makes it more area efficient.
(37) Second, implementing the aSQR method requires a number multiply or divide by two operations which can be implemented inexpensively in the digital domain by a shift right or left operation, respectively.
(38) Third, utilizing the aSQR method having more interpolations, the peak-to-peak digital value of sequential P.sub.i digital words diminish, which can help reduced the overall logic gate-count of its implementation.
(39) Fourth, the aSQR method generates a number of points (digital words) that exactly (represent) fit the square function, and linearly interpolates in-between those points. The larger the number of interpolation (n), the greater number of points that exactly fit an ideal square function and thus the less the error associated with linearly interpolating in between those exact fit points.
(40) Fifth, fewer gates in a digital circuit generally go hand-in-hand with lower dynamic power consumption and faster speed. As such, since the aSQR method requires fewer gates for implementing a square function, it can function with higher speed and lower dynamic power consumption compared to convocational digital IC squarer implementations, for a given resolution.
(41) Sixth, the disclosed digital IC approximate squaring can be arranged to perform approximate multiplication by utilizing the quarter square method. Accordingly, digital IC multiplication can be performed by deducting the square of subtraction of two digital words (x, y) from the square of their summation as in (x+y).sup.2+(x−y).sup.2=4xy.
(42) Seventh, the disclosed digital IC approximate squaring can be arranged to performed square and accumulate (SAC) and multiply and accumulate (MAC) functions in mixed-mode. For example, plurality of outputs of approximate digital IC squarers or approximate digital IC multipliers can be inputted to plurality of current mode Digital-to-Analog-Converters (iDACs), wherein by the function of summation (e.g., adding two multiplications) can be performed simply by coupling together the current output terminals of plurality of iDACs.
Section 1A″—Description of FIG. 1A″
(43)
(44) The horizontal axis shows the digital input word Di spanning from zero scale (ZS) at zero milli-seconds (ms) to full scale (FS) at 10 ms.
(45) The vertical axis shows the percent (%) of inaccuracy of the squarer approximation (S.sub.2 to S.sub.6) as compared to an ideal square (D.sub.i).
(46) Bear in mind that for sake of clarity (e.g., avoid over-lapping graphs), some of the error waveforms in the upper and lower graphs of
(47) The lower part of
(48) The upper part of
Section 1B & 1B′—Description of FIGS. 1B & 1B′
(49)
(50) The A and B are 1-bit wide digital input ports, Ci is the carry-in 1-bit port port, So is the summation output 1-bit port, and Co is the carry-out 1-bit port.
Section 1C & 1D—Description of FIGS. 1C & 1D
(51)
(52) The a1 to a4 (a1:a4) are the first 4-bit wide input port and b1 to b4 (b1:b4) are the second 4-bit wide input port of the 4-bit wide full adder of
Section 1E & 1F—Description of FIGS. 1E & 1F
(53)
(54) The a1 to a6 (a1:a6) are the first 6-bit wide input port and b1 to b6 (b1:b6) are the second 6-bit wide input port of the 6-bit wide full adder of
Section 1G & 1H—Description of FIGS. 1G & 1H
(55)
(56) The a1 to a8 (a1:a8) are the first 8-bit wide input port and b1 to b8 (b1:b8) are the second 8-bit wide input port of the 8-bit wide full adder of
Section 2A—Description of FIG. 2A
(57)
(58) Here, the digital input word Di is an 8-bit wide word (D1:D8) where D1 is the Most-Significant-Bit (MSB) and D8 is the Least-Significant-bit (LSB).
(59) In the asynchronous embodiment of aSQR method depicted in
(60) The D1/
(61) The D1/
(62) The D1/
(63) The D2/
(64) The 4-bit full adder 4FA.sub.2a adds the 2-bit wide digital word Z7:Z8 (with proper scaling via arranging a1:a2=0) to the 4-bit wide Y5: Y8 digital word. Then, the Q1:Q4 four-bit wide digital output word of 4FA.sub.2a (with proper scaling via arranging a1′:a2′=0) is added the 6-bit wide digital word X3:X8 through the 6-bit full adder 6FA.sub.2a. Next, the Q1:Q6 six-bit wide digital output word of 6FA.sub.2a (with proper scaling via arranging a1″:a2″=0) is added the 8-bit wide digital word W2:W8 (with b8=0) through the 8-bit full adder 8FA.sub.2a.
(65) The 8-bit digital output word Q1:Q8 of 8FA.sub.2a represents the equivalent to the S4 digital word of
(66) The SPICE simulation of digital design in
(67) It is obvious to one skilled in the art that other combination logic designs can be implemented in accordance with the aSQR method. Moreover, it is known by those skilled in the arts that for asynchronous logic, alternative digital IC embodiments (e.g., flip-flops, clocked latches, etc.) may be utilized to prevent (e.g., adder output, etc.) glitches due to intermediate digital values rippling through the stages of digital IC logic paths. Also, keep in mind that for clarity of illustration of
(68) The benefits of approximate squarer summarized in sections 1A and 1A′ are applicable here to
Section 3A—Description of FIG. 3A
(69)
(70) The horizontal axis shows the digital input word Di spanning from zero scale (ZS) at zero milli-seconds (ms) to full scale (FS) at 50 μs.
(71) The vertical axis shows the percent (%) of inaccuracy of the asynchronous squarer of