Parallel self-timed adder (PASTA)

Abstract

A parallel self-timed adder (PASTA) is disclosed. It is based on recursive formulation and uses only half adders for performing multi-bit binary addition. Theoretically the operation is parallel for those bits that do not need any carry chain propagation. Thus the new approach attains logarithmic performance without any special speed-up circuitry or look-ahead schema. The corresponding CMOS implementation of the design along with completion detection unit is also presented. The design is regular and does not have any practical limitations of fan-ins or fan-outs or complex interconnections. Thus it is more suitable for adoption in fast adder implementation in high-performance processors. The performance of the implementation is tested using SPICE circuit simulation tool by linear technology. Simulation results show its superiority over cascaded circuit adders. A constant time carry propagation is also achieved using the proposed implementation by tuning the CMOS parameters.

Claims

1. A parallel self-timed digital adder, without look-ahead, the adder being an electronic processor having the structure of a digital logic circuit for performing arithmetic operations, the adder comprising: two or more single bit adders, each single bit adder being a half-adder; and two multiplexers connected with each of the two or more half-adders, thereby forming two or more adder blocks, the multiplexers being operable to selectively provide inputs to the half-adders with which they are connected; wherein the adder as an initial condition generates the result of a single bit summation as the first operation, whereafter in subsequent operations the sum bit from each adder block is connected recursively with itself through one of the multiplexers for addition with the carry in from the previous adder block and wherein the multiplexers enable the timing of inputs provided to the half adders for said subsequent operations at each adder block to be tuned to avoid delays in carry propagation from the previous adder block; and further wherein each adder block transfers any carry generated thereby modifying its own recursive carry to zero, and the arithmetic operations terminate when the carry signals from all of the two or more adder blocks are zero, signaling completion of the arithmetic operations.

2. The adder in accordance with claim 1, wherein the initial condition for the arithmetic operations can be defined as
S.sub.i.sup.0=a.sub.ib.sub.i
C.sub.i.sup.0=a.sub.ib.sub.i where S.sub.i.sup.j and C.sub.i.sup.j are the Sum and Carry respectively for i.sup.th bit at the j.sup.th recursion.

3. The adder in accordance with claim 1, wherein the initial condition for the arithmetic operations can be defined as
S.sub.i.sup.0=a.sub.ib.sub.i
C.sub.i.sup.0=a.sub.ib.sub.i where S.sub.i.sup.j and C.sub.i.sup.j are the Sum and Carry respectively for i.sup.th bit at the j.sup.th recursion and wherein the j.sup.th iteration for the recursive addition can be found as
S.sub.i.sup.j=S.sub.i.sup.j1C.sub.i1.sup.j1
C.sub.i.sup.j=S.sub.i.sup.j1C.sub.i1.sup.j1.

4. The adder in accordance with claim 1, wherein the digital logic circuit is implemented into CMOS (Complementary Metal-Oxide-Semiconductor) design.

5. The adder in accordance with claim 1, wherein the connection between the multiplexers and half-adders is parallel for the number of bits that do not require carry propagation.

Description

BRIEF DESCRIPTION OF THE DRAWING/FIGURES

(1) FIG. 1 is a block diagram of the recursive parallel self-timed adder.

(2) FIG. 2 is an implementation of the recursive parallel self-timed adders using CMOS transistors.

(3) FIG. 3 is the simulated timing diagram from 8 bit PASTA implementation. (a) The worst-case carry propagation while adding operands (FF).sub.16 and (1).sub.16, (b) The best-case carry propagation while adding operands (F7).sub.16 and (0).sub.16, and (c) The average-case carry propagation while adding operands (F7).sub.16 and (11).sub.16. S8 is the carry out signal for the entire block. (d) Timing Diagram for addition of (FFFF FFFF).sub.16 and (1).sub.2 operands from 32 bit constant carry propagation PASTA implementation.

DETAILED DESCRIPTION OF THE PRESENT INVENTION

(4) A first embodiment of the parallel self-timed adder is presented in FIG. 1. Multi-bit adders are often constructed from single bit adders using combinatorial and sequential circuits for asynchronous or synchronous design. The sequential circuits are often serial/chain adders and are not the match for high speed combinatorial adders.

(5) Let a.sub.n1a.sub.n2 . . . a.sub.0 and b.sub.n1b.sub.n2 . . . b.sub.0 be two n-bit binary numbers with sum and carry denoted by S.sub.n1S.sub.n2 . . . S.sub.0 and c.sub.nc.sub.n1 . . . c.sub.0 where 0.sup.th bit represents the least significant bit. Basic single bit adders are now discussed.

(6) Single Bit Adders:

(7) Single bit Half-Adders (HA) and Full-Adders (FA) are the fundamental building blocks for nearly all high-speed adders. A single bit HA for i.sup.th bit addition is logically formulated as follows:
S.sub.i=a.sub.ib.sub.i
c.sub.i+1=a.sub.ib.sub.i (1)

(8) According to delay model by A. Tyagi. A reduced-area scheme for carry-select adders. IEEE Trans. Comput., 42(10):1162-1170, October 1993; simple logic gates (AND, OR, NAND, NOR, NOT) have 1 unit of associated gate delay and XOR/XNOR have 2 units of gate delay. Thus, the gate level delays associated with S.sub.i and c.sub.i bits are 2 and 1 respectively. The gate level area complexity for HAs is hence 2+1=3.

(9) A single bit full adder implementation additionally takes consideration of the carry-in input from the preceding single bit unit and formulated as follows:
S.sub.i=a.sub.ib.sub.ic.sub.i
c.sub.i+1=a.sub.ib.sub.i+(a.sub.ib.sub.i)c.sub.i (2)
The gate level delay associated with S.sub.i, and c.sub.i, bits are 4. The gate level area complexity for FAs is 7.

(10) The recursive binary addition formula for addition of A and B is presented as follows.

(11) Recursive Formula for Binary Addition:

(12) Let S.sub.i.sup.j and C.sub.i.sup.j be the Sum and Carry respectively for i.sup.th bit at the j.sup.th recursion. The initial condition for the addition operation can now be defined as follows:
S.sub.i.sup.0=a.sub.ib.sub.i
C.sub.i.sup.0=a.sub.ib.sub.i (3)

(13) The j.sup.th iteration for the recursive addition can be found as follows:
S.sub.i.sup.j=S.sub.i.sup.j1C.sub.i1.sup.j1 (4)
C.sub.i.sup.j=S.sub.i.sup.j1C.sub.i1.sup.j1 (5)

(14) The recursion is terminated at the k.sup.th iteration when the following condition is met.
C.sub.n.sup.kC.sub.n1.sup.kC.sub.0.sup.k=0 (6)

(15) Using the formulae presented in equations (3)-(6), a fast adder will now be designed. At first the correctness of the recursive formulation will be proved inductively by the following observation and subsequent theorem.

(16) Observation 1: In a single bit adder with no carry in, the maximum obtainable result is 2.

(17) Explanation. It is obvious that the sum cannot exceed the maximum sum obtained by two highest possible operands and hence should be equal or less than 2.

(18) The significance of this observation is that for individual i.sup.th bit adder, the case of having S.sub.i=1 and C.sub.i=1 (decimal value of 3) is impossible as it will exceed the maximum of the sum of two inputs which is 2 (binary 10). Thus the only valid (S, C) forms by i.sup.th bit adder are (0, 0), (0, 1) and (1, 0).

(19) Theorem 1: The recursive formulation of (3), (4), (5) and (6) will produce correct sum for any number of bits and will terminate at finite time.

(20) Proof. We prove the correctness of the algorithm by induction on terminating condition.

(21) Basis: For operands A, B such that c.sub.i.sup.0=0 for i, i [0. . . n], the proposed recursive formulation produces correct results in parallel by single bit computation time and terminates instantly as condition (6) is met.

(22) Induction: Assume C.sub.i.sup.k0 for i. Let j be such a bit for which C.sub.j.sup.k=1. First we show that it will be killed in the (k+1).sup.th iteration and next we will show that it will be successfully transmitted to next higher bit in the (k+1).sup.th iteration.

(23) According to Observation 1, (S.sub.j.sup.k, C.sub.j.sup.k), (S.sub.j+1.sup.k, C.sub.j+1.sup.k) could be in any of (0, 0), (0, 1) or (1, 0) forms. As C.sub.j.sup.k=1, it implies that S.sub.j.sup.k=0. Hence, from equation (5), C.sub.j.sup.k+1=0 for any input condition between 0 to j1 bits.

(24) We now consider the next higher bit (S.sub.j+1.sup.k, C.sub.j+1.sup.k) at k.sup.th iteration. By observation 1, it could be in any of (0, 0), (0, 1) or (1, 0) forms. In the (k+1).sup.th iteration, the (0, 0) and (0, 1) forms from k.sup.th iteration will correctly produce output of (1, 0) following equation (4) and (5) and hence carry will be absorbed in (j+1).sup.th bit. For (1, 0) form, the carry is supposed to propagate through this bit level as the sum value is 1. By applying (4) and (5), we find C.sub.j+1.sup.K+1=1. Thus the carry propagation/killing will be correctly performed by j.sup.th bit adder.

(25) Finally, there is one extra bit adder block for carry out of the n-bit adder. This will have initial output (S,.sub.n.sup.0, C.sub.n.sup.0)=(0, 0). Any carry chain is hence bound to end up at this bit and produce output (1, 0), if it is not already killed by any previous bit levels during earlier iteration(s). Thus all the single bit adders will successfully kill or transfer the carries to the next level until being killed at the n.sup.th bit carry out block. This ensures that terminating condition is always reached by the recursive formulation. QED.

(26) The mathematical form presented above is valid under the condition that the iterations take place simultaneously and the signals will be available synchronously from the previous level. This implicates a clocked design. However, the complexity is supposed to rise for a clocked circuit. In the next section we present a pseudo-sequential feedback circuit for the implementation of the proposed recursive formulation.

(27) Architecture of PASTA

(28) The general architecture of the proposed recursive adder is presented in FIG 1.

(29) The selection bits for 2 input multiplexers will be a single 0 to 1 pulse (denoted SEL in the CMOS implementation diagram). It will initially select the actual inputs during SEL =0 and will switch on to feedback/carry paths for subsequent iterations (SEL=1). The feedback path from the HAs enable the recursion to continue until the terminating condition is met.

(30) CMOS Implementation of PASTA

(31) The CMOS implementation of the proposed embodiment is shown in FIG. 2.

(32) One particular practical issue is synchronization of the carry transitions during the recursion. The recursion will be implemented by a single pulse of the carry signal. However, this implies there will be switching transients for the rising and falling edge of the carry signals from one block to the next. To avoid the pitfall of switching twice (or multiple times) for the same signal the sum and carry outputs are separated by an extra multiplexer that tunes the delays associated in the feedback path and helps avoid the switching transients due to feedback path of Sum bits.

(33) The termination signal following equation (6) can be generated by the CMOS implementation as shown in FIG. 2(d). In addition to the and of C.sub.i's, the negative of SEL signal is used to ensure that the termination cannot be turned on accidentally during the initial selection phase of the actual inputs.

(34) Though, the TERM signal is the only block where as many as n +2 interconnections are needed it will not create any fan-in problem as all the connections are parallel except a single pull-up transistor.

(35) Performance Analysis

(36) It is evident from the architecture and implementation of FIG. 1 and FIG. 2 that with the exception of multiplexers, PASTA is simpler than most simple ripple carry adders as it uses HAs while RCAs use FAs. Consequently, it is practical and will not suffer from any delays from complex circuitry and interconnection of wires. The architecture is also highly regular for implementation on VLSI chips. Regularity is a key design objective for VLSI designers and hence it is addressed by many earlier researchers for efficient adder implementation as clarified by R. P. Brent and H. T. Kung. A regular layout for parallel adders. IEEE Trans. Comput., 31(3):260-264, March 1982.

(37) Logic and time complexity of available adders along with PASTA are shown in Table 1. Though the theoretical limit of PASTA is similar to existing logarithmic algorithms, it achieves the same performance supporting a regular structure with constant fan-in and fan-out. Thus, it is better for the VLSI implementation than the prefix algorithms. Moreover, it is shown by SPICE simulations in the next section that it is possible to achieve constant time carry propagation by PASTA implementation. This phenomenon can be utilized for a future constant time parallel adder.

(38) TABLE-US-00001 TABLE 1 Gate Transistor Time Type Count Count Complexity Area RCA 7n 28n O(n) O(n) CLA 14n 48n 22 O(lg n) O(n) BKA 10n O(lg n) O(n lg n) Delay Insensitive 30n 42n O(lg n) O(n) RCA Delay Insensitive 39.5n 2 72.5n 2 O(lg lg n) O(n) CLA with Speed up Circuit PASTA 9n 34n O(lg n) O(n)
SPICE Simulation Results

(39) The specified CMOS circuit is simulated using Linear Technology SPICE version 4.04i. The 50 nm fourth generation Berkeley Short-channel IGFET Model (BSIM4) is used. Initially, the outcome of 8 bit adders are presented to show the practical realization of the prototype implementation. In practicality, the situation is complicated due to the length of the carry signal in effect for the next block. If the duration is not properly tuned or quite large this could feed carry to the next block for multiple transitions before eventually settling down to zero. This is similar to race condition as the final outcome is not predictable. Consequently, it is important to tune the MOS dimensions for proper synchronization.

(40) In FIG. 3, the timing diagrams for different signals are shown for the 8 bit PASTA implementation. Only carries (C.sub.0-C.sub.7) and carry out of the entire block (S.sub.8) are shown to maintain the clarity of the figure. Terminating signal (TERM) and custom character (SELECT) signal for MUXes are also displayed. As evident from the figure, the adder can perform addition in spite of the fact that there are effectively two transition periods (rise time and fall time) of the carry-in signals from the previous stages. The timing can be tuned so that the sum signal starts changing after the full rise of the carry input. This is also the purpose of the introduction of Multiplexers producing the sum bit.

(41) The worst-case, best-case and average case for maximum, minimum and average length carry propagation is highlighted in the timing diagrams of FIG. 3 (a), (b) and (c). The best-case addition avoids any unnecessary delay as it is generated without any propagation delay while the worst-case involves cascaded delays by each bit adders. The average case shows how the separate carry propagation is limited within their chain and thus proceeds in parallel. Especially noticeable is the fact that C.sub.2 and C.sub.6 are overlapping at near exact times as they are at the same distance from the start of the carry chain from bit 0 and bit 4 respectively.

(42) As the proposed approach is a very basic one without any lookahead scheme and further optimization, we compare the performances of similar chained schemes of RCA and Delay Insensitive RCA (DIRCA). The results are displayed in Table 2. For the average case we have used the expected carry length for n bit binary numbers as found by G. W. Reitwiesner, The Determination of Carry Propagation Length for Binary Addition, IRE Trans. On Electronic Computers, vol. EC-9, pp. 35-38, March 1960. The delay in case of PASTA is computed from the switching time (when SEL changes to 1 from 0) of the multiplexers.

(43) TABLE-US-00002 TABLE 2 Worst Case Best Case Avg. Case Delay (ns) Delay (ns) Delay (ns) RCA 2.67 N/A N/A DIRCA 1.81 1.81 1.81 PASTA 2.75 0.19 1.25

(44) It is to be noted that DIRCA architecture is not able to perform better in the best/average cases. This is due to the fact that dual rail signals are reset at the beginning of the computation and require propagation from previous completed stages to produce successful completion signal. We have used the completion signal as provided by Fu-C. Cheng, Practical Design and Performance Evaluation of Completion Detection Circuit, In proceedings of the Intl. Conf. on Comp. Design (ICCD), pp. 354-359, October 1998.

(45) The results clearly indicate the potential of the new PASTA as it performs best among the cascaded logic designs. It is due to the truly parallel theoretical basis of the design for independent carry chains.

(46) However, the biggest advantage that could be reaped out of the proposed design could possibly be a truly constant time parallel adder. It is found that the cascading delay for successive carry propagation can be totally avoided by tuning the MOS dimensions. The timing diagram for single carry propagation for a 32 bit adder circuit for operands A=(FFFF FFFF).sub.16 and B=(1).sub.16 is shown in FIG. 3(d). The completion signal (TERM) and the carry out bit (S.sub.32) do not require little more than a single bit delay.

(47) For clarity only a few carry signals are displayed (C.sub.0, C.sub.1, C.sub.15 and C.sub.31). From C.sub.2-C.sub.31 all carry signals follow nearly same timing. The addition thus merely takes 1.29 ns to complete for the worst case propagation condition in 32 bit adder.

(48) It has been a theory that the delay could be reduced to that of single gate delay by tuning MOS parameters for parallel connections. However, it was not possible with earlier adder designs to achieve constant time carry propagation which involves complex circuitry.

Parallel self-timed adder (PASTA)

Inventors

Cpc classification

Classification Explorer

G06F17/10

PHYSICS

Classification Explorer

G06F7/506

PHYSICS

International classification

Classification Explorer

G06F7/50

PHYSICS

Classification Explorer

G06F7/506

PHYSICS

Classification Explorer

G06F17/10

PHYSICS

Abstract

Claims

Description