OPTIMAL METASTABILITY-CONTAINING SORTING VIA PARALLEL PREFIX COMPUTATION

Abstract

In order to provide smaller, faster and less error-prone circuits for sorting possibly metastable inputs, a novel sorting circuit is provided. According to the invention, the circuit is metastability-containing.

Claims

1. A sorting circuit, characterized in that the circuit is metastability-containing.

2. The circuit of claim 1, comprising one or more sub circuits for comparing each prefix pair of at least two input strings.

3. The circuit of claim 2, further comprising a sub circuit for inferring the output bits, based on the result of the comparison.

4. The circuit of claim 2, wherein the input strings are Gray coded.

5. The circuit of claim 4, wherein the Gray code is a binary reflected Gray code.

6. The circuit of claim 1, wherein the sorting circuit is a 2-sort circuit for sorting two input strings.

7. The circuit of claim 1, wherein the sorting circuit is a sorting network for sorting n strings.

8. The circuit of claim 1, wherein a size of the sorting circuit is within the order of the size of the input strings O(B).

9. The circuit of claim 1, wherein buffers are used to bound the fan-out.

10. The circuit of claim 9, wherein the number of buffers used is twice the length (B) of an input string.

11. The circuit of claim 9, wherein a fan-out of the sorting circuit is constant.

12. The circuit of claim 1, wherein a depth of the sorting circuit is within the order of ┌log B┐, wherein B is the number of bits in an input string.

13. A transistor-level implementation of the circuit of claim 1.

Description

[0011] FIG. 1 shows standard transistor-level implementations of inverter (left), NAND (center), and NOR (right) gates in CMOS technology. The latter can be turned into AND and OR, respectively, by appending an inverter.

[0012] FIG. 2 shows finite state machine determining which of two Gray code inputs g, h∈B.sup.B is larger. In each step, the machine receives g.sub.ih.sub.i as input. State encoding is given in square brackets.

[0013] FIG. 3 shows An example for a computation of the 2-sort(9) circuit arising from the inventive construction for fan-out f=3. The inputs are g=101010110 and h=101M100000; see Table 10 for s.sub.M.sup.i(g, h) and the output. We labeled each ⋄M by its output. Buffers and duplicated gates (here the one computing 0M) reduce fan-out, but do not affect the computation. Grey boxes indicate recursive steps of the PPC construction; see also FIG. 7 for a larger PPC circuit using the one here in its “right” top-level recursion. For better readability, wires not taking part in a recursive step are dashed or dotted.

[0014] FIG. 4 shows the recursion tree T4 (center). Right nodes are depicted black, left nodes gray and leafs are depicted white. The recursive patterns applied at left and right nodes are shown on the left and right, respectively. At the root and its left child, we have that B=B/2; for other nodes, B gets halved for each step further down the tree (where the leaves simply wire their single input to their single output). The left pattern comes in different variants. The basic construction does not incorporate the gray buffers; these will be needed in Section 5.2 to reduce fan-out. The gray wire with index B+1 is present only if B is odd; this never occurs in PPC(C,T.sub.b), but becomes relevant when initially applying the left pattern exclusively for k∈N steps (see Theorem 5.8), reducing the size of the resulting circuit at the expense of increasing its depth by k.

[0015] FIG. 5 shows comparison of the balanced recursion from [19] and ours. The curves for unbounded fan-out are the exact sizes obtained, whereas “upper bound” refers to the bound from Corollary 5.7; the fan-out 3 curves show that the unbalanced strategy performs better also for the construction from Theorem 5.18 (for f=3 and k=0) we derive next.

[0016] FIG. 6 shows construction of PPC(C,T.sub.4)′. On the left, we see the recursion tree, with the aggregation trees separated and shown at the bottom. Inputs are depicted as black triangles. On the right, the application of the recursive patterns at the children of the root is shown. Parts marked blue will be duplicated in the second step of the construction that achieves constant fan-out; this will also necessitate duplicating some gates in the aggregation trees.

[0017] FIG. 7 shows PPC(.sup.3)(C,T.sub.4). Right recursion steps R.sub.r are marked with dark gray, left recursion steps with light gray. The steps at the root (above) and aggregation trees (below) are not marked explicitly. Duplicated gates are depicted in a layered fashion. Dashed lines indicate that a wire is not participating in a recursive step.

[0018] FIG. 8 shows a dependence of the size of the modified construction on f. For comparison, the upper bound from Corollary 5.7 on the circuit with unbounded fan-out is shown as well.

[0019] FIG. 9 shows an XMUX circuit according to an embodiment of the invention, used to implement ⋄.sub.M and out.sub.M.

[0020] FIG. 10 shows constructing 2-sort(B) from out.sub.M and PPCM(B−1).

[0021] FIG. 11 shows an excerpt from a simulation for 4-bit inputs, where X=M. The rows show (from top to bottom) the inputs g and h, both outputs of the simple non-containing circuit, and both outputs of our design. Inputs g and h are randomly generated valid strings. Columns 1 and 3 show that the simpler design fails to implement a 2-sort(4) circuit.

[0022] FIG. 12 shows a comparison of the inventive solution PPC Sort to a standard non-containing one. For the latter, the unexpected delay reduction at B=16 is the result of automatic optimization with more powerful gates, which the inventive solution does not use.

DETAILED DESCRIPTION

[0023] We set [N]:={0, . . . , N−1} for N∈ custom-character and [i, j]={i, i+1, . . . , j} for i, j∈, i≤j. We denote :={0,1} and .sub.M:={0,1, M}. For a B-bit string g∈.sub.M and i∈[1,B], denote by g.sub.i its i-th bit, i.e., g=g.sub.1g.sub.2 . . . g.sub.B. We use the shorthand g.sub.i,j:=g.sub.i . . . g.sub.j, where i, j∈[1,B] and i≤j. Let par(g) denote the parity of g∈ custom-character .sup.B, i.e, par(g)=Σ.sub.i=1.sup.Bg.sub.i mod 2. For a function f and a set A, we abbreviate f(A):={f(y)|y∈A}.

[0024] A standard binary representation of inputs is unsuitable: uncertainty of the input values may be arbitrarily amplified by the encoding. E.g. representing a value unknown to be 11 or 12, which are encoded as 1011 resp. 1100, would result in the bit string 1MMM, i.e., a string that is metastable in every position that differs for both strings. However, 1MMM may represent any number in the interval from 8 to 15, amplifying the initial uncertainty of being in the interval from 11 to 12. An encoding that does not lose precision for consecutive values is Gray code.

[0025] A B-bit binary reflected Gray code, rg.sub.B:[N].fwdarw. custom-character , is defined recursively. For simplicity (and without loss of generality) we set N:=2.sup.B. A 1-bit code is given by rg.sub.1(0)=0 and rg.sub.1(1)=1. For B>1, we start with the first bit fixed to 0 and counting with rg.sub.B−1(⋅) (for the first 2.sup.B−1 codewords), then toggle the first bit to 1, and finally “count down” rg.sub.B−1(⋅) while fixing the first bit again, cf. Table 1. Formally, this yields for x∈[N]

[00001] $\begin{matrix} r g_{B} (x) : = {\begin{matrix} 0 r g_{B - 1} (x) & if x \in [2^{B - 1}] \\ 1 r g_{B - 1} (2^{B} - 1 - x) & if x \in [2^{B}] ∖ [2^{B - 1}] \end{matrix} . & (x) \end{matrix}$

[0026] As each B-bit string is a codeword, the code is a bijection and the encoding function also defines the decoding function. Denote by custom-character ⋅:.sup.B.fwdarw.[N] the decoding function of a Gray code string, i.e., for x∈[N], rg.sub.B(x)=x.

[0027] For two binary reflected Gray code strings g, h∈ custom-character .sup.B, we define their maximum and minimum as

[00002] $(\max^{rg} {g, h}, \min^{rg} {g, h}) := {\begin{matrix} (g, h) & if .Math. g .Math. \geq .Math. h .Math. \\ (h, g) & if .Math. g .Math. < .Math. h .Math. \end{matrix}$

[0028] For example:

max.sup.rg{0011,0100}=max.sup.rg{rg.sub.B(2),rg.sub.B(7)}=0100,

min.sup.rg{0111,0101}=min.sup.rg{rg.sub.B(9),rg.sub.B(10)}=0111.

[0029] Inputs to the sorting circuit may have some metastable bits, which means that the respective signals behave out-of-spec from the perspective of Boolean logic. However, they are valid strings in the sense of the invention. Valid strings have at most one metastable bit. If this bit resolves to either 0 or 1, the resulting string encodes either x or x+1 for some x, cf. Table 2.

[0030] More formally, if B∈ custom-character and N=2.sup.B, the set of valid strings of length B is defined as

[00003] $S_{rg}^{B} : = r g_{B} ([N]) .Math. \underset{x \in [N - 1]}{.Math.} {r g_{B} (x) * r g_{B} (x + 1)}$

[0031] The operator * is called the superposition and is defined as

[00004] $\forall i \in {1, .Math., B} {(x * y)}_{i} := {\begin{matrix} x_{i} & if x_{i} = y_{i} \\ M & else . \end{matrix}$

[0032] The specification of max.sup.rg and min.sup.rg may be extended to valid strings in the above sense by taking all possible resolutions of metastable bits into account. More particularly, in order to extend the specification of max.sup.rg and min.sup.rg to valid strings, the metastable closure (Stephan Friedrichs, Matthias Függer, and Christoph Lenzen. Metastability-Containing Circuits. Transactions on Computers, 67, 2018) is used. The metastable closure of an operator on binary inputs extends it to inputs that may contain metastable bits, by considering all possible stable resolutions of the inputs, applying the operator and taking the superposition of the results.

[0033] The closure is the best one can achieve w.r.t. containing metastability with clocked logic using standard registers, i.e., when f.sub.M(x).sub.i=M, no such implementation can guarantee that the i.sup.th output stabilizes in a timely fashion.

[0034] If one wants to construct a circuit computing the maximum and minimum of two valid strings, allowing to build sorting networks for valid strings, one also needs to answer the question what it means to ask for the maximum or minimum of valid strings. To this end, suppose a valid string is rg.sub.B(x)*rg.sub.B(x+1) for some x∈[N−1], i.e., the string contains a metastable bit that makes it uncertain whether the represented value is x or x+1. If one waits for metastability to resolve, the string will stabilize to either rg.sub.B(x) or rg.sub.B(x+1). Accordingly, it makes sense to consider rg.sub.B(x)*rg.sub.B(x+1) “in between” rg.sub.B(x) and rg.sub.B(x+1), resulting in the following total order on valid strings (cf. Table 2).

[0035] Definition (<). A total order < is defined on valid strings as follows. For g, h∈ custom-character .sup.B, g<h⇔g<h. For each x∈[N−1], we define rg.sub.B(x)<rg.sub.B(x)*rg.sub.B(x+1)<rg.sub.B(x+1). We extend the resulting relation on S.sub.rg.sup.B×S.sub.rg.sup.B to a total order by taking the transitive closure. Note that this also defines ≤, via g≤h⇔(g=h∨g<h).

[0036] We intend to sort with respect to this order. It turns out that implementing a 2-sort circuit w.r.t. this order amounts to implementing the metastable closure of max.sup.rg and min.sup.rg. In other words, max.sub.M.sup.rg and min.sub.M.sup.rg are the max and min operators w.r.t. the total order on valid strings shown in Table 2, e.g.,

max.sub.M.sup.rg{1001,1000}=rg.sub.4(15)=1000,

max.sub.M.sup.rg{0M10,0010}=rg.sub.4(3)*rg.sub.4(4)=0M10, and

max.sub.M.sup.rg{0M10,0110}=rg.sub.4(4)=0110.

[0037] Hence, our task is to implement max.sub.M.sup.rg and min.sub.M.sup.rg.

[0038] Definition (2-sort(B)). For B∈ custom-character , a 2-sort(B) circuit is specified as follows.

[0039] Input: g, h∈S.sub.rg.sup.B

[0040] Output: g′, h′∈S.sub.rg.sup.B

[0041] Functionality: g′=max.sub.M.sup.rg{g, h}, h′=min.sub.M.sup.rg{g, h}.

[0042] FIG. 1 shows standard transistor-level implementations of inverter (left), NAND (center), and NOR (right) gates in CMOS technology. The latter can be turned into AND and OR, respectively, by appending an inverter.

[0043] The invention seeks to use standard components and combinational logic only. In particular, the behavior of basic gates on metastable inputs may be specified via the metastable closure of their behavior on binary inputs, cf. Table 3, using the standard notational convention that a+b=OR.sub.M (a, b) and ab=AND.sub.M(a, b). In this logic, most familiar identities hold: AND and OR are associative, commutative, and distributive, and DeMorgan's laws hold. However, naturally the law of the excluded middle becomes void. For instance, in general, OR(x, x)≠1, as OR(M, M)=M.

[0044] It can be shown that the basic CMOS gates shown in FIG. 1 behave according to this logic, i.e. that they implement the truth tables given in Table 3, thereby justifying the model.

[0045] FIG. 2 shows finite state machine determining which of two Gray code inputs g, h∈B.sup.B is larger. In each step, the machine receives g.sub.ih.sub.i as input. State encoding is given in square brackets.

[0046] More particularly, FIG. 2 depicts a finite state machine performing a four-valued comparison of two Gray code strings. In each step of processing inputs g, h∈BB, it is fed the pair of ith input bits g.sub.ih.sub.i. In the following, we denote by s.sup.(i)(g, h) the state of the machine after i steps, where s.sup.(0)(g,h):=00 is the starting state. For ease of notation, we will omit the arguments g and h of s.sup.(i) whenever they are clear from context.

[0047] Because the parity keeps track of whether the remaining bits are compared w.r.t. the standard or “reflected” order, the state machine performs the comparison correctly w.r.t. the meaning of the states indicted in FIG. 2.

[0048] For all i∈[1,B], we have that max

[00005] ${}^{rg}{g, h}_{i} \min^{rg} {g, h}_{i} = out (\overset{l - 1}{\underset{j = 1}{.diamond-solid.}} g_{j} h_{j}, g_{i} h_{i})$

[0049] In order to extend this approach to potentially metastable inputs, all involved operators are replaced by their metastable closure: for i∈[1, B] (i) compute s.sup.(i), (ii) determine max.sup.rg{g, h}.sub.i and min.sup.rg{g, h}.sub.i according to Table 4, and finally (iii) exploit associativity of the operator computing the state s(i) in the PPC framework. Thus, we only need to implement ⋄.sub.M and the out.sub.M (both of constant size), plug them into the framework, and immediately obtain an efficient circuit.

[0050] The reader may raise the question why we compute s.sub.M.sup.(1) for all i∈[09,B−1] instead of computing only s.sub.M.sup.(B)) with a simple tree of ⋄.sub.M elements, which would yield a smaller circuit. Since s.sub.M.sup.(B) is the result of the comparison of the entire strings, it could be used to compute all outputs, i.e., we could compute the output by out(s.sub.M.sup.(B), g.sub.ih.sub.i) instead of out(s.sup.(i−1).sup.M, g.sub.ih.sub.i). However, in case of metastability, this may lead to incorrect results: e.g., for g=0M1 and h=001, we have that s.sub.M.sup.(3)=00*01=0M and out.sub.M(0M; g.sub.2h.sub.2)=MM, yet min.sub.M.sup.rg{g, h}.sub.2=0 (see Tables 6 and 7).

[0051] While it is not obvious that this approach yields correct outputs, it may be formally proven that: (P1) ⋄.sub.M is associative. (P2) repeated application of ⋄.sub.M computes s.sub.M.sup.(i). (P3) applying out.sub.M to s.sub.M.sup.(i−1) and g.sub.ih.sub.i results for all valid strings in max.sub.M.sup.rg{g, h}.sub.i min.sub.M.sup.rg{g, h}.sub.i. This yields the desired correctness. Regarding the first point, we note the statement that ⋄.sub.M is associative does not depend on B. In other words, it can be verified by checking for all possible x, y, z∈B.sub.;M.sup.2 whether (x⋄.sub.My)⋄.sub.Mz=x⋄.sub.M(y⋄.sub.Mz).

[0052] While it is tractable to manually verify all 3.sup.6=729 cases (exploiting various symmetries and other properties of the operator), it is tedious and prone to errors. Instead, it was verified that both evaluation orders result in the same outcome by a short computer program, proving the desired associativity of the operator.

[0053] For the convenience of the reader, Table 6 gives the truth table of ⋄.sub.M. It can be shown that repeated application of this operator to the input pairs g.sub.jh.sub.jj∈[1, i], actually results in s.sub.M.sup.(i). This is closely related to the elegant recursive structure of Binary Reflected Gray Code, leading to the important observation that if in a valid string there is a metastable bit at position m, then the remaining B−m following bits are the maximum code word of a (B−m)-bit code.

[0054] It may be observed that for g∈S.sub.rg.sup.B, if there is an index 1≤m<B such that g.sub.m=M then g.sub.m+1,B=10.sup.B−m−1.

[0055] The reasoning is based on distinguishing two main cases: one is that s.sub.M.sup.(i) contains at most one metastable bit, the other that s.sub.M.sup.(i′)MM. Each of these cases can be proven by technical statements.

[0056] It may further be observed that if |res(s.sub.M.sup.(i))|≤2 for any i∈[B+1], then res(s.sub.m.sup.(i))=⋄.sub.j=1.sup.ires(g.sub.jh.sub.j).

[0057] The operator out: B.sup.2×B.sup.2.fwdarw.B.sup.2 is the operator given in Table 4 computing max.sup.rg{g, h}.sub.i min.sup.rg{g, h}.sub.i out of s.sup.(i−1) and g.sub.ih.sub.i. For convenience of the reader, we provide the truth table of out.sub.M in Table 7.

[0058] FIG. 3 shows an example for a computation of the 2-sort(9) circuit arising from the inventive construction for fan-out f=3. The inputs are g=101010110 and h=101M100000; see Table 10 for s.sub.M.sup.i(g, h) and the output. More particularly, table 10 shows an example run of the FSM in FIG. 2 on inputs g=101010110 and h=101M100000. We drop s.sub.M.sup.(9)s(9)M, as it is not needed to compute g′.sub.9h′.sub.9. We labeled each ⋄M by its output. Buffers and duplicated gates (here the one computing 0M) reduce fan-out, but do not affect the computation. Grey boxes indicate recursive steps of the PPC construction; see also FIG. 7 for a larger PPC circuit using the one here in its “right” top-level recursion. For better readability, wires not taking part in a recursive step are dashed or dotted.

[0059] In order to derive a small circuit from the above, we make use of the PPC framework by Ladner and Fischer. They described a generic method that is applicable to any finite state machine translating a sequence of B input symbols to B output symbols, to obtain circuits of size O(B) and depth O(log B). They reduce the problem to a parallel prefix computation (PPC) task by observing that each input symbol defines a restricted transition function, whose compositions evaluated on the starting state yield the state of the machine after the corresponding number of steps. This matches our needs, as we need to determine s.sub.M.sup.(i) for each i∈[B]. However, their generic construction involves large constants. Fortunately, we have established that ⋄.sub.M:B.sub.M.sup.2×B.sub.M.sup.2 is an associative operator, permitting us to directly apply the circuit templates for associative operators they provide for computing s.sub.M.sup.(i)=(⋄.sub.M).sub.j=1.sup.ig.sub.jh.sub.j for all i∈[B]. Accordingly, only these templates are discussed.

[0060] We revisit the part of the framework relevant to our construction, also providing a minor improvement on their results in the process. To this end, we first formally specify the PPC task for the special case of associative operators.

[0061] Definition 5.1 (PPC.sub.⊕(B)). For associative ⊕: D×D.fwdarw.D and B∈N, a PPC.sub.⊕(B) circuit is specified as follows.

[0062] Input: d∈D.sup.B,

[0063] Output: π∈D.sup.B,

[0064] Functionality: π.sub.i=⊕.sub.j=1.sup.id.sub.j for all i∈[1, B].

[0065] In our case, ⊕=⋄.sub.M and D=B.sub.M.sup.2 the method by Ladner et al. provides a family of recursive constructions of PPC.sub.⊕ circuits. They are obtained by combining two different recursive patterns.

[0066] FIG. 4 shows the recursion tree T4 (center). Right nodes are depicted black, left nodes gray and leaves are depicted white. The recursive patterns applied at left and right nodes are shown on the left and right, respectively. At the root and its left child, we have that B=B/2; for other nodes, B gets halved for each step further down the tree (where the leaves simply wire their single input to their single output). The left pattern comes in different variants. The basic construction does not incorporate the gray buffers; these will be needed to reduce fan-out. The gray wire with index B+1 is present only if B is odd; this never occurs in PPC(C,T.sub.b), but becomes relevant when initially applying the left pattern exclusively for k∈N steps, reducing the size of the resulting circuit at the expense of increasing its depth by k.

[0067] More particularly, suppose that C and P are circuits implementing ⊕ and PPC.sub.⊕(┌B/2┐) for some B∈N, respectively. Then applying the recursive pattern given at the left of FIG. 4 (i) with B:=B and without the rightmost gray line if B is even and (ii) with B:=B−1 if B is odd yields a PPC.sub.⊕(B) circuit. It has depth 2d(C)+d(P) and size at most (B−1)|C|+|P|. Moreover, the last output is at depth at most d(C)+d(P) of the circuit.

[0068] The second recursive pattern, shown in FIG. 4c, avoids to increase the depth of the circuit beyond the necessary d(C) for each level of recursion. Assume for now that B is a power of 2. We represent the recursion as a tree T.sub.b, where b:=log B, given in the center of FIG. 6. It has depth b with all leafs in this depth, and there are two types of nonleaf nodes: right nodes (filled in black) have two children, a left and a right node, whereas left nodes (filled in gray) have a single child, which is a right node. T.sub.b is essentially a Fibonacci tree in disguise.

[0069] Definition. T.sub.0 is a single leaf. T.sub.1 consists of the (right) root and two attached leaves. For b≥2, T.sub.b can be constructed from T.sub.b−1 and T.sub.b−2 by taking a (right) root r, attaching the root of T.sub.b−1 as its right child, a new left node l as the left child of r, and then attaching the root of T.sub.b−2 as (only) child of l.

[0070] The recursive construction is now defined as follows. A right node applies the pattern given in FIG. 4 to the right, where R.sub.l is the circuit (recursively) defined by the subtree rooted at the left child, R.sub.r is the circuit (recursively) defined by the subtree rooted at the right child, and B=2.sup.b−d−1, where d∈[b] is the depth of the node. A left child applies the pattern on the left, where the recursively used circuit R.sub.c is defined by the subtree rooted at its child and B=2.sup.b−d, where d∈[b] is the depth of the node. The base case for a single input and output is simply a wire connecting the input to the output, for both patterns. As b=log B and each recursive step cuts the number of inputs and outputs in half, the base case applies if and only if the node is a leaf. Note that the figure shows the recursive patterns at the root and its left child, where B=2.sup.b−1 is always even (i.e., in this recursive pattern, the gray wire with index B+1 is never present); when applying the patterns to nodes further down the tree, B and B are scaled down by a factor of 2 for every step towards the leaves.

[0071] In the following, denote by PPC(C,T.sub.b) the circuit that results from applying the recursive construction described above to the base circuit C implementing ⊕. Moreover, we refer to the i.sup.th input and output of the subcircuit corresponding to node ν∈T.sub.b as d.sub.i.sup.ν and π.sub.i.sup.ν, respectively.

[0072] It may be shown that If C implements ⊕, PPC(C,T.sub.b) is a PPC.sub.⊕(2.sup.b)circuit and PPC(C,T.sub.b) has depth b.Math.d(C).

[0073] It remains to bound the size of the circuit. Denote by F.sub.i, i∈N, the i.sup.th Fibonacci number, i.e., F.sub.1=F.sub.2=1 and F.sub.i+1=F.sub.i+F.sub.i−1 for all 2≤i∈N. Then it may be shown that PPC(C,T.sub.b) has size (2.sup.b+2−F.sub.b+5+1)|C|.

[0074] Asymptotically, the subtractive term of F.sub.b+5 is negligible, as F.sub.b+5∈(1/√{square root over (5)}+0(1))((1+√{square root over (5)})/2).sup.b+5.Math.O(1.62.sup.b); however, unless B is large, the difference is substantial. We also get a simple upper bound for arbitrary values of B. To this end, we “split” in the recursion such that the left branch is “complete,” while applying the same splitting strategy on the right. This is where our construction differs from and improves on the method of Ladner et al. They perform a balanced split and obtain an upper bound of 4B on the circuit size.

[0075] It follows that for B∈N and circuit C implementing ⊕, set b:=┌log B┐. Then a PPC.sub.⊕(B) of depth ┌log B┐d(C) and size smaller than (5B−2.sup.b−F.sub.b+3)|C|≤(4B−F.sub.b+3) exists.

[0076] We remark that one can give more precise bounds by making case distinctions regarding the right recursion, which for the sake of brevity we omit here. Instead, we computed the exact numbers for B≤70.

[0077] FIG. 5 shows comparison of the balanced recursion from Ladner et al and ours. The curves for unbounded fan-out are the exact sizes obtained, whereas “upper bound” refers to the above-given bound; the fan-out 3 curves show that the unbalanced strategy performs better also for the construction (for f=3 and k=0) we derive next.

[0078] The construction derived from iterative application of the above results can be combined with PPC(C,T.sub.b), achieving the following trade-off; note that if B=2.sup.b for b∈N, then F.sub.┌log B┐−k+3 can be replaced by F.sub.b−k+5.

[0079] Suppose C implements ⊕. For all k∈[┌log B┐+1] and b∈N, there is a PPC.sub.⊕(B) circuit of depth (┌log B┐+k)d(C) and size at most

[00006] $((2 + \frac{1}{2^{k - 1}}) B - F_{.Math. \log B .Math. - k + 3}) .Math. C .Math.$

[0080] FIG. 6 shows the construction of PPC(C,T.sub.4)′. On the left, we see the recursion tree, with the aggregation trees separated and shown at the bottom. Inputs are depicted as black triangles. On the right, the application of the recursive patterns at the children of the root is shown. Parts marked blue will be duplicated in the second step of the construction that achieves constant fan-out; this will also necessitate duplicating some gates in the aggregation trees.

[0081] The optimal depth construction incurs an excessively large fan-out of Θ(B), as the last output of left recursive calls needs to drive all the copies of C that combine it with each of the corresponding right call's outputs. This entails that, despite its lower depth, it will not result in circuits of smaller physical delay than simply recursively applying the construction from FIG. 4a. Naturally, one can insert buffer trees to ensure a constant fan-out (and thus constantly bounded ratio between delay and depth), but this increases the depth to Θ(log.sup.2 B+d(C)log B).

[0082] We now modify the recursive construction to ensure a constant fan-out, at the expense of a limited increase in size of the circuit. The result is the first construction that has size O(B), optimal depth, and constant fan-out.

[0083] In the following, we denote by f≥3 the maximum fanout we are trying to achieve, where we assume that gates or memory cells providing the input to the circuit do not need to drive any other components. For simplicity, we consider C to be a single gate.

[0084] We proceed in two steps. First, we insert 2B buffers into the circuit, ensuring that the fan-out is bounded by 2 everywhere except at the gate providing the last output of each subcircuit corresponding to a right node. In the second step, we will resolve this by duplicating such gates sufficiently often, recursively propagating the changes down the tree. Neither of these changes will affect the output of the circuit or its depth, so the main challenges are to show our claim on the fan-out and bounding the size of the final circuit.

[0085] Step 1: Almost Bounding Fan-Out by 2

[0086] Before proceeding to the construction in detail, we need some structural insight on the circuit.

[0087] For node ν∈T.sub.b, define its range R.sub.υ and left-count a.sub.υ recursively as follows. [0088] If υ is the root, then R.sub.υ=[1,2.sup.b] and a.sub.υ=0. [0089] If υ is the left child of p with R.sub.p=[i, i+j], then R.sub.υ=[i, i+(j+1)/2] and a.sub.υ=a.sub.p. [0090] If υ is the right child of right node p with R.sub.p=[i, i+j], then R.sub.υ=[i+(j+1)/2+1, i+j] and a.sub.υ=a.sub.p. [0091] If υ is the right child of left node p, then R.sub.υ=R.sub.p and a.sub.υ=a.sub.p+1.

[0092] Suppose the subcircuit of PPC(C,T.sub.b) represented by node ν∈T.sub.b in depth d∈[b+1] has range R.sub.υ=[i, i+j]. [0093] Then [0094] (i) it has 2.sup.b−d inputs, [0095] (ii) j=2.sup.b−d+α.sup.ν−1, [0096] (iii) if υ is a right node, all its inputs are outputs of its childrens' subcircuits, and [0097] (iv) if υ is a left node or leaf, only its even inputs are provided by its child (if it has one) and for odd k∈[1,2.sup.b−d], we have that

[00007] $d_{k}^{v} = \oplus_{k^{'} = i + (k - 1)}^{i + k 2^{α_{v} - 1}} 2^{α_{v}} d_{k^{'}} .$

[0098] This leads to an alternative representation of the circuit PPC(C,T.sub.b), see FIG. 6, in which we separate gates in the recursive pattern from FIG. 4a that occur before the subcircuit R.sub.c. Adding the buffers we need in our construction, this results in the modified patterns given in FIG. 6b. The separated gates appear at the bottom of FIG. 6a: for each leaf υ of T.sub.b, there is a tree of depth α.sub.υ aggregating all of the circuit's inputs from its range. Each non-root node in an aggregation tree provides its output to its parent. In addition, one of the two children of an inner node in the tree must provide its output as an input to one of the subcircuits corresponding to a node of T.sub.b′, cf. Property (iv) above.

[0099] From this representation, we will derive that the following modifications of PPC(C,T.sub.b) result in a PPC.sub.⊕(2.sup.b) circuit PPC(C,T.sub.b)′, for which a fan-out larger than 2 exclusively occurs on the last outputs of subcircuits corresponding to nodes of T.sub.b. [0100] 1) Add a buffer on each wire connecting a non-root node of any of the aggregation trees to its corresponding subcircuit (see FIG. 6a). [0101] 2) For the subcircuit corresponding to left node l with range R.sub.l=[i, i+j], add for each even k≤j (i.e., each even k but the maximum of j+1) a buffer before output π.sub.k.sup.l (see bottom of FIG. 6b). [0102] 3) For each right node r with range [i, i+j], add a buffer before output π.sub.(j+1)/2.sup.r (see top of FIG. 6b).

[0103] With the exception of gates providing the last output of subcircuits corresponding to nodes of T.sub.b (blue in FIG. 6b), fan-out of PPC(C,T.sub.b)′ is 2. Buffers or gates driving an output of the circuit drive nothing else.

[0104] It remains to count the inserted buffers. The following helper statement will be useful for this, but also later on.

[0105] Denote by L.sub.b.Math.T.sub.b the set of leaves of T.sub.b. Then |L.sub.b|=F.sub.b+2 and Σ.sub.ν∈L.sub.b2.sup.α.sup.ν=2.sup.b.

[0106] Next, consider the recurrence given by L′.sub.0=1, L′.sub.1=2, and L′.sub.b=L′.sub.b−1+2L′.sub.b−2 for b≥2; the factor of 2 assigns twice the weight to the subtree rooted at the child of the root's left child, thereby ensuring that each leaf is accounted for with weight 2.sup.α.sup.ν. This recurrence has solution 2.sup.b.

[0107] Denote by s the size of a buffer. Then

|PPC(C,T.sub.b)′|=|PPC(C,T.sub.b)|+(2.sup.b+2.sup.b−1−F.sub.b+3)s.

[0108] Step 2: Bounding Fan-Out by f

[0109] In the second step, we need to resolve the issue of high fan-out of the last output of each recursively used sub circuit in PPC(C,T.sub.b)′. Our approach is straightforward. Starting at the root of T.sub.b and progressing downwards, we label each node υ with a value a(υ) that specifies a sufficient number of additional copies of the last output of the sub circuit represented by υ to avoid fan-out larger than f. At right nodes, this is achieved by duplicating the gate computing this output sufficiently often, marked blue in FIG. 6b (top). For left nodes, we simply require the same number of duplicates to be provided by the sub circuit represented by their child (i.e., we duplicate the blue wire in the bottom recursive pattern shown in FIG. 6b). Finally, for leaves, we will require a sufficient number of duplicates of the root of their aggregation tree; this, in turn, may require to make duplicates of their descendants in the aggregation tree.

[0110] We start by defining a(υ) and then argue how to use these values for modifying the circuit to obtain our fan-out f circuit. Afterwards, we will analyze the increase in size of the circuit compared to PPC(C,T.sub.b)′.

[0111] Definition (a(υ)). Fix b∈N.sub.0. For ν∈T.sub.b in depth d∈[b+1], define

[00008] $a (v) : = {\begin{matrix} 0 & if v is the root \\ \frac{a (p) + 2^{b - d}}{f} & if v is the left child of p \\ \frac{a (p)}{f} & if v is the right child of right node p \\ p & if v is the (only) child of left node p . \end{matrix}$

[0112] Suppose that for each leaf ν∈T.sub.b, there are └a(ν)┘ additional copies of the root of the aggregation tree, and for each right node ν∈T.sub.b, we add └a(ν)┘ gates that compute (copies of) the last output of their corresponding sub circuit of PPC(C,T.sub.b)′. Then we can wire the circuit such that all gates that are not in aggregation trees have fan-out at most f, and each output of the circuit is driven by a gate or buffer driving only this output.

[0113] It remains to modify the aggregation trees so that sufficiently many copies of the roots' output values are available.

[0114] Consider an aggregation tree corresponding to leaf ν∈T.sub.b and fix f≥3. We can modify it such that the fan-out of all its non-root nodes becomes at most f, there are └a(ν)┘ additional gates computing the same output as the root, and at most (fa(ν))/(f−2)+(2.sup.a.sup.ν.sup.−1)/(f−1) gates are added.

[0115] Finally, we need to count the total number of gates we add when implementing these modifications to the circuit.

[0116] For f≥3, define PPC.sup.(f)(C,T.sub.b) by modifying PPC(C,T.sub.b)′ according to Lemmas 5.15 and 5.16. Then, with λ.sub.1:=(1+√{square root over (5)})/4, |PPC.sup.(f)(C,T.sub.b)| is bounded by

[00009] $.Math. {PPC (C, T_{b})}^{'} .Math. + 2^{b} (\frac{1}{2 f - 2} + \frac{2}{f - 2} + O (\frac{λ_{1}^{b}}{f^{2}})) .Math. C .Math. .$

[0117] We summarize our findings in the following:

[0118] Suppose that C implements ⊕, buffers have size s and depth at most d(C), and set λ.sub.1:=(1+√{square root over (5)})/4. Then for all k∈[b+1], b∈N.sub.0 and f≥3, there is a PPC.sub.⊕(2.sup.b) circuit of fan-out f, depth (b+k)d(C), and size at most

[00010] $(2^{b + 1} + 2^{b - k} (2 + \frac{5 f - 6}{2 f^{2} - 6 f + 4} + O (\frac{λ_{1}^{b}}{f^{2}}))) .Math. C .Math. + (2^{b} + 2^{b - k - 1}) s .$

[0119] Due to space constraints, we refrain from analyzing the size of the construction for values of B that are not powers of 2. However, in FIG. 8 we plot the exact bounds (without buffers) for k=0 and selected values of f against B.

[0120] FIG. 7 shows, as an example for the overall resulting construction, PPC.sup.(3)(C,T.sub.4). Right recursion steps R.sub.r are marked with dark gray, left recursion steps with light gray. The steps at the root (above) and aggregation trees (below) are not marked explicitly. Duplicated gates are depicted in a layered fashion. Dashed lines indicate that a wire is not participating in a recursive step.

[0121] FIG. 8 shows a dependence of the size of the modified construction on f. For comparison, the upper bound on the circuit with unbounded fan-out is shown as well.

[0122] FIG. 9 shows an XMUX circuit according to an embodiment of the invention, used to implement ⋄.sub.M and out.sub.M.

[0123] First, we need to specify implementations of the sub circuits computing ⋄.sub.M and out.sub.M.

[0124] From Tables 5a and 5b, for s, b∈B.sup.2 we can extract the Boolean formulas

(s⋄b).sub.1=s.sub.1s.sub.2+s.sub.1b.sub.1+s.sub.2b.sub.1

(s⋄b).sub.2=s.sub.1s.sub.2+s.sub.1b.sub.2+s.sub.2b.sub.2

out(s,b).sub.1=s.sub.1b.sub.2+s.sub.2b.sub.1+b.sub.1b.sub.2

out(s,b).sub.2=s.sub.1b.sub.2+s.sub.2b.sub.1+b.sub.1b2.

[0125] In general, realizing a Boolean formula f by replacing negation, multiplication, and addition by inverters, AND, and OR gates, respectively, does not result in a circuit implementing f.sub.M.sup.1 However, we can easily verify that the above formulas are disjunctions of all prime implicants of their respective functions. As one can manually verify, these formulas evaluate to the truth tables given in Tables 6 and 7 and in this special case the resulting circuits do implement the closure—provided the gates behave as in Table 3, which the implementations given in FIG. 2 do. Using distributive laws (recall that these also hold in Kleene logic), the above formulas can be rewritten as

(s⋄b).sub.1=s.sub.1(s.sub.2+b.sub.1)+s.sub.2b.sub.1

(s⋄b).sub.2=s.sub.2(s.sub.1+b.sub.2)+s.sub.1b.sub.2

out(s,b).sub.1=b.sub.1(b.sub.2+s.sub.2)+b.sub.2s.sub.1

out(s,b).sub.2=b.sub.2(b.sub.1+s.sub.1)+b.sub.1s.sub.2.

.sup.1 For instance, (s⋄b).sub.1=s.sub.1b.sub.1+s.sub.2b.sub.1 as Boolean formula, but the two expressions differ when evaluated on s.sub.1=s.sub.2=1 and b.sub.1=M. The circuits resulting from the different formulas are implementations of a multiplexer (with select bit b1) and its closure, respectively.

[0126] We see that, in fact, a single circuit with suitably wired (and possibly negated) inputs can implement all four operations. As for sel.sub.1=sel.sub.2 the circuit implements a multiplexer with select bit sel.sub.1, we refer to it as extended multiplexer, or XMUX for short. Its functionality is specified by

XMUX(sel.sub.1,sel.sub.2,x,y):=y(x+sel.sub.2)+x sel.sub.1.

[0127] Table 8 lists how to map inputs to compute ⋄.sub.M and out.sub.M.

[0128] We note that this circuit is not a particularly efficient XMUX implementation; a transistor-level implementation would be much smaller. However, our goal here is to verify correctness and give some initial indication of the size of the resulting circuits—a fully optimized ASIC circuit is beyond the scope of this article. The size of the implementation may be slightly reduced by moving negations. Due to space limitations, we refrain from detailing this modification here, but note that FIG. 12 and table 9 consider it.

[0129] We now have all the pieces in place to assemble a containing 2-sort(B) circuit. As stated above, ⋄.sub.M is associative. Thus, from a given implementation of ⋄.sub.M (e.g., two copies of the circuit from FIG. 9 with appropriate wiring and negation, cf. Table 8) we can construct PPC.sub.⋄.sub.M (B−1) circuits of small depth and size, as shown above. We can combine such a circuit with an out.sub.M implementation (again, two XMUXes with appropriate wiring and negation will do) to obtain our 2-sort(B) circuit.

[0130] FIG. 10 shows constructing 2-sort(B) from out.sub.M and PPCM(B−1).

[0131] The correctness of this construction follows from the above explanations, where we can plug in any PPC.sub.⋄.sub.M(B−1) circuit. For the circuits derived by relying on the XMUX circuit from FIG. 9, we independently confirmed this via simulation.

[0132] More particularly, we implemented the design given in FIG. 10 on register transfer-level using the PPC.sub.⋄.sub.M(B−1) circuit described above for k=0..sup.3 Quartus by Altera was used for design entry, which in our case mainly consists of checking correct implementation. After design entry, we used ModelSim by Altera for behavioral simulation. Note that we must not simulate the preprocessed Quartus output, because processing may compromise metastability-containing behavior. Instead, we simulate pure VHDL. Metastable signals are simulated using VHDL signal X, because its behavior matches the worst-case behavior assumed for M.

[0133] FIG. 11 shows an excerpt from a simulation for 4-bit inputs, where X=M. The rows show (from top to bottom) the inputs g and h, both outputs of the simple non-containing circuit, and both outputs of our design. Inputs g and h are randomly generated valid strings. Columns 1 and 3 show that the simpler design fails to implement a 2-sort(4) circuit.

[0134] For the implementation of PPC.sub.⋄.sub.M(B−1) we used the basic circuits, i.e., we did not make use of the extension to constant fan-out. We exhaustively checked the design from FIG. 10 for B up to 12 (and all k accordingly). Simulation shows that the design works correct for several levels of recursion, e.g., when regarding B=1 and B=2 as simple base cases, B=12 implies 3 levels of recursion for both patterns. We refrained from simulating the constant fan-out construction, because it simply repeats replicates intermediate results without adding functionality.

[0135] After behavioral simulation we continue with a comparison of our design and a standard sorting approach Bin-comp(B). As mentioned earlier, the 2-sort(B) implementation given in FIG. 9 is slightly optimized by pulling out a negation from the operators in every recursive step [3]. After design entry as described above we use Encounter RTL Compiler for synthesis and Encounter for place and route. Both tools are part of the Cadence tool set and in both steps we use NanGate 45 nm Open Cell Library as a standard cell library.

[0136] Since metastability-containing circuits may include additional gates that are not required in traditional Boolean logic, Boolean optimization may compromise metastability containing properties. Accordingly, we were forced to disable optimization during synthesis of the circuits.

[0137] FIG. 12 shows a comparison of the inventive solution PPC Sort to a standard non-containing one. For the latter, the unexpected delay reduction at B=16 is the result of automatic optimization with more powerful gates, which the inventive solution does not use.

[0138] As a binary benchmark Bin-comp was used: In short, Bin-comp consists of a simple VHDL statement comparing two binary encoded inputs and outputting the maximum and the minimum, accordingly. It follows the same design process as 2-sort, but then undergoes optimization using a more powerful set of basic gates. For example, the standard cell library provides prebuilt multiplexers. These multiplexers are used by Bin-comp, but not by 2-sort. We stress that these more powerful gates provide optimized implementations of multiple Boolean functions, yet each of them is still counted as a single gate. Thus, comparing our design to the binary design in terms of gate count, area, and delay disfavors our solution. Moreover, we noticed that the optimization routine switches to employing more powerful gates when going from B=8 to B=16 (cf. FIG. 12), resulting in a decrease of the delay of the Bin-comp implementation.

[0139] Nonetheless, our design performs comparably to the non-containing binary design in terms of delay, cf. FIG. 12 and Table 9. This is quite notable, as further optimization of our design is possible by optimizing it on the transistor level, with significant expected gains. The same applies to gate count and area, where a notable gap remains. Recall, however, that the Bin-comp design hides complexity by using more advanced gates and does not contain metastability.

[0140] We emphasize that we refrained from optimizing the design by making use of all available gates or devising transistor-level implementations, since such an approach is tied to the utilized library or requires design of standard cells.

[0141] In conclusion, we demonstrated that efficient metastability containing sorting circuits are possible. Our results indicate that optimized implementations can achieve the same delay as non-containing solutions, without a dramatic increase in circuit size. This is of high interest to an intended application motivating us to design MC sorting circuits: fault tolerant high-frequency clock synchronization. Sorting is a key step in envisioned implementations of the Lynch-Welch algorithm with improved precision of synchronization. The complete elimination of synchronizer delay is possible due to the efficient MC sorting networks presented in this article; enabling an increment of the rate at which clock corrections are applied, significantly reducing the negative impact of phase drift of local clock sources on the precision of the algorithm.

[0142] More generally speaking, MC circuits like those presented here are of interest in mixed signal control loops whose performance depends on very short response times. When analog control is not desirable, traditional solutions incur synchronizer delay before being able to react to any input change. Using MC logic saves the time for synchronization, while metastability of the output corresponds to the initial uncertainty of the measurement; thus, the same quality of the computational result can be achieved in shorter time. Note that our circuits are purely combinational, so they can be used in both clocked and asynchronous control logic.

[0143] Examples of such control loops are clock synchronization circuits, but MC has been shown to be useful for adaptive voltage control and fast routing with an acceptable low probability of data corruption as well.

OPTIMAL METASTABILITY-CONTAINING SORTING VIA PARALLEL PREFIX COMPUTATION

Inventors

Cpc classification

Classification Explorer

G06F7/78

PHYSICS

Classification Explorer

G06F2207/228

PHYSICS

Classification Explorer

G06F7/24

PHYSICS

Classification Explorer

G06F7/405

PHYSICS

International classification

Classification Explorer

G06F7/24

PHYSICS

Classification Explorer

G06F7/40

PHYSICS

Classification Explorer

G06F7/78

PHYSICS

Abstract

Claims

Description