Delay element, delay element chain and fast all-digital clock frequency adaptation circuit for voltage droop tolerance
11520370 · 2022-12-06
Assignee
Inventors
- Christoph Lenzen (Saarbrücken, DE)
- Matthias Függer (Cachan, FR)
- Ben Wiederhake (Saarbrücken, DE)
- Attila Kinali (Saarbrücken, DE)
- Mordechai Medina (Tel Aviv, IL)
Cpc classification
G06F1/08
PHYSICS
G11C7/222
PHYSICS
G11C5/14
PHYSICS
H03L7/0818
ELECTRICITY
International classification
G06F1/08
PHYSICS
Abstract
A circuit for delaying an electric signal (CI), comprises an input for the electric signal (CI); an input for a control signal (EI); a first storage element (U5) for storing the control signal; a delay element for delaying the electric signal; and an output for the delayed electric signal (CO). According to the invention, the electric signal is delayed, based on the stored control signal. The delay circuit is employed in a fast all-digital clock frequency adaptation circuit for voltage droop tolerance.
Claims
1. A frequency adaptation circuit comprising: one or more circuits for delaying an electric signal; and a voltage droop detector, wherein each circuit for delaying an electric signal comprises: an input for the electric signal; an input for a control signal; a first storage element for storing the control signal; a second storage element for the storing the control signal; an output for outputting the control signal stored in the second storage element; a delay element for delaying the electric signal; and an output for the delayed electric signal, wherein the electric signal is delayed, based on the stored control signal, characterized in that the first storage element is a metastability-masking flip-flop, and wherein a digital phase accumulation module accumulates the control signal value output by a delay circuit into its phase offset.
2. The frequency adaptation circuit according to claim 1, wherein the phase accumulation module skips a clock cycle whenever its accumulated phase reaches a full period.
3. The frequency adaptation circuit according to claim 1, wherein the input clock of the phase accumulation module is four times the system clock.
4. The frequency adaptation circuit according to claim 1, wherein the electric signal is a clock signal.
5. The frequency adaptation circuit according to claim 1, wherein the first storage element is clocked by a delayed clock signal.
6. The frequency adaptation circuit according to claim 1, wherein the delay element is an inverter.
7. The frequency adaptation circuit according to claim 1, further comprising a pulse shaping module for shaping the electric signal.
8. The frequency adaptation circuit according to claim 1, wherein the control signal is a digital control signal.
9. The frequency adaptation circuit according to claim 1, wherein the second storage element is clocked by the delayed electric signal.
10. The frequency adaptation circuit according to claim 1, wherein the frequency of the generated clock output is decreased by a configurable factor.
Description
(1) The invention also presents a simplification of the circuit that uses only one backpropagation rail instead of two, reducing the necessary guardband further and making it easier to find a drop-in replacement for the droop detection mechanism.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION
(15) The description starts with the specification of a correct frequency adaption module FAM. Then, a a circuit FAM-I is specified and shown to be a correct implementation of the frequency adaption module. The circuit FAM-I consists of the submodules Droop Detector (DD), Delay Element (DE), and Phase Accumulator (φ).
(16) All module specifications are stated as a list of input assumptions (Ix) and output constraints (Cy). A module is correct if it fulfills all (Cy) if all (Ix) hold.
(17) One input signal is a clock signal with a fixed nominal frequency (which can be chosen much higher than the derived system clock), the other is the supply voltage. The clock signal is modeled by a sequence of times (τ.sub.i.sup.↑)i∈N, where τ.sub.i.sup.↑ corresponds to the time the i.sup.th rising input clock edge occurs; analogously, τ.sub.i.sup.↑ is the time of the i.sup.th falling input clock edge. The supply voltage is given by V.sub.CC .sub.≥0.fwdarw.[V.sub.min,V.sub.max], where V.sub.CC(t) is the voltage at time t.
(18) The input is required to be well-behaved: Assumption of well-separated input. The input clock fulfills
τ.sub.i.sup.↑≥0,∀i∈N:r.sub.i+1.sup.↑−r.sub.i.sup.↑∈[T.sub.s.sup.−,T.sub.s.sup.+],
and ∀i∈N:r.sub.i.sup.↓−r.sub.i.sup.↑∈[T.sub.s.sup.−/2,T.sub.s.sup.+/2]. (I1) where T.sub.s.sup.− and T.sub.s.sup.+ are the minimum and maximum duration of the “short” clock pulses it provides. The above essentially means a 50% duty cycle of the input clock, although this requirement can be relaxed. Assumption on droops. The supply voltage satisfies that
∀t,t′≥0:|V.sub.CC(t)−V.sub.CC(t′)|≤K|t−t′|, (I2) i.e., Kbounds how steep a droop can be. The only output is the clock signal, which during a voltage droop must slow down appropriately. The output is modeled by the sequence of times (r.sub.i.sup.↑)i∈N, where τ′.sub.i.sup.↑ is the time the i.sup.th rising output clock edge occurs. (r′.sub.i.sup.↓)∈N is defined analogously. One will also need T.sub.i.sup.− and T.sub.l.sup.+, the desired minimum and maximum period of the sloweddown clock, which has “long” periods, to accommodate increased switching times during droops. In summary, T.sub.s.sup.−<T.sub.s.sup.+<T.sub.l.sup.−<T.sub.s.sup.+. The frequency adaptation module is said to be correct if, given (I1) and (I2), it fulfills constraints (C1) and (C2): Guarantee of well-separated output. Output clock edges are well-separated, i.e.,
r′.sub.0.sup.↑≥r.sub.0.sup.↑ and ∀i∈N r′.sub.i+1.sup.↑−r′.sub.i.sup.↑≥T.sub.s.sup.−. (C1) pA 50% duty cycle of the output clock is not required, but bounds will be shown for the inventive solution later on. Guarantee of well-shifted output. The output clock always runs fast when the supply voltage has been sufficiently high during the previous cycle, and that it runs slow when the supply voltage was too low during the last clock cycle:
(∀t∈└r′.sub.i−1.sup.↑,r′.sub.i.sup.↑┘:V.sub.CC(t)≥V.sub.high).Math.r′.sub.i.sup.↑−r′.sub.i−1.sup.↑∈[T.sub.s.sup.−,T.sub.s.sup.+],and
(∃t∈└r′.sub.i−1.sup.↑,r′.sub.i.sup.↑┘:V.sub.CC(t)≥V.sub.low).Math.r′.sub.i.sup.↑−r′.sub.i−1.sup.↑≥T.sub.l.sup.− (C2) The voltages V.sub.low; V.sub.high define what is considered a droop. No implementation can work for arbitrarily close V.sub.low; V.sub.high. In summary, V.sub.min<V.sub.low<V.sub.high<V.sub.max. While this specification does not explicitly require it, the proposed system also guarantees an amortized minimum frequency of 1/T.sub.l.sup.+; in absence of metastability in the constructed delay chain, in fact no clock period is longer than T.sub.l.sup.+, and for a chain of length n, the maximum clock period is T.sub.l.sup.++n(T.sub.l.sup.+−T.sub.s.sup.+). These requirements and guarantees, especially (C2), could be phrased differently. The inventors attempted to capture a broad set of interpretations. Given more information about the specifics of the desired requirements and guarantees, the analysis can be tailored towards them, yielding slightly better results.
(19) Central to the proposed solution are flip-flops with x-masking outputs, for x∈{0,1}: a flip-flop whose output is x if it is internally metastable. Note that such a flipflop only produces full-swing, fast transitions at its output, but no glitches or long intermediate voltage levels: when metastability resolves to 1−x, it produces a (possibly arbitrarily late) transition from x to 1−x; if metastability resolves to x, its output remains at x. Such flip-flops can be realized by successive high/low-threshold inverters; see e.g. [D. J. Kinniment, A. Bystrov, and A. V. Yakovlev, “Synchronization circuit performance,” IEEE SSC, vol. 37, no. 2, pp. 202-209, 2002], [D. J. Kinniment, Synchronization and arbitration in digital systems. John Wiley & Sons, 2008. [19] M. Függer, A. Kinali, C. Lenzen, and B. Wiederhake, “Fast all digital clock frequency adaptation circuit for voltage droop tolerance,” in IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), 2018].
(20) Next, an abstract implementation of a frequency adaptation module, called FAM-I, will be presented, that consists of (i) a droop detector, (ii) a configurable delay chain comprising n≥1 conditional delay elements, and (iii) a digital phase accumulator.
(21)
(22) The three modules of FAM-I are specified and interconnected as follows: (1) The Droop Detector DD continuously provides two single-bit digital measurement values of V.sub.CC at its outputs Ē.sub.O.sup.F, Ē.sub.O.sup.S. Note that these may be unstable or transitioning when being sampled, i.e., could induce metastability of storage elements. The output is lowactive, i.e., Ē.sub.O*=0 indicates presence of a voltagedroop and the request to slow down the clock (i.e., have a long clock period), Ē.sub.O*=1 absence and the request for a fast clock (period), and Ē.sub.O*=M an unstable signal. In case of such an unstable signal, Ē.sub.O.sup.F must be 1-masking (called fast-masking), and Ē.sub.O.sup.S 0-masking (called slow-masking). The output values are used for setting the rightmost delay element. (2) Each conditional Delay Element DE delays the clock signal, from input C.sub.I to output C.sub.O, based on a possibly metastable input Ē.sub.I.sup.F, the delay enable signal: if Ē.sub.I.sup.F=1 by a short delay within [T.sub.s.sup.−,T.sub.s.sup.+], and if Ē.sub.I.sup.F=0 by a long delay within [T.sub.l.sup.−,T.sub.l.sup.+]. Potential uncertainties in Ē.sub.I.sup.F due to unstable or metastable input are transformed into delay uncertainties. Several of these building blocks are combined into a pipeline that is fed from right to left, as depicted in
(23) What follows is a detailed specification of the modules. Delays in all module specifications are in terms of time ranges. This does not only allow to capture standard jitter and imbalance within the circuit, but also accounts for the effect of a voltage droop on the frequency adaptation module itself. For example, the present model accounts for the fact that a delay element operating in long delay mode propagates the clock signal with delay T.sub.l.sup.− in presence of full Vdd, and with up to delay T.sub.l.sup.+ in presence of a voltage droop. This allows to capture clock pulse shrinking and stretching effects caused by voltage droops as observed in [Bowman et al., ibid]. For succinctness and in the interest of readability, however, single variables will be used instead of intervals for a time range in the following, with the understanding that the timing analysis has to respect the respective upper and lower bounds. For example, T.sub.s will be written instead of the interval [T.sub.s.sup.−,T.sub.s.sup.+], d≤T instead of d≤T.sub.s.sup.−, d≥T.sub.s instead of d≥T.sub.s.sup.+, and d=T.sub.s instead of d∈[T.sub.s.sup.−,T.sub.s.sup.+]. Also needed will be the common timing parameters for what boils down to the properties of the underlying storage elements: t.sub.set, t.sub.hold, t.sub.prop, t.sub.ofs, which are the setup, hold, and propagation times of the circuits, as well as the offset between the active clock edge and the time the input is captured.
(24) The behavior of module φ (phase accumulator) is modeled in a straightforward way. The component has an internal state (the accumulated phase shift), and two inputs: the single-bit signal Ē.sub.I indicating whether to increase the phase offset, and the clock signal Clk.sub.in generated by the source clock, e.g., an external free-running quartz oscillator. It outputs a clock signal C.sub.O derived from Clk.sub.in, whose pulses are phaseshifted appropriately. Specifically, this means that one has to add phase shift values, handle overflow as clock gating, and must be able to complete this within T.sub.s.sup.− time even during a voltage droop. According to the invention, this is can be achieved by a simple and fast circuit.
(25) Formally, let the sequences τ.sub.i.sup.↑, τ.sub.i.sup.↓, r.sub.i,0.sup.↑, r.sub.i,0.sup.↓ be the times of the rising and falling edges of the input and output clock signals, respectively (the 0 indicates that φ is the “0.sup.th” element of the delay chain). It is assumed that (I1) holds for Module φ's clock input. The variable b.sub.i,0 denotes the digital interpretation of Ē.sub.I around time r.sub.i,0.sup.↑, i.e., for b∈{0,1}, b.sub.i,0=b if ∀t∈[−t.sub.set,t.sub.hold]: Ē.sub.I(r.sub.i,0.sup.↓+t.sub.ofs+t)=b (where Ē.sub.I is scaled accordingly). It is assumed: Assumption of metastability-free input. There always is such a value b, which will be argued to hold with high probability later.
b.sub.i,0∈{0,1} (I3)
(26) The total shift count can now be defined B.sub.i=Σ.sub.k=0.sup.i−1(1−b.sub.k,0). The Phase Accumulator is correct if, given (I1) and (I3), conditions (C3) and (C4) hold: Guarantee of well-shifted output. Let Q be the quotient of the clock period increase, i.e., T.sub.l/T.sub.s=1+1/Q, and assume Q is in N. The output clock C.sub.O is shifted according to the amount indicated by all previous rounds' b.sub.i;0:
q=└B.sub.i/Q┘,r.sub.i=B.sub.i−q.sub.i.Math.Q, and
r.sub.i,0.sup.↑=r.sub.i,q.sub.
r.sub.i,0.sup.↓−r.sub.i,0.sup.↑=T.sub.s/2 (C4)
The delay element DE has three inputs Ē.sub.I.sup.F, Ē.sub.I.sup.S and C.sub.I, and three corresponding outputs Ē.sub.O.sup.F, Ē.sub.O.sup.S and C.sub.O, connected like a REQ/ACK pipeline. Clock output C.sub.O is the clock input C.sub.I, potentially delayed by an additional up to T.sub.s/Q time. Inputs Ē.sub.I* provide the delay enable, representing the (lowactive) decision whether the clock needs to be delayed or not. Outputs Ē.sub.O* propagate this delay enable backwards in the chain, at the occurrence of the next local falling edge of C. The output Ē.sub.I.sup.F is used for the internal decision whether to add delay, whereas Ē.sub.I.sup.S is propagated to both outputs Ē.sub.O*. Distinguishing between the local and forwarded “copy” of the delay enable is relevant only if the input is unstable, a case that will carefully be handled using metastability masking techniques.
(27) Formally, it is required that the input signal at C.sub.I is a “clean” clock signal, i.e., it has sharp edges between periods of strong-high and strong-low signals (as the invention considers unstable inputs, this will need to be shown to be true in the proof of correctness); the module guarantees the same for its clock output C.sub.O. Denote by r.sub.i,j.sup.↑ and r.sub.i,j.sup.↓ the sequences of times of the rising and falling output clock edge of the j.sup.th delay element, respectively. Therefore, r.sub.i,j−1* is the occurrence of the respective rising/falling input clock edge.
(28) Observe that r.sub.i,j−1.sup.↑ and r.sub.i,j−.sup.↓ fully describe the clock input C.sub.I to the j.sup.th element, where the first element receives r.sub.i,0.sup.↑ and r.sub.i,0.sup.↓ from φ The following requirements are made: Assumption of well-separated input.
r.sub.i,j−1.sup.↑−r.sub.i−1,j−1.sup.↑≥T.sub.s.sup.− and (I4)
r.sub.i,j−1.sup.↓−r.sub.i,j−1.sup.↑=T.sub.s/2, (I5) i.e., the clock period is at least T.sub.s.sup.− and the high time is T.sub.s/2.
(29) Then the same guarantees are ensured for the clock output: Guarantee of well-separated output.
r.sub.i,j.sup.↑−r.sub.i−1,j.sup.↑≥T.sub.s.sup.− and (C5)
r.sub.i,j.sup.↓−r.sub.i,j.sup.↑=T.sub.s/2, (C6)
(30) It remains to specify how the module responds to the delay enable inputs. To this end, for *∈{S, F} one defines b.sub.i,j* as the digital abstraction of the respective signal at the input port E.sub.I* of the j.sup.th delay element, using the mapping
(31)
where Ē.sub.I* is scaled such that 1 represents a strong-high, 0 a strong-low, and M any voltage in between. Intuitively, b.sub.i,j* is the resulting state of a flip-flop with input Ē.sub.I* latched at time r.sub.i,j.sup.↓+t.sub.ofs, where M represents metastability resulting from a setup/hold time violation or otherwise unclean signal.
(32) As the outputs Ē.sub.O* are fed to the module to the left, b.sub.i,j−* is given in terms of Ē.sub.O* latched at time r.sub.i,j−1.sup.↓+t.sub.ofs.
(33) With this, one may require: Assumption of proper masking.
b.sub.i,j.sup.Sb.sub.i,j.sup.F∈{00,0M,01,M1,11} (I6)
(34) Also, if the element adds delay, one needs the guarantee that the one to the left (providing C.sub.I as its clock output) does the same on the next clock pulse, as otherwise one would have to choose T.sub.s conservatively, defeating the purpose of the present construction. Hence, one also demands: Assumption of delayed input.
b.sub.i,j−1.sup.F=0.Math.r.sub.i+1,j−1.sup.↑−r.sub.i+1,j−1.sup.↑≥T.sub.l.sup.−. (I7)
One now uses b.sub.i−1,j.sup.F to decide whether or not to delay the i.sup.th clock pulse. b.sub.i−1,j.sup.S, on the other hand, is used to forward the delay enable. If b.sub.i−1,j.sup.F=M, one is satisfied with ensuring (C1)-(C3), where (C3) is achieved by guaranteeing that b.sub.i−1,j.sup.F=M.Math.B.sub.i−1,j.sup.S=b.sub.i,j−1.sup.F=0 by masking metastability. If b.sub.i−1,j.sup.S=M, one guarantees that b.sub.i−1,j.sup.F=1 by masking metastability. Both properties together (captured by (C10)) ensure that if a delay enable input causes any delay for a pulse i, then it is guaranteed to delay all following pulses by Q/T.sub.s time, which lies at the heart of the correctness proof. Guarantee of delayed output and delay propagation.
b.sub.i,j.sup.F=1.Math.r.sub.i+1,j.sup.↑−r.sub.i,j.sup.↑≥T.sub.s.sup.+ (C7)
b.sub.i,j.sup.F=0.Math.r.sub.i+1,j.sup.↑−r.sub.i,j.sup.↑≥T.sub.k.sup.+ (C8)
b.sub.i,j.sup.S=b∈{0,1}.Math.b.sub.i+1,j−1.sup.S=b.sub.i+1,j−.sup.F=b (C9)
b.sub.i+1,j−1.sup.Sb.sub.i+1,j−1.sup.F∈{00,0M,01,M1,11}. (C10)
(35) Formally, the Delay Element is correct if (C5)-(C10) hold, given that (I4)-(I7) hold.
(36) Finally, the Droop Detector module DD provides a discrete, but potentially unstable or metastable value of whether a droop has occurred; see e.g. [A. Muhtaroglu et al., ibid] for an implementation. To enable the inventive masking strategy, however, the invention uses a high and a low output threshold to generate two signals Ē.sub.O*, *∈{S, F}, which are fed as Ē.sub.I* to the rightmost delay element. It is required that (C10) holds for this element; straightforward ways to ensure this is using two identical detectors with different thresholds and exploiting the assumption that V.sub.CC changes at most at rate K, or to use a detector with (at least) three-valued output.
(37) Moreover, the detector's output must indicate whether a voltage droop may be imminent. Accordingly, one requires for a correct DD module that if (I2) holds then (C10) (for any i+1∈N and j−1=n), (C11), and (C12) hold: Guarantee of droop detection.
V.sub.CC(t)<V.sub.low+(1+n/Q)T.sub.sK.Math.Ē.sub.O*(t)=0 and (C11)
V.sub.CC(t)≥V.sub.high.Math.Ē.sub.O*(t)=1 (C12)
(38) The specifics of the implementation of the detector are of no concern to us. However, note that it is crucial that the detector's delay is small, as it adds to the response time of the circuit and thus affects the steepness K of droops that can be tolerated. This suggests to favor simple implementations.
(39) The requirements (C11) and (C12) yield that the required gap between V.sub.low and V.sub.high is
V.sub.high−V.sub.low>(1+n/Q)T.sub.sK.
(40) To show that the FAM-I is a correct implementation of the FAM, it may be proven that all input requirements of the FAM-I's submodules are fulfilled. More formally, for the FAM-I with correct implementations of its submodules and a chain of n≥1 delay elements, it may be proven that if (I1) and (I3) hold, the input requirements (I4), (I5), (I6), and (I7) hold for each delay element.
(41) As a corollary, one obtains that for the FAM-I with correct implementations of its submodules, and a chain of n≥1 delay elements, it may be proven that if (I1) and (I3) hold, property (C1) holds and the output clock hightime is within [T.sub.s.sup.−/2, T.sub.s.sup.+/2]. The output clock period is at most (1+n/Q)T.sub.s.sup.+ and amortized (1+1/Q)T.sub.s.sup.+=T.sub.l.sup.+.
(42) It may now be shown that the FAM-I reacts to voltage droops as required by (C2). From the above, one already has that all delay elements' input and output requirements are fulfilled; specifically delay element n's output guarantees hold. It remains to be shown that the DD module correctly senses a droop and passes on this information to delay element n, which then reacts with an according phase shift.
(43) More formally, for the FAM-I with correct implementations of its submodules, and a chain of n≥1 delay elements, it may be proven that if the delay constraints t.sub.ofs≥t.sub.set and t.sub.ofs+t.sub.hold≤T.sub.s/2, (I1), (I2), and (I3) are fulfilled, then property (C2) holds.
(44) Overall correctness follows from the above together with (I3), i.e., the chain being long enough to ensure that metastability is always resolved before reaching (o.
(45) More formally: if the delay constraints t.sub.ofs≥t.sub.set and t.sub.ofs+t.sub.hold≤T.sub.s/2, (I1), (I2), and (I3) hold, then the FAM-I with correct implementations of its submodules, and a chain of n≥1 delay elements, is correct.
(46) Notably, the chain length n does not influence correctness assuming that no metastability occurs, but is of course relevant to ensure (I3) indeed holds. The delay chain achieves this by acting as a synchronizer chain of length n.
(47) Circuits for the Phase Accumulator φ and the Delay Element that fulfill the modules' specifications will be presented next.
(48) Circuit for Phase Accumulator. The phase accumulator behaves like a phase accumulator in a numerically controlled oscillator (NCO).
(49)
(50) A natural implementation is to provide the phase accumulator with an input clock frequency of Q/T.sub.s and with each active input clock transition add a constant phase offset (plus an externally provided potential phase shift), thereby generating the output clock. Such a design, however, has the drawback that the phase accumulator with output frequency of, say, 2 GHz must internally run a counter at Q.Math.2 GHz=8 GHz, thereby typically representing the frequency bottleneck of the overall FAM design. In addition, one might want to run the whole frequency adaption circuit at a higher frequency than the system, as this decreases the time required to respond to a droop; dividing the output clock yields a system clock that adapts even faster to droops, while only a very small part of the circuit runs at the higher frequency.
(51)
(52) Their design is based on a tapped delay-locked loop (DLL) and a MUX that allows to select among the taps, thereby applying the required phase shift; see
(53) Lemma. The circuit φ-DLL-I in
(54) Proof. The PLL, formed by the phase detector PD and the starved inverter chain, make sure that the tapped inverter outputs r∈{0,1,2,3} correspond to clock Clk phase shifted by 2πr/Q. The 2-bit counter increments modulo Q, triggered with the falling output clock edge C.sub.O, given that the delay enable Ē.sub.I=0. From input constraint (I3) one has that Ē.sub.I either is stable 0 or stable 1, but not in transition while being sampled. Each counter increment results in an additional phase shift of 2π/Q for the next rising clock edge, thereby ensuring (C3). Finally, (C4) is guaranteed by the fact that phase shifts are only applied after falling output clock edges and before the occurrence of the next rising output clock edge, together with input constraint (I1).
(55)
(56)
(57)
(58) Circuit for Delay Element. Consider the circuit DE-I in
t.sub.prop<T.sub.s/2−δ.sub.DE−(T.sub.l−T.sub.s)−t.sub.ofs, (6)
t.sub.set<T.sub.s/2−t.sub.ofs, and t.sub.hold<δ.sub.DE+t.sub.prop. (7)
(59) in
, to U4
, an finally to PS
. The pulse shaper then reshapes the pulse to T.sub.s/2 width
, which triggers the flip-flop U5 to sample Ē.sub.I.sup.F. In the example execution it is assumed Ē.sub.I.sup.F=1 is sampled
, resulting in b.sub.i,j.sup.F=1.
(60) Likewise, via U3
, to U4
, and finally to PS
. Again, the pulse shaper reshapes the pulse to T.sub.s/2 width
, triggering U5 to sample Ē.sub.I.sup.F. In the present case Ē.sub.I.sup.F=0 is sampled
, resulting in b.sub.i,j.sup.F=0.
(61) . Observe that this triggers a delayed propagation along the short path
,
,
, and
. Also note that a metastable U5 cannot delay propagation beyond the slow path
,
, and
. The remainder of the execution is analogous to the cases before.
(62) Lemma. The circuit DE-I in
(63) Proof. One proves the claim by induction over the pulse number i, where apart from the properties (C5)-(C10) it is claimed that U5 and U6 attain states b.sub.i,j.sup.F and b.sub.i,j.sup.S when being latched by the falling outgoing clock edge.
(64) Combining the above, one obtains correctness of the FAM implementation. Note, however, that correctness relies on requirement (I3). Given the present circuit implementation, (I3) corresponds to the fact that the delay enable propagated through the n delay elements from the DD module to the φ module is not metastable when it arrives. From the fact that stable register values are propagated correctly, i.e., again result in stable register states of the element to the left, one deduces that metastability can only propagate through the chain when the register U6 of delay element j resolves exactly when register U6 of element j−1 latches its input; i.e., the chain acts as a synchronizer chain of length n. The overall probability of a failure can thus be bounded analogous to failure of an n-stage synchronizer. Specifically, as the chain of registers contains no logic gates, one can assume that T.sub.w=t.sub.set+t.sub.hold and the available metastability resolution time T.sub.res=nT.sub.s−(n−1)T.sub.w.
(65) For example, one may assume worst-case conditions for the droop detector (f.sub.d=f.sub.c). Using the values for common ASIC synchronizers (r=31.6 ps, T.sub.w=8 ps) and a chain running at a high clock speed (n=5, f.sub.c=4 GHz), this achieves a good MTBF:
(66)
(67) Apart from the delay constraints, this is the only technology-dependent aspect of the inventive approach. Hence, it is very easy to transfer the inventive design to different technologies. In particular, the length of the delay chain is simply the length of a synchronizer chain of sufficient MTBF for the respective technology and application.
(68) The previous construction used two backward rails, which essentially propagate the same signal, but with different masking applied. It imposes the requirement that the droop detector provides two output signals, only one of which may induce metastability of the corresponding storage element when it is latched. While the constraint on the output of the detector may be straightforward to satisfy, it has negative impact on performance: To guarantee that not both capturing storage elements become metastable, the respective voltage thresholds for when the detector's outputs transition between 0 and 1 need to be sufficiently separated; however, via constraints (C11) and (C12), this entails that K (i.e., the maximum steepness of a droop) or the difference between V.sub.high and V.sub.low (and thus the minimum voltage under which a clock period of T.sub.s is sufficient) becomes smaller.
(69) According to a further embodiment of the invention, one can simplify the interface to the droop detector and resolve this performance issue at the same time. The general idea is to separate the flip-flops U5 and U6 of the delay element into their constituent latches, “merge” the master latches into one, and ensure the separation by exploiting that, when intransparent, the (single) master latch can only stabilize either to 0 or to 1 (as opposed to the two master latches of the flip-flops U5 and U6 from
(70)
(71) To formalize this, the specifications of the droop detector and delay element modules are adapted to match the system description given by
(72) In the following, all flip-flop parameters refer to the flipflops given by the master/slave pairs U7/U5 and U7/U6, respectively, which are assumed to be equal due to symmetry.
(73) Module φ (Phase Accumulator). The specification of the phase accumulator remains unchanged.
(74) Simplified Delay Element (sDE). The delay element has clock input C.sub.i and clock output C.sub.O. It receives a delay enable input Ē.sub.I and provides a delay enable output Ē.sub.O.
(75) In order to specify the delay element similarly to before, it is most convenient to specify b.sub.i,j* similarly as well. However, these values are now derived from the same input signal Ē.sub.I, with metastability masking taking place entirely within the element. Accordingly, with the same definitions of r.sub.i,j.sup.↑ and r.sub.i,j.sup.θ as before, one integrates (C10) into the definition:
(76)
(77) A correct (modified) delay element then guarantees (C5)-(C9), granted that (I4), (I5), and (I7) hold.
(78) Simplified Droop Detector (sDD). The specification of the droop detector is changed so that there is only a single output E.sub.O that needs to satisfy (C11′) and (C12′): Guarantee of droop detection.
V.sub.cc(t)<V.sub.low+(1+n/Q)T.sub.sK.Math.Ē.sub.O(t)=0 (C11′)
V.sub.cc(t)≥V.sub.high.Math.Ē.sub.O(t)=1 (C12′)
(79) Correctness of the sFAM-I given in
(80) Corollary. If t.sub.ofs≥t.sub.set, t.sub.ofs+t.sub.hold≤T.sub.s/2, (I1), (I2), and (I3) hold, then the sFAM-I in
(81) Concerning the implementation of the modified delay elements given in
t.sub.prop<T.sub.s/2−δ.sub.DE−(T.sub.l−T.sub.s)−t.sub.ofs,
t.sub.set<T.sub.s/2−t.sub.ofs, and t.sub.hold<δ.sub.DE+t.sub.prop.
(82) Corollary. The circuit sDE-I in
(83) The circuit was implemented and simulated by the inventors, both in a highlevel logic simulator using VHDL as well as in Spice in three different variants, demonstrating that the required design constraints can be met for clock frequencies between 1 GHz and 3.3 GHz in 65 nm.
(84) Based on the circuit specification and constraints above, the design entry in VHDL followed a standard approach. All sub-circuits used back-annotated gate delays, after synthesis in the UMC 65 nm process, and their constraints were met.
(85) For synthesis, all flip-flops and gates were used from the UMC library. Delay elements were modeled using chains of minimal sized inverters with small RC elements in between (in the order of 100Ω and 10 fF, respectively).
(86) The first variant is using the phase accumulator φ-Div-I as discussed in [Függer et al., 2018, ibid] with a 4 GHz input clock resulting in a 1 GHz output clock frequency.
(87) As expected from the circuits presented in Section 3, the critical path is in the Module φ, the phase accumulator, as this part of the circuit runs at four times the clock frequency of the remaining parts. For maximum speed, the proper alignment of 4*Clk and 1*Clk is vital. The delay added on 1*Clk and 4*Clk by a naively implemented divide-by-4 circuit would easily consume the slack at the inputs of U1 and U2. In case this is handled properly, the critical path in the circuit is the loop from U2, C.sub.O, via the up-counter and its output r[1:0] back to the multiplexer and the inputs of U1 and U2. The simulations showed that the circuit could be clocked well in excess of 4.5 GHz resulting in an output clock frequency of over 1.1 GHz. Adding some margins, it was decided to use a clock of 1 GHz for the simulations.
(88) The complete circuit consists of the phase accumulator φ-Div-I as shown in
(89)
(90)
(91) The top-most graph shows the supply voltage and its drop to 0.95 V. The second graph “E” shows the simulated output of the droop detector. A delay of 1 ns was assumed for the droop detector. The third graph “C_out” denotes the clock output of the inventive circuit. The remaining graphs are pairs of the delay enable and clock signals passed between the delay elements, with corresponding signals shown in the same color, backwards from the clock output to the phase accumulator: “E7” and “C7” are the enable and clock signal between the last and second last delay element, “E6” and “C6” the signals between the second and third last, etc. The signals “E1” and “C1” are between the phase accumulator and the first delay element.
(92) As can be seen, the output clock frequency adapts to the droop detect signal within a single clock cycle, both at the start and the end of the droop. The delay enable trickles backward in the chain and finally gets absorbed in the phase accumulator. As the droop lasts for approximately 9 clock cycles, this results in two clock cycles being dropped. Note that, because there are seven delay elements in the chain, the phase accumulator has just seen the delay enable signal by the time the droop is over. Yet the output clock immediately resumes its high-frequency operation, thus minimizing the performance impact of the droop.
(93) In the simulation, output requirement (C2) is violated. This is a consequence of the almost instantaneous voltage droop, violating constraint (I2); there is no time for the circuit to react before the voltage is too low. If the droop would be less steep, the voltage would still be sufficiently high until the clock speed is adapted. However, the input given here was deliberately chosen to clearly visualize the response time between the detection threshold V.sub.low+(1+n/Q)T.sub.sK (cf. (C11)) being reached and sFAM-I adapting the clock period, which is independent of the steepness of the droop.
(94) Replacing the phase accumulator by the implementation based on a tapped DLL, φ-DLL-I from [K. Wilcox et al., ibid], the clock speed can be increased significantly, as the phase accumulator does not need to run at four time the clock frequency of the remaining circuit.
(95) Using this approach, the phase accumulator can operate at frequencies well above 4 GHz. The element that limits the clock frequency thus shifts from the phase accumulator to the delay elements DE-I or sDE-I, respectively, and the pulse shaping module.
(96) The delay selection within the delay element needs to ensure that the delay difference between the two paths, T.sub.l−T.sub.s matches the delay steps within the phase accumulator φ-DLL-I, i.e. that prolonging the current clock cycle does not cut into the next clock cycle and violates (I4). A major problem here is the slight asymmetry in rise and fall times of the gates and their slew rate dependent delay. Both lead to a change of high and low times of the clock pulses, which has to be compensated by the pulse shaping module, thus reducing its slack. The slew rate dependence also induces different delays during droops, which in turn requires additional slack in order to satisfy (I4) and (I5).
(97) Similar problems arise from the delay paths in the pulse shaping module, although in this case the issue is to match the delays of the first and second half of the pulse shaper. In order to get a well-defined output pulse, the pulse shaping module is required to generate a pulse with high-time of T.sub.s/2. Additionally the second half needs to have a delay strictly lower than the first half. Matching the delay elements such that, including the delays through the NANDgates, these conditions are met for both the rising and falling edge, even during a droop and with changing slew rates, ultimately limits the clock frequency in the present implementation.
(98) Pushing the circuit close to its limit (or rather beyond its limit, see below), a maximum clock frequency of 3.3 GHz was achieved, limited by the stability of the pulse shaping module.
(99)
(100) To ensure a more realistic setting, a faster droop detection circuit with a reaction time of 300 ps was also assumed. The graph shows nicely that the circuit quickly responds to the droop, just like variant 1.
(101) However, there are two issues with the circuit's behavior: Clearly visible, there is a gap, with lost clock pulses, in the output clock, forming around 8 ns. The source of this issue is the restoration of the proper supply voltage, which speeds up the buffers in the delay line of the φ-DLL-I. Thus there is one shortened clock cycle. The rise/fall time dependence in the delay elements and pulse shapers leads to a contraction of the low part of the clock pulse. The contraction continues until one pulse shaper rejects the pulse, which then leads to the gap. One may expect this to be another consequence of violating (C2), as the power supply rising from 0.95V to its nominal 1.1V within 10 ps results in a too rapid change in switching times. However, the effect is too pronounced even when assuming a smaller slope of the droop.
(102) Secondly, there is a too short pulse, barely visible, in C.sub.out around 5.8 ns. Its source is a slightly too long high time at the output of C2 at about 3.8 ns. This is due to slight mismatch in the delays of rise and fall times within the delay element, which becomes exaggerated by the voltage droop.
(103) Both issues show that the circuit, although the initial design suggested that the circuit should work correctly, violates the constraints of the inventive design. This is a consequence of the standard design approach not taking into account the dynamic voltage conditions during a droop. In principle, one could use conservative bounds on the timing behavior of circuit elements within a voltage range of [V.sub.min; V.sub.max] and a (maximum) rate of change of input voltage of K is a feasible approach to ensure correct operation of the circuit. Unfortunately, such conservative bounds would result in substantially smaller bounds on the frequency at which the inventive FAM implementation can run. Thus, special care has to be taken in order to ensure correct and fast operation of the circuit under dynamic conditions.
(104) According to the invention, there are two ways to handle the supply voltage induced problems. One is to use circuits that have a lower speed dependence on power supply voltage variations, like, e.g. current mode logic (CML). On one hand, the increased static power consumption of CML is of less concern here than in more general circuits, as this circuit is constantly switching. On the other hand, the potentially higher supply voltage requirements might be an issue. Regardless of the solution employed, such an approach is likely to be more technology-dependent and beyond the scope of this article.
(105) In contrast, the second approach is very straightforward. One may use a separate, stabilized power supply for the clock circuit to avoid the performance impact of varying supply voltage. This is common practice for clocking circuits and, due to the relatively small size of the present FAM implementations, much easier to achieve than a stable power supply for the entire chip.
(106)
CONCLUSION
(107) High-frequency voltage droops consume a significant fraction of the clock guardband. A circuit was proposed that allows to react to steep and high-amplitude droops, without the need to halt the clock. The circuit is based on detecting droops and propagating this information along a delay line, back to a DCO that accounts for the respective phase offset. The clock signal travels in the opposite direction through the delay line. Care had to be taken in handling metastability: embodiments of the invention make use of masking flip-flops, ensuring that no glitches are introduced in the clock signal.
(108) The inventive design may be verified by correctness proofs and synthesized it in UMC 65 nm, running VHDL and Spice simulations with a 1 GHz and 3.3 GHz input clock respectively, which are in accordance with theoretical predictions. Going to high speeds the second order effects of the circuit become an issue and an appropriate design methodology has to be chosen to counteract those effects.