Method and apparatus for performing clock and data recovery (CDR)

11108536 · 2021-08-31

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for implementing an efficient clock recovery for multilane high-speed Serializer/Deserializer (SerDes) system having M interleaved lanes, has a non-recursive architecture.

Claims

1. A method for efficient multilane Serializer/Deserializer (SerDes) systems, the method comprising: receiving a data stream, wherein the data stream comprises a plurality of interleaved lanes, wherein each of the plurality of interleaved lanes comprises a plurality of samples; determining, via a non-recursive architecture, an error signal for each of the plurality of interleaved lanes, wherein the error signal for each of the plurality of interleaved lanes is available during a full interleaving cycle; generating a correction signal based upon the error signal for each of the plurality of interleaved lanes; and sampling the data stream based upon the correction signal.

2. The method according to claim 1, wherein determining the error signal for each of the plurality of interleaved lanes further comprises executing an early-late algorithm for each lane.

3. The method according to claim 1, wherein determining the error signal for each of the plurality of interleaved lanes further comprises employing a Time Error Detection (TED) algorithm.

4. The method according to claim 3, wherein the TED algorithm further comprises an early-late algorithm.

5. The method according to claim 1, wherein the full interleaving cycle comprises a time during which the error signal for each of the plurality of interleaved lanes is determined.

6. The method according to claim 1, wherein the plurality of samples are extracted from an analog signal.

7. The method according to claim 6, further comprising generating a digital signal based upon the sampled data stream.

8. The method according to claim 1, wherein sampling the data stream based upon the correction signal further comprises augmenting a sampling rate of Clock and Data Recovery (CDR) circuitry based upon the correction signal.

9. An efficient multilane Serializer/Deserializer (SerDes) system comprising: circuitry configured to receive a data stream, wherein the data stream comprises a plurality of interleaved lanes, wherein each of the plurality of interleaved lanes comprises a plurality of samples; a non-recursive architecture configured to determine an error signal for each of the plurality of interleaved lanes, wherein the error signal for each of the plurality of interleaved lanes is available during a full interleaving cycle; circuitry configured to generate a correction signal based upon the error signals for each of the plurality of interleaved lanes; and circuitry configured to sample the data stream based upon the correction signal.

10. The system according to claim 9, wherein the non-recursive architecture is configured to determine the error signal for each of the plurality of interleaved lanes by executing an early-late algorithm for each lane.

11. The system according to claim 9, wherein the non-recursive architecture is configured to determine the error signal for each of the plurality of interleaved lanes by employing a Time Error Detection (TED) algorithm.

12. The system according to claim 11, wherein the TED algorithm further comprises an early-late algorithm.

13. The system according to claim 9, wherein the full interleaving cycle comprises a time during which the determination of the error signal for each of the plurality of interleaved lanes occurs.

14. The system according to claim 9, wherein the plurality of samples are extracted from an analog signal.

15. The system according to claim 14, further comprising generating a digital signal based upon the sampled data stream.

16. The system according to claim 9, wherein the circuitry configured to sample the data stream based upon the correction signal is further configured to augment a sampling rate of Clock and Data Recovery (CDR) circuitry based upon the correction signal.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) In the drawings:

(2) FIG. 1 is a block diagram of a prior art CDR solutions for a single-lane;

(3) FIG. 2 is a block diagram of a prior art CDR with M parallel lanes;

(4) FIG. 3 shows an existing CDR implementation scheme for one lane;

(5) FIG. 4 shows a single lane modified CDR scheme;

(6) FIG. 5A is an exemplary embodiment of an interleaved 40-lane CDR according to the invention;

(7) FIG. 5B shows an exemplary configuration of an operator of FIG. 5A, for n=0;

(8) FIG. 5C shows an exemplary configuration of another operator of FIG. 5A, for n=0; and

(9) FIG. 6 is an alternative exemplary embodiment of an interleaved 40-lane according to the invention.

(10) In the drawings, “Pulse Gain Detector Output”, “VCO_gain”, and “Kpd”, are used interchangeably.

DETAILED DESCRIPTION OF THE INVENTION

(11) In many single-lane systems, a CDR circuit is used at the receiver end in order to recover the serial clock, align the sampling times, and sample the symbol waveforms at some “optimal” instants. In the case of a single-lane system, “optimal” may be taken in the sense that the average error rate on this single lane is minimized. This requires both recovering the clock frequency, and dynamically adjusting the sampling instant in order to compensate on-the-fly for the effect of jitter on the incoming data. In order to be able to compensate for the relative frequency drift between transmit clock and receive clock, which results in phase accumulation, the CDR must include an integrator, and therefore is implemented using a 2.sup.nd order Phase Locked Loop (PLL). One of the prior art effective CDR solutions for a single-lane is shown in FIG. 1. In the circuit illustrated in the figure, a source clock 10, at frequency close to the estimated symbol rate (or, without loss of generality, a multi-phase set of clocks at lower frequency) is fed into a phase interpolator (PI) 11. The analog signal is sampled by the Analog-to-Digital Converter (ADC) 12 according to the timing provided by the transitions of the clock 10 fine-adjusted on-the-fly by the phase interpolator 11. The phase interpolator 11 has the capability to introduce delays in the clock waveform, thus both tuning the average clock frequency, and dynamically controlling the instant when the ADC samples the analog signal (indicated by arrow 13). Based on the sampled values, the digital signal is recovered, and the “phase error” (time error) with respect to the “optimal” sampling instant is estimated. In turn, the phase error is inserted into the 2.sup.nd order loop 14, thus acting as a correction signal that makes the system stabilize so that the slow-moving average phase error value over the ongoing serial stream of symbols is minimized. This stable state is referred to as one in which the PLL is “locked”.

(12) However, when the system must transfer interleaved data, namely implementing parallel lanes, averaging the error over the whole serial data may not be desirable, as, for instance, it may result in one lane working with low error rate, while another working marginally or not at all. Moreover, in the case of a single data channel mentioned in the background section, one may want to exploit the underlying lanes to reduce the overall noise level. Thus in order to carry out a correct optimization, one must continuously detect the data from all the lanes in parallel, compute the phase error for each lane separately, and then build a correction signal that makes the system stabilize so that all the lanes are optimized in some sense, or in turn, different optimization priorities may be given to different lanes. This dictates modifying the CDR of FIG. 1 to the form schematically shown in FIG. 2. In this figure, as well as throughout this description, the same numerals are used to indicate the same components. However, as further illustrated hereinafter, the implementation of the multi-lane system in the form of FIG. 2, which is based on a 2.sup.nd order loop architecture, is extremely demanding from both hardware duplication as well as cost standpoints. The invention seeks to remedy this problem by providing a minimal-hardware and low-cost solution.

(13) FIG. 3 shows an existing CDR implementation scheme for one lane, which is parallel to the block diagram of FIG. 1, and, without loss of generality, uses a specific error-computing algorithm that will be discussed later. As shown in the figure, the serial symbols enter the ADC 12 and are sampled according to the PI-controlled clock 10. The time error detector consists of a couple of Finite Impulse Response (FIR) filters, 30 and 30′, staggered by one sample shift, which effectively yields the difference between symbol samples spaced approximately one clock period apart. The FIR filters 30 and 30′ are shown as being implemented in 4-taps for the sake of simplicity, but may be implemented with any desired number of taps. The FIR outputs are subtracted (as shown at 31), and the resulting value is multiplied by the signed value of the detected symbol voltage. This arrangement performs a mathematical action equivalent to an absolute operator. The resulting value 32 entering the 2.sup.nd order loop is the “time error” (phase error) signal 15 of FIG. 1. The phase error so generated makes the PLL system 14 to stabilize to the “lock” state mentioned hereinbefore.

(14) In order to better illustrate an exemplary embodiment of the interleaved multi-lane CDR according to the invention, a few modifications are introduced into the existing single-lane implementation. The improved schematic diagram is shown in FIG. 4. The modifications introduced in FIG. 4, which will be easily understood by the skilled person, are as follows: a) The upper FIR filter B(z) indicated by numeral 40 may be any interpolation filter. One possible filter implementation is the “raised cosine” form. The raised cosine filter is well fit as its impulse response (IR) decays rapidly so that few coefficients suffice for an effective filtering, and its shifted IRs suffer no inter-symbol interference (ISI). b) The lower filter Flip(B(z)) indicated by numeral 41, is a “flipped” version of the upper filter. In other words, if the coefficients of B(z) 40 are {b.sub.0, b.sub.1, b.sub.2, b.sub.3}, and the coefficients of Flip(B(z)) 41 are {c.sub.0, c.sub.1, c.sub.2, c.sub.3}, then c.sub.0=b.sub.3, c.sub.1=b.sub.2, c.sub.2=b.sub.1, and c.sub.3=b.sub.0. Using the above approach, B(z) and Flip(B(z)) allow to implement an anti-symmetric filter required for the time error detector (TED), while reducing the filter complexity. c) The block denoted by Sign(x.sub.−2) and indicated by numeral 42, simply returns the values±1, in correspondence to the sign of the sample value that multiplies the coefficient b.sub.2 of B(z) 40. This allows providing a good estimate of the sign of the sample, and moreover, since no ISI is present, the middle point of the IR of a raised cosine filter coincides with one single sample. If the sample value whose sign is returned belongs to the sample x.sub.n, the returned sign is denoted as Q(x.sub.n)∈±1. For 4-taps, on the average, the sign of x.sub.−2 can replace the absolute operator mentioned before.

(15) A detailed analysis of the operation of the CDR of FIG. 4 will assist in better understanding the invention. The following analysis assumes that the input to ADC 12 consists of a sequence of interleaved multi-lane symbols. In fact, the circuit of FIG. 4 acts as if the interleaved input consists of a single-lane input, and behaves according to the single-lane block diagram of FIG. 1. In the following description, the index 0 is used to denote present values, and negative indices to denote previous values. However, it should be understood that in all that follows, the analysis holds for any set of (time) shifted indices. a) For the sake of clarity, the analysis is carried out for a set of five consecutive input samples {x.sub.−4, x.sub.−3, x.sub.−2, x.sub.−1, x.sub.0}, where the index 0 denotes the present sample and negative indices denote previous samples. The above samples constitute the sequential inputs delivered by the ADC 12 to the block 43 (Time Error Detector (TED)). b) The present value that constitutes the input to block 44 (2.sup.nd order loop (PLL)) is denoted as (PLL_in.sub.0). This value is derived in a straightforward way from FIG. 4 and has the form
PLL_in.sub.0=Q(x.sub.−2)[(x.sub.0b.sub.0+x.sub.−1b.sub.1+x.sub.−2b.sub.2+x.sub.−3b.sub.3)−(x.sub.−1c.sub.0+x.sub.−2c.sub.1+x.sub.−3c.sub.2+x.sub.−4c.sub.3)]

(16) Substituting c.sub.0=b.sub.3, c.sub.1=b.sub.2, c.sub.2=b.sub.1, and c.sub.3=b.sub.0, in the flipped filter, finally obtains
PLL_in.sub.0=Q(x.sub.−2)[b.sub.0(x.sub.0−x.sub.−4)+(b.sub.1−b.sub.3)(x.sub.−1−x.sub.−3)]

(17) It should be noted that only the sign of the central sample x.sub.2 (not its amplitude) has effect in this expression c) PLL_in.sub.0 is in fact the “time error” (phase error) signal 15 of FIG. 1, and corresponds to a well-known algorithm denoted as “Early-Late”, which is based on the assumption that if there is a point near x.sub.2 where a symbol pulse has maximal absolute amplitude, then samples taken at symmetrical distance from x.sub.−2 should have similar amplitude. PLL_in.sub.1 takes the following values: 1) Its value is zero if the PI 11 has set the clock position so that the sampling occurs at the point of maximal absolute amplitude of the symbol at the ADC 12 input. This is the desired sampling instant, since it is the point where the “pulse narrowing/expanding” effect due to jitter and drift has minimal influence on the amplitude. 2) Its value is positive if the PI 11 has set the clock position so that the sampling occurs before the symbol at the ADC 12 input reaches its maximal absolute amplitude. This is denoted as an “Early” sampling. 3) Its value is negative if the PI 11 has set the clock position so that the sampling occurs after the symbol at the ADC 12 input reaches its maximal absolute amplitude. This is denoted as a “Late” sampling.

(18) In view of the above description, using the circuit of FIG. 4 with multi-lane interleaved input, again the error signal makes the (phase interpolator-controlled) clock position stabilize so that the global average error is optimized, which leaves the multi-lane optimization problem unsolved. The reason for this problem lies in the fact that circuits of FIG. 4 have a recursive architecture, in which at each new step, a full re-computation is carried out, and all the values at previous states are lost. In order to perform a multi-lane optimization, one needs to implement hardware resources for each parallel lane, in order to keep the values belonging to all the sequential states for all the lanes, until a full interleaving cycle has completed. Thus, if one wish to use a CDR circuit scheme similar to FIG. 4 to perform an optimization over M lanes, this implies duplicating M times several high-speed circuits, including multipliers, adders, and memories, which results in a large amount of high-speed hardware, with the associated cost and current consumption.

(19) The invention addresses the abovementioned problem by providing circuits of a multi-lane CDR design, which have non-recursive architecture, while still performing the PLL action as before. In order to illustrate how this is done an accurate mathematical expression describing the PI input as a function of the input samples from ADC needs first to be established. Accordingly, the invention provides a non-recursive hardware circuit that allows to perform the same PLL task as in FIG. 4, together with multi-lane optimization, while requiring a modest hardware investment as compared to the prior art. In the context of this invention, the term “non-recursive architecture” refers to hardware architecture adapted to keep the values belonging to all the sequential states for all the lanes available, until a full interleaving cycle has completed. This result is accomplished, inter alia, by a thorough analysis of the recursive behavior of FIG. 4, and then rearranging, swapping, and consolidating adders and multipliers so as to lower the number of operators thus leading to an economical hardware implementation. The invention will be illustrated hereinafter through exemplary embodiments thereof, it being understood that it allows to provide different practical hardware solution, and therefore the embodiments described herein are merely illustrative and are not intended to limit the invention in any way.

(20) Referring now to FIG. 4, the lower branch in block 44 (2.sup.nd order loop (PLL)), consists of an integrator, and the final integration value is found at the output of the delay block denoted by z.sup.−1. a) The final integration value resulting at the end of the previous interleaving cycle is denoted by xi.sub.−1. b) Kpd is a multiplying factor that translates amplitude to phase. For the sake of simplicity and for the purposes of this explanation it can be taken to equal unity. c) The I.sup.th recursive value at the output of the of block 44 is denoted by PLL_out.sub.i. d) Block 45 (ACC phase) is an adder that sums up the recursive values PLL_out.sub.i. e) The value at the input of the phase interpolator PI 11 at the end of the present full interleaving cycle is denoted by PI.sub.in.

(21) A straightforward computation of the signal PI.sub.in yields the following result (Eq. 1):

(22) PI in = .Math. l = 0 M PLL_out l = M .Math. ( K i .Math. xi - 1 ) + ( K i + K p ) .Math. l = 0 M PLL_in l + K i .Math. l = 0 M - 1 ( ( M - 1 ) - l ) .Math. PLL_in l = M .Math. ( K i .Math. xi - 1 ) + .Math. l = 0 M - 1 PLL_in l .Math. ( K p + K i ( M - l ) )

(23) The above result may be rearranged in the form (Eq. 2):
PI.sub.in=M.Math.(K.sub.i.Math.xi.sub.−1)++(K.sub.i+K.sub.p).Math.{b.sub.0.Math.[Q(x.sub.M-3).Math.(x.sub.M-1−x.sub.M-5)]+(b.sub.1−b.sub.3).Math.[Q(x.sub.M-3).Math.(x.sub.M-2−x.sub.M-4)]}+(2K.sub.i+K.sub.p).Math.{b.sub.0.Math.[Q(x.sub.M-4).Math.(x.sub.M-2−x.sub.M-6)]+(b.sub.1−b.sub.3).Math.[Q(x.sub.M-4).Math.(x.sub.M-3−x.sub.M-5)]}+ . . . +(M.Math.K.sub.i+K.sub.p).Math.{b.sub.0.Math.[Q(x.sub.−2).Math.(x.sub.0−x.sub.−4)]+(b.sub.1−b.sub.3).Math.[Q(x.sub.−2).Math.(x.sub.−1−x.sub.3)]}

(24) Equation 2 can be implemented using circuits based on non-recursive elements, so that the values for each state in the interleaving cycle are preserved during the all cycle, while rearranging and recombining the multiplication and additions thereby reducing the number of required hardware operators.

(25) In the following examples, “Kpd”, “VCO_gain” and “pulse_gain_detector's output” have the same meaning.

Example 1

(26) An exemplary embodiment of an interleaved 40-lane CDR is shown in FIG. 5 (A-C).

(27) With reference to FIG. 5A, the following hardware operators, which are all straightforward applications of digital adders and multipliers, well-known to any person skilled in the art, are defined below:

(28) a) BOX-(n+1), n=0, 1, 2, . . . , 39: this hardware operator, the first of which is indicated in the figure by numeral 50, accepts 5 input samples indexed {x.sub.n-4, x.sub.n-3, x.sub.n-2, x.sub.n-1, x.sub.n}, and outputs two values Q(x.sub.n-2).Math.(x.sub.n−x.sub.n-4) and Q(x.sub.n-2).Math.(x.sub.n-1−x.sub.n-3)

(29) FIG. 5B shows an exemplary operator 50 configuration for n=0.

(30) b) Twin I/O multiplier operator 51: this hardware operator accepts the two values from the BOX-(n+1) operator, and returns both values multiplied by K.sub.p+(40−n).Math.K.sub.i at its output.

(31) FIG. 5C shows an exemplary operator configuration for n=0.

(32) c) additional standard multipliers and adders are also used, and Kpd is renamed as VCO_gain.

(33) As can be readily appreciated, the “time error” (phase error) values for all lanes are available at all times during the full interleaving cycle, while the overall PLL functionality is maintained, with no recursive computations.

Example 2

(34) An alternative embodiment of the circuit of FIG. 5 is shown in FIG. 6. As compared to the embodiment of FIG. 5, in this embodiment Equation 2 is implemented with a different rearrangement of the operators thus yielding a different hardware configuration that performs the same task as in Example 1. The choice of a specific hardware implementation among the possible ones, may be done so to optimally exploit the available hardware resources for each specific case.