Original version here
Just a couple of days ago as I pen these words, I was chatting with Bryan Hoyer from Align Engineering. After slaving away for years in their secret underground bunker, these little rapscallions have just come out of “Stealth Mode”. As part of their public launch, the folks at Align have announced a patented breakthrough technology called Align Lock Loop (ALL), which allows every LVDS input/output (I/O) pair in an FPGA to be used as a complete SERDES (Serializer/Deserializer) solution. This forms the basis for implementing fast, simple, and very affordable chip-to-chip and board-to-board communication without using large numbers of I/O pins and without involving intensive engineering that makes your eyes water.
In fact, I was so excited about the ALL concept that I decided to pen this brief technology introduction/backgrounder. Bryan promises that Align will follow this with a full-up “How To” article in the not-so-distant future (after the “proof-in-silicon” technology demonstration that is currently planned for sometime in Q4 2007).
Phase Lock Loops (PLLs) and Clock Data Recovery (CDR)
Before we leap into the fray with gusto and abandon, it’s well-worth spending a few moments reminding ourselves as to the role played by PLL and CDR functions, because these concepts will be important to our future discussions.
A PLL is a closed-loop electronic control system/function that can be used for frequency control by generating an output signal with a fixed relation to the phase of an input (“reference”) signal. For the purposes of these discussions, both the input and output signals will be considered to be clock signals. The simplest form of PLL generates an output clock with the same frequency and phase as the input clock (Fig 1).
1. A generic PLL.By means of a feedback path coupled with a phase detector, the PLL responds to both the frequency and the phase of the input signal, automatically raising or lowering the frequency of a controlled oscillator until it is synchronized to the input/reference signal in both frequency and phase.
As described in an associated Wikipedia Article, a good analogy is the tuning of a string on a guitar. Using a tuning fork as a reference signal, the tension of the string is adjusted up or down until the beat frequency is inaudible, thereby indicating that both the tuning fork and the guitar string are vibrating at the same frequency. If the guitar string is perfectly tuned and in phase with the tuning fork andmaintained there, it may be described as being in phase-lock with the fork.
Now, to the uninitiated, this may not at first appear to be too mind-bending. A common knee-jerk reaction is “woopee-doo-dah” – we’ve used a lot of complex circuitry to take an existing clock signal and generate a new one that looks exactly the same. But, of course, there’s a lot more to it than this. Consider the case of “jitter” for example, in which the rising and falling edges of the input clock may wander back and forth slightly. The PLL removes this jitter and generates a nice “clean and shiny” clock signal.
Sidebar: How important is jitter? Very! Consider an Analog-to-Digital Converter (ADC), whose role in life is to sample data at specific times, for example. Now assume that this sampled data is fed into a Digital Signal Processing (DSP) algorithm/function, which works on the assumption that all of the samples are taken at regular intervals as determined by some clock signal. Not surprisingly, if the clock signal is subject to jitter, the overall quality of the resulting data is degraded. In order to address this issue, some “clock cleaner” PLL chips/functions can reduce jitter down into the fempto second range.
And, of course, there’s more, because the output frequency from the PLL may be a higher (multiplied) or lower (divided) version of the input frequency.
Sidebar: As a simple example of why we might wish to use a multiplying PLL function, consider the case where an ADC chip is required to sample data at 100 MHz. Also consider that several chips may be driven by a common clock. One approach would be to use a 100 MHz clock generator to feed the various chips, but running a 100 MHz clock signal around a board isn’t a lot of fun. An alternative would be to employ a 25 MHz clock generator at the board level, and for each of the chips to use a 4× PLL to generate 100 MHz signals for internal consumption.
In fact, there are a variety of PLL functions as follows (these are presented in terms of increasing complexity, which equates to more gates/transistors, silicon area, power consumption, and so forth):
- Base-Level PLL: In this, the simplest case, the output from the PLL has the same frequency and phase as the input signal (the phase can be adjusted as required by means of the feedback path).
- Integer Mult (m) PLL: The output frequency from the PLL is some integer multiple ‘m’ of the input frequency.
- Integer Div (n) PLL: The output frequency is some integer dividend ‘n’ of the input frequency.
- Integer Mult/Div (m/n) PLL’: The output frequency is generated as a combination of an integer multiple ‘m’ and an integer dividend ‘n’ of the input frequency (this is often achieved by using two or more PLL’s in tandem).
- Fractional Mult/Div (m/n) PLL: As for the previous case, except that the multiplication ‘m’ and division ‘n’ values may be real/fractional values.
- Clock Data Recovery (CDR) PLL: In this case, a clock signal is embedded in – and recovered from – a stream of data.
As we shall see, the CDR case is of particular interest to us in the context of these discussions. The idea here is that, as opposed to having separate clock and data signals, the clock is embedded in (and can be derived from) the data stream itself.
As a starting point, let’s consider a data stream that consists of alternating 0s and 1s (e.g. 010101010101. . .) as illustrated in Fig 2(a), where each 0-to-1 and 1-to-0 transition occurs on a clock edge from a reference clock that is embedded in the transmitter.
Obviously, recovering the clock is not too complex a task in this case (to all intents and purposes, this data stream is the clock).
2. Embedding the clock in the data stream.By comparison, consider the more complex data stream illustrated in Fig 2(b). Once again, any data transitions between 0 and 1 values (and vice versa) occur on “clock edges” corresponding to a reference clock that is embedded in the transmitter, but the data stream itself may comprise a “random” sequence of 0s and 1s. In this case, the CDR function embedded in the receiver will have to be much more sophisticated.
Now, we’re getting a little ahead of ourselves here, because the concept of 8b/10b (and related encoding schemes) doesn’t really come into play until we start talking about the SERDES and ALL techniques. The reason we are going to introduce it at this time is that it is relevant in the context of the CDR functions presented in the previous topic.
As a starting point, let’s assume that we are using some high-speed serial transceiver technique, and also that we are transmitting an ideal signal consisting of a series of alternating 0s and 1s as illustrated in Fig 3.
3. Transmitting and receiving an “ideal” high-speed serial signal.For the purposes of this simple example, the signal generated by the transmitter is shown as being a pure square wave; in the real world, however, this signal would have significant analog characteristics. Also, the signal as “seen” by the receiver would be phase-shifted from that shown in Fig 3; we’ve aligned the signals here to better illustrate which bits at the transmitter and receiver are associated with each other.
Now, when we are talking about high-speed signals with data rates as high as gigabits-per-second, the tracks linking the transmitting and receiving chips (and the pins on the chips) absorb a lot of the signal’s high-frequency content, which means that the receiver “sees” only a drastically attenuated version of the original signal.
The end result at these extreme frequencies is that the signal coming out of the transmitting chip is horrible, and it’s even worse by the time it reaches the receiver, but we digress. The point here is that the signal as “seen” by the receiver in Fig 3 still oscillates above and below some median level, which means the receiver can detect it and pull useful information (such as the data and the recovered clock) from it.
Now let’s consider what might happen if we were to modify the previous data stream such that it commences by transmitting a series of three consecutive 1 values as illustrated in Fig 4.
4. The effect of transmitting a series of identical bits.In this case (and remembering that this is an overly-pessimistic scenario intended only to provide us with something to talk about), the signal as “seen” by the receiver continues to rise throughout the course of the first three bits. This takes the signal above its “median” value, which means that even when the signal returns to its 010101. . . sequence, the receiver will actually continue to “see” a never-ending series of 1s.
The point of all of this is that 8b/10b refers to an encoding scheme in which original 8-bit (256-value) characters/symbols are mapped into 10-bit (1,024-value) characters/symbols. This means that each of the original 8-bit characters/symbols can have a number of 10-bit counterparts. The result is that even if the transmitter wishes to transmit a group of 0s or 1s, it can select between different 10-bit symbols so as to ensure that the overall sequence ends up “hovering” around the median value.
In addition to ensuring a constant DC value as discussed above, 8b/10b encoding is also used to ensure enough state changes to facilitate clock recovery by the receiver. Last but not least, some of the 10-bit codes can be used as control characters; for example, to announce the start and end of a “frame”.
The evolution of I/O
When I was a bright-eyed, bushy-tailed young engineer, transmitting signals from one chip to another was so much simpler than it is today. In those now-far-off times, we were typically working with Transistor-Transistor Logic (TTL), whose signals swung between 0V and 5V. Furthermore, generally speaking, we were working with clock frequencies of only a few hundred KHz – how well I can remember the excitement when our clock speeds started to approach 1 MHz (at that time we would have laughed our socks off if anyone had talked in terms of gigahertz clock frequencies and data rates of gigabits-per-second).
But, once again, we are “wandering off into the weeds”, so let’s bypass those days of yore, leap forward to when the use of CMOSbecame prevalent, and briefly summarize the evolution of different types of I/O as follows:
- Parallel asynchronous CMOS [No PLLs or CDRs]
- Parallel synchronous CMOS [No PLLs or CDRs]
- Parallel source-synchronous CMOS [Requires PLLs]
- Parallel source-synchronous LVDS [Requires PLLs]
- Serial source-synchronous LVDS [Requires PLLs]
- XCVR-based* multi-gigabit SERDES [Requires CDRs]
*Just in case you aren’t familiar with this terminology, in ham radio jargon, X can stand for trans (from the Latin, meaning “across” or “through”), so XCVR is an abbreviation for “transceiver”.
We’ll next take a very quick peek at each of these cases to quickly remind ourselves as to the most salient points…
Parallel asynchronous CMOS: Let’s start by considering an FPGA acting as a master device communicating (reading and/or writing data) with some form of slave device. In this case the master device will be fed by an external clock and the two components will be connected by a parallel data bus augmented by some control signals; for example, Chip Select and Read/Write as illustrated inFig 5.
5. Parallel asynchronous CMOS (single slave device).The advantage of this scheme is that it’s relatively simple. One disadvantage is that it utilizes a lot of pins; another is that is requires a number of clock cycles to set up the R/W and CS lines and then perform the read/write operation.
If multiple slave devices are required, the data bus and R/W signals are copied to all of the slaves; meanwhile a unique CS signal is required by each slave as illustrated in Fig 6.
6. Parallel asynchronous CMOS (multiple slave devices).Of course, as opposed to the master device generating unique CS signals, it could output an address value that was externally decoded to generate the CS signals, but we’re trying to keep things as simple as possible.
Parallel synchronous CMOS: This is very similar to the previous case, except that that the clock is fed to all of the devices as illustrated in Fig 7.
7. Parallel asynchronous CMOS (multiple slave devices).The advantage of this scheme is that each read/write operation requires only a single clock cycle; the disadvantage is that we now have to balance the clock lines so as ensure that all of the devices” see” the clock at the same time. This introduces a new level of system complexity, where the “balancing requirements” become tighter and tighter as the clock frequency increases.
Parallel source-synchronous CMOS: OK, let’s change things around a little. For the purpose of the following examples, let’s assume that we are trying to establish communications between an FPGA and an Analog-to-Digital Converter (ADC) chip. The idea here is that our master device (the FPGA) wishes to upload a continuous stream of sample data from the slave device (the ADC).
Now assume that the data rate we require is so high that we can no longer guarantee our ability to synchronize operations between the two devices using a common clock. One solution is to use a source synchronous technique, in which the slave device produces its own clock that travels in parallel with the data as illustrated in Fig 8.
8. Parallel source-synchronous CMOS (single slave device).Note that only the data and clock signals are shown in Fig 8; any additional control signals have been omitted for simplicity. Also note that the use of a single system clock to drive both devices as illustrated in Fig 8 is only one scenario; it is also common for both devices to be fed by separate clocks.
Using this approach, the clock generated by the slave device suffers the same delay and drift as its data, thereby facilitating the receiving device’s ability to reliably retrieve that data.
Observe that the Source Synchronous Control (SSC) logic is relatively small compared to a PLL block. In the case of the slave (ADC), both the PLL and SSC would be implemented as hard-wired logic; by comparison, in the case of the FPGA, the PLL would be implemented as a hard macro while the SSC would be implemented using the device’s programmable fabric.
One disadvantage of this approach is the proliferation of PLLs – first, we need a PLL in the slave to lock onto (and remove jitter from) the external clock; second, we need a PLL in the master to lock onto (and remove jitter from) the clock signal generated by the slave.
Furthermore, each new slave device will require an additional PLL macro in the FPGA. Even worse, each slave device behaves as though it were the only device (and clock generator) in the world. The result as “seen” from the FPGA’s perspective is multiple clock domains – one for each slave.
Parallel source-synchronous LVDS: This is almost identical in concept to the previous topic. The only significant difference is the fact that the clock signal and each data signal are presented as LVDS (Low-Voltage Differential Signal) pairs.
There are three key advantages associated with using LVDS: (a) low power consumption, (b) low “outbound” (radiated) Electromagnetic Interference (EMI) emissions, and (c) a greater tolerance to “inbound” EMI (noise). The primary disadvantage is that each signal consumes two pins on each device.
Serial source-synchronous LVDS: In the case of a serial source synchronous LVDS scheme, we require a minimum of three signals:Clock, Data, and Frame as illustrated in Fig 9.
9. Serial source-synchronous LVDS (single slave device).As for each of the other source synchronous techniques, the FPGA will require a PLL to process the clock from each of its slave devices.
XCVR-based multi-gigabit SERDES: At the top of the “food chain” we find high-speed serial communications schemes, such as PCI express, as illustrated in Fig 10, in which the data signal includes an embedded clock. Observe that PCI express was originally conceived as a board-to-board technique (chip-to-chip incarnations followed later). Thus, standard implementations employ a separate clock for each device.
10. XCVR-based multi-gigabit SERDES (single slave device).In this case, a minimum implementation consists of a single (×1) “lane” comprising a transmit path and a receive path, each of which uses a special-purpose differential signal pair. Higher bandwidths may be achieved by using multiple lanes, which is why it is common to see ×1, ×4, ×8, etc. references in this context.
At the current time, the bandwidth for a single lane is typically quoted as 2.5 or 5.0 gigabits-per-second, but this can be a little misleading (in the case of 10 gigabits-per-second solutions, these are formed using four × 2.5 gigabits-per-second lanes).
The problem is that the data stream will use some form of encoding, such as the 8b/10b scheme introduced earlier in this paper (in the case of networks, a related 64b/66b scheme is typically employed, but we will assume the use of 8b/10b for the purposes of these discussions). This means that – for every 8 bits of data we wish to transmit or receive – we actually end up using 10 bits. Thus, in the case of a 2.5 gigabits-per-second Link Rate, the corresponding Data Rate is actually 2.5 / 10 * 8 = 2.0 gigabits-per-second.
While we’re here, it’s probaby worth noting that, in the case of 10 Gig Ethernet, the Link Rate = 4 × 3.125 Gbps, while the Data Rate = 4 × 2.5 Gbps. The point is that one has to be careful when making comparisons, because network folks commonly quote the Data Rate, whereas other folks often quote the Link Rate (which is 25% higher for 8b/10b encoding).
Observe that the control (“Ctrl”) logic is relatively small compared to a CDR block. Once again, in the case of the slave (ADC), both the CDR and the “Ctrl” functions would be implemented as hard-wired logic; by comparison, in the case of the FPGA, the CDR would be implemented as a hard macro while the “Ctrl” functions would be implemented using the device’s programmable fabric.
The advantages of multi-gigabit SERDES solutions are that they are extremely fast and require a low pin count. The disadvantages are that they are expensive in terms of dollars, silicon area, power consumption, and complexity to use; also that their high-speed pins are typically dedicated to their hardwired CDR macros.