Accelerating 32 GT/s PCIe 5.0 Designs

Gary Ruggles, Sr. Product Marketing Manager, Synopsys

The transition from older PCI Express (PCIe) technologies to the latest Revision 5.0 is on an accelerated path, with system-on-chip (SoC) designers seeing a much faster roll out than they did with PCIe 4.0. In a recent Synopsys webinar, viewers’ response to a survey showed that while many PCIe 4.0 design starts are well underway, some designers are leapfrogging PCIe 4.0 and moving directly to PCIe 5.0 designs. The survey also showed that many of those not yet moving to PCIe 5.0 designs will be doing so in the next 12 months.

PCIe bandwidth has been doubling with each generation and is now moving from 16 GT/s PCIe 4.0 to today’s 32 GT/s PCIe 5.0. The recent release of version 0.9 of the PCIe 5.0 Base Specification locks in the functional changes to the specification, allowing designers to confidently start their designs.

In addition to the bandwidth doubling, the specification delivers some new features such as equalization bypass modes to enable faster link bring-up, precoding support to help avoid burst errors that could result from the higher decision feedback equalization (DFE) tap ratio, and loopback enhancements to allow for crosstalk simulation. With the rapid adoption of PCIe 5.0 technology, SoC designers should understand and consider some of the key design challenges they will face, such as increased channel loss, complex controller considerations, PHY and controller integration, packaging and signal integrity issues, and modeling and testing requirements. This article outlines the design challenges of moving to a PCIe 5.0 interface and how to successfully overcome the challenges using proven IP that is designed and tested to meet the key features of PCIe 5.0 at 32 GT/s.

The Channel

Doubling the data rate from 16 GT/s to 32 GT/s also doubles the Nyquist frequency to 16 GHz, making frequency-dependent insertion losses worse. In addition, increased capacitive coupling at higher frequencies adds more interference or noise to the signal, making the crosstalk worse than it was in PCIe 4.0 channels. These factors combine to make the PCIe 5.0 channel the most challenging Non-Return-to-Zero (NRZ) channel SoC designers have faced.

The PCB material selected – FR4, Megtron, Tachyon, iSpeed – has a huge impact on the insertion loss across various reaches. Figure 1 shows a simple example of insertion loss for a 16-inch trace across different PCB materials at both 16 GT/s (8 GHz Nyquist) and 32 GT/s (16 GHz Nyquist) data rates. FR4, a common and widely used material, has an insertion loss that grows from 19.34 dB at 8GHz Nyquist (Gen 4 data rate) to 33.44 dB at 16 GHz Nyquist (Gen5 data rate). Due to this reason, FR4 for PCIe 5.0 systems is completely impractical, since 16 inches is not very long, and the board loss is only a fraction of the total channel loss (which cannot exceed approximately 36 dB as defined by the PCIe 5.0 specification) and includes packages, multiple PCBs, connectors, etc. For real-world PCIe 5.0 systems require better materials than FR4.

Figure 1: Insertion loss of the channel material increases significantly from PCIe 4.0 at 16 GT/s to PCIe 5.0 at 32 GT/s

Besides channel materials, the channel configuration strongly influences both the total insertion loss and the overall bumpiness of the channel, as each transition from one material to another induces signal reflections. As an example, one of the simplest channels is a chip-to-chip interface across a substrate or board without any additional connectors, which may have a smooth insertion loss curve. However, when more connectors are added along the way, the channel performance can quickly get worse. For example, a real-world chip-to-chip channel may include a mezzanine connector, or two connectors using a riser card and an add-in card, or more than two backplane connectors and a mezzanine connector. Each time a connector is added to the channel, the transmitter and receiver have to overcome the additional channel loss, and must be able to equalize interferers that can show up many Unit Intervals from the main cursor. This typically requires a complex multi-tap DFE receiver design with fixed and floating taps to fully equalize the channel and open the eye at 32 GT/s.

Designers will do their best to anticipate these challenges and design a robust system with sufficient margin, ensuring error-free data transmission. For PCIe 5.0 designs, it is important for designers to be able to assess the real receiver margin in the actual system by utilizing RX lane margining, introduced in the PCIe 4.0 specification. While the PCIe 4.0 specification required only RX lane margining for timing (horizontal eye opening), the PCIe 5.0 specification (32 GT/s) also requires RX lane margining for voltage (eye height) to help ensure system robustness.

Controller Considerations

When configuring a PCIe 5.0 controller, data payload size is important for optimizing performance and throughput. Due to a relatively fixed overhead in each packet, typically about 20 to 24 Bytes per transaction layer packet (TLP), small payloads are inefficient, so the controller must allow maximum large enough payload size to meet the required throughput. While the PCIe specification defines payload sizes up to 4096 Bytes, the industry average is typically 256 Bytes. However, it is up to the designers to make the proper selection of maximum data payload size for their target application to reach the ideal performance level for their PCIe 5.0 controller, while understanding the potential limitations of their PCIe link partner’s supported payloads. Designers must also understand that achievable throughput must take into consideration the TLP header overhead: LCRC, sequence and framing, potentially ECRC, and the loss due to 128b/130b encoding.

To achieve the best performance in a PCIe 5.0 system, the designer must determine the maximum number of outstanding non-posted requests (NPR) and ensure a sufficient number of tags is provided. The number of tags is a property of the controller, and hence, it must be set correctly based on the system requirements. The latest version of the PCIe 5.0 specification allows for 10-bit tags, which enables up to 768 unique tags (reduced from the expected limit of 1024 due to reservation of some bit values). Selecting too few tags has an adverse effect on performance. As total roundtrip transmit time, or latency increases, so does the number of tags required to maintain the maximum performance at 32 GT/s. The number of tags required is also impacted by the payload size and the minimum read request size at which maximum throughput must be maintained. For PCIe 5.0, the required number of tags is also higher since at 32 GT/s the system throughput is higher.

Figure 2: Number of tags needed to achieve maximum throughput for PCIe 4.0 and PCIe 5.0 links

PHY and Controller Integration

An ideal situation is to implement a complete PHY and controller IP solution from a single vendor. When mixing and matching solutions from different vendors, designers must consider certain integration challenges. Intel has defined a specification called the PHY Interface for PCIe (PIPE) to help with such integration, however, changes to the PIPE specification make it important to understand this interface and its implementation. The PIPE 4.4.1 interface does not support the PCIe 5.0 technology explicitly because it requires additional register bits to handle the higher speed. If designers are looking to use this version of the PIPE specification, the designer and IP vendors must manage many technical details, which can be cumbersome. The new PIPE 5.1.1 specification delivers the first true support for PCIe 5.0 technology with many new features, of which designers must have a comprehensive understanding:

  • The Low Pin Count Interface simplifies the PHY-controller interface by moving what used to be side-band pins to register bits. This concept was originally introduced to support a limited set of pins for PCIe 4.0 RX lane margining signals and has been greatly expanded in PIPE 5.1.1, offering a vastly simplified interface.

  • The SerDes Architecture effectively moves much of the Physical Coding Sublayer (PCS) functionality from the PHY into the controller and has been added as a “required” mode for PIPE 5.1.1. The SerDes Architecture facilitates the use of multi-standard PHYs that do not need to be encumbered with the PCS functionality. Retention of the original PIPE architecture is recommended for PCIe 5.0, but is not required, so support for SerDes Architecture becomes an important factor to consider.
  • A 64-bit PIPE options is added, but only for the SerDes Architecture. This can allow for lower speed operation of the PIPE interface, but it is not practical for 16-lane implementations due to the lack of availability of 1024-bit controllers. Synopsys supports the 64-bit PIPE, even when operating for the original PIPE architecture.

There has always been a tradeoff between the data path width and the frequency at which timing must be closed at the PIPE interface. For PCIe 5.0, some of the options designers may have had for PCIe 4.0 are no longer available. At 32 GT/s the PIPE interface must be at least 32-bits wide to avoid timing closure beyond 1GHz. The 64-bit PIPE interface can be an option, allowing timing to be closed at 500 MHz, but not for the widest interfaces. To understand this, consider a few configurations shown in Table 1. For PCIe 5.0 at 32 GT/s, 16-bit PIPE can be ruled out, because it requires 2GHz timing closure, which would be extremely difficult or impossible to achieve. This leaves options of 32-bit or 64-bit PIPE. However, if designers are taking advantage of the maximum available throughput by implementing x16 links, then only one option is left: a 512-bit controller with 32-bit PIPE interface and 1GHz timing closure. Otherwise, a 1024-bit controller architecture is required, which is currently unavailable from any IP vendor.

Table 1: Finding a feasible implementation tradeoff of speed and width becomes critical when closing timing

Thus, for x16 links operating at 32 GT/s, a 512-bit controller is mandatory, making it vital for designers to use a silicon-proven and tested 512-bit controller IP architecture. Moving to a 512-bit architecture also means multiple data packets per clock cycle are possible. This means the controller architecture must be able to correctly handle serializing and ordering the TLPs to avoid unnecessary complications to the designer’s application logic, requiring a proven 512-bit solution—preferably one that has demonstrated successful timing closure across the PIPE interface at 1GHz using standard libraries (as opposed to costlier high-speed libraries).

Packaging and Signal Integrity

For packaging and signal integrity, new insertion loss and crosstalk specifications must be set and met to accommodate the faster 32 GT/s data rate and resulting 16 GHz Nyquist frequency. Trace length and routing must be carefully managed within the package form factor to avoid cross-talk violations and meet the new insertion loss and crosstalk specifications. Power distribution is also a big factor, as it needs lower inductance in package for 32 GT/s designs. Inductance must be reduced to keep voltage noise at the same level due to the higher inrush currents (di/dt).

Reflections and crosstalk are more challenging at 32 GT/s data rates, and all the discontinuities in the signal path, like vertical interconnect accesses (VIAs, ball grid arrays (BGA) balls, connectors, DC blocking caps, etc.), must be carefully analyzed. Improper transmitter and receiver routing in the VIA regions will increase crosstalk between adjacent signals or lanes. Designers must try to maintain the maximum spacing for traces even in such crowded VIA regions to avoid crosstalk.

As data rate increases, the power supply current demand will increase in amplitude and frequency, however, the basic challenge of maintaining a stable supply voltage remains the same. For example, power state changes in one lane that may create inrush current for another lane running in continuous transmit mode, creates a big spike in the power supply voltage. Designers must be able to conduct proper analysis of the power delivery network to:

  • Verify that noise with all lanes running meet AC ripple specifications through adequate decoupling capacitance and package/board inductance

  • Check on-board filter components have optimal frequency response and are improved as needed

  • Verify mode change in one lane doesn’t impact operation in another lane

  • Understand packaging and signal integrity issues and, when necessary, work with companies experienced in designing packages and boards for such high data rates

Modeling and Testing

The only way to accurately simulate PCIe 5.0 systems is to use Input/output Buffer Information Specification Algorithmic Modeling Interface (IBIS-AMI) models for the PHY TX and RX interfaces. Designers can combine IBIS-AMI models from their PHY IP provider together with the models for package, PCB, and connectors into a complete channel model to run an accurate system simulation. Figure 3 shows a comparison between an IBIS-AMI model simulation (on the left) and a real, measured eye diagram (on the right) through a system board simulation. IBIS-AMI simulations match actual silicon data with good accuracy.

Figure 3: IBIS-AMI models are mandatory for accurate results during system simulations

For production devices, manufacturing testing at 32 GT/s requires fast tests that can verify links, typically using built-in loopback modes, pattern generators and receivers that are incorporated into the PHY and controller IP. Some test setups may utilize the built-in oscilloscope capability typically incorporated into the PCIe 5.0 PHY IP as well. Robust system testing should take advantage of the PCIe controller IP solution’s built-in debug, error-injection, and statistical capabilities. This helps ensure that firmware and software can correctly anticipate any potential real-world system issues that may be encountered.

For PHY testing, when designers need more details on their 32 GT/s PHY performance, a high-speed oscilloscope is typically used to measure things like TX jitter and other parameters. Moving to 32 GT/s means that the oscilloscope bandwidth also needs to be higher, but how much higher? Even though the signal rise time drives this requirement, real-world PHYs typically have some rise time limitations to keep power realistic. For this reason, a 50 GHz oscilloscope will typically have sufficient bandwidth for proper analysis of 32 GT/s signals1

Summary

While the adoption of 32 GT/s PCIe 5.0 technology is on an accelerated pace, SoC designers must understand and handle a few design challenges as they make the shift. 32 GT/s designs have challenging NRZ channels that are extremely lossy and bumpy with many discontinuities, with insertion loss reaching 36dB and beyond. The PCIe PHY design must encompass unique architectures with a proven analog front-end, continuous time linear equalizer, and advanced multi-tap decision feedback equalizer that seamlessly work together to mitigate design issues. Integrating the PHY and controller requires more careful planning to ensure compatibility at the PIPE interface and to facilitate timing closure at 1GHz.

Several PCIe 5.0 controller configuration options must be carefully selected and managed to achieve maximum performance. Architectural tradeoffs should be explored to balance maximum payload size, read request size, number of tags, and other important controller configuration settings.

Careful signal and power integrity analysis must be carried out for chips and packages, and the whole channel must be simulated to ensure performance targets are met at 32 GT/s.

These new challenges can be mitigated or eliminated by partnering with Synopsys, a proven and trusted IP partner with a track record of many years of success in developing high-quality PCIe IP. The Synopsys DesignWare® IP complete solution for PCIe 5.0 includes controllers, PHYs and verification IP. The silicon-proven IP supports the PIPE 4.4.1 and 5.1.1 specifications, using architectures allowing more than 36dB channel loss and enabling straightforward 1GHz timing closure. The controller is highly configurable with support for multiple data path widths, including a tested, silicon-proven 512-bit architecture and provides the industry’s most extensive RAS-DES features to enable seamless bring up and debug. The silicon-proven solution, already adopted by many customers, provides the full IBIS-AMI models needed to accurately simulate PCIe systems.

 

1From Real-time oscilloscope analysis for 28/32-Gbps SerDes measurements, a Whitepaper by Brig Asay, Agilent Technologies, December 17, 2012