Designing for Effective Use of PCIe 6.0 Bandwidth

Richard Solomon, TMM, Synopsys

The PCI Express (PCIe) 6.0 specification introduced a new 64GT/s link speed which doubles the previously available PCIe bandwidth.  Designers using this new link speed need to be aware of some significant changes which impact SoC designs beyond the obvious need to double on-chip bandwidth. This article highlights the significant changes in PCIe 6.0 beyond the 64GT/s link speed.

PCIe 6.0 Electricals at 64GT/s - Fundamental Shift In Mechanism

In order to achieve this 64GT/s link speed, PCIe 6.0 utilizes Pulse Amplitude Modulation 4-level (PAM4) signaling which provides for 4 levels (2 bits) in the same Unit Interval (UI) as 32GT/s PCIe.  Figure 1 shows how this results in 3 eyes compared to the previous 1.

PCIe 6.0 PAM-4 signal results

Figure 1: PCIe 6.0 PAM-4 signal results in three eyes compared to the NRZ signal

 

Unsurprisingly, these reduced eye height voltage levels and eye width increase the new link’s susceptibility to errors.  To mitigate this, the 6.0 specification implements a number of new features when operating at 64GT/s.  When mapping the new 4-level voltage eyes to digital values, Gray coding is used to minimize errors within each UI, and precoding is applied by transmitters to minimize burst errors.  While these features are important, they have little impact on the digital logic of a PCIe 6.0 controller.  A further required feature called Forward Error Correction (FEC) is applied at the digital level to improve the bit error rate (BER), which has a significant impact on both the PCIe protocol and controller design.

A New Generation of Protocol

In order to minimize the overhead of the additional data for FEC, PCIe 6.0 introduces a new “FLIT-Mode” of operation, required at 64GT/s, which fundamentally changes PCIe data transfer from a symbol-based system to a larger quantum of transfer known as a FLIT.  This change requires substantial new design inside PCIe controllers. As a consequence, PCIe 6.0 also introduced an entirely new header format used when operating in FLIT-Mode.  The new header simplifies decoding, better separates PCIe attributes, and allows for enhancements such as 14-bit tag support – compared to 10-bit Tag support in PCIe 5.0.  When operating at 64GT/s, FLIT-Mode operation uses unencoded data - referred to as “1b1b encoded” in contrast to the 128/130 encoding used for link speeds of 8GT/s through 32GT/s, and the classic 8b10b encoding used for link speeds of 2.5GT/s and 5GT/s.  

An unfortunate but expected consequence of these protocol changes is a significant increase in the silicon area consumed by a 64GT/s PCIe 6.0 controller compared to one of the same configuration for 32GT/s PCIe.   Supporting 1b1b encoding adds a third Physical Layer path (atop 8b10b and 128b130b), along with additional logic in the Data Link Layer to handle the FLIT structure and related changes.  The new optimized headers used in FLIT-Mode also require new logic in the transaction layer, further adding to the gate count increases over 32GT/s solutions.

As has been common with the last few generations of PCIe, the move to 64GT/s results in an increase in the typical PIPE datapath width to maintain the same maximum clock frequency as the previous generation.  Typical PIPE width now doubles to 64-bits per lane to keep the clock at 1GHz maximum but with an associated increase in silicon area.  Consequently, 4-lane designs are now at 256-bits and 8-lane designs at 512-bits - both generally acceptable sizes for current 32GT/s implementations. However, 16-lane designs now needing 1024-bit datapaths gives rise to a new concern for certain users.

Most current CPU designs use 64-byte (512-bit) cache lines, so utilizing datapaths wider than that is extremely unappealing due to the need to process two (or more) cache lines per clock cycle.  Such designers will most likely seek solutions which can maintain 512-bit datapath widths – even if more than one application interface is required as a result.  Even designs not using such a CPU for data movement, may require multiple internal agents in order to saturate their 64GT/s PCIe link, so they too may find multi-port interfaces attractive.  Of course, the additional buffers for multi-port interfaces and associated complex ordering logic results in yet more increase in the gate count of such 64GT/s PCIe controllers.

Multiple Packets Per Clock Cycle

Datapath widths greater than 128-bits can result in SoCs needing to process more than one PCIe packet per clock cycle.  The smallest PCIe Transaction Layer Packet (TLP) can be considered as being 3 DWORDs (12 bytes) plus the 4-byte LCRC for a total of 16 bytes (128 bits).  At 8GT/s using a 500MHz, 16-bit PIPE interface to the PCIe PHY was most common, meaning that link widths of 8 lanes or below (16 bits / lane * 8 lanes = 128 bits) would transfer at most a single complete packet per clock.  However, 16-lane implementations (16 bits / lane * 16 lanes = 256 bits) could encounter two complete packets per clock.  Table 1 shows that the problem worsens as link speed goes up, with the number of complete packets per clock growing correspondingly, thereby affecting more and more designs with each new link speed increase.

Datapath widths increase with link speed

Table 1: Datapath widths increase with link speed causing more configurations to exceed the 128-bit threshold

Multiple Application Interfaces – Improving Small Packet Link Utilization

In addition to the factors discussed above, utilizing multiple application interfaces provides a significant improvement in overall performance when transfer sizes are smaller than the interface width.  This happens because application interfaces are almost always designed to transfer no more than one packet per clock cycle, meaning small packets leave a lot of that width unused.  Thus multiple narrower interfaces are more efficient for smaller packets than a single wider interface would be. 

Figure 2 shows Transmit link utilization in 64GT/s Flit Mode on a Synopsys PCI Express 6.0 controller IP sending a continuous stream of Posted TLPs.  For larger datapath widths, it’s clear that larger packets are required to maintain full link utilization with a single application interface – with 128-byte payloads being needed for 1024-bit interfaces.

Datapath widths increase with link speed

Figure 2: Transmit link utilization for various payload sizes and datapath widths in 64GT/s FLIT-Mode with a single application interface

 

When the Synopsys controller is instead configured for two application interfaces and the same traffic patterns are run, there is a marked improvement – with 64-byte payloads now yielding full link utilization even for 1024-bit datapaths, as seen in Figure 3.

Transmit link utilization for various payload sizes and datapath widths

Figure 3: Transmit link utilization for various payload sizes and datapath widths in 64GT/s FLIT-Mode with a two application interface configuration

Relaxed Ordering – A Crucial Factor in 64GT/s Utilization

PCIe ordering rules require posted transactions, such as memory writes to remain in order unless either the Relaxed Ordering (RO) or ID Ordering (IDO) attribute is set in the packet header.  A posted transaction with RO set is allowed to pass any previous posted transaction, while one with IDO set is only allowed to pass previous transactions with different RequesterIDs, meaning they came from different logical agents on the PCIe link.  

The following scenarios will show how this attribute is critically important for reaching full PCIe 64GT/s performance.  Even SoC designers whose implementation may not suffer from these issues should consider that their link partner on PCIe may – and that by appropriately setting RO and/or IDO attributes, their SoC could see improved performance from that link partner!  The examples, shown in Tables 2-5, all utilize a sequence of 4 PCIe memory writes each of 256-bytes, representing delivery of a 1KB payload to address 1000, followed by a PCIe memory write of 4-bytes representing delivery of a “successful completion” indication to address 7500.  Each row of a table represents a chunk of time, while the three columns indicate (from left to right) the arrival of the transaction at the PCIe pins, the application interface, and the SoC memory.  Any scenario in which the successful completion indication arrives in memory before all 4 memory writes reflects a failure, as software could proceed with data processing immediately upon receipt of the indication and therefore prior to delivery of the correct data!

Example 1: One application interface clearly works correctly so long as its bandwidth at least equals PCIe bandwidth.

Single full rate application interface results

Table 2: Single full rate application interface results in correctly delivered data

 

Example 2: Dual Interfaces will generally fail as there is no guarantee of arrival order between two independent paths to memory in the SoC.

Dual half-rate application interfaces shown failing

Table 3: Dual half-rate application interfaces shown failing due to arrival of successful completion prior to arrival of all data

 

Example 3: Forcing strongly-ordered traffic to a single interface avoids the out-of-order arrival but quickly falls behind the PCIe link by virtue of being unable to use the full internal bandwidth.

Dual half-rate application interfaces shown failing

Table 4: Dual half-rate application interfaces shown failing due to inability to deliver data at full speed

 

Example 4: When the link partner marks the data payload packets as RO and the successful completion packet as strongly ordered, two half-rate interfaces can transfer successfully.  Notice that while the RO payload data arrives out of order, the non-RO Write to 7500 is not allowed to pass the payload writes and thus it is not sent to the application interface until all the preceding writes have been sent.

Dual half-rate application interfaces shown succeeding

Table 5: Dual half-rate application interfaces shown succeeding by using Relaxed Ordering on payload data

 

With some careful attention to the types of data being sent, SoC designers can set the RO attribute in their outbound data streams and dramatically improve PCIe link performance.  The IDO ordering attribute provides similar benefits in many cases, and most PCIe implementations can apply it to every packet they transmit! 

Packets with IDO set are only allowed to pass previous transactions with different RequesterIDs, which means the packets came from different logical agents on the PCIe link.  Most endpoint implementations (both single-function and multi-function) are indifferent about their data ordering in relation to traffic to/from other PCIe endpoints since they’re generally only communicating with the root complex.  Likewise, most root complexes aren’t generally mixing the same stream of traffic among multiple endpoints, so in both situations, there is no concern with ordering in relation to the RequesterID(s) of other devices.  Similarly, most multi-function endpoints are indifferent about their data ordering between their functions, therefore there is no concern with ordering between their own RequesterIDs either.  So, most implementations can already set IDO on all transactions they initiate.

Small Packet Inefficiencies

While most devices have little or no control over their traffic patterns, it’s important to realize just how littlebandwidth can be achieved with small packets.  Parameters such as Maximum Payload Size and Round Trip Time (RTT) are used by Synopsys CoreConsultant to configure buffer sizes, number of outstanding PCIe tags, and other critical parameters in the PCIe 6.0 controller.  Figures 3-4 show data taken from simulations of Synopsys’ 64GT/s x4 controller when configured for 512-byte maximum payload size and 1000nS RTT sweeping across a range of payload size and RTT values. If the same sweeps were to be repeated over the same ranges, but with reductions in either parameter, performance reductions would be seen once the sweep passed the optimized range.

Figure 3: Posted packet inefficiencies for small sizes

Non-Posted packet inefficiencies for small sizes, swept across a range of Round Trip Times

Figure 4: Non-Posted packet inefficiencies for small sizes, swept across a range of Round Trip Times

Conclusions for Optimizing 64GT/s Designs

SoC designers implementing 64GT/s PCIe interfaces should make sure they support the relaxed ordering attributes as a critical part of enabling high performance throughout the 64GT/s ecosystem.  Set Relaxed ordering attributes in transmitted data whenever possible: RO on payloads but not on associated controls, and IDO on all packets unless the application has unusual requirements.  Consider leveraging relaxed ordering attributes for possible reordering of data in the receive path when appropriate.

Support for multiple application interfaces is becoming more widespread, and SoC designers should consider designing for multiple application interfaces when application interface bandwidth is lower than line rate, and/or when typical traffic patterns are smaller than application interface width.

Designers implementing 64GT/s PCIe for x4 and wider links need to pay attention to multiple packets per-clock cycle cases and should consider multiple application interfaces depending on their typical traffic size.

All 64GT/s implementers should be prepared for 1GHz (or faster) design implementation, and all should be sure to check their assumptions with pre-silicon performance simulation.

Synopsys offers a complete PCIe 6.0 solution, including controller, PHY and verification IP, that supports relaxed ordering attributes, PAM-4 signaling, FLIT mode, L0p power, architectures up to 1024-bit, and options for multiple application interfaces all to make the transition to 64GT/s PCIe design easier.