This issue appeared when PCIe 3.0 introduced 8 GT/s, and while there are several different ways to explain why it happens, we will consider the simplest high-level view. The smallest PCIe Transaction Layer Packet (TLP) is made up of three 32-bit “DWORDs” plus a single 32-bit LCRC value for a total of 128-bits. At 8 GT/s it is common for the PIPE interface to run at 16-bits wide and 500MHz, so consider what happens in an 8 lane (x8) implementation: 8 lanes * 16-bits yields 128-bits for a datapath. This width is the same as a minimum sized packet, so at most one complete packet can be on the interface at once. However, when the link is 16 lanes (x16), 256-bits are required and now it’s possible to receive two complete packets on each cycle.
The problem worsens with PCIe 4.0’s 16 GT/s as keeping the PHY interface from exceeding 500MHz would require 32-bits per lane –- so a x4 16 GT/s link fits into the 128-bit datapath (32 bits * 4 lanes), but a x8 link (32 bits * 8 lanes = 256 bits) now exhibits the same two packets per cycle issue as a x16 8 GT/s link did. A x16 link running 16 GT/s would require 512 bits (32bits * 16 lanes) and could therefore have as many as four packets in each cycle.
As application logic in an SoC generally depends on receiving packets one at a time, the PCI Express controller designer must make a choice of how to handle multiple packets per cycle.
Option 1: Limit interface width
Limiting the interface width to 128 bits removes the multiple-packet issue but requires increased clock frequency: 8 GT/s x16 = 1GHz; 16 GT/s x8 = 1GHz; 16 GT/s x16 = 2GHz. While this is easy from an architectural and RTL design standpoint for both the controller designer and the controller user (SoC designer), gate implementation will be extremely challenging. The difficulty of closing timing from flop to flop is apparent, but even more difficult may be finding RAMs of suitable speed in the desired sizes. Fast memories are often not large memories.
Option 2: Provide multiple packet paths
A second option for the controller designer is to push the problem to the controller user (SoC designer) and insist that the user accept multiple packets per cycle. This forces the controller user to multi-thread an application interface that is traditionally single-threaded, and pushes significant responsibility for obeying PCIe ordering rules back into the application. While this option theoretically provides the best possible performance, the cost to the controller user makes it impractical.
Option 3: Serialize the data stream
The controller designer can instead guarantee never to issue multiple packets per clock by using internal buffering and logic duplication to hold back any packets that would appear simultaneously and present them to the application in sequence. This is the easiest approach for the controller user (SoC designer) but comes at an implementation cost to the controller designer. It also opens up the possibility of lower than theoretical maximum performance – clearly a case of 100% continuous minimum packets must eventually fill the controller’s buffers and apply backpressure in the PCI Express link.
Option 4: Combination of Option 1 (Limit interface width) and Option 3 (Serialize the data stream):
The most attractive option is a combination of limiting interface width and serialization to balance implementation complexity against maximum clock frequency. As shown in Table 1, the controller designer can optimize for an attainable frequency by adjusting the datapath width.
For example, a x16 link might be designed at 256 bits and accept two packets per cycle to limit maximum frequency to 1 GHz, vs the 2 GHz that would be required to avoid the issue altogether. Likewise, a 1 GHz frequency might be more attractive than the complexity of handling 4 packets per cycle, as the 512 bit choice would require.