DesignWare Technical Bulletin

Implementing Large Numbers of Virtual Functions with PCI Express SR-IOV

By Richard Solomon, Technical Marketing Manager, PCI Express and contributor to the SR-IOV specification

One of the most powerful features of PCI Express for today’s data centers is I/O virtualization. I/O virtualization improves the performance of enterprise servers by giving virtual machines direct access to hardware I/O devices. The specification that has gained traction in the market is Single-Root I/O Virtualization (SR-IOV). The SR-IOV specification allows one PCI Express device to present itself to the host as multiple distinct “virtual” devices. This is done through a new PCI Express capability structure added to a traditional PCI Express function (i.e., a Physical Function (PF)). The PF provides control over the creation and allocation of new Virtual Functions (VFs). VFs share the device’s underlying hardware and PCI Express link (Figure 1). A key feature of the SR-IOV specification is that VFs are very lightweight so that many of them can be implemented in a single device.

Figure 1: VFs present independent views and configuration of the hardware underlying the PF, while still using one PCIe interface to move data

Virtual Function Requirements

SoC architects often struggle with determining how many virtual functions to provision in their devices. The most minimal implementation of a single VF requires around 1,000 bits of storage just to cover the lightweight definition. If particular data center-oriented features such as Advanced Error Reporting (AER) and Message Signaled Interrupts (MSI/MSI-X) are needed in each VF, the number of bits can grow by a factor of 3 to 4. Additional logic is required in the SoC application logic to provide virtualization of the SoC’s mission-critical functionality, but that is outside the scope of this article. At the very least, SoC architects should provision one VF for every virtual machine the device is expected to support.

Consider a 4 socket server with 8-core processors – it’s quite feasible for that system to run 32 virtual machines (one per core) or even 64 (or more) with hyper-threading. For some types of applications, such as high-end networking or storage processors, it may make sense to sub-divide SoC resources even further into individual VFs controlling those resources. This allows the virtual machine manager to partition the SoC’s capacity unequally among virtual machines by allocating different numbers of VFs to each one. Architects for these types of devices could easily be faced with a requirement for hundreds or even thousands of VFs!

Determining the Right Storage Solution for Virtual Functions

The PCIe configuration space has traditionally been implemented in simple flip-flop-based registers. This is a good fit due to the potential for a PCIe device to have six or so address decoders, various control bits, numerous error and other status bits – all of which operate completely independently.

Flip-flop-based address decoders minimize latency, while flop-flop-based control and status registers can be routed directly to/from the relevant logic, all simplifying the designer’s work and making for straightforward synthesis. Unfortunately, as the number of VFs increases, and as the number of PCIe capabilities per VF increases (particularly register-heavy features such as AER and MSI-X), the gate cost of a register implementation can become burdensome. Adding a couple of hundred fully featured VFs to a PCIe controller could add as many as 2 to 3 million gates to a design!

Since the SR-IOV specification was written to support over 64 thousand VFs in a single device, the PCI-SIG put a lot of effort into enabling implementations other than directly mapping to flip-flops. Wherever possible, control and status functionality for all VFs was consolidated in their associated PF. All the PCI Express link-level controls fall into this category – as one VF shouldn’t take down the link it shares with other VFs. Only controls that absolutely must be implemented individually for each VF (such as Bus Master Enable) are replicated. Address decoding is greatly simplified by contiguously locating all the VF copies of each PF region – so only two additional decoders per region are required for any number of VFs, rather than needing an additional decoder per VF (Figure 2). Because of that effort,most of the storage for a VF can be fairly slow and high latency in comparison to the same-clock-cycle access time for directly mapped flip-flops.

Figure 2: Memory address decoding for VFs is greatly simplified because the VF memory regions are specified to be contiguously located

With some care, PCI Express controller designers can segregate a small amount of high-speed, low-latency storage (e.g., flip-flops) per VF, and provide some additional logic with the intelligence to merge in data from slower bulk storage, and in some cases, with data from the physical function (Figure 3). This will naturally come at the cost of increased latency to system reads and writes of the device’s configuration space, but accessing those registers is done at initialization time and is not a part of the device’s performance path for performing I/Os. 

Figure 3: Merging data from the slower bulk storage with information from the underlying physical function enables storage of a VF configuration space outside the PCI Express controller

All the accesses to/from an SR-IOV device’s main memory address space can proceed at the same speed whether the configuration space is implemented in flip-flops or slower storage. Configuration space accesses are only made when the host first initializes the device, during link-level error handling, and at other times when virtual machine operation is going to be suspended anyway. Note also that the SR-IOV specification presumes the virtual machine manager will trap and intercept every configuration space access from a virtual machine, so adding even a few dozen clock cycles in hardware would literally be in the noise for the overall system configuration space access time. 

On-Chip Memory vs. CPU-Based Storage

Since the speed of configuration access isn’t important to device performance, PCI Express controller designers may choose to locate the bulk storage in on-chip SRAMs, or even in dedicated off-chip memory such as DDR memory. Given today’s high-density silicon processes, fairly large SRAMs can be implemented at very moderate area cost – certainly in comparison to the millions of gates required for a flip-flop-based implementation of a multi-thousand VF device! This should be very encouraging news to SoC designers faced with such high VF counts, but it still requires a moderately large amount of physical memory dedicated to SR-IOV. In cases where the system will not use all the VFs a device is capable of provisioning, or in cases where a single piece of silicon is intended for use in multiple products of differing capabilities, it would be advantageous to not waste that memory.

Some SoC architectures may have uses for dedicated memories that their architects may be able to easily partition such that any memory unneeded for SR-IOV can be repurposed for the underlying application logic. In other architectures though, the SoC architect may wish to consider whether bulk storage can be offloaded to a local CPU via an interrupt-type mechanism (Figure 4). In this manner, the PCI Express controller would present a read or write request from bulk storage which would actually be serviced by the local CPU and data/acknowledgement returned to the controller hardware. This provides a very straightforward mechanism for partitioning RAM, as the SoC firmware can choose how much CPU memory to allocate without requiring any additional hardware or multi-port memories to avoid waste. Devices using this approach also have great flexibility in PCIe feature support, as most of the VF features can be executed in firmware. With a different firmware build, or just a different configuration, the same hardware can support different SKUs having different VF counts, AER support or not, MSI-X support or not, etc.

Figure 4: Bulk storage can be implemented directly in on-chip memory, or indirectly via a local CPU servicing read/write requests

Summary

SoC architects and PCI Express controller designers should evaluate their total virtual function count requirement, and based on the count and their silicon area constraints, determine whether a flip-flop, SRAM, or CPU-based storage solution is right for their application. Synopsys offers a silicon-proven DesignWare® IP solutions for PCI Express that is compliant with the latest PCI-SIG and SR-IOV specifications and offers the flexibility to implement thousands of VFs in a mix of flip-flop-based registers, RAMs, or with the use of a local CPU. The SR-IOV implementation in the DesignWare PCI Express IP is configurable and scalable, enabling designers to improve time-to-market. Learn more about Synopsys’ DesignWare IP for PCI Express Single Root I/O Virtualization at http://www.synopsys.com/dw/ipdir.php?ds=dwc_pci_express_sriov