Contact Sales

Search Synopsys

Innovate Faster with Synopsys Multi-Die Solution

Explore our eBook for scalable multi-die solutions to boost innovation, productivity, and success.

Automotive Executive Guide: Rethinking Automotive Development

A guide to virtualization in software-defined vehicles for automotive leaders.

Mastering AI Chip Complexity

This eBook explores AI chip design trends, challenges,
and strategies for first-pass silicon success.

CXL 2.0 and 3.0 for Storage and Memory Applications

Richard Solomon

Oct 16, 2022 / 11 min read

Table of Contents

PCIe vs CXL for Memory/Storage
Advantages of CXL for Emerging HPC Applications Memory Composability and Disaggregation
CXL for Memory Disaggregation and Composability
Summary

For many years, memory and storage have been clearly distinct things. Memory is a short-term place to hold data while a nearby CPU or accelerator processed that data. It is effectively working memory placed near the elements needing the data while providing rapid access with very low latency. The typical examples include SRAM/DRAM in its various forms of DDR, LPDDR and more recently, HBM. These memory devices, in addition to providing rapid access to data with low latency, also share the characteristics of being volatile devices that need to remain powered on to retain the data.

Storage, on the other hand, is long-term memory that is non-volatile and is retained even when devices are powered off, and their access times and latency are much greater than traditional memory devices. The typical examples include Hard Disk Drives (HDDs) and Solid State Drives (SSDs). Table 1 compares some of the characteristics of memory and storage.

	Memory	Storage
Examples	DDR, LPDDR, HBM	SSD, HDD
Proximity to CPU	Near or embedded	Farther
Access time	Fast, low latency	Slower, higher latency
Permanence	Volatile, requires power	Persistent, no power required
Capacity	Limited by physical constraints	Not inherently limited
Data access size	Byte	Blocks: pages or sectors (kBytes)
Interface to CPU	Various JEDEC DDR standards	PCIe with NVMe, other (SATA, SAS)

Table 1: Characteristics of memory versus storage

Years ago, before the advent of SSDs, the differences between memory and storage were simple and stark. Memory meant random access memory (RAM), and storage meant magnetic media (disk drives or magnetic tape). Because of the physical differences between the two, memory and storage have another difference not shown in Table 1. Memory can be read or written a byte at a time, while storage, largely due to its rotating disk structure, has a minimum storage unit of a sector (typically 512 bytes for HDDs). As SSDs replaced HDDs, due to nature of the SSD design, even though they were not rotating magnetic media, they were still not able to be read or written a byte at a time like RAM.

SSDs store data in a matrix of electrical cells organized into rows called pages where the data is stored. Pages are grouped together to form blocks. SSDs can only write to empty pages within a block. The net result is that SSDs read data a page at a time, and they can write at the page level only if surrounding cells are empty, otherwise an entire block must be erased before a page can be written. Reading or writing a page can translate to 16kB of data, so these devices are not good for cache-like applications where small amounts of data must be frequently accessed while working on a problem.

With new types of persistent memory like Intel’s Optane Technology and others offering non-volatility and reduced access times approaching DRAM, the line between memory and storage is beginning to blur, and this is opening up interesting possibilities.

Designing for 448G Ethernet

Explore host architectures and modulation strategies for next-gen AI and HPC cluster networks.

Download

PCIe vs CXL for Memory/Storage

PCI Express (PCIe) implementations are expandable and hierarchical with embedded switches or switch chips allowing one root port to interface with multiple endpoints, such as multiple storage devices (as well as other endpoints like Ethernet cards and display drivers). However, limitations of these implementations are seen in large systems with isolated memory pools that require heterogeneous computing where the processor and accelerator share the same data and memory space in a single 64-bit address space. The lack of a cache coherency mechanism makes memory performance for these applications inefficient and latency less than acceptable when compared to alternative implementations using CXL.

While PCIe typically transfers a large block of data through a direct memory access (DMA) mechanism using load-store semantics (load = read, store = write), CXL uses a dedicated CXL.mem protocol for short data exchanges and extremely low latency.

While the introduction of PCIe 6.0.1 at 64GT/s helps increase the bandwidth available for storage applications with minimal or no increase in latency, the lack of coherency still limits PCIe applications like traditional SSDs, which are block storage devices. For these storage applications, NVMe, which uses PCIe as the transport interface, has become the dominant SSD technology. Next-generation SSDs, with CXL interfaces instead of PCIe, are currently being developed.

Table 2 shows a summary of some of the important characteristics for storage applications for PCIe versus CXL. This article highlights CXL’s main characteristics for high-performance computing storage applications.

Feature	PCI Express	CXL
Max bandwidth	32GT/s x16 for PCIe 5.0 64GT/s x16 for PCIe 6.0	32GT/s x16 for CXL 2.0 64GT/s x16 for CXL 3.0
Coherency	None	Supported; Host-managed
Latency	100’s of ns	10’s of ns
Cacheability	PCIe address space typically NON-cacheable	CXL address space cacheable by definition
Switching	Chips and embedded	Embedded in CXL 2.0 and 3.0 Future chips may expected
Topologies	Host to Device, Switched	Host to device, switched, fabrics
Memory Access	DMA typically	Dedicated CXL.mem
Transfer sizes	PCIe optimized for larger data payloads; Traditional block storage (512B, 1KB, 2KB, 4KB); lower overhead for non-cached data	CXL optimized for 64B cacheline transfers; Fixed size offers low latency
Storage standards	PCIe-based Flash memory (NVMe)	Emerging SSD and DRAM with CXL interface, promising for many new types of memory/storage applications
Datapath	32b – 512b 1024b	Natively 512b 1024b
Implementation	PCIe only	CXL controller with support to PCIe No Home agent needed in devices
Applications	Non coherent data movement applications, large DMA block transfers, traditional storage controllers; NVMe	Linear storage: Byte-addressable (vs block or sector) SSD successors; Computational memory
Ecosystem	Massive and well established up to PCIe 5.0	Limited adoption so far; CXL expected to accelerate through 2022 and 2023

Table 2: Characteristics of PCIe versus CXL for storage applications

Advantages of CXL for Emerging HPC Applications Memory Composability and Disaggregation

Memory pooling introduced by CXL 2.0 has been calculated to theoretically enable support for CXL attached memory of at least 1.28 petabyte (PB), and with multi-level switching and other features introduced in CXL 3.0, it could be even higher. This opens the door for new approaches to solving large computation problems, in which multiple hosts can work on massive problems while accessing the entire dataset simultaneously. As an example, with access to a petabyte of memory, whole new models can be created and coded to work on complex problems, like modeling climate change, with an assumption that the system can handle the entire problem all at once, rather than breaking the problem down into smaller pieces.

Advanced fabric capabilities introduced in CXL 3.0 are a shift from previous generations and their traditional tree-based architectures. The new fabric supports up to 4,096 nodes, each able to communicate with one another via a port-based routing (PBR) mechanism. A node can encompass a CPU host, a CXL accelerator (memory included or not), a PCIe device, or a Global Fabric Attached Memory (GFAM) device. A GFAM device is a Type 3 device that effectively acts as a shared pool of memory whose I/O space is owned either by a single Host or a Fabric Manager. After configuration other Hosts and devices on the CXL Fabric can directly access the pooled memory of the GFAM device. The GFAM device allows an array of new possibilities in building systems made up of compute and memory elements that are arranged to satisfy the needs of specific workloads. For example, with access to a terabyte or a petabyte of memory, it’s possible to create whole new models to tackle complex challenges such as mapping the human genome.

Table 3 shows some of the key feature that are driving the adoption of CXL for memory and storage applications.

Feature	When Introduced
Coherency and Low latency	Introduced in CXL 1.0/1.1
Switching	Introduced in CXL 2.0 as single level switching for CXL.mem Expanded to multi-level switching for all protocols in CXL 3.0
Memory pooling & sharing	Pooling introduced in CXL 2.0 w/MLD support Sharing added in CXL 3.0
Fabrics	Introduced in CXL 3.0

Table 3: Key CXL characteristics for storage applications

Traditionally, there have been only a couple of options to attach additional memory to accelerators or other SoCs. The most common method is to add additional DDR memory channels supporting more standard DDR memory modules. The only other viable option was to integrate memory together with the SoC within the same package. With CXL it becomes possible to put memory onto something that is very much like the PCIe bus (CXL uses PCIe PHYs and electricals). This enables a system to support more memory modules using a card with a standard CXL interface without the need for additional DDR channels. Figure 1 shows an example of how this can vastly increase the memory accessible to the SoC, both in amount (GB) and it type (RAM or persistent memory). Using this approach, memory begins to look like a pooled resource and can be accessible to multiple hosts via switching, which was first introduced in CXL 2.0 and vastly expanded in CXL 3.0.

Diagram showing CXL's ability to enable media independence with a single interface

Figure 1: CXL enables media independence with a single interface such as DDR3/4/5, LPDDR 3/4/5, Persistent Memory/Storage

As can be seen in Figure 1, CXL can solve a problem that has blocked the development of expandable pools of memory accessible to multiple systems—it does away with proprietary interconnects, so that any CPU, GPU or tensor processing unit (TPU) that needs access to additional memory can be designed with an industry-standard CXL interface. CXL will eventually permit connection to a vast array of memory modules, including SSDs, DDR DRAM, and emerging persistent memories. The combination of CXL’s low latency, coherency and memory pooling and sharing capabilities make it a viable technology for allowing system architects to create large pools of both volatile and persistent memory that extend into multiple infrastructure pools, becoming a true shared resource.

Another advantage of the approach shown in Figure 1 is that the SoC pins devoted to CXL interfaces do not have to be dedicated to memory interfaces—they can be used to connect anything with a CXL interface, including additional CXL switches, GFAM devices, or chip-to-chip interconnects.

At the 2022 Flash Memory Summit it was clear that CXL is emerging as the leading architecture for pooling and sharing connected memory devices targeting both DRAM and NAND flash devices. Many large SSD companies are either introducing or talking about plans for flash-based SSDs with a CXL interface, and others are discussing their memory controllers or other memory products that are featuring CXL as the high-speed interface to memory. See Figure 1.

CXL has now acquired the assets of Gen Z and OpenCAPI, further enhancing the scope and types of applications that CXL can handle.

Diagram showing CXL's ability to enable fine grained memory allocation and share among multiple hosts

Figure 2: CXL enables fine grained memory allocation (pooling) and sharing among multiple Hosts

CXL for Memory Disaggregation and Composability

The advantages of CXL are many, but two in particular are worth highlighting: memory disaggregation and composability. Memory disaggregation refers to the capability of spreading memory around to various devices while still allowing sharing and coherency by multiple servers, such that memory is no longer aggregated and devoted to a single device or server. Composability refers to the capability to allocate the disaggregated memory to particular CPUs, TPUs, as needed, with a result that memory utilization can be increased substantially. This enhanced utilization offers a critical improvement over current systems, where actual memory utilization in real systems, as measure by Microsoft and highlighted in The Next Platform article, MICROSOFT AZURE BLAZES THE DISAGGREGATED MEMORY TRAIL WITH ZNUMA, can be on the order of only 40% with most Virtual Machines (VMs) utilizing less than half of the memory allocated to them by their hypervisor.

With CXL 2.0 and CXL 3.0 which include switching, a host can access memory from one or more devices that form a pool. It’s important to note that in this kind of pooled configuration, only the resources themselves and not the contents of the memory are shared among the hosts: each region of memory can only belong to a single coherency domain. Memory sharing, which has been added to the CXL 3.0 specification, actually allows individual memory regions within pooled resources to be shared between multiple hosts. Figure 2 shows and example illustrating memory pooling and memory sharing within a single system.

CXL can also enable computational memory, where attached memory devices can carry out some computation directly on the memory contents without the involvement of a Host or accelerator.

The ultimate goal is 100% disaggregation and composability, in which all memory attached to a system can be utilized by any attached device and is all available as a pooled resource.

To achieve this goal of 100% disaggregation and composability, a system needs to be able to discover and enumerate every device within the system, including servers, accelerators, memory expansion device, and other devices with shareable memory. This requires rack-level device discovery and identification of 100% of the devices (servers, memory pools, accelerators, and storage devices), whether already composed or as yet unassigned. This can only be accomplished using the PCIe and CXL capabilities, since fabrics like Ethernet and Infiniband can’t support fine grained discovery, disaggregation and composition.

The approach to dynamically create flexible hardware configurations capable of meeting different workload requirements is often referred to as Composable Disaggregated Infrastructure (CDI), and it becomes possible using the low latency fabrics now enabled by CXL. This capability can effectively permit an entire rack to be configured and act like a server.

Summary

CXL is rapidly becoming the interface of choice for managing and sharing large amounts of memory coherently among multiple Devices and Hosts. It is enabling a true heterogeneous composable and disaggregated architecture supporting more than just memory. The CXL 3.0 spec expands on previous versions of CXL, doubling the per-lane bandwidth to 64 GT/s with no added latency, while adding multi-level switching, efficient peer-to-peer communications, and memory sharing.

To ease and accelerate adoption of the latest CXL protocol, Synopsys offers a complete CXL IP solution, encompassing controller with IDE Security Module, PHY, and verification IP to deliver secure, low-latency, high-bandwidth interconnection for AI, machine learning, and high-performance computing, including storage, applications.

Built on silicon-proven Synopsys PCIe IP, our CXL IP solution lowers integration risks for device and host applications and helps designers achieve the benefits that CXL 3.0 brings to SoCs for data-intensive applications. As an early CXL contributor, Synopsys had early access to the latest specification, enabling our engineers to deliver a more mature solution. Already, Synopsys has delivered CXL 2.0 and 3.0 solutions with IDE support to several customers, including for next generation SSD and advanced memory applications with proven silicon in customer products and successful third-party interoperability demonstrated in hardware.