CXL is emerging as a potential boom to HPC applications but there needs to be a software framework developed around the hardware ecosystem that SoC designers are delivering. This article examines the CXL specifications and speculates on opportunities to accelerate or enable novel application acceleration for high-performance computing.

To review some of the basics regarding the CXL specification, there are three protocols (CXL.io, CXL.cache, and CXL.mem) and three device types that use different combinations of these protocols.  CXL.io is a reduced latency alternative to PCIe communication and the two protocols function almost interchangeably.  CXL.io provides an enhanced but non-coherent load/store interface for I/O devices.  CXL.cache creates an opportunity for a protocol enabled device to create a coherent cache in system memory and access that cache for low latency I/O transactions using a request and response approach.  Lastly, CXL.mem provides the inverse capability of CXL.cache, allowing the host processor to access CXL attached devices that either are or contain memory with a low latency load/store command structure.

CXL Device Types

As mentioned earlier – these three protocols combine in different ways to enable three devices type, conveniently known as types 1, 2, and 3.  Type 1 devices combine CXL.io and CXL.cache protocols to allow devices like smart NICs or accelerators without internal memory to directly control and access an area of system memory.  Type 2 devices are like type 1 devices but have attached or intrinsic memory and utilize all three protocols to allow either the system or device to allocate an area in the other’s memory with hardware-supported coherency between the two.  These two different directions (either initiated by the system or initiated by the host) are known as host-biased or device-biased modes respectively.  Finally, a type 3 device is a memory device, supported by the CXL.io and CXLmem protocols allowing a byte addressable memory semantic across a wide variety of both persistent and volatile memory devices such as DRAM, NVRAM, and expanders.  All of these devices are supported and with fabrics, hosts can access attached memory in other systems and dedicated memory expanders as if it was local memory.  These three device types are summarized in Figure 1 below.

Figure 1: CXL Device Types. Taken from the Compute Express Link Specification r.3.0, v1.0

CXL.mem and CXL.cache offer reduced latency as compared with native PCIe transactions – a commonly used latency metric is ~100 ns for PCIe 5.0 and 20-40 ns for CXL 2.0 operations. The simplest and most immediately obvious CXL application stemming from this reduced latency of CXL.mem vs PCIe is memory expansion using type 3 devices – the idea that an application thread can access memory external to the system all but eliminates the chances of a job failing due to insufficient memory.  This capability, known as memory pooling, was enhanced with CXL 2.0 through support for switch-attached memory to expand this paradigm.  It not only allows memory to be enhanced from other sources, but it also allows systems to offer unused local memory to other systems allowing for greater utilization and lower up front system costs.  Memory sharing, introduced in the CXL 3.0 specification, allows multiple hosts to access a given allocation of CXL-attached memory.  It also defines the concept of fabric-attached memory expanders – devices that can contain various types of memory for the purpose of pooling and sharing and can implement local memory tiering to optimize the performance characteristics of a pool on behalf of the host.  This creates an interesting alternative to the SHMEM protocol defined by Cray Research for multiple hosts to have very low latency access to a shared memory pool. This not only provides better performance than the SHMEM library routines due to the native bus interconnect medium, but it also provides for a potentially much simpler programmatic model for parallel computing on this shared memory pool.

Another intrinsic value of CXL decreased latency both within and between systems is that it also has the potential to facilitate device-to-device memory transactions such as seen with multiple GPUs in one or more systems without the need for or expense of a proprietary secondary bus or software layer to interconnect those devices.  For smaller model AI training paradigms – this can easily demonstrate immediate performance impact.  This potential use case for fabric-attached accelerators to expose remote hardware directly to shared memory systems could herald in a new era of AI training, particularly in the datacenter as symmetric peer device communication features are introduced to CXL removing the requirement for ongoing CPU involvement.

Overall, CXL fabrics enable an opportunity for server disaggregation, eliminating a specific individual server from limiting its use for an application workflow that requires specific resources not locally resident in the system architecture.  When memory can be concentrated into a fabric attached expander, the need for dedicated (and isolated) memory within each system can be relaxed.  The additive bandwidth of the dedicated memory bus and CXL help to address core memory bandwidth starvation allowing individual servers to be designed and configured with more of a focus on performance than capacity.  The keys to enabling this vision are the development of low latency CXL switching and flexible memory tiering support built into both supporting software and expander hardware.

Integrity and Data Encryption for CXL  

With the introduction of external switching in CXL 2.0 and enhanced fabrics in CXL 3.0, enhanced bus security becomes paramount as data travels over exposed cables outside of the server.  For this reason, Integrity and Data Encryption (IDE) Security IP Modules are available for PCIe and CXL controllers to ensure that both data integrity and privacy are protected even if those data can be accessed by third-party intruders or observers. 

Synopsys Secure CXL Controllers, integrated with configurable, standards-compliant IDE Security Modules, help designers protect data transfer in their SoCs against tampering and physical attacks. The Synopsys CXL 3.0 IDE Security Module and the CXL 2.0 IDE Security Module provide confidentiality, integrity, and replay protection for FLITs in the case of CXL.cache and CXL.mem protocols and for Transaction Layer Packets (TLP) in the case of CXL.io. They match the data interfaces bus widths and lanes configurations of the controllers and are optimized for area, performance and latency (as low as zero cycles for CXL .cache/.mem skid mode).

Looking Ahead

After many years of attempted standardization on a coherent protocol that has included OpenCAPI, GenZ and others, the industry seems to have coalesced around CXL.  With a streaming interface known as the Credited eXtensible Stream (CXS) protocol, it looks like CXL controllers will also embrace a method for providing symmetrical coherence to multiprocessor architectures by encapsulating an updated version of CCIX; which originally suffered from higher latency in its native form due to increased coherency overhead, especially for small write operations.  CXS.B (the CXL-hosted version of CXS) solves that problem by providing a streaming pair of channels dedicated to symmetrical communication between CPUs.  In fact, it may even be possible to envision extensible SMP processing using CPUs from multiple computers over a CXL fabric!

Summary

The development and implementation  of CXL is reshaping the landscape of high-performance computing, offering promising advancements in latency reduction, memory pooling, and server disaggregation, which could drive a new era of computational power and efficiency in various applications as the software framework evolves to support the growing hardware ecosystem. Synopsys is a leading provider of PCIe and CXL PHY, Controller + IDE and Verification IP with expertise in integration and validation in over 1800 designs reducing risk and accelerating time to market for SoC engineers.

Synopsys IP Technical Bulletin

In-depth technical articles, white papers, videos, webinars, product announcements and more.

Continue Reading