Getty Images
What role does CXL play in AI? Depends on who you ask
CXL has struggled to find its place in AI, given competing Nvidia components. But some argue the interconnect could help companies get the most out of their GPUs.
At Nvidia GTC 2024, GPUs and AI took center stage. But another technology that some believe could help solve GPU bottlenecks was absent: the compute express link. CXL has been touted for years as a way to increase memory for data center devices, including for accelerators such as GPUs. But if this year's GTC is any kind of indicator, CXL appears unimportant in the AI-era.
Offstage, however, the debate about CXL's role in AI is more nuanced. Some argue that CXL has a limited place in the discussion given Nvidia's lack of support. Others, including memory software vendor MemVerge, memory supplier Micron, and hardware vendor Supermicro, are proving it could.
At GTC, MemVerge, Micron and Supermicro demonstrated how CXL can increase large language model GPU utilization without the addition of more processing units. CXL does this by expanding the GPU memory pool to increase high-bandwidth memory usage at a lower cost than scaling out the infrastructure with more GPUs or more HBM. The tradeoff, however, is performance.
Nvidia has gone in a different direction. The GPU maker has its own NVLink, an interconnect that's designed specifically to enable a high bandwidth connection between its GPUs. While CXL offers general purpose capabilities for expanding the memory footprint and pooling memory of processors together, it's absent in some of the most sought-after GPUs.
CXL for AI is dead?
CXL first came on the scene in 2019 and was seen as a potential way to overcome siloed memory and the memory limitations of CPUs. Since then, CXL use cases have grown and the interconnect can now enable memory sharing between multiple hosts as well as provide expanded bandwidth and device capabilities.
At last month's Memory Fabric Forum 2024, MemVerge highlighted CXL as a potential AI fabric capable of connecting compute, networking and storage. MemVerge, a member of the CXL Consortium along with Nvidia, builds software that plays a key role in developing CXL use cases.
The traditional data center lay out or x86 era has x86 CPU servers connected to storage through an Ethernet networking fabric, according to Charles Fan, CEO of MemVerge. But the AI-era will see GPU-based servers that connect to storage through HBM and likely use NVLink or Ultra Ethernet as the interconnection between GPUs and memory pools.
"CXL could play a role as the AI fabric as well," Fan said.
But Dylan Patel and Jeremie Eliahou Ontiveros, two analysts at independent research firm SemiAnalysis, don't believe CXL will make the leap to AI. In a new article, they argue that while CXL comes with potential benefits to servers in general, those benefits don't translate to Nvidia GPUs due to their limited shoreline area -- how much room Nvidia GPUs have for connectivity -- and Nvidia's proclivity for its own NVLink.
I/O for chips come from the chip's edges, and two of the four edges of Nvidia GPUs are dedicated to HBM, according to Patel and Ontiveros. That leaves two edges for connectivity, where Nvidia is more likely to choose its own NVLink and NVLink-C2C -- an interconnect to Grace CPUs -- over CXL. Both protocols are proprietary to Nvidia and enable more bandwidth than CXL, which is an open standard.
But Fan said the argument focuses on GPU-to-GPU connectivity for AI workloads only, whereas CXL provides a broader set of capabilities.
"The GPU-to-GPU communication was not the initial design or use case of the CXL standard," Fan said. Instead, CXL addresses bandwidth and capacity expansion.
Nvidia supports both NVLink to connect to other Nvidia GPUs and PCIe to connect to other devices, but the PCIe protocol could be used for CXL, Fan said. In fact, rival GPU vendor AMD makes chips that use PCIe almost exclusively. Like Nvidia GPUs, Fan sees a future with both interconnects coexisting.
More than one use case
Marc Staimer, president of Dragon Slayer Consulting, agreed that focusing on GPU-to-GPU communication, important for generative AI, is too limited in scope and was never the intended target for CXL technology.
"CXL is not only aimed at solving the GPU problem," he said.
There are two main aspects of generative AI, Staimer said. First is training, which requires significant bandwidth often provided by GPUs to ensure large amounts of data are read in parallel at maximum speed. Second is inferencing, where trained language models might rely on retrieval-augmented generation, an AI framework that enables the use of additional data sets to improve accuracy.
One RAG technology is the vector database, which can store high-dimensional data such as images and text, that can be used to update the query as needed with no additional training.
"Databases run in CPUs and in memory," he said. "And the more memory you have, the better."
CXL can expand the memory footprint, letting the entire vector database run in-memory, Staimer said. Running a database in-memory means that there is no need to go to storage to retrieve data, thus increasing the database speed. However, he noted that generative AI is still a small part of total data center spending whereas CXL can be used more broadly to expand data center memory, providing lower costs and better memory utilization.
Patel and Ontiveros also see value in CXL's memory expansion and memory pooling benefits outside of AI, specifically for massively increasing DRAM utilization. They wrote these capabilities "could save billions" for each cloud provider.
But Fan believes it's too early to exclude CXL from AI workloads given how quickly AI is advancing and, new use cases are still being discovered. One such use case could be expanding the pool of HBM on the GPUs to sustain utilization of the processor.
Expanding HBM
HBM is stacked synchronous dynamic random-access memory that is generally attached to a processor, CPU, application-specific integrated circuit, or -- more commonly -- a GPU instead of being in the lane next to it. The stacking design increases bandwidth and lowers power consumption. However, HBM is limited in capacity and expensive. HBM has seen an uptick in interest because AI bandwidth is key, and HBM supplies the highest possible bandwidth.
Marc StaimerPresident, Dragon Slayer Consulting
But CXL could expand GPU capacity beyond the limits of HBM. At GTC, MemVerge, Micron and Supermicro demonstrated the potential to overcome the memory wall problem in AI, which is the limited capacity and bandwidth of memory transfers specifically to the size of the memory on the GPU, according to Fan.
"Growth of the size of the model as well as growth of the computational power of the GPUs outpace the memory capacity on the GPUs," Fan said.
One fix is to scale out the number of GPUs used. But doing so is both expensive and reliant on processing units that are currently in high demand, Fan said. Another fix is to offload or expand the memory through CXL, which would be cheaper and forgoes the need for more GPUs or denser HBM, according to Fan.
Switching to slower CXL memory will affect performance compared to HBM. But MemVerge, Micron and Supermicro's combined technology showed GPU utilization was also up significantly, resulting in completing tasks faster, he said.
It should be noted that the GPUs used in their setup utilized GDDR6 -- Graphics Double Data Rate 6 -- memory, not HBM. Regardless, MemVerge said that the memory-expanding effect would be the same.
Adam Armstrong is a TechTarget Editorial news writer covering file and block storage hardware, and private clouds. He previously worked at StorageReview.com.