Explore challenges and solutions in AI chip development
Meta trained one of its AI models, called Llama 3, in 2024 and published the results in a widely covered paper. During a 54-day period of pre-training, Llama 3 experienced 466 job interruptions, 419 of which were unexpected. Upon further investigation, Meta learned 78% of those hiccups were caused by hardware issues such as GPU and host component failures.
Hardware issues like these don’t just cause job interruptions. They can also lead to Silent Data Corruption (SDC), with unwanted data loss or inaccuracies that often go undetected for extended periods.
While Meta’s pre-training interruptions were unexpected, they shouldn’t be entirely surprising. AI models like Llama 3 have massive processing demands that require colossal computing clusters. For training alone, AI workloads can require hundreds of thousands of nodes and associated GPUs working in unison for weeks or months at a time.
The intensity and scale of AI processing and switching create a tremendous amount of heat, voltage fluctuations, and noise, all of which place unprecedented stress on computational hardware. The GPUs and underlying silicon can degrade more rapidly than they would under normal (or what used to be normal) conditions. Performance and reliability wane accordingly.
This is especially true for sub-5 nanometer process technologies, with silicon degradation and faulty behavior observed upon manufacturing and in the field.
But what can be done about it? How can unanticipated interruptions and SDC be mitigated? And how can chip design teams ensure optimal performance and reliability as the industry pushes forward with newer, bigger AI workloads that demand even more processing capacity and scale?
Discover how our full-stack, AI-driven EDA suite revolutionizes chip design with advanced optimization, data analytics, and generative AI.
Certain AI innovators like Meta have established monitoring and diagnostics capabilities to improve the availability and reliability of their computing environments. But with processing demands, hardware failures, and SDC issues on the rise, there is a distinct need for test and telemetry capabilities at deeper levels — all the way down to the silicon and multi-die packages within each XPU/GPU as well as the interconnects that bring them together.
The key is silicon lifecycle management (SLM) solutions that help ensure end-to-end RAS, from design and manufacturing to bring-up and in-field operation.
With better visibility, monitoring, and diagnostics at the silicon level, design teams can:
As a global leader in silicon-to-systems design, Synopsys offers SLM IP and analytics solutions that help improve silicon health and provide operational metrics at each phase of the system lifecycle.
This includes environmental monitoring for understanding and optimizing silicon performance based on the operating environment of the device; structural monitoring to identify performance variations from design to in-field operation; and functional monitoring to track the health and anomalies of critical device functions.
Our SLM IP and analytics solutions include:
Process, voltage, and temperature monitors
Path margin monitors
Clock and delay monitors
UCIe monitor, test, and repair
High-speed access and test
High-bandwidth memory (HBM) external test and repair
SLM hierarchical subsystem
With the scale and processing demands of AI devices and workloads on the rise, system reliability, silicon health, and SDC issues are becoming more widespread. While there is no single solution or antidote for avoiding these issues, deeper and more comprehensive test, repair, and telemetry — at the silicon level — can help mitigate them. The ability to detect or predict in-field chip degradation is particularly valuable, enabling corrective action before sudden or catastrophic system failures occur.
Delivering end-to-end visibility and RAS, silicon test, repair, and telemetry will be increasingly important as we push forward in the age of AI.
This article originally appeared in EDN.