Addressing Hardware Failures and Silent Data Corruption in the AI Infrastructure Buildout

Shankar Krishnamoorthy, Yervant Zorian

Jul 23, 2025 / 4 min read

Meta trained one of its AI models, called Llama 3, in 2024 and published the results in a widely covered paper. During a 54-day period of pre-training, Llama 3 experienced 466 job interruptions, 419 of which were unexpected. Upon further investigation, Meta learned 78% of those hiccups were caused by hardware issues such as GPU and host component failures.

Hardware issues like these don’t just cause job interruptions. They can also lead to Silent Data Corruption (SDC), with unwanted data loss or inaccuracies that often go undetected for extended periods.

While Meta’s pre-training interruptions were unexpected, they shouldn’t be entirely surprising. AI models like Llama 3 have massive processing demands that require colossal computing clusters. For training alone, AI workloads can require hundreds of thousands of nodes and associated GPUs working in unison for weeks or months at a time.

The intensity and scale of AI processing and switching create a tremendous amount of heat, voltage fluctuations, and noise, all of which place unprecedented stress on computational hardware. The GPUs and underlying silicon can degrade more rapidly than they would under normal (or what used to be normal) conditions. Performance and reliability wane accordingly.

This is especially true for sub-5 nanometer process technologies, with silicon degradation and faulty behavior observed upon manufacturing and in the field.

But what can be done about it? How can unanticipated interruptions and SDC be mitigated? And how can chip design teams ensure optimal performance and reliability as the industry pushes forward with newer, bigger AI workloads that demand even more processing capacity and scale?


The Future of Chip Design

Discover how our full-stack, AI-driven EDA suite revolutionizes chip design with advanced optimization, data analytics, and generative AI.


Ensuring silicon reliability, availability, and serviceability (RAS)

Certain AI innovators like Meta have established monitoring and diagnostics capabilities to improve the availability and reliability of their computing environments. But with processing demands, hardware failures, and SDC issues on the rise, there is a distinct need for test and telemetry capabilities at deeper levels — all the way down to the silicon and multi-die packages within each XPU/GPU as well as the interconnects that bring them together.

The key is silicon lifecycle management (SLM) solutions that help ensure end-to-end RAS, from design and manufacturing to bring-up and in-field operation.

With better visibility, monitoring, and diagnostics at the silicon level, design teams can:

  • Gain telemetry-based insights into why chips are failing or why SDC is occurring.
  • Identify voltage or timing degradation, overheating, and mechanical failures in silicon components, multi-die packages, and high-speed interconnects.
  • Conduct more precise thermal and power characterization for AI workloads.
  • Detect, characterize, and resolve radiation, voltage noise, and mechanism failures that can lead to undetected bit flips and SDC.
  • Improve silicon yield, quality, and in-field RAS.
  • Implement reliability-focused techniques — like triple modular redundancy and dual core lock step — during the register-transfer level (RTL) design phase to mitigate SDC.
  • Establish an accurate pre-silicon aging simulation methodology to detect sensitive or vulnerable circuits and replace them with aging-resilient circuits.
  • Improve outlier detection on reliability models, which helps minimize in-field SDC.
synopsys-silicon-lifecycle-management-solution-infographic

Synopsys SLM solutions

As a global leader in silicon-to-systems design, Synopsys offers SLM IP and analytics solutions that help improve silicon health and provide operational metrics at each phase of the system lifecycle.

This includes environmental monitoring for understanding and optimizing silicon performance based on the operating environment of the device; structural monitoring to identify performance variations from design to in-field operation; and functional monitoring to track the health and anomalies of critical device functions. 

Our SLM IP and analytics solutions include:

Process, voltage, and temperature monitors

  • Help ensure optimal operation while maximizing performance, power, and reliability.
  • Highly accurate and distributed monitoring throughout the die, enabling thermal management via frequency throttling.
  • Available on process nodes from 28 nm to 3 nm.

Path margin monitors

  • Measure timing margin of 1000+ synthetic and functional paths (in-test and in-field).
  • Enable silicon performance optimization based on actual margins.
  • Automated path selection, IP insertion, and scan generation.

Clock and delay monitors

  • Measure the delay between the edges of one or more signals.
  • Check the quality of clock duty cycle.
  • Measure memory read access time tracking with built-in self-test (BIST).
  • Characterize digital delay lines.

UCIe monitor, test, and repair

  • Monitor the signal integrity of die-to-die UCIe lane(s).
  • Generate algorithmic BIST patterns to detect interconnect fault types, including lane-to-lane crosstalk.
  • Perform cumulative lane repair with redundancy allocation (upon manufacturing and in-field).

High-speed access and test

  • Enable testing over functional interfaces (PCIe, USB, SPI, etc.).
  • For in-field operation as well as wafer sort, final test, and system-level test.
  • Can be used in conjunction with automated test equipment.
  • Facilitate in-field remote diagnoses and lower-cost test via reduced pin count.

High-bandwidth memory (HBM) external test and repair

  • Comprehensive, silicon-proven DRAM stack test, repair, and diagnostics engine.
  • Support third-party HBM DRAM stack provider solutions.
  • High-performance die-to-die interconnect test and repair support.
  • Operate in conjunction with HBM PHY and support a range of HBM protocols and configurations.

SLM hierarchical subsystem

  • Automated hierarchical SLM and test manageability solution for systems-on-chip (SoCs).
  • Automated integration of and access to all IP/cores with in-system scheduling.
  • Pre-validated, ready ATE patterns with pattern porting.
ai-infrastructure-hardware-failures-silent-data-corruption-image

Silicon test and telemetry in the age of AI

With the scale and processing demands of AI devices and workloads on the rise, system reliability, silicon health, and SDC issues are becoming more widespread. While there is no single solution or antidote for avoiding these issues, deeper and more comprehensive test, repair, and telemetry — at the silicon level — can help mitigate them. The ability to detect or predict in-field chip degradation is particularly valuable, enabling corrective action before sudden or catastrophic system failures occur.

Delivering end-to-end visibility and RAS, silicon test, repair, and telemetry will be increasingly important as we push forward in the age of AI.

 

This article originally appeared in EDN.

 

Continue Reading