Contact Sales

Search Synopsys

Innovate Faster with Synopsys Multi-Die Solution

Explore our eBook for scalable multi-die solutions to boost innovation, productivity, and success.

Automotive Executive Guide: Rethinking Automotive Development

A guide to virtualization in software-defined vehicles for automotive leaders.

Mastering AI Chip Complexity

This eBook explores AI chip design trends, challenges,
and strategies for first-pass silicon success.

Addressing Hardware Failures and Silent Data Corruption in the AI Infrastructure Buildout

Shankar Krishnamoorthy, Yervant Zorian

Jul 23, 2025 / 4 min read

Table of Contents

Silicon reliability, availability, and serviceability (RAS)
Synopsys silicon lifecycle management (SLM) solutions
Silicon test and telemetry in the age of AI

Meta trained one of its AI models, called Llama 3, in 2024 and published the results in a widely covered paper. During a 54-day period of pre-training, Llama 3 experienced 466 job interruptions, 419 of which were unexpected. Upon further investigation, Meta learned 78% of those hiccups were caused by hardware issues such as GPU and host component failures.

Hardware issues like these don’t just cause job interruptions. They can also lead to Silent Data Corruption (SDC), with unwanted data loss or inaccuracies that often go undetected for extended periods.

While Meta’s pre-training interruptions were unexpected, they shouldn’t be entirely surprising. AI models like Llama 3 have massive processing demands that require colossal computing clusters. For training alone, AI workloads can require hundreds of thousands of nodes and associated GPUs working in unison for weeks or months at a time.

The intensity and scale of AI processing and switching create a tremendous amount of heat, voltage fluctuations, and noise, all of which place unprecedented stress on computational hardware. The GPUs and underlying silicon can degrade more rapidly than they would under normal (or what used to be normal) conditions. Performance and reliability wane accordingly.

This is especially true for sub-5 nanometer process technologies, with silicon degradation and faulty behavior observed upon manufacturing and in the field.

But what can be done about it? How can unanticipated interruptions and SDC be mitigated? And how can chip design teams ensure optimal performance and reliability as the industry pushes forward with newer, bigger AI workloads that demand even more processing capacity and scale?

Navigate AI Chip Development

Your essential guide to overcoming AI chip complexity and achieving successful silicon outcomes from design to deployment.

Download Now

Ensuring silicon reliability, availability, and serviceability (RAS)

Certain AI innovators like Meta have established monitoring and diagnostics capabilities to improve the availability and reliability of their computing environments. But with processing demands, hardware failures, and SDC issues on the rise, there is a distinct need for test and telemetry capabilities at deeper levels — all the way down to the silicon and multi-die packages within each XPU/GPU as well as the interconnects that bring them together.

The key is silicon lifecycle management (SLM) solutions that help ensure end-to-end RAS, from design and manufacturing to bring-up and in-field operation.

With better visibility, monitoring, and diagnostics at the silicon level, design teams can:

Gain telemetry-based insights into why chips are failing or why SDC is occurring.
Identify voltage or timing degradation, overheating, and mechanical failures in silicon components, multi-die packages, and high-speed interconnects.
Conduct more precise thermal and power characterization for AI workloads.
Detect, characterize, and resolve radiation, voltage noise, and mechanism failures that can lead to undetected bit flips and SDC.
Improve silicon yield, quality, and in-field RAS.
Implement reliability-focused techniques — like triple modular redundancy and dual core lock step — during the register-transfer level (RTL) design phase to mitigate SDC.
Establish an accurate pre-silicon aging simulation methodology to detect sensitive or vulnerable circuits and replace them with aging-resilient circuits.
Improve outlier detection on reliability models, which helps minimize in-field SDC.

synopsys-silicon-lifecycle-management-solution-infographic

Synopsys SLM solutions

As a global leader in silicon-to-systems design, Synopsys offers SLM IP and analytics solutions that help improve silicon health and provide operational metrics at each phase of the system lifecycle.

This includes environmental monitoring for understanding and optimizing silicon performance based on the operating environment of the device; structural monitoring to identify performance variations from design to in-field operation; and functional monitoring to track the health and anomalies of critical device functions.

Our SLM IP and analytics solutions include:

Process, voltage, and temperature monitors

Help ensure optimal operation while maximizing performance, power, and reliability.
Highly accurate and distributed monitoring throughout the die, enabling thermal management via frequency throttling.
Available on process nodes from 28 nm to 3 nm.

Path margin monitors

Measure timing margin of 1000+ synthetic and functional paths (in-test and in-field).
Enable silicon performance optimization based on actual margins.
Automated path selection, IP insertion, and scan generation.

Clock and delay monitors

Measure the delay between the edges of one or more signals.
Check the quality of clock duty cycle.
Measure memory read access time tracking with built-in self-test (BIST).
Characterize digital delay lines.

UCIe monitor, test, and repair

Monitor the signal integrity of die-to-die UCIe lane(s).
Generate algorithmic BIST patterns to detect interconnect fault types, including lane-to-lane crosstalk.
Perform cumulative lane repair with redundancy allocation (upon manufacturing and in-field).

High-speed access and test

Enable testing over functional interfaces (PCIe, USB, SPI, etc.).
For in-field operation as well as wafer sort, final test, and system-level test.
Can be used in conjunction with automated test equipment.
Facilitate in-field remote diagnoses and lower-cost test via reduced pin count.

High-bandwidth memory (HBM) external test and repair

Comprehensive, silicon-proven DRAM stack test, repair, and diagnostics engine.
Support third-party HBM DRAM stack provider solutions.
High-performance die-to-die interconnect test and repair support.
Operate in conjunction with HBM PHY and support a range of HBM protocols and configurations.

SLM hierarchical subsystem

Automated hierarchical SLM and test manageability solution for systems-on-chip (SoCs).
Automated integration of and access to all IP/cores with in-system scheduling.
Pre-validated, ready ATE patterns with pattern porting.

ai-infrastructure-hardware-failures-silent-data-corruption-image

Silicon test and telemetry in the age of AI

With the scale and processing demands of AI devices and workloads on the rise, system reliability, silicon health, and SDC issues are becoming more widespread. While there is no single solution or antidote for avoiding these issues, deeper and more comprehensive test, repair, and telemetry — at the silicon level — can help mitigate them. The ability to detect or predict in-field chip degradation is particularly valuable, enabling corrective action before sudden or catastrophic system failures occur.

Delivering end-to-end visibility and RAS, silicon test, repair, and telemetry will be increasingly important as we push forward in the age of AI.

This article originally appeared in EDN.

White paper: Silicon Lifecycle Management

Continue Reading

LPDDR6 Verification IP for AI, Automotive, HPC, and Mobile SoCs

Blog

4 min read / Dec 23, 2025

LPDDR6 Verification IP for AI, Automotive, HPC, and Mobile SoCs

By Rishikumar Chauhan

Tags: Engineering Central, AI & Machine Learning, Memory, Verification IP, Verification

Read Article

Rebels with a Quad: Rebellions Redefines Energy-Efficient AI for Data Centers

Blog

4 min read / Dec 22, 2025

Rebels with a Quad: Rebellions Redefines Energy-Efficient AI for Data Centers

By Frank Schirrmeister

Tags: Customer Spotlight, Data Center, AI & Machine Learning, Design, Emulation, About Synopsys, Energy-Efficient SoCs, HPC, Data Center, Verification, Virtual Prototyping

Read Article

Industry 1st CXL 4.0 Verification IP: Transforming AI and HPC Systems

Blog

2 min read / Dec 18, 2025

Industry 1st CXL 4.0 Verification IP: Transforming AI and HPC Systems

By Sangeeta Kulkarni

Tags: About Synopsys, HPC, Data Center, Verification

Read Article

Search Synopsys

Popular Content

Innovate Faster with Synopsys Multi-Die Solution

Explore our eBook for scalable multi-die solutions to boost innovation, productivity, and success.

Automotive Executive Guide: Rethinking Automotive Development

A guide to virtualization in software-defined vehicles for automotive leaders.

Mastering AI Chip Complexity

This eBook explores AI chip design trends, challenges,
and strategies for first-pass silicon success.

Browse by Tags

Addressing Hardware Failures and Silent Data Corruption in the AI Infrastructure Buildout

Navigate AI Chip Development

Ensuring silicon reliability, availability, and serviceability (RAS)

Synopsys SLM solutions

Silicon test and telemetry in the age of AI

Continue Reading

LPDDR6 Verification IP for AI, Automotive, HPC, and Mobile SoCs

Rebels with a Quad: Rebellions Redefines Energy-Efficient AI for Data Centers

Industry 1st CXL 4.0 Verification IP: Transforming AI and HPC Systems

Search Synopsys

Popular Content

Innovate Faster with Synopsys Multi-Die Solution

Explore our eBook for scalable multi-die solutions to boost innovation, productivity, and success.

Automotive Executive Guide: Rethinking Automotive Development

A guide to virtualization in software-defined vehicles for automotive leaders.

Mastering AI Chip Complexity

This eBook explores AI chip design trends, challenges, and strategies for first-pass silicon success.

Browse by Tags

Addressing Hardware Failures and Silent Data Corruption in the AI Infrastructure Buildout

Navigate AI Chip Development

Ensuring silicon reliability, availability, and serviceability (RAS)

Synopsys SLM solutions

Silicon test and telemetry in the age of AI

Continue Reading

LPDDR6 Verification IP for AI, Automotive, HPC, and Mobile SoCs

Rebels with a Quad: Rebellions Redefines Energy-Efficient AI for Data Centers

Industry 1st CXL 4.0 Verification IP: Transforming AI and HPC Systems

This eBook explores AI chip design trends, challenges,
and strategies for first-pass silicon success.