Ensuring the Health and Reliability of Multi-Die Systems

Guy Cortez, Manuel Mota, Randy Fish, Yervant Zorian

Oct 11, 2023 / 7 min read

From generative AI tools that rapidly produce chatbot responses to high-performance computing (HPC) applications enabling financial forecasting and weather modeling, it’s clear we’re in a whole new realm of processing power demand. Given these compute-intensive workloads, monolithic SoCs are no longer capable to meet today’s processing needs. Engineering ingenuity, however, has answered the call with the emergence of multi-die systems, a masterpiece of heterogeneous integration in a single package that delivers new levels of system power and performance, yield advantages as well as acceleration of additional system functionality.

With so much riding on multi-die systems, how do you ensure their health and reliability through their lifecycles?

Chip testing is essential for any silicon design. Multi-die systems, in particular, require thorough testing from the die through the system level, including all of the interconnects (like Universal Chiplet Interconnect Express (UCIe)) that tie each component together. In this blog post, we’ll take a closer look at the unique issues for multi-die systems and how testing and silicon lifecycle management can ensure that these complex designs will work reliably as intended. You can also gain additional insights by watching our on-demand, six-part webinar series, “Requirements for Multi-Die System Success.” The series covers multi-die system trends and challenges, early architecture design, co-design and system analysis, die-to-die connectivity, verification, and system health.

semiconductor reliability multi-die systems

Thorough Chip Testing from Die to System

Many factors can affect a chip’s performance. Temperature, aging, and degradation are just a few. The stakes are even higher for multi-die systems, where one failed die can cause the entire system to fail—a costly outcome. Filtering out defects at the die level is a good first step. Each die that is developed will undergo its own testing process to ensure very low defective parts per million (DPPM). Test automation flows provide testing and diagnostic capabilities for digital, memory, and analog segments of devices. The challenge is to balance the number of test patterns required with the associated costs and, ultimately, the need to get the desired results.

While examining each die is important, so too is evaluating the system at, well, the system level. Multi-die systems can bring together dies, or chiplets, from different process nodes and for different purposes. As a result, one system could contain dies that run at different temperatures or dissipate heat at different levels. Electromagnetic interference between the dies as well as electromigration could become problematic, too.

Multi-die systems benefit from a thorough pre-assembly testing step to uncover known good dies (KGD). Advanced design-for-test (DFT) capabilities built into the design blocks can assess the dies. Once individual dies have been tested and, if needed, repaired, the design can be assembled and bonded. After the memory and logic dies are partially or fully bonded, the interconnects can be tested. 

Enhancing Power and Performance of Chiplet Interconnects

Die-to-die interfaces enable dies to be placed side by side or, for even greater density, stacked in a 2.5D or 3D package. When the interfaces—functional blocks that provide the data interface between two dies—are able to deliver high bandwidth, high power efficiency, and low latency, they can enhance the system’s performance.

Die-to-die connectivity is often based on high-speed interfaces such as UCIe, which is poised to become the interconnect standard of choice for multi-die systems. It’s the industry’s only standard with a complete tool suite for the die-to-die interface. Suited for 2D and 2.5D packages (as well as 3D packages in the future), UCIe supports the bulk of designs today from 8 Gbps to 16 Gbps per pin, ideal for high-bandwidth applications from networking to hyperscale data centers. For 3D designs, interconnect-level risks are higher because the interconnects are shorter, making the through-silicon vias (TSVs) more fragile.

What multi-die system designers need to avoid are stuck-at faults, opens, or shorts in the interconnects, while assuring proper behavior from timing and voltage perspectives. Since very high-speed signals are involved, signal integrity is an important parameter indicating the effectiveness of data sharing between the dies. As such, measuring and monitoring to detect signal degradation levels is essential. UCIe does mandate redundant lanes between the two sides of the PHY, enabling repairs through the extra lanes. All dies in system based on UCIe must be accessed, tested, and repaired through the UCIe channel, which enables monitoring of ongoing issues in the dies.

Post-bond testing can address interconnect-level issues that warrant the need to switch interconnect lanes. Algorithmic tests are also available to evaluate for interconnect defects. There are different sets of algorithms for 2.5D and 3D interconnects, and the tests are based on the defectivities of the interconnect. Fault models will dictate the algorithmic test to apply. 

Intelligent Monitoring and Analysis Through the System’s Lifecycle

Multi-die systems feature tiny microbumps that are placed close together, making testing via physical probing impossible. For example, for UCIe, microbumps are at a 25 to 55 micrometer distance, while probing distance is typically 90 micrometers. Instead, a better solution is to conduct electronic probing through built-in self-test (BIST). BIST can detect soft or hard errors requiring corrective action. Alternatively, dedicated wafer-based testing pads, integrated at the pre-assembly phase, can be used.

When the system is in development as well as in the field, a silicon lifecycle management (SLM) methodology that integrates sensors and monitors on the dies to assess various parameters, such as temperature, voltage, aging, and degradation, becomes useful. SLM IP technology integrated with analytics intelligence can turn high volumes of data collected from device sensors and monitors into actionable insights for system optimization.

Consider how SLM technology can identify thermal issues, which are a concern for individual dies and multi-die systems alike. Without real workloads, these are difficult issues to evaluate during the in-design phase. When you add in the complexity of a 2.5D or 3D architecture, then it’s really hard to know the thermal profile of the final design. Here’s a situation where SLM can help. On-chip monitors that are strategically placed on the die can open the door to analytics providing deeper insights into the thermal characteristics of dies and can signal the need to adjust placements to address heat dissipation. Similarly, knowing more about thermal effects might lead to a decision to slow down the data rates in the system’s High-Bandwidth Memory component. Or, there may be ways to mitigate heat dissipation via the software. With monitors providing data, designers can analyze and determine the best course for correction.

SLM technology also provides traceability—the ability to trace back to a root cause of an issue regardless of when in the lifecycle the end product exhibits an issue. For example, if a yield excursion is detected any time during the test manufacturing process, the ability to determine whether the problem stems from a certain wafer or die, across every wafer or die manufactured during a certain time period, or from the fab can be vitally important, especially in multi-die systems where the packaging costs can be very expensive. The faster you find the problem, the faster you can go to market and reduce your costs. A good SLM solution should be able to identify root cause within a matter of minutes, compared to manual methods that can take days or weeks.

Traceability also includes the case where the end product is already deployed in the field but starts to exhibit unexpected and potentially catastrophic failures, potentially requiring a recall. This return merchandise authorization (RMA) case can take advantage of SLM and the entire ecosystem of testing all the way back through manufacturing to identify root cause as well as “like” devices in the field that may still exhibit the same behavior, enabling the product owner to proactively recall devices before they fail or adjust the operating voltage or frequency of the devices to prolong their lives.

The last phase of testing is on the stack itself. Here, “known good system” is the operative phrase, as testing teams aim to determine whether their multi-die system will work well—and find ways to monitor, analyze, and fix issues when needed. IEEE Std 1838-2019 provides a modular test access architecture, enabling testing of dies and interconnect layers between adjacent stacked dies.

For stacked architectures, some testing needs to be pushed downstream, while more intelligent testing remains upstream in the process. For example, assessing for high temperatures at the die level isn’t feasible. Instead, temperature tests on multi-die systems are most effective when performed after stacking. Failures uncovered at this point can be fixed depending on their location. Temperature tests at the wafer level are also possible, though these can be rather expensive. Designers of high-end systems may opt to perform these tests. The ability to monitor and gather this important data gives design, manufacturing, and test teams the ability to make decisions on how to achieve the best quality of results.

Automation and Intelligence Drive Higher Quality Multi-Die Systems

To address the needs we’ve discussed and help move the next wave of semiconductor innovation forward, Synopsys provides our Multi-Die Solution, which accelerates heterogeneous integration in a single package. The comprehensive solution includes elements for testing, diagnosing, repairing, calibrating, and improving operational metrics through a system’s lifecycle. Traceability and analytics for in-design, in-ramp, in-production, and in-field optimization can drive improvements in yield, quality, and reliability as well as a reduction in costs. In addition, Synopsys.ai, our AI-driven chip design suite, features the first autonomous AI application for semiconductor test. Synopsys TSO.ai optimizes test program generation in complex designs for maximum defect coverage with fewer test patterns.

Multi-die systems are fast-becoming mainstream as chip designers seek ways to deliver the high bandwidth and performance demanded by compute-intensive workloads. Automated test flows and analytics intelligence can drive quality and reliability levels higher for these systems. That’s great news for the applications that are enhancing our world, from generative AI to HPC. 

Continue Reading