Managing thermal complexities and optimizing power are major priorities in SoC systems. While this is true across a single die, it is exponentially more difficult to do across multiple dies in a system, especially as the system ages. Inserting the right monitors into your design is foundational to mitigate heat and voltage problems and to achieve long-term device success in both HPC and the data center.
Process, voltage, and temperature (PVT) monitors have been used for years in-field for on-chip voltage and power management—otherwise known as dynamic voltage and frequency scaling (DVFS). Or sometimes, these monitors are used to simply monitor the temperature, enabling a shut off when they are trending toward a catastrophic outcome. In fact, nearly 100% of designs at 16nm and below, along with 100% of data center chips, use PVT monitors.
During your wafer sort testing, you’ll get first results from these monitors, and the data can be used immediately. At this point, you will understand your thermal profile and can apply more test sequences to monitor voltage values across the die. In addition, you can perform analytics based on test, PVT, and path margin monitor IP data, and go back into the design environment to understand the real margins that you’re seeing in your silicon and correlate them to your models. The better the modeling, the more you can strip down your margins to increase performance or reduce power without sacrificing RAS.
To help anticipate whether something will go wrong ahead of time, you can set thresholds. For temperature monitors, the threshold sets the point where you start managing the temperature back down. You can do this because thermal response is most of the time relatively slow. The more aggressive you are with your thresholds, the earlier you can act. Voltage monitors can be used similarly, even though what you are monitoring is a bit different.
When you are in your early ramp phase, you’re doing minimal production of chips, just to make sure that that chip is functional and to confirm that it is hitting target yields before launching into full production. You collect data from the test and diagnostic results that comes off the fab early on, as well as all the data throughout product manufacturing. You may identify systematic to address during this time. Once your device is deployed in the field, you’ll want to take advantage of the latest strategies to see how your device is functioning while in use as it ages. For this, new capabilities are emerging, including in-field scans with Intel’s Sapphire Rapids. You can also insert an SLM software agent into a device that’s local to the system for ongoing edge analytics, as well as problem mitigation. In-field silicon management is an area where much innovation is happening right now, and new capabilities are on the near-term horizon.