How to Save Power and Keep HPC and Data Center Applications Cool

Rita Horner, Shekhar Kapoor, William Ruby

Aug 15, 2023 / 6 min read

AI is ushering in a new era in innovation where applications are more compute and memory intensive than ever. There are a bewildering number of arithmetic operations, not to mention neural network training and inferences, all of which can consume a lot of power. Applications such as generative AI and high-performance computing (HPC) mean your power bill in the data center will soar.

Data center service providers are struggling to keep their electricity costs down, while keeping their systems operating. In fact, they are at times consuming 3% to 5% of today’s global electricity budget, and data centers use roughly 40% to 45% of their electricity just to keep their systems cool. It’s a challenge to find packaging and cooling solutions to keep the parts functional over the long haul. And, keeping the total cost of ownership down for data center customers and staying within provisioned power limits can challenge the bottom line.

How can you better manage power and thermal profiles in HPC and the data center? Read on to learn about system power reduction for demanding applications to keep your cool.

hpc data center power consumption

Why Multi-Die Systems Present Power and Thermal Benefits and Challenges in HPC and the Data Center

While the drive to lower power used to be in the purview of mobile applications, today it's also important for the thermal management in HPC and the data center. Multi-die systems—consisting of heterogeneous integration of multiple dies in a package—are fast becoming the engine for higher performance, higher levels of integration, and more optimized system solutions. While multi-die systems deliver higher levels of performance in smaller power envelopes, there is also a high density of transistors in more compact spaces that concentrate the heat. In fact, the power density of some of today’s devices can rival a nuclear reactor or a rocket nozzle!

Here are just a few examples of things to consider when designing multi-die systems for HPC and the data center:

  • Components: It’s important to keep your components cool to keep them functioning. Today, it’s not unheard of for a full circuit board to go into liquid coolant tanks in the data center to keep them from overheating.
  • Switching activity: As high-speed designs advance to 3nm or even 2nm process nodes, greater levels of switching activity becomes a central power and thermal management issue.
  • Glitch power: Getting glitch power under control is key to avoid system failure or disruption.
  • Power distribution: Optimizing the power distribution networks for your requirements and having measures in place to prevent overheating is an important part of your planning process.

In multi-die systems, you have models to help you meet requirements for performance, energy efficiency, and even reliability. But because of high power density in these types of systems, you must also consider heat and power at the architecture level.  

Why Striking the Right Power and Performance Balance Is Critical for ROI

There is a delicate balance between power and performance to optimize ROI for the datacenter. Cutting-edge applications are demanding more throughput than ever, but all that performance requires power—and that costs.

Also figuring into the equation:

  • Provisioned power – Data center companies enter into contracts with utility companies for their power ahead of using it. This means that any increase in power beyond what is provisioned to them in the contract will be delivered at a premium. Moreover, if they don’t use the power they had expected, they still have to pay for it.
  • Total Cost of Ownership (TCO)Data center customers choose datacenters based on TCO which equates to the total time spent utilizing the compute power multiplied by the total resources used. If datacenters can save on time or resources for their customers, they make their datacenter more attractive. But, the way to achieve this is by increasing the throughput of their compute engines—that also increases power.
  • Even small power savings can have a big impact in aggregate – Saving small amounts of power can have a big effect. Imagine you can save 15 Watts of power in a chip, which may not seem like much relative to its total power, but if you are using 1,000,000 units it can equate to saving 15 megawatts of power. Every little bit counts.

Having highly accurate projected power consumption—the right data and insight—is critical. Any uncertainty around power consumption due to workload characteristics, glitch power, switching activity, power distribution, and more, needs to be resolved as soon as possible. 

Radically Shift Left in Design Methodology to Manage Today’s Power and Thermal Requirements

To ensure the reliability of your multi-die systems, power and thermal impact must be considered and planned for at every stage of the design process. This includes the architectural phase, which is a radical shift left in design methodology.

The architectural phase

When designing multi-die systems for HPC and the data center, it’s not just about hardware design methodology but software development that also matters. You can optimize software code for a major impact on the energy efficiency and power consumption of your design once deployed.

You can extend these benefits even to optimize the compilers for the software—how the compiler translates the language (the code) into machine instructions, and how those instructions are sequenced and executed.

Then there is the hardware architecture. The hardware architecture and software go hand-in-hand. When you are designing the hardware architecture, you need to think through various questions to optimize it for energy efficiency. For example, do you really need all that memory?

The logic design phase

When you are in the logic design phase of development, you are looking for wasted power conditions. Some of the areas that can impact this are clock gating efficiency, memory access redundancy, data-path transitions, and so on. How do you control your memory access? How do you control computation? Afterall, you only want to compute when you need it. You don’t want to waste energy doing computations that will never be used by the system.  

The implementation phase to signoff

During the implementation phase, you can perform additional optimizations. Examples include logic restructuring, mixed Vt and multi-bit flop cell mapping, layer re-assignment, and placement changes to reduce wire capacitance. These optimizations can give you roughly another 5% or 10% power savings as opposed to the orders of magnitude greater savings that you get from optimizing the architecture. But remember, every fraction of a percent can equate to huge savings in aggregate. This will take you all the way to signoff that’s driven from top to bottom for power optimization.

As you can see, there is a great deal to consider in design planning. You will also need to determine the conducting material you will use and the models you need to have. With all the thermal- and stress- related impacts, you need to carefully consider materials at the substrate (or interposal) level. And up front, need to determine the kind of analysis needed around the rules and the process design kits (PDKs), from the software side, the hardware side, and into the design side—on every level.

To get power and thermal management right, you also need to think ahead to when your system is deployed in the field. System monitoring and predictive analytics can help you understand the dynamic effects throughout the flow. You also need to gain insight into optimizations based on real working data to inform design changes and to address potential issues. Predictive failure analysis and redundancy measures and other strategies can help mitigate possible failure. Defining your lifecycle timeframe upfront, designing in the right monitors, and getting the right information and insights, will help you prevent or respond to potential problems down the road, enabling reliability and functionality for as long as possible.

Get a Comprehensive Solution to Minimize Power Consumption in Multi-Die Systems

The comprehensive and scalable Synopsys Multi-Die System Solution for fast heterogeneous integration offers technologies to help minimize power consumption and heat dissipation. From the architecture phase through design, IP, and full system validation, verification, emulation and signoff, Synopsys End-to-End Energy Efficiency design automation software suite can help reduce risk and accelerate time to market for HPC and data center chips. In addition, our Synopsys Silicon Lifecycle Management can help you mitigate power and thermal problems in the field and ensure better reliability.

Our deep experience and insights come from working closely with customers, seeing their challenges from different angles, and helping them address these challenges. Our long-term relationships with some of the industry leaders in the HPC and data center markets and partnerships across the ecosystem have resulted in our ability to invest early in solving issues and providing unified, integrated, and elegant solutions and signoff capabilities for power and thermal management.

To optimize the power and thermal profiles for your HPC and data center systems, design using a holistic approach. Put the power and thermal lens on every stage of development. There is no silver bullet. However, thinking about the power and thermal implications from the get-go and continuing to address them throughout the entire process can help you achieve the smallest power envelope possible while mitigating the challenges of power density and meeting your performance goals.

Continue Reading