Estimating Power Early & Accurately for Smart Vision SoCs

Derya Eker, ARC Processors Engineering Manager, Synopsys
Diego Gonzalez Montes, ARC Processors R&D Engineer, Synopsys

Introduction

Today’s high-end systems-on-chips (SoCs) need to handle increasingly compute-intensive workloads but must carefully balance power-to-performance tradeoffs. The demand for wide deployment of artificial intelligence (AI) and deep learning is surging. Face recognition is paramount in mobile phones and extending to smart wearables. Identifying objects and surroundings in augmented- and virtual-reality headsets further push the envelope. Self-driving cars apply deep learning to interpret, predict and respond to data coming from surroundings for safer, smarter autonomous driving.

To optimize for both power and performance, hardware becomes more tightly intertwined with software. Designers must make key architectural choices such as hardware/software workload partitioning and IP vendor selection in the early phases of product development. Today’s SoCs represent a multi-million dollar investment, so the accurate estimation of SoC power is critical to whether your chip is a success or a failure.

Variables in Power Estimation

By the nature of deep learning applications, most of the processing elements in a chip are busy for long periods to sustain compute power. The power dissipation must fit in the power budget of the target device, whether a smart phone or autonomous car. In addition, battery lifetime, thermal issues affecting reliability, packaging, and cooling all add additional constraints. Designing to hit your power budget is critical, so accurate and predictable power estimation early on is increasingly important.

Before discussing the challenges of estimating power both accurately and as early as possible, let’s briefly look at how designers calculate power. Power dissipation (Ptotal) of a device can be split into two types: dynamic power consumption (Pdynamic) caused by the switching activity, and static power consumption (Pstatic). The main parameters that impact these two components are summarized in Figure 1.

Figure 1: Variables in power consumption to consider in vision SoC designs

Figure 2 shows the impact of process technology scaling on both dynamic power and static power. As we move to the smaller technology nodes, dynamic power goes down. However, static power starts becoming more dominant due to increased leakage current (left). In addition to this, threshold voltage of the cells in the same technology node (right) can affect frequency/delay and leakage power.

To achieve higher frequency and performance, designers might want to use low Vth cells. This increases leakage current, which is part of static power. There is a constant need during design process to make trade-offs to balance between power and performance.

Figure 2: Impact of process technology scaling on dynamic and static power

Potential Power Reduction Techniques

Designers can apply wide range of power reduction techniques to reduce power.

  • Clock gating reduces switching activity when no processing is happening.
  • Power gating shuts down large portions of the design that are inactive to also reduce leakage.
  • During physical design implementation, the usage of cells with different threshold voltages is balanced to achieve optimum timing and power.
  • Floorplan optimizations reduce congestion and improve timing, which can lead to lower routing capacitance.
  • At the architectural level, designers can optimize memory management and reduce data movement in the system. Although it is not contributing to the IP power alone, reducing DRAM bandwidth does have a big impact on SoC power.

As performance and power are closely interlinked in the AI and vision domain, individual metrics on performance and power do not give the full picture. There are many factors affecting accuracy and correctness of power estimation. Therefore, the conditions under which power is estimated must be explicitly clarified.  Let’s start with the commonly used metrics:

  • Power (Watt). Dynamic power and performance are closely related. Designers can achieve almost any arbitrarily low power value just by decreasing the amount of compute done per unit of time. For high-compute applications, low power does not necessarily mean a more energy-efficient system. The metric of “power” alone is meaningless unless it is linked to the system-level application performance, like cycle count or frames per second of a known workload.
  • Energy efficiency (Ops/Watt). This metric does combine performance and power, but the definition of the operation (Ops) is often ambiguous for application-specific IP. Derived metrics like MAC/s/Watt reduce convolutional neural network (CNN) graphs to just the multiplications, when in practice one of the biggest architectural challenges in AI is the high amount of data that needs to be constantly moved around the system to feed the multipliers optimally. This is not clearly accounted for in an Ops/Watt metric. Other operations, such as activation functions and pooling, which may take a relevant percentage of the processing cycles, are also excluded. The Ops/Watt metric also assumes ideal (or at least constant) hardware utilization, which is unlikely to happen for all layers and feature map dimensions in a real-world application.
  • Energy (Joules/frame). This metric goes beyond just MAC operations and incorporates both the type and characteristics of all layers in the graph being benchmarked. It also reflects how efficiently the graph has been mapped into the hardware for both compute and data movement.

Energy in terms of Joules per frame for representative graphs is the most accurate metric to evaluate CNN applications’ power consumption. However, computing the average power per frame is challenging. In many cases, to maximize throughput, systems have multiple images being processed simultaneously, either in batch or pipeline mode. Since power and performance are closely related, power measurements should be done with the right batch size and/or the when the pipeline is in a steady state. Processing only a single frame can take hundreds of millions of cycles. Reaching the correct steady-state for measurement will require many more cycles, sometimes in the order of billions. State-of-the-art simulation tools cannot handle this kind of workload in a reasonable time, not even for the smallest graphs.

Instead, designers often measure the energy efficiency of a single convolution layer of a graph such as the multi-layered SegNet graph (Figure 3). However, the common pitfall is to extrapolate the result to a full graph. Taking such shortcuts can be misleading for several reasons:

  • Different convolution layers with different dimensions will have a different hardware utilization, and this affects overall power.
  • A convolution layer can be used in different positions in a graph with each position imposing a different input data distribution.
  • Different graph architectures affect the input data distribution and the type of processing to be done, e.g., feature maps and weights with values around zero may require less power per multiplication.

Hence, depending on position or graph architecture, the same layer may require a different amount of energy. In addition, other layers, such as activation functions, element-wise operations, and deconvolution also need to be accounted for.

Figure 3: SegNet architecture implements multiple layers. Depending on position or graph architecture, the same layer may require a different amount of energy, so no single layer can be extrapolated to represent the entire graph

Orthogonal to the stimuli used, power estimation accuracy is greatly affected by the applied power estimation methodology combined with the abstraction level of the design that is measured:

  • RTL: functionally correct and relatively quick to execute in HDL simulator, but does not contain any power information. During power estimation, switching activity must be annotated into a netlist, but it will never cover 100% of the pins.
  • Synthesis-level netlist: all logic mapped to cells, but clock tree implementation is incomplete. Capacitance cannot be accurately modeled. This impacts the accuracy of power estimation.
  • Full-layout gate-level netlist: all physical implementation details are completed and the wire model of all layer can be used to extract actual load capacitance. This further increases the accuracy of power estimation.

The more details captured in the actual implementation, the more accurate power estimate becomes. RTL simulation of a small synthetic benchmark may complete in minutes, but for a netlist it can take hours or days. Simulation of a very deep CNN graph with all implementation details included may require weeks. This simulation time challenge increases the risk that IP vendors may skip such detailed power analysis and accurate power estimation. The result is that an actual power consumption may exceed the power budget; a clear product risk manifesting later during SoC power sign-off phase.

Accurate & Earlier Power Estimation for Smart Vision SoCs

To be able to execute billions of cycles of a CNN graph on a full-layout netlist to achieve maximum accuracy in the power measurements, simulation tools are simply not enough. Synopsys’ ZeBu emulation platform provides a solution that can help both IP developers and SoC designers to compute power accurately for hundreds of millions of processed cycles in a matter of minutes or hours instead of weeks or months. The ZeBu server also supports advanced use modes, including power management verification, comprehensive debug and integration with Synopsys’ verification ecosystem, hybrid emulation with virtual prototypes and architectural exploration and optimization. Therefore, access to a ZeBu emulator enables both easy exploration of power/performance tradeoffs with application software on various candidate hardware architectures, and efficiently achieving sign-off quality power estimates, helping to tune power consumption of all elements in a system during the different stages of the design cycle. Designers using Synopsys’ DesignWare®ARC® EV7x Vision Processors are adopting the Zebu software-based power estimation and sign-off flow to get the most accurate and realistic power estimates when using the EV7x processor to handle high-performance deep learning applications.

Conclusion

Estimating the power consumption of IP blocks for AI applications in an SoC can be a challenge. Designers need to carefully consider all aspects of the power estimation process to ensure that the decisions they make early in the process allow them to stay within their power budgets when silicon gets back. Implementing a design on an emulation system like the ZeBu Server is a more accurate means of estimating and tuning power consumption than deriving estimates from a single convolutional layer.