Insight Home | Previous Article | Next Article
Issue 1, 2013
Power-Area Tradeoffs for Parallel Signal Processing Architectures
Teams tackling datapath designs for high-speed signal processing must create architectures that meet the application’s performance needs without breaking the power budget. Using parallel architectures to achieve throughput with lower clock speeds can drastically lower power consumption. However, these tradeoffs are not easy to measure accurately, especially early in the design cycle. Chris Eddington, senior product and technical marketing manager, FPGA and Systems Group, Synopsys, explains how algorithm designers can use high-level design tools with Design Compiler and Power Compiler to quickly explore power and area of parallel signal processing architecture.
High-speed digital signal processing is the mainstay of a broad range of applications. As well as being bread-and-butter work for communications and telecoms design teams, fast signal processing plays an important part in mil/aero and test and measurement designs. Whether a design team is tackling a cellular infrastructure project or a high-speed ADC design, creating multi-gigahertz signal processing designs challenges them to create datapath architectures that can support high-performance processing while consuming low power.
For many signal-processing blocks, using parallel structures is a major tool in balancing power and performance. Parallel architectures allow the use of lower clock speeds, which result in less power consumption while maintaining data throughput1. Of course, all engineering tradeoffs come at a cost, and the price to pay for more parallelism is an increase in chip area and more time spent exploring the architectural options available and analyzing potential power savings. Typically, decisions taken during design of the architecture have the biggest impact on the final power and performance of the chip.
Challenge: Accurate Power Exploration for Algorithm Developers
Accurately measuring the power impact of an architecture decision is difficult at any level, but it is especially problematic for algorithm developers who may not be familiar with the intricacies of RTL flows. Also many have argued that there is simply not enough information at the transaction and algorithm level to accurately measure power2.
Synopsys’ Design Compiler® and Power Compiler™ flow provides a solution for RTL and gate-level power optimization and analysis given the activity data of the design3. This solution is usually accurate enough for exploring the relative architecture tradeoff curves, however, for algorithm designers, the ASIC design flow may be out of reach, as it requires:
- Design RTL
- RTL testbench
- Generation of activity data (SAIF file)
- Logic synthesis constraints for power and SAIF (SDC file)
- Compiled memories (Optional)
RTL development for a specific architecture design and verification can take significant time and effort – days and perhaps weeks. In addition, because algorithm and FPGA/ASIC designers must collaborate to create a specific architecture, the bandwidth of expertise can become a bottleneck. Therefore exploration of performance and relative power consumption of different architectures can be impractical (Figure 1).
Figure 1: Power tradeoffs for signal processing architectures are difficult to explore accurately from algorithm specification, usually because of the effort required to implement and test a specific architecture.
Tools and IP for Signal Processing Power Analysis
Most signal processing designs start in a high-level design environment like MATLAB®/Simulink®, and the main challenge for architecture exploration is the effort required to create the RTL and power data.
However, Synopsys’ Synphony Model Compiler (MC) provides a more automated way to drive this flow from the MATLAB/Simulink environment (Figure 2).
Figure 2: The Synphony MC design flow produces implementation and verification RTL from a high-level algorithm model. When targeting an ASIC, a choice of power analysis constraints can be added for automating the use of activity data and power optimizations in the logic synthesis flow.
From a MATLAB/Simulink model, algorithm and ASIC designers can use Synphony MC to generate a testbench, which will automatically produce the necessary scripts to drive VCS®, Synopsys’ functional verification solution, to produce a switching activity data, or SAIF, file. Using Synopsys’ Design Compiler and Power Compiler, the ASIC designer can optimize the circuit using the activity data from VCS and produce an accurate estimate of the power consumption based on the stimulus provided by the testbench. The user has control over parallelism, architecture, and choices in multirate clocking implementation.
ASIC Low-Power Implementation Flow
Synphony MC uses scripts to direct Synopsys’ low-power optimization tools, which enables ASIC design teams to produce effective results at the gate level without having expert knowledge of the low-power flow. The ASIC low-power RTL implementation flow estimates a design’s power dissipation and optimizes it for power during the synthesis process.
The Power Compiler tool uses the power characterization specified in the target library and switching activity to estimate power dissipation, which is available from the VCS RTL simulation. Without switching activity information, the tool uses default switching activity data.
The algorithm design team can use Simulink simulations to estimate power for various functional modes of the design because the flow enables the generation of switching activity data based on the Simulink simulations.
The flow provides algorithm designers with a quick estimate of the power dissipation of their design at a very early stage in the project. The designers can then optimize the micro-architecture of their design to meet the power targets.
Design Compiler and Power Compiler call on a range of gate-level optimizations, such as clock-gating and operand isolation, and use specific DesignWare® minPower components. The DesignWare minPower components support power-optimized datapath architectures, which enable synthesis of logic that suppresses switching activity and glitches.
Design Example: A Parallel FFT
The fast fourier transform (FFT) is common to many high-speed DSP algorithms. The parallel FFT (PFFT) is a way to implement the FFT with more parallel processing and slower clock speeds for a given throughput requirement. Table 1 illustrates the results of investigating four different parallel FFT architectures using Synphony MC. Lower power per frame is achieved using more parallelism for a given throughput at the expense of additional area. The relative dynamic power is estimated using the activity data generated by the Synphony MC testbench for a fixed number of frames.
Figure 3: The Synphony MC PFFT block is configurable to process up to 32 samples at a time. It generates a parallel micro-architecture optimized for user-selected targets.
Table 1: Parallel FFT power results in TSMC 40-nm LP using Synphony Model Compiler’s ASIC power estimation flow. Lower power per frame is achieved using more parallelism for a given throughput at the expense of additional area. Power is estimated using activity data generated by the Synphony MC testbench for a fixed number of frames.
While throughput is maintained at 2GS/s across all the design variants, adding parallelism to the architecture makes it possible to significantly reduce the clock rate with a corresponding saving in dynamic power. Although leakage power increases because of the increased gate count, the increase is negligible compared to the considerable savings in dynamic power. Exploring and analyzing the four architectures in Table 1 manually would take many days or weeks, but can take only hours using the Synphony MC-based flow.
Synphony MC includes a comprehensive datapath model library, which makes it quick and easy for designers to get started. Designers can start with a reference block and either use it out of the box or modify its parameters for a particular application. Users can build fully custom designs out of multiple blocks and mix standard HDL code as part of a design as well.
Power analysis and optimization are critical to today’s ASIC designs. By using parallel architectures and lower clock speeds, datapath designers can significantly reduce power in their designs. However, making tradeoffs between power and performance by manually changing the datapath architecture is time consuming, error prone, and makes it difficult to accurately estimate the power consumption across a range of architectures.
While ASIC implementation and simulation tools provide powerful techniques for power estimation and optimization at the gate level, most algorithm designers are not familiar enough with ASIC implementation tools to easily measure the power impact of architecture choices on their designs. It is difficult for design teams to compare power, performance, and area for different micro-architectures, assess the tradeoffs, and choose the best implementation for their requirements.
Synphony MC provides an easy-to-use encapsulation of the power analysis and optimization functionality available in downstream implementation tools, and it enables algorithm designers to easily compare and evaluate different micro-architectures for their design. ASIC and algorithm designers can quickly explore power/area tradeoffs with high-level design flows that support activity data generation and accurate power analysis. Design teams tackling high-speed signal processing designs are using Synphony MC today for both ASIC and FPGA implementation.
For more information about implementing FFT algorithms on FPGA devices, read the article “Multi-Gigahertz FPGA Signal Processing.”
1DSP Architecture Design Essentials, Markovic, Brodersen
2 ‘Early and accurate’ power analysis: myth or reality?
3 Power Compiler DS
About the Author
Chris Eddington is Sr. Technical Marketing Manager for High-Level Synthesis at Synopsys and has over 20 years of experience in ASIC and FPGA design. He has held various roles in technical marketing, algorithm development and IC design at semiconductor companies that develop video and audio conferencing ICs and wireless communications systems. He holds an MS engineering degree from the University of Southern California and an undergraduate degree in Physics and Math from Principia College.