Innovative Ideas for Predictable Success
      Issue 1, 2010


Technology Update Technology Update
Exploring Low Power Algorithm Architectures
Chris Eddington and Josefina Hobbs, Synopsys, used Synphony High Level Synthesis and the Eclypse Low Power Solution to rapidly explore a range of low power architectures, trading off timing, area, and power. In this article they describe the fundamentals, the relevant tools from system-level down to gate-level, and present results for an example wireless algorithm implemented in a 90nm ASIC.

The decisions that designers take during the early stages of an ASIC development project often have the greatest impact on the final outcome. Selecting a suitable architecture is a case in point: it greatly affects performance, area and power. While circuit-level optimizations can help to shave percentage points from power budgets, architectural decisions can reduce the power by orders of magnitude.

One problem that designers face is how to tell whether they have the best architecture for a particular algorithm. To make matters worse, these days, algorithms are so complex that manual effort to code and verify different architectures exceeds the time and resources available in the development cycle. Furthermore, when design teams are looking to squeeze every last microwatt from a design, manual estimates are often not good enough to gauge subtle changes at the architectural stages of a design. Exploring different architectures manually is very time consuming, and just not accurate enough when designing algorithms in today’s ASIC technologies.

Automating Architecture Exploration
Designers can use the Synopsys Synphony High Level Synthesis (HLS) environment to help them rapidly and accurately explore a number of different architectures much earlier in the development cycle. By providing HLS flows from The MathWorks’ MATLAB and Simulink, designers can accurately explore architectures from a familiar environment, which can shave months off the design cycle and lead to more efficient verification and system validation during the implementation phase.

Synphony HLS allows designers to rapidly generate all the necessary RTL and verification files to perform accurate speed, area, and power analysis with the Eclypse™ Low Power Solution. Once a high level model is available, architects can specify various architecture constraints and target technologies, and the Synphony HLS engine will automatically apply architecture transformations and optimizations and generate the optimized RTL, logic synthesis scripts, testbenches, and verification scripts. This data can be easily used with Design Compiler®, VCS®, PrimeTime® and other Eclypse solution components to investigate the architecture performance. This enables the design team to identify tradeoffs and find the best opportunities to make significant power, performance, and area savings.

Figure 1. Synphony High Level Synthesis Design Flow

Starting From Higher Abstraction Levels
The benefit of using Synphony HLS is to explore architectures very early in the design cycle. By providing a MATLAB and Simulink high-level synthesis flow, Synphony enables design teams to start exploring different architectures with significantly less coding of fixed-point RTL, C-code, and verification script development.

Architectural Transformations
Synphony HLS automates several architectural transformations:

Retiming and Pipelining
Synphony HLS performs re-timing based on the target technology. It can automatically calculate any additional stages of pipelining required and synthesize those in the implementation. It will use either a fast estimation mode that uses a built-in approximation of the ASIC technology, or it can use Design Compiler in conjunction with target technology models to produce more accurate timing data.

Designers can use Synphony HLS to control how their designs use hardware resources, which helps them decrease area by using higher clock speeds and higher power, or by decreasing throughput. Synphony does this by scheduling multiple operations onto a reduced set of hardware operations. This technique is also referred to as resource sharing.

For example, if a design contains hundreds of multiply operations, Synphony HLS can automatically introduce a higher system clock and map these operations onto a single system multiplier. It will also synthesize the control logic and multiplexing to enable sharing of the multiplier resource over multiple clock cycles.

IP-Level Optimizations
Synphony HLS provides a high level IP model library for math, signal processing, communications and video applications. When users build their designs from this library, the Synphony HLS engine will perform IP-level optimizations across instances of these blocks. Depending on the constraints, such as sample rates and target technology, Synphony HLS will automatically choose different architectures or micro-architectures for IP-level implementation.

Target Technology Optimizations
High level synthesis requires knowledge of the design performance in the target technology in order to make optimization choices. Synphony HLS has an Advanced Timing Mode that can directly utilize Design Compiler to more accurately characterize the performance of various optimization parameters. This leads to better architecture and target technology optimizations.

This advanced characterization capability includes the effects of DesignWare® arithmetic optimizations such as adders and multipliers and can also include the impact of using DesignWare minPower components. DesignWare minPower components focus specifically on minimizing the power contributions in a design such as glitch power, dynamic power and high-performance data power. DesignWare minPower components enable DC Ultra® to automatically infer power-optimized datapath architectures, thus reducing both dynamic and leakage power.

Memory Management
The Synphony HLS implementations can have memories from various sources, such as user instantiated or tool-inferred from optimizations. For FPGA targets, these will be written to infer the vendor memory hardware. For ASIC targets, the tool gives designers a choice to extract memories and separate them out into a module, and then replace them with compiler-generated memories from a specialist memory vendor. This is important for low power architecture implementation because specialist vendors can provide memories optimized for power and area.

Multi-Rate Optimizations and Clock Circuit Generation
Synphony HLS provides a high level of abstraction for multi-rate algorithm design and can automatically optimize and implement these types of algorithms. This often results in architectures that have multiple clock domains.

Multi-rate clocking has the potential to make a big impact on power, so it is an important part of exploring different architectures. Synphony HLS is aware of multi-rate architectures, and it can optimize across different rates and implement different clocking strategies. Synphony HLS will take care of synthesizing the clock circuit for generating all the various clocks in a design.

If a design has multiple synchronous sample rates, the design team has a choice of clocking strategies for managing this type of implementation. One option is to implement different sample rates from a dedicated clock – e.g., from a PLL or from a clock circuit divider. The other option is to derive all the different clock sample rate domains from a fast clock, and then use “clock enables” to regulate the clock for the slower domains.

Synphony HLS supports both of these clocking strategies. It can synthesize the clock and reset circuits, separate them out from the core design and make them available at the top-level module.

Design Case Study: Digital Downconverter
Designers use downconverter algorithms in applications such as cell phones and base stations. We used Synphony HLS to explore a number of different architectures for a digital downconverter typical to cellular applications. A brief summary of the computational requirements of the algorithm are: three-stage decimation filtering, complex mixer, digital frequency synthesizer, a total of 124 multiplies. It includes multiple sample rates at 70 MHz, 1.094 MHz, 0.547 MHz and 0.273 MHz.

Figure 2. Digital Downconverter Block Diagram

We synthesized three architectures using automatic global pipelining and folding, and then applied different clock strategies to each of these, which gave us six architectures in total. The target technology was a 90nm TSMC library.

Baseline Architecture (base_dclk)
We generated all of these architectures with retiming enabled. If the circuit needs to be re-timed to meet the sample rate clocks for the target technology (a 90 nanometer ASIC library in this case), Synphony will automatically add any pipeline registers required.

Fold x1 with Dedicated Clocks (fold1_dclk)
For this design variant, we enabled resource sharing and scheduling in our clock domains using the fastest sample clock. In this case, our design had a 70 MHz clock with several full-clock domains. The optimizations included resource sharing, scheduling and area optimization in the lower clock domains.

We also enabled memory extraction for this design. Synphony will infer memory for different variables and storage elements in the high-level model as it shares resources. It will merge the variables into a memory, which is a more efficient way to implement storage than, say, using an array of registers. This approach allowed us to use a compiled memory for implementation, which provides further power and area improvements.

Fold x2 with Dedicated Clocks (fold2_dclk)
For this design we used a faster system clock (140 MHz) to enable resource sharing and scheduling across the whole design. In this case the two mixer multiplies were resource shared in the architecture.

Enabled Clocks (base_eclk, fold1_eclk, fold2_eclk)
In addition to synthesizing dedicated clock versions of these three architectures, we also synthesized enabled clock versions (_eclk) to investigate how the power varied for each clocking strategy. Base_eclk, fold1_eclk and fold2_eclk are the corresponding architectures with the enabled clock circuit.

Unified Verification Flow
We generated all these architectures from a single high level model. Designers can use Synphony HLS to also create simulation testbenches at the MATLAB and Simulink level, and automatically generate the RTL verification for the optimized RTL architecture. This can be used for faster and more consistent verification across the architecture choices and for more accurate power analysis using activity data (described next).

Switching Activity for Accurate Power Analysis
Traditional power estimation techniques employ vectorless analysis. However, power estimates can vary by a factor of two or more depending on switching activity, so it is essential to have a real-world representation of switching activity to enable accurate power calculations. The RTL testbench generation of Synphony HLS makes it easy to create realistic RTL simulations that will generate activity data, which can be used for more accurate logic synthesis and power analysis. In this example, we created waveforms in the high level model of multiple input channels and configured the DDC to tune into one of them. Synphony HLS automatically created VCS scripts for all these architectures to run these simulations. By running them we generated representative switching activity from VCS, using the Switching Activity Interchange Format (SAIF) for the data.

Logic Synthesis
Synphony HLS creates a set of inputs to enable DC Ultra to perform logic synthesis. These include RTL, design constraints, compilation scripts, and the SAIF switching activity data (via VCS). Synphony can also extract the memories so that designers can choose between synthesizing register arrays or using more efficient compiled memories.

In DC Ultra, we turned on register retiming, clock gating, data path optimization and dynamic power optimization. All these techniques are easily enabled and disabled with compile variables and command options.

Although we had already performed some high-level re-timing, we turned on the re-timing during logic synthesis to get even further refinement of register placement. Even on our small design, we saw an incremental improvement in the area results when we performed low-level register re-timing. Clock gating during logic synthesis is automatic – it takes any synchronous enabled clocked logic and automatically infers clock gating logic. Clock gating is a well-established, easy to implement technique that can significantly reduce dynamic power.

The runtime for each design variant was about five minutes, so for logic synthesis we could get through all six of our architectures and analyze all of the results in about half an hour – that’s a very fast turnaround.

The results are as expected. The baseline architecture with the dedicated clock occupies the most area because there’s not much resource sharing or area optimization. There are some logic-level and IP-level optimizations that occur in the filters and the digital frequency synthesizer. In this case we get the lowest power results of all the architectures, but the highly parallel architecture takes more area.

Figure 3. Initial Power and Area Results for Six Architectures

Using a dedicated clock means that a PLL or a dedicated clock source separately drives the 70 MHz clock, the 1 MHz and other low power clocks. In general, this gives the lowest power implementation, with lower clock speeds and less switching activity because all the signals are spread out and switching less frequently. A fast system clock drives the enabled clock variant of this architecture, which provides the clock enable, and this variant generates a little more power.

Applying the folding reduces the area but requires faster clocks. As a result, the power increases significantly: typically a 30-40% jump in the folded-by-1 case. The fold-by-1 cases (fold1_eclk and fold1_dclk) are special because the fastest clock rate didn’t increase; it is simply used in the slower sample rate domains to implement resource sharing and scheduling. This is a win-win situation in the sense that we get an area reduction with minimal (or less) clock power increase because we didn’t increase the maximum clock rate significantly.

However, as we start increasing the system clock to enable more resource sharing and scheduling, we see the power rise quite significantly. On the last two designs, fold-by-2, we’re using a 140 MHz clock and only sharing the two multipliers to demonstrate the diminishing returns of resource sharing when we don’t have sufficient resources to share. Synphony allows us to generate that architecture so we can see how its results compare. We can easily conclude it’s not a good choice: the area reduction is minimal because we don’t have enough resources to benefit from sharing in the faster clock domain.

Synphony allows design teams to explore architectures for multi-rate algorithms very quickly. It performs a high-level synthesis and integrates well with the Eclypse Low Power Solution. Within days of having the high-level algorithm model, a design team can start making important decisions about architectures to be used for implementation, generate a verification framework, and estimate the associated costs to the overall system and product.

We could also have looked at results for 65 nm or 45 nm technology very easily by simply targeting the high-level synthesis to each of those technologies. Designers can optimize across technologies very quickly once they have a high-level Synphony model of their algorithm.

Architectural exploration using Synphony HLS provides the additional benefit of enabling a faster handoff to layout. The final architecture provides the right scripts, test benches and constraints to hand off to the implementation team.

Chris Eddington
Product Marketing Director for High Level Synthesis and System Level Products

Chris Eddington is the Product Marketing Director for High Level Synthesis products at Synopsys. Chris has 20 years of experience in ASIC and FPGA design for communications and multimedia products. His previous role was Technical Marketing Director for high speed networking ICs at Mellanox Technologies and prior to that had various roles as Lead IC Designer for VOIP processors, video conferencing ICs, and wireless communications systems. He holds a MS engineering degree from the University of Southern California and an undergraduate degree in Physics and Math from Principia College.

Josefina Hobbs
Technical Solutions Architect, Low Power Solutions Marketing

Josefina Hobbs brings nearly 18 years of high-tech design and applications expertise to her role as technical solutions architect, low power solutions marketing, at Synopsys. Hobbs is currently responsible for the deployment of the Eclypse Low Power Solution.

She joined Synopsys in 1998 as an applications consultant, supporting a broad spectrum of design and verification products and specializing in low power. While at Synopsys, she has been a key contributor to the successful launch and adoption of Synopsys’ advanced multi-voltage capabilities since their inception.

Hobbs came to Synopsys from IBM, where she first served as a board designer for mobile products. She later moved into the PowerPC organization, where she did system design and graphics chip design. She holds two patents on remote power-wakeup technology for PowerPC.

Hobbs holds a bachelor’s degree in electrical engineering from Duke University and a master’s degree in computer engineering from the University of Texas at Austin.

©2010 Synopsys, Inc. Synopsys and the Synopsys logo are registered trademarks of Synopsys, Inc. All other company and product names mentioned herein may be trademarks or registered trademarks of their respective owners and should be treated as such.

Having read this article, will you take a moment to let us know how informative the article was to you.
Exceptionally informative (I emailed the article to a friend)
Very informative
Somewhat informative
Not at all informative

Register Buttom

Email this article

- Using MATLAB and High-Level Synthesis for DSP Implementation
- Synphony HLS

"The runtime for each design variant was about five minutes, so for logic synthesis we could get through all six of our architectures and analyze all of the results in about half an hour… "