Contact Sales

Search Synopsys

Multiphysics Fusion Technology for Multi-Die Designs Explained

Unified multiphysics fusion helps multi-die teams validate earlier and sign off faster.

Automotive Executive Guide: Rethinking Automotive Development

A guide to virtualization in software-defined vehicles for automotive leaders.

Mastering AI Chip Complexity

This eBook explores AI chip design trends, challenges,
and strategies for first-pass silicon success.

Optimizing Performance and Power in CPUs, GPUs, and DSPs at 28 nm

Synopsys Logic Library IP for Processor Core Implementation Team

Jul 22, 2013 / 8 min read

Table of Contents

Why Different PPA Goals for CPU, GPU and DSP Cores?
Logic Libraries for High-Performance Core Optimization
Sequential Cells
Combinational Cells
Additional Options
Clock Cells
Minimizing Power in CPU, GPU and DSP Cores
Power Optimization Toolbox
Multi-bit Flip-Flops
Additional Flops
Taking Advantage of Multiple VTs and Channel Lengths for Power Optimization
Conclusion

The 28-nm process technology provides opportunities to optimize CPU, GPU and DSP processor core implementations to achieve better performance, power and area (PPA) results. This article provides guidelines for establishing 28-nm core design targets, selecting a design kit of standard cells to handle the diversity of requirements of CPUs, GPUs and DSPs and using implementation best practices to achieve PPA targets most efficiently with these cells.

Foundation IP: Pushing the Boundaries of Energy- Efficient Chip Design

Selected articles on trusted solutions and technical innovations

Download Digest

Why Different PPA Goals for CPU, GPU and DSP Cores?

Within a single SoC design, CPU, GPU and DSP cores may co-exist and are often optimized to different points along the performance, power and area axes. For example, CPUs are typically tuned first for high performance at the lowest possible power while GPUs, because of the amount of silicon area they occupy, are usually optimized for small area and low power. GPUs can take advantage of parallel algorithms that reduce the operating frequency, but they increase the silicon area—accounting for up to 40% of the logic on an SoC. Depending on the application, a DSP core may be optimized for performance, as in the case of base station applications with many signals, or optimized for area and power for handset applications.

3D Graph Showing CPU Optimization in Performance, Power, Area

Figure 1: Primary PPA Targets for CPUs, GPUs and DSPs

Logic Libraries for High-Performance Core Optimization

Synthesizable cores, today’s high-performance standard cell libraries and EDA tools can achieve an optimal solution without having to design a new library for every processor implementation. To optimally harden your high-performance core, you need the following components in your standard cell library:

High-performance sequential cells
High-performance combinational cells
High-performance clock cells
Power minimization libraries and cells for non-critical paths

Sequential Cells

The setup plus the delay time of flip-flops is sometimes referred to as the “dead” or “black hole” time. Like clock uncertainty, this time eats into every clock cycle that could otherwise be doing useful computational work. Multiple sets of high-performance flip-flops are required to optimally manage this dead time. Delay-optimized flops rapidly launch signals into critical path logic clusters and setup-optimized flops capture registers to extend the available clock cycle. Synthesis and routing optimization tools can be effectively constrained to use these multiple flip-flop sets for maximum speed, resulting in a 15-20% performance improvement.

CPU Optimization Diagram Showing Critical Logic Flow

Figure 2: Sequential cells are used to resolve high-performance core design challenges. Multiple flop variants enable targeted optimization.

Combinational Cells

Optimizing register-to-register paths requires a rich standard cell library that includes the appropriate functions, drive strengths, and implementation variants. Even though Roth’s “D-Algorithm” (IBM 1966) demonstrated that all logic functions can be constructed from a single NAND gate, a rich set of optimized functions (NAND, NOR, AND, OR, Inverter, buffers, XOR, XNOR, MUX, adders, compressors, etc.) are necessary for synthesis to create high-performance implementations. Advanced synthesis and place-and-route tools can take advantage of a rich set of drive strengths to optimally handle the different fanouts and loads created by the design topology and physical distances between cells.

CPU Optimization Chart Showing Various Flop Types

Figure 3: Combinational, synthesis-friendly cells for resolving high-performance core design challenges include tapered, bubble, beta ratios and AOI/OAI

Additional Options

Multiple voltage threshold (Vt) and channel length cells offer additional options for the tools as well as variants of these cell functions such as tapered cells that are optimized for minimal delays in typical processor critical paths. Having these critical path-efficient cells and computationally efficient cells, such as AOIs and OAIs, available from the standard cell library provider is critical, but so is having a design flow tuned to take advantage of these enhanced cell options. Additionally, high drive-strength variants of these cells must be designed with special layout considerations to effectively manage electromigration operating at GHz speeds.

To help the tools make the correct choices in selecting cells and minimize cycle time, it is often necessary to employ don’t_use lists to temporarily “hide” specific cells from the tools. Grouping multiple signals with similar constraints and loads can also make a major difference in synthesis efficiency. Attaining the absolute maximum performance out of a design requires the tools and flows to be pushed at different steps in the design flow (e.g., initial synthesis, incremental synthesis, clock tree synthesis, placement, routing, physical optimization). Optimization techniques can typically provide a 15-20% performance improvement.

Clock Cells

High-performance clock driver variants are tuned to provide the minimum delay to reduce clock latency and minimize clock uncertainty caused by skew and process variability. These can include clock buffers tuned for symmetrical rise/fall times and clock inverters for minimum power. Clock tree synthesis tools must be robust enough to handle the PPA tradeoffs of these variants in order to use them effectively.

Graph Showing CPU Optimization Channel Length Impact

Figure 4: Long Channel clock drives exhibit low On-Chip Variability (OCV) and low power

Intelligent use of integrated clock gating cells (ICGs) in multiple functional and drive strength variants are critical to minimizing clock tree power, which can easily consume 25%-50% of the dynamic power in an SoC.

Minimizing Power in CPU, GPU and DSP Cores

Mobile communications, multimedia and consumer SoCs must achieve high degrees of performance while consuming a minimal amount of energy to achieve longer battery life and fit into lower cost packaging. Power optimization is often the most important constraint, making the design challenge getting the best performance possible within the available power budget. Each new silicon process generation brings a new set of challenges for providers of logic library and memory compiler IP and a new set of opportunities to create more power-efficient IP.

Power Optimization Toolbox

A power optimization toolbox consists of all the logic cell functions needed to implement the power optimization techniques for the SoC. These techniques include clock gating, shut down, deep sleep, multiple voltage domains, dynamic voltage and frequency scaling (DVFS), state retention and voltage biasing. The toolbox contains all of the necessary circuits to perform power optimization functions and provides the annotations needed by tools in the design flow to validate the design correctly.

Diagram of CPU Optimization Kit Components for Power Management

Figure 5: Synopsys’ Power Optimization Kit (POK) includes power gates to control power to a block, isolation cells to manage signals from powered-down blocks, retention registers (both balloon style and live latch style) to maintain state in powered-down blocks, and level shifters to translate signals between voltage domains.

Multi-bit Flip-Flops

Using multi-bit flip-flops is an effective method to reduce clock power consumption. Multi-bit flip-flops can significantly reduce the number of individual loads on the clock tree, reducing overall dynamic power used in the clock tree. Area and leakage power savings can also be achieved simply by sharing the clock inverters in the flip-flops with a single structure.

Diagram of CPU Optimization Master-Slave Latch System

Figure 6: Combining two single-bit flops into a dual flop with shared clocking

Additional Flops

Multi-bit flip-flops provide a set of additional flops that have been optimized for power and area with a minor tradeoff in performance and placement flexibility. The flops share a common clock pin, which decreases the overall clock loading of the N flops in the multi-bit flop cell, reduces area with a corresponding reduction in leakage, and reduces dynamic power on the clock tree significantly (up to 50% for a dual flop, more for quad or octal).

Multi-bit flip-flops are typically used in blocks that are not in the critical path of the highest chip operating frequency. They range from small, bus-oriented registers of SoC configuration data that are only clocked at power up, to major datapaths that are clocked every cycle and with a number of variants in between. SoC designers use the replacement ratio, measured by how many of the standard flops in the design can be replaced by their multi-bit equivalents and the resulting PPA improvements, to determine their overall chip power and area savings. The single-bit flip-flops to be replaced with multi-bit flip-flops must have the same function (clock edge, set/reset, and scan configuration).

Taking Advantage of Multiple VTs and Channel Lengths for Power Optimization

At 28 nm, High K Metal Gate (HKMG) technology provides process improvements that make it a very attractive node to use for building high-performance/power-efficient SoCs. PolySiON processes that use much of the same manufacturing equipment provide very cost-effective alternative silicon. Many of these silicon processes support multiple transistor gate lengths at the same gate pitch. This process feature enables multi-channel libraries without the area penalty of designing to the worst-case channel length to achieve footprint compatibility. These interchangeable libraries facilitate late-stage leakage recovery performed by automatic place-and-route tools and very fine granularity in power optimization. Additional VT cells (ultra-high VT, ultra-low VT) provide even more granularity, but with increased costs.

CPU Optimization Performance Chart from Synopsys DesignWare Technical Bulletin

Figure 7: This 28-nm graph plots the relative leakage (at the leakage corner) of a library on the vertical axis and the relative performance of a library (at the signoff corner) on the horizontal axis. The graph shows the leakage advantages of the mid and max channel cells and the performance advantages of low and ultra-low VT cells.

With all of the possible library options, the amount of data presented to the synthesis and place-and-route tools can seem overwhelming. The aggressive use of don’t_use lists (initially hiding both very low and very high drive strength cells) and the proper sequencing of libraries provides an efficient methodology for identifying the optimal set of high-speed and high-density logic libraries and memory compilers that will achieve optimum performance and power tradeoffs at minimum cost. These methodologies are effective on many different circuit types—CPUs, GPUs, high-speed interfaces—and are dependent on the specific circuit configuration and process options being used. With a good understanding of the synthesis and place-and-route flow, one can quickly determine the optimal library combination and sequence for a given configuration of a design.

Table Comparing CPU Optimization Techniques for Different Performance Levels

Table 1: Library selection and sequence recommendations for synthesizing blocks to achieve different levels of target circuit performance

Acquiring specific libraries for each different type and configuration of CPU, GPU and DSP core implemented on an SoC can be inefficient and costly. A properly designed portfolio of logic cells and memory instances can deliver optimal PPA results in processor core hardenings if it includes a full selection of efficient logic circuit functions, the right set of variants and the right set of drive strength granularities.

Conclusion

The 28-nm process generation brings a new set of challenges for SoC designers who can take advantage of advances in logic libraries to achieve the best processor core implementation for their specific application. A single set of performance-, power- and area-optimized standard cells and memories that enable this multi-dimensional optimization can significantly reduce the design effort of hardening cores to SoC-specific requirements. Synopsys provides a unique blend of IP, tools, design flows and expert services to help design team achieve the processor and SoC PPA goals in the shortest possible time.

Learn how how Synopsys logic libraries can benefit your system

Subscribe to the Synopsys IP Technical Bulletin

Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.

Continue Reading

Synopsys Secure Storage Solution for OTP IP

White Paper

Beating the Edge AI Power Wall with Low Voltage Foundation IP

How silicon-proven logic, memory, and I/O at ~0.4–0.5V deliver predictable PPA and faster convergence

Download

White Paper

Accelerating Automotive Innovation: SRAM Compiler Breakthroughs for 5nm and 3nm SoCs

Download

Article

Addressing AI and Advanced Packaging Challenges with Synopsys 3DIO PHY

Learn more

ASK

BETA

End Chat

Closing this window clears your chat history and ends your session. Are you sure you want to end this chat?

Legal Disclaimer

NOTICE: You are interacting with an AI-powered chatbot that provides general information about Synopsys, including its products and services, which may be incorrect or incomplete. In the event of any conflict or discrepancy, the terms of your applicable agreements supersede any information provided by this chatbot. These chats may be accessed by Synopsys and its service providers to customize the experience and improve this tool, and your use of this chatbot is an agreement to that data processing activity.

Search Synopsys

Popular Content

Multiphysics Fusion Technology for Multi-Die Designs Explained

Unified multiphysics fusion helps multi-die teams validate earlier and sign off faster.

Automotive Executive Guide: Rethinking Automotive Development

A guide to virtualization in software-defined vehicles for automotive leaders.

Mastering AI Chip Complexity

This eBook explores AI chip design trends, challenges, and strategies for first-pass silicon success.

Optimizing Performance and Power in CPUs, GPUs, and DSPs at 28 nm

Synopsys IP Technical Bulletin

Foundation IP: Pushing the Boundaries of Energy- Efficient Chip Design

Why Different PPA Goals for CPU, GPU and DSP Cores?

Logic Libraries for High-Performance Core Optimization

Sequential Cells

Combinational Cells

Additional Options

Clock Cells

Minimizing Power in CPU, GPU and DSP Cores

Power Optimization Toolbox

Multi-bit Flip-Flops

Additional Flops

Taking Advantage of Multiple VTs and Channel Lengths for Power Optimization

Conclusion

Subscribe to the Synopsys IP Technical Bulletin

Continue Reading

Beating the Edge AI Power Wall with Low Voltage Foundation IP

Accelerating Automotive Innovation: SRAM Compiler Breakthroughs for 5nm and 3nm SoCs

Addressing AI and Advanced Packaging Challenges with Synopsys 3DIO PHY

End Chat

Legal Disclaimer

This eBook explores AI chip design trends, challenges,
and strategies for first-pass silicon success.

Synopsys IP
Technical Bulletin