Subscribe to DesignWare Technical Bulletin
Optimizing CPUs, GPUs and DSPs for High Performance and Low Power
at 28 nm
By Ken Brock, Product Marketing Manager, Logic Libraries
The 28-nm process technology provides opportunities to optimize CPU, GPU and DSP processor core implementations to achieve better performance, power and area (PPA) results. This article provides guidelines for establishing 28-nm core design targets, selecting a design kit of standard cells to handle the diversity of requirements of CPUs, GPUs and DSPs and using implementation best practices to achieve PPA targets most efficiently with these cells.
Why Different PPA Goals for CPU, GPU and DSP Cores?
Within a single SoC design, CPU, GPU and DSP cores may co-exist and are often optimized to different points along the performance, power and area axes. For example, CPUs are typically tuned first for high performance at the lowest possible power while GPUs, because of the amount of silicon area they occupy, are usually optimized for small area and low power. GPUs can take advantage of parallel algorithms that reduce the operating frequency, but they increase the silicon area—accounting for up to 40% of the logic on an SoC. Depending on the application, a DSP core may be optimized for performance, as in the case of base station applications with many signals, or optimized for area and power for handset applications.
Figure 1: Primary PPA Targets for CPUs, GPUs and DSPs
Logic Libraries for High-Performance Core Optimization
Synthesizable cores, today’s high-performance standard cell libraries and EDA tools can achieve an optimal solution without having to design a new library for every processor implementation. To optimally harden your high-performance core, you need the following components in your standard cell library:
- High-performance sequential cells
- High-performance combinational cells
- High-performance clock cells
- Power minimization libraries and cells for non-critical paths
The setup plus the delay time of flip-flops is sometimes referred to as the “dead” or “black hole” time. Like clock uncertainty, this time eats into every clock cycle that could otherwise be doing useful computational work. Multiple sets of high-performance flip-flops are required to optimally manage this dead time. Delay-optimized flops rapidly launch signals into critical path logic clusters and setup-optimized flops capture registers to extend the available clock cycle. Synthesis and routing optimization tools can be effectively constrained to use these multiple flip-flop sets for maximum speed, resulting in a 15-20% performance improvement.
Figure 2: Sequential cells are used to resolve high-performance core design challenges. Multiple flop variants enable targeted optimization.
Optimizing register-to-register paths requires a rich standard cell library that includes the appropriate functions, drive strengths, and implementation variants. Even though Roth’s “D-Algorithm” (IBM 1966) demonstrated that all logic functions can be constructed from a single NAND gate, a rich set of optimized functions (NAND, NOR, AND, OR, Inverter, buffers, XOR, XNOR, MUX, adders, compressors, etc.) are necessary for synthesis to create high-performance implementations. Advanced synthesis and place-and-route tools can take advantage of a rich set of drive strengths to optimally handle the different fanouts and loads created by the design topology and physical distances between cells.
Figure 3: Combinational, synthesis-friendly cells for resolving high-performance core design challenges include tapered, bubble, beta ratios and AOI/OAI
Multiple voltage threshold (Vt) and channel length cells offer additional options for the tools as well as variants of these cell functions such as tapered cells that are optimized for minimal delays in typical processor critical paths. Having these critical path-efficient cells and computationally efficient cells, such as AOIs and OAIs, available from the standard cell library provider is critical, but so is having a design flow tuned to take advantage of these enhanced cell options. Additionally, high drive-strength variants of these cells must be designed with special layout considerations to effectively manage electromigration operating at GHz speeds.
To help the tools make the correct choices in selecting cells and minimize cycle time, it is often necessary to employ don’t_use lists to temporarily “hide” specific cells from the tools. Grouping multiple signals with similar constraints and loads can also make a major difference in synthesis efficiency. Attaining the absolute maximum performance out of a design requires the tools and flows to be pushed at different steps in the design flow (e.g., initial synthesis, incremental synthesis, clock tree synthesis, placement, routing, physical optimization). Optimization techniques can typically provide a 15-20% performance improvement.
High-performance clock driver variants are tuned to provide the minimum delay to reduce clock latency and minimize clock uncertainty caused by skew and process variability. These can include clock buffers tuned for symmetrical rise/fall times and clock inverters for minimum power. Clock tree synthesis tools must be robust enough to handle the PPA tradeoffs of these variants in order to use them effectively.
Figure 4: Long Channel clock drives exhibit low On-Chip Variability (OCV) and low power
Intelligent use of integrated clock gating cells (ICGs) in multiple functional and drive strength variants are critical to minimizing clock tree power, which can easily consume 25%-50% of the dynamic power in an SoC.
Minimizing Power in CPU, GPU and DSP Cores
Mobile communications, multimedia and consumer SoCs must achieve high degrees of performance while consuming a minimal amount of energy to achieve longer battery life and fit into lower cost packaging. Power optimization is often the most important constraint, making the design challenge getting the best performance possible within the available power budget. Each new silicon process generation brings a new set of challenges for providers of logic library and memory compiler IP and a new set of opportunities to create more power-efficient IP.
Power Optimization Toolbox
A power optimization toolbox consists of all the logic cell functions needed to implement the power optimization techniques for the SoC. These techniques include clock gating, shut down, deep sleep, multiple voltage domains, dynamic voltage and frequency scaling (DVFS), state retention and voltage biasing. The toolbox contains all of the necessary circuits to perform power optimization functions and provides the annotations needed by tools in the design flow to validate the design correctly.
Figure 5: Synopsys’ Power Optimization Kit (POK) includes power gates to control power to a block, isolation cells to manage signals from powered-down blocks, retention registers (both balloon style and live latch style) to maintain state in powered-down blocks, and level shifters to translate signals between voltage domains.
Using multi-bit flip-flops is an effective method to reduce clock power consumption. Multi-bit flip-flops can significantly reduce the number of individual loads on the clock tree, reducing overall dynamic power used in the clock tree. Area and leakage power savings can also be achieved simply by sharing the clock inverters in the flip-flops with a single structure.
Figure 6: Combining two single-bit flops into a dual flop with shared clocking
Multi-bit flip-flops provide a set of additional flops that have been optimized for power and area with a minor tradeoff in performance and placement flexibility. The flops share a common clock pin, which decreases the overall clock loading of the N flops in the multi-bit flop cell, reduces area with a corresponding reduction in leakage, and reduces dynamic power on the clock tree significantly (up to 50% for a dual flop, more for quad or octal).
Multi-bit flip-flops are typically used in blocks that are not in the critical path of the highest chip operating frequency. They range from small, bus-oriented registers of SoC configuration data that are only clocked at power up, to major datapaths that are clocked every cycle and with a number of variants in between. SoC designers use the replacement ratio, measured by how many of the standard flops in the design can be replaced by their multi-bit equivalents and the resulting PPA improvements, to determine their overall chip power and area savings. The single-bit flip-flops to be replaced with multi-bit flip-flops must have the same function (clock edge, set/reset, and scan configuration).
Taking Advantage of Multiple VTs and Channel Lengths for Power Optimization
At 28 nm, High K Metal Gate (HKMG) technology provides process improvements that make it a very attractive node to use for building high-performance/power-efficient SoCs. PolySiON processes that use much of the same manufacturing equipment provide very cost-effective alternative silicon. Many of these silicon processes support multiple transistor gate lengths at the same gate pitch. This process feature enables multi-channel libraries without the area penalty of designing to the worst-case channel length to achieve footprint compatibility. These interchangeable libraries facilitate late-stage leakage recovery performed by automatic place-and-route tools and very fine granularity in power optimization. Additional VT cells (ultra-high VT, ultra-low VT) provide even more granularity, but with increased costs.
Figure 7: This 28-nm graph plots the relative leakage (at the leakage corner) of a library on the vertical axis and the relative performance of a library (at the signoff corner) on the horizontal axis. The graph shows the leakage advantages of the mid and max channel cells and the performance advantages of low and ultra-low VT cells.
With all of the possible library options, the amount of data presented to the synthesis and place-and-route tools can seem overwhelming. The aggressive use of don’t_use lists (initially hiding both very low and very high drive strength cells) and the proper sequencing of libraries provides an efficient methodology for identifying the optimal set of high-speed and high-density logic libraries and memory compilers that will achieve optimum performance and power tradeoffs at minimum cost. These methodologies are effective on many different circuit types—CPUs, GPUs, high-speed interfaces—and are dependent on the specific circuit configuration and process options being used. With a good understanding of the synthesis and place-and-route flow, one can quickly determine the optimal library combination and sequence for a given configuration of a design.
Table 1: Library selection and sequence recommendations for synthesizing blocks to achieve different levels of target circuit performance
Acquiring specific libraries for each different type and configuration of CPU, GPU and DSP core implemented on an SoC can be inefficient and costly. A properly designed portfolio of logic cells and memory instances can deliver optimal PPA results in processor core hardenings if it includes a full selection of efficient logic circuit functions, the right set of variants and the right set of drive strength granularities.
The 28-nm process generation brings a new set of challenges for SoC designers who can take advantage of advances in logic libraries to achieve the best processor core implementation for their specific application. A single set of performance-, power- and area-optimized standard cells and memories that enable this multi-dimensional optimization can significantly reduce the design effort of hardening cores to SoC-specific requirements. Synopsys provides a unique blend of IP, tools, design flows and expert services to help design team achieve the processor and SoC PPA goals in the shortest possible time. For more information on how Synopsys logic libraries can benefit your system, visit http://www.synopsys.com/dw/ipdir.php?ds=dwc_standard_cell
Subscribe to DesignWare Technical Bulletin