Innovative Ideas for Predictable Success
      Issue 3, 2010

  NEWS  |   CALENDAR  |   PAST ISSUES SYNOPSYS.COM  |  CONTACT US


Technology Update Technology Update
Extending Static Timing Analysis Beyond 500 Million Instances
PrimeTime® HyperScale technology extends static timing analysis (STA) to designs of up to 500 million instances, and improves block and chip timing convergence, performance and capacity. Bernadette Mortell, Synopsys, explains how design teams can benefit from HyperScale’s 5–10X faster runtimes and 5-10X smaller memory footprint for full chip analysis.

Consider that performing complex signal-integrity based timing analysis on a 100 million instance design can take up to 24 hours using today’s multicore solutions. Clearly, a 10X speedup in STA would be a compelling benefit for a design of this size. But since large design sizes of 10 to 50 million instances are more common today, it’s reassuring that the latest technology in PrimeTime offers similar benefits to mainstream designs.

For example, when using PrimeTime HyperScale technology on a customer design of 15 million instances, we saw chip level runtimes reduce from 3 hours to 20 minutes and saw the memory required fall from 23 GB to just 3.6 GB. That speedup offers a significant improvement in turnaround time for mainstream designs, especially considering that designers increasingly perform STA with multiple signoff scenarios.

Our recent introduction of PrimeTime HyperScale technology is the latest deliverable in a development program that ensures we continue to provide analysis solutions where performance and capacity scale with the needs of designers.

PrimeTime’s Approach to Performance and Capacity
We have taken a multi-pronged approach to enhancing performance and capacity in PrimeTime. First, we sped up the core PrimeTime technology. We re-wrote code to take advantage of new programming approaches, and focused on improving signal integrity (SI) timing analysis runs where users were seeing the longest runtimes. These enhancements improved the performance of PrimeTime and PrimeTime SI running on a single core by more than 3X.

Next we enhanced PrimeTime for multicore compute environments. We added support for threaded multicore analysis allowing users to use the available cores on a single machine to run on a single timing run. Unique to PrimeTime we added support for distributed multicore analysis, which partitions the design so that design teams can take advantage of using multiple machines in a compute farm. Distributed multicore technology enables design teams to analyze large designs using machines with even modest amounts of memory. Using four cores improves runtimes by a further 2X. Taking both approaches, we’ve been able to deliver a solution that helps designers whatever the makeup of their compute environments.

Using multiple cores to run software applications is a trend that is set to continue. We are seeing machines with more cores and memory become available, and we expect to continue to improve PrimeTime’s multicore capabilities so that performance continues to improve with the number of cores. However, we do expect to reach a point where multicore technology will deliver diminishing returns in performance. This happens when more time is spent transferring data to and from memory than actually processing it. That is why we continue to invest in other technologies that complement our multicore initiative.

Introducing HyperScale Technology
A common approach to deal with complexity is to break a big problem down into a set of smaller problems and deal with them hierarchically. In large scale design the use of hierarchical implementation flows is common because it provides capacity and performance improvements over flat techniques. PrimeTime HyperScale technology directly addresses the STA needs of those already doing hierarchical implementation. HyperScale is suited to design teams working on 40 and 28 nm technologies doing hierarchical design implementation on designs in excess of 10 million instances.

More Productive Block and Chip Design
In a typical hierarchical design flow, the initial chip specification is used to create block level timing budgets for each of the blocks which will be physically implemented. Block designers then take these budgets and begin the implementation process. As place and route proceeds, additional constraints will be added in order to achieve their block level timing goals. Performing timing analysis at the block level will reveal whether they have closed timing on the block but will not have the timing context of the full chip.

The design team repeats this process for all of the blocks in the design, and then integrates them. At this point, it is possible that the interaction between blocks will cause the timing to change. The changes may require edits inside already completed blocks. Consequently, it can take many iterations at block level and chip level to close timing. One of our objectives in introducing PrimeTime HyperScale technology is to reduce the time it takes to achieve timing convergence at the block level, and enable faster full-chip analysis and timing closure. This is achieved by maintaining up-to-date timing context information for the chip level and the block level at all times. Block designers no longer have to use the initial static timing budget to optimize their designs. They can see if their blocks are meeting their timing goals with a full understanding of the chip-level timing environment.

It’s the same difference as waking up in the morning in a room with no windows and a completely independent climate-controlled environment. You make a decision on what to wear for the day based on your current situation and what you knew of the weather the day before. But overnight a storm may have blown in and it may be raining hail. The HyperScale solution is equivalent to giving you a window that opens to the outside, and access to weather measurements that are accurate for every place you plan to visit that day.

How HyperScale Technology Works
HyperScale technology takes advantage of the physical hierarchical implementation process by re-using the block-level timing for the chip-level analysis. HyperScale technology uses the information from the block timing to generate block timing contexts, which completely describe the timing for the block including all signal integrity effects. Unlike past approaches to hierarchical analysis, PrimeTime HyperScale block contexts are automatically generated and do not require model validation. They are automatically saved as part of the save session process ready to be used for the chip level analysis. Disadvantages associated with earlier modeled approaches are eliminated by providing the user with the same accurate timing results using the HyperScale contexts as would be seen with flat timing analysis. Users also have the ability to see inside blocks if required for the purposes of reviewing a particular critical timing path, which includes both top and block level logic.

HyperScale offers design teams a number of important benefits:

Runtime and capacity: HyperScale chip-level timing analysis is fast, in the 2010 releases average runtimes are 5X faster than flat full chip timing analysis. The memory required is also 5X smaller than the peak memory experienced in a flat full-chip flow. In the 2011 release the improvement factors are planned to improve to 10X for both runtime and capacity.

Accuracy and visibility: It maintains full accuracy including accounting for signal integrity effects on the blocks boundaries. It gives the user the ability to have full visibility inside blocks on a block-by-block basis, which is ideal for analysis of critical paths which may traverse several blocks of the design.

No model generation or validation: HyperScale can be used to automatically generate block timing contexts and avoids burdening design teams with the problem of creating models or validating timing models.

Pinpoints when and where timing is outside scope: As PrimeTime HyperScale technology loads in block-level timing contexts to run chip level timing analysis, it updates the chip-level context to analyze if each block is within its timing scope. Blocks that meet timing at block level, can have chip-level context violations when integrated into the chip layout. HyperScale technology will report any block that is out of scope.

No manual re-budgeting: It automatically updates the block timing context, so that no re-budgeting is required to drive timing closure ECOs. Without HyperScale technology, the process of re-budgeting is manual, which can take minutes or hours, depending on the extent of the re-budgeting process.

Informs place and route: IC Compiler can use information from PrimeTime HyperScale technology to guide the timing ECO process, which assists in achieving block- and chip-level timing closure. Providing focused fixing information means that IC Compiler will need fewer iterations to close timing.

Easier constraint management: HyperScale technology lets designers close timing at the block level with full visibility of top-level timing. This eliminates the need to merge constraints for chips and blocks. Instead of having to build a flat chip-level timing constraints, they can use all the block-level timing constraints and the constraints that are uniquely associated with the chip level.

Scalable Results
We have benchmarked HyperScale technology on a number of customer designs. Figures 2 and 3 show results for both runtime speedup and memory reduction.

We performed the analysis measurements on identical hardware, with and without HyperScale technology.


Figure 1: Comparing performance speedup across a range of designs


Figure 2: Comparing capacity improvement across a range of designs (memory – GB)

While we have created HyperScale technology to extend STA to meet the needs of 500m-instance designs, the results show just how effective it can be in improving productivity for relatively small designs. Using HyperScale technology reduces runtime from 57 minutes to 8 minutes for a 5m-instance design. In this case HyperScale technology makes it feasible for the design team to sign off against many scenarios in a day.

Figures 3 and 4 show our projections for runtime and capacity based on the early results with customer designs.


Figure 3: Scaling capacity with runtime


Figure 4: Scaling capacity with machine memory

Today, a 100 million instance design will run in 256 GB of memory when the full flat design has to be loaded into memory along with all of its associated constraints. Our projections show that HyperScale technology will enable a 500 million instance design to load into the same 256 GB of memory. As we continue to improve our multicore technology, PrimeTime HyperScale technology performance and capacity will improve further as the two technologies are used together.

Summary
PrimeTime HyperScale technology extends STA beyond 500 million instances, delivers 5-10X speed and capacity benefits, and improves the block and chip timing convergence process.

HyperScale technology enhances the Galaxy™ Implementation Platform by providing more precise timing context to drive timing closure in IC Compiler. It works with existing PrimeTime features like signal integrity (SI) analysis, advanced on-chip variation (AOCV) analysis, multi-scenario analysis and threaded multicore analysis, to improve overall timing closure turnaround time.

Bernadette Mortell
Bernadette (Bernie) Mortell is senior product marketing manager for PrimeTime suite at Synopsys.


©2010 Synopsys, Inc. Synopsys and the Synopsys logo are registered trademarks of Synopsys, Inc. All other company and product names mentioned herein may be trademarks or registered trademarks of their respective owners and should be treated as such.


Having read this article, will you take a moment to let us know how informative the article was to you.
Exceptionally informative (I emailed the article to a friend)
Very informative
Informative
Somewhat informative
Not at all informative

Register Buttom

Email this article

WEB LINKS
- PrimeTime with HyperScale technology

“PrimeTime HyperScale technology extends STA beyond 500 million instances, delivers 5-10X speed and capacity benefits, and improves the block and chip timing convergence process.”