| Technology Update|
Using Parallelism to Double Simulation Speed
Usha Gaira, Synopsys, explains how verification teams can use VCS® multicore technology to speed up their simulations.
Verification teams recognize the value of using parallelism to improve lengthy simulation throughput wherever they can. They are used to ‘divide and conquer’ strategies where they split verification tasks, share them across their teams and run several time-consuming simulation tasks in parallel on multiple computers.
A comprehensive verification environment may comprise of a number of distinct tasks as a result of using checkers, monitors, applying constraints, using assertions, measuring coverage, performing debug and dumping waveforms. Application-level parallelism (ALP) is the process of taking advantage of these verification tasks by running them in parallel by harnessing the power of multiple CPUs.
The latest compute infrastructures are built on multicore, multi-threading machines. To get the best performance from multicore hardware, verification engineers need to use software that can exploit the parallelism inherent within it.
The latest design trends within the engineering community are also moving towards parallelism. Increasingly, engineers are producing SoC architectures that consist of multiple cores. If verification teams can partition the design so that they can verify each core or functional block on independent processors running in parallel, they can achieve higher verification throughput.
Partitioning a design into smaller chunks also makes life easier for verification engineers because each piece of the design will more readily fit into computer memory. Verification engineers refer to this partitioning approach as design-level parallelism (DLP).
Ideally, multicore simulation technology should support both DLP and ALP techniques. Part of that support includes ensuring that single-core and multicore simulations produce consistent results. The tools can do this by synchronizing communication across multiple cores and carefully managing shared memory.
VCS Multicore Technology
Synopsys’ VCS® multicore technology allows designers to identify performance bottlenecks and share time-consuming activities across multiple cores to achieve faster functional verification and debug. Supporting both ALP and DLP, it combines user-assisted partitioning and load balancing, event synchronization and memory optimization, and provides a performance boost for chip-level and system-level verification of SoCs.
Figure 1: VCS multicore technology
To enable parallel simulation, the environment must define the dependent and independent events, and select a partitioning scheme that will deliver the best performance.
Discrete event simulation techniques use a number of partitions, or process objects – one or several objects for each processor. The objects on different processors communicate through event messages. Independent events do not need to communicate, and a good partitioning scheme is one that can maximize the number of independent events on all processors. The simulation must preserve the order in which the dependent events are processed so that no event can occur unless all events on which it depends have already occurred. The simulation can process independent events in an arbitrary order or concurrently.
Dynamic load balancing improves the performance of parallel simulations. The simplest dynamic load balancing technique uses a multi-partition approach where the number of process objects generated before simulation is a multiple of the number of processors. The simulation assigns these objects to processors at the beginning – each of them represents a part of the simulation model and has its own event list. If the simulator detects load imbalances, it migrates objects between the processors to get a better balance.
VCS multicore technology uses a conservative synchronization method for different partitions in parallel simulation. In this approach, VCS only advances the simulation time if essential constraints are not broken. It runs all partitions in parallel; however, the time does not advance. All events in the simulation progress forward in simulation time in a lockstep fashion.
Traditionally, event simulators slow down by a factor of two or three while dumping waveform data. By distributing waveform dumping, verification engineers can accomplish a 2X productivity boost during the debug phase of the project. In addition, engineers can execute a large number of covergroups and/or SystemVerilog assertions concurrently using VCS multicore. The compiler automatically partitions the design for ALP simulation (Figure 2).
Figure 2: Application-level parallelism
Partitioning, load balancing, event synchronization and memory optimization all affect VCS multicore technology performance.
Figure 3: Design-level parallelism
Design Partitioning Strategies
VCS multicore helps verification engineers to partition their designs to achieve fast simulation from design-level parallelism. For the best results, engineers should aim to achieve even load balancing and minimum communication between partitions. Designs should avoid clock dependencies between blocks because any clock dependency prevents the simulator from running the blocks in parallel.
Slave partitions are those that the verification engineer specifies. All parts of the design not covered in any of the specified partitions, together form an implicitly defined master partition.
For the best performance, clock logic should be in the master partition or else replicated in each slave partition.
Verification engineers should select partitions that require significant time spent in simulation, because overall performance depends on how active each partition is. Creating a simulation profile of the design as it simulates on a single processor enables verification engineers to understand which instances use the most CPU time, and therefore which blocks are the best candidates for partitions.
For mixed-language designs, VHDL parts should run within the master partition while slave partitions can include Verilog parts.
While multicore designs will naturally satisfy many of these guidelines, verification engineers can achieve similar benefits with single-core designs if they satisfy the same rules.
To further tune the performance gains, VCS multicore technology provides an intelligent profiler that highlights the performance bottlenecks. The profiler produces graphs (for example, Figure 4) that enable verification engineers to infer load balance, wait times, communication times, execution deltas and the amount of parallelism that the simulator can exploit.
Figure 4: Processor segment totals
Figure 4 shows a situation where both master and slave partitions are doing about the same amount of work.
Modeling Style Guidelines
Use of certain HDL modeling styles improves VCS multicore performance. Key technical recommendations include:
- When cross-scope references are essential, bind them as early as possible in the compilation and simulation cycle. Form the path names as early as possible using explicit strings. Dynamically forming path names is an inefficient modeling style for VCS multicore simulation.
- Avoid defining tasks/functions in one partition and calling them inside another partition. Similarly, avoid defining Verilog memory/array in one partition, and read/writing in another partition.
- Eliminate use of bi-directional ports/primitives on a partition boundary.
- In general terms, stepping down a level of abstraction increases simulation effort by an order of magnitude. When simulation executes on a single processor, an n-fold increase in simulation effort means an n-fold increase in simulation time. However, with parallel simulation, increasing the level of detail can yield more units of work capable of executing concurrently, thus allowing VCS multicore to simulate faster than a comparable single-processor simulation.
- As models increase in detail and simulation effort, the potential for parallel speed-up continues to improve relative to a single processor. However, internal information dependencies within models limit the actual parallelism. This means that a thousand-fold increase in simulation effort resulting from increasing model detail seldom actually yields a thousand-fold increase in simulation parallelism.
- Back annotating delay or parasitic information into an HDL simulation increases simulation accuracy at the expense of performance. Adding detail to the model adds to the computation required, whether using a single-processor or a multicore architecture. Detailed timing models tend to distribute evaluations at a wider range of instants in the simulation time domain, which decreases the potential for parallelism. Verification engineers can simulate back-annotated designs faster by simplifying the timing model or changing other timing-related parameters.
- Verification engineers can improve VCS multicore performance by ensuring that there are significantly more processes, tasks, or their equivalents than processors running simulation. Common explicit or implicit sensitivity lists among processes reduce scheduling overhead.
- Minimizing the number of drivers associated with implicitly or explicitly resolved signals also boosts performance.
Harnessing the Power of Multicore
Design teams can now integrate hundreds of millions of transistors on a single die. Because simulator performance decreases as complexity increases, simulation is becoming more of a bottleneck within the verification process. VCS multicore technology cuts already lengthy verification time in half by harnessing the power of multicore CPUs. It allows users to identify performance bottlenecks and distribute time-consuming activities across multiple cores for faster verification. VCS multicore technology performs load balancing, event synchronization and memory optimization automatically, making it easy to deploy.
VCS multicore technology extends the native VCS testbench optimizations to run on multicore CPUs. The parallel execution includes the design under test as well as verification applications such as testbench, assertion, coverage and debug. Design-level parallelism allows a user to concurrently simulate multiple instances of a core, several partitions of a large design, or a combination of the two. Application-level parallelism allows users to run testbenches, assertions, coverage, and debug concurrently on multiple cores.
Usha Gaira is a Staff Corporate Applications Engineer at Synopsys, and has previously worked in verification at Zaiq and in design at ST Microelectronics.
©2010 Synopsys, Inc. Synopsys and the Synopsys logo are registered trademarks of Synopsys, Inc. All other company and product names mentioned herein may be trademarks or registered trademarks of their respective owners and should be treated as such.