Insight Home | Previous Article | Next Article
Issue 2, 2011
Faster Simulation without More Hardware
How can design and verification engineers improve the simulation performance of modern high-performance CPU cores? Neil McKenzie, Mitchell Poplingher, Lloyd Cha, and Dhruba Chandra of AMD and Vijay Akkaraju of Synopsys looked at various techniques – VCS® technology upgrades, elimination of transparent latches in the design, compile-time and run-time optimization of SystemVerilog covergroups, and optimization of C++ PLI routines – that resulted in simulations running approximately twice as fast.
As the highest performing microprocessor designs become exponentially more complex, so too does the effort required to verify them. This verification bottleneck can slow down the overall design cycle, causing major problems for many design teams today because faster time to market remains a high priority.
To run more simulation cycles you need more hardware. While hardware upgrades are a necessary aspect of maintaining a large compute server cluster, the gains from improving simulation performance can end up saving millions of dollars in capital expenditures and operating costs. So we set out to run more test cases per unit time without using any extra hardware.
- We identified four basic categories of simulation improvements:
- Simulator (Synopsys VCS) technology improvements
- Elimination of transparent latches
- Coding improvements in functional coverage constructs
- Coding improvements in C++ infrastructure
To quantify the impact of each improvement, we first created a measurement environment by isolating a set of compute nodes from the overall compute server cluster.
The metric we used to measure VCS simulation performance was simulated cycles per second (CPS). CPS measurements from multiple simulation runs that are launched on the compute server cluster have a great deal of variability. To demonstrate the variation in CPS typical of a large compute server cluster, we ran more than 600 different simulation runs of the same test case (Figure 1). We normalized CPS to make the arithmetic average across all runs equal to 100. The lowest CPS measured was 5 and the highest was 167. In other words, the slowest simulation was slower than the mean by a factor of 20, and the fastest simulation ran 1.67 times faster than the mean.
Figure 1: Normalized CPS as measured from 600 simulation runs of the same test case. The blue bars represent 300 simulations of CPU core variant X1 and the red bars represent 300 simulations of core variant X2, sorted in order of increasing CPS. Simulation identifiers for the two core variants range from 1 to 300; note that X1 sim ID 1 and X2 sim ID 1 represent different simulation runs.
- To reduce the variability of CPS measurements, we isolated a small sub-set of the compute server cluster:
- All compute nodes in the isolated sub-set consisted of identically configured hardware, using CPUs from the same manufacturer in the same series, at the same clock frequency, and using the same amount of main memory.
- We allowed only one job per compute node in the isolated sub-set to run at a time.
The standard check-in qualification suite for the CPU core (we’ll call this QS1) consists of approximately 200 tests. By using the isolated set of homogenous machines and averaging the CPS for all 200 simulations in the QS1 qualification suite, we achieved repeatable results: the variation in CPS from multiple QS1 regressions run on the same snapshot of the design dropped to less than 1%. Once we established and refined the measurement environment, we had a very high degree of confidence that CPS improvements made to the simulation testbench would be reflected in core simulations run on the entire compute server cluster.
How to Improve Simulation Performance
Let’s look at each of our four approaches in detail. What net improvement in simulated CPS could we achieve?
VCS Technology Improvements
The baseline testbench used VCS version 2008.12-8. After upgrading to VCS version 2009.12-14, average CPS rose by 15%. We observed this level of improvement both with and without functional coverage enabled.
Transparent Latch Modeling
Modern ASIC designs make extensive use of power-saving techniques such as clock gating and block-level power gating. These techniques are essential for reducing power consumption and heat dissipation. They also present a challenge for verification. To detect an interesting sub-set of X-propagation hazards in the presence of clock gating, and to model the back-end implementation more accurately, the build script had a compile-time option to instantiate some of the registers (D flip-flops) in the design as transparent latches. VCS, like other HDL simulators, models edge-triggered logic much more efficiently than it models transparent latches. Simulation speeds improved by 30% after these transparent latches were eliminated. In response, we ran the bulk of the simulations with the compile-time option to model transparent latches disabled, and waited until just before final tapeout to run large regressions with the compile-time option enabled to catch any latent X-propagation problems.
SystemVerilog Covergroup Coding Improvements
When a CPU core project transitions from the architectural design phase into the implementation phase, the project’s developers want to find bugs as rapidly as possible. In this phase of the project, simulation speed is paramount and there is little emphasis on coverage collection. When the project approaches tapeout, the developers redirect their focus toward coverage closure. During this phase of the project, a majority of the VCS simulations are run with functional coverage enabled. For complex CPU cores, such as those that execute the x86 instruction set, it can take weeks or even months of simulation runs to achieve functional coverage closure.
The trade-off with enabling functional coverage is that while it provides essential information about the implementation and the test stimulus, it can also slow simulation speeds significantly. Initially, we measured a slow-down of about 45% with functional coverage enabled. It became clear that optimizing simulation performance with functional coverage enabled was a top priority.
- We used the VCS profiling tool to determine which modules consumed the greatest simulation time. To our surprise, the profiling report showed that 30 of the top 34 time-consuming modules were functional coverage modules, which meant functional coverage overhead was widely distributed rather than isolated to just a few modules. Reasons why functional coverage had such a large amount of overhead include:
- The covergroups made extensive use of wildcarded bins, which we learned later were inefficiently implemented in VCS.
- Many covergroups and coverpoints were automatically generated from documentation by scripts, and some of the auto-generated covergroups contained more than 10,000 variables.
- Many of these covergroups were sampled on every clock cycle even if the monitored unit was completely idle.
- In response to the results from the VCS profiling tool, where possible we:
- Re-coded covergroups to remove the wildcarded bins,
- Improved the scripts to reduce the size of the auto-generated covergroups, and
- Re-wrote covergroups to eliminate unnecessary sampling.
To complement these static (compile-time) improvements, we also added a run-time optimization. After reset, the testbench executes a start-up sequence of x86 instructions that is common to all simulations for each instance of the core. There are two parts to the start-up sequence: boot ROM code, which initializes low-level model-specific registers (MSRs), and shell code, which initializes the secure virtual machine (SVM) environment including page table entries and the virtual machine control block (VMCB). The start-up sequence typically accounts for 10% to 50% of the overall run time of the simulation. The optimization keeps the functional coverage sample clock turned off until the test gets past the end of the shell code. Optionally, the optimization can be disabled when the test is launched to capture the functional coverage events during the start-up sequence as well.
The combination of both the static (compile-time) and the start-up sequence (run-time) optimizations improved CPS 28% overall. Correspondingly, the overhead of functional coverage decreased from 45% to about 30%. While improving simulation performance with functional coverage enabled is important, it is most effective for short tests in which the start-up sequence comprises the greatest fraction of overall simulation time. For longer-running tests, we observed a correspondingly smaller increase in CPS.
Recoding of C++ PLI Routines
One of the things we did in recoding C++ PLI routines was reduce the frequency of callbacks in the CPU core testbench. Most of the checkers and monitors in the CPU core environment are implemented in C++, and a smaller fraction is in SystemVerilog. The checkers and monitors written in C++ are hooked up to the testbench using the older VPI interface because the bulk of these C++ routines evolved from prior CPU projects that existed before SystemVerilog became standard practice in the industry. We compared execution time spent on each of the C++ modules by running VCS with profiling enabled. We observed that the configuration manager (CfgMgr) module is the most taxing of all the C++ modules (Figure 2). CfgMgr maintains and provides access to shadow copies of the configuration registers and checks the expected values against the actual (programmed) values on reading the registers. It is also used for pre-loading the configuration state determined by the randomization flow.
Figure 2: Relative execution time profile of C++ modules in the CPU core testbench.
The CfgMgr module is called from the RTL side using the VPI interface with the rising clock edge as the trigger for the callback. The CfgMgr check is redundant as long as the RTL register values do not change. Since the frequency of value change for these registers is much lower than the number of clock edges in any test simulation, we changed the way callbacks are added for the CfgMgr module. Instead of having each callback triggered by clock edge, we added callbacks for RTL paths of each field in each of these registers that are triggered if we see a value change.
This technique reduced the number of callbacks during test simulation from the total number of clock edges to the total number of value changes in these registers. We observed an average CPS improvement of 3.5% in the QS1 qualification suite. The longer a given simulation runs, the more its simulation performance benefits from this change. This effect contrasts with improvements in the functional coverage implementation because the latter benefits more from shorter tests.
The other thing we did in recoding C++ PLI routines was replace VPI with Direct Kernel Interface (DKI), an API for SystemVerilog that is only supported on VCS. It has less simulation overhead and a smaller memory footprint than the VPI interface. Therefore, we were motivated to use DKI in place of VPI whenever possible in our environment.
The majority of the callbacks use the read_only_sync type. We replaced the read_only_sync and no_sync VPI callbacks with DKI callbacks, so the new environment had a mix of DKI and VPI callbacks. With this change, we measured an average performance improvement of 7% in QS1.
A 112% Net Improvement in CPS
|Type of modification||Net improvement in CPS|
|Eliminate transparent latches||35%|
|Functional coverage implementation||28%|
|C++ change: CfgMgr||3.5%|
|C++ change: VPI to DKI||7%|
Table 1: How modifying the testbench improved CPS.
- From the results summarized in Table 1, we can conclude that:
- Inefficiencies in simulation were spread across many different facets of the simulation environment. If we focused on only one inefficiency, we would have missed finding a number of other optimizations.
- Isolating a group of identically configured compute nodes for benchmarking was essential. Small improvements to the testbench environment did not seem to make any noticeable difference due to the high variability in simulation performance in the entire compute cluster.
- The run-time performance degradation with functional coverage enabled was much higher than initially anticipated. Functional coverage closure is required for final sign-off and it takes many months of simulation runs to achieve closure, which made it an important facet of the test environment to optimize.
The combination of compile-time and run-time optimizations to the functional coverage modules yielded 27% improvement in CPS. However, not all simulation runs use functional coverage, and the remaining modifications (the C++ part of the testbench and the VCS upgrade) increased CPS across all simulation runs.
HDL simulation performance directly affects the bottom line of companies that design and implement high-performance, high-volume ASICs. Because verification closure is an essential phase of the ASIC development cycle, improving simulation speed means running more tests in the same timeframe, which can ultimately shorten the time to market. Transistor counts grow exponentially over time, and each ASIC generation demands exponentially more verification resources in the form of hardware needed to run the test cases, simulation time, or both. It is important to extract more simulation cycles from existing compute server clusters instead of throwing more hardware at the problem.
- Web links