Insight Home | Previous Article | Next Article
Issue 4, 2012
Improving Compute Farm Efficiency for EDA
Many IT managers report that the average utilization of their compute farms is just 50-60%. Neel Desai, product marketing manager, Lynx Design System, explains how new technology can improve EDA runtimes by helping design teams make better use of their available compute resources and potentially delay expensive hardware upgrades.
Chip design teams are under constant pressure to meet shorter project schedules. As they have increasingly come to depend on automated design tools for design synthesis, place and route and verification, a critical factor in their ability to meet project deadlines is the turnaround time – the time it takes from job submission to getting results – for EDA tools.
Recognizing the need for speed, EDA vendors like Synopsys invest heavily in R&D to optimize tool performance with each successive software release. Many chip companies have amassed thousands of CPUs in server farms to give their design teams the ability to create and iterate as fast as possible. But EDA tool performance counts for little if design teams are unable to take full advantage of the hardware resources at their disposal. Unfortunately, designers often lose productivity waiting for hardware, for example, when their synthesis job sits idle in a queue waiting for a suitable machine in the compute farm to become available.
Many design teams think the answer to this problem is always bigger and faster machines, but simply better utilizing the existing compute farm can often be more cost effective.
Synopsys operates its own server farms and through its services business, interacts with IT and CAD managers from many chip companies, who in turn manage their own compute farm environments. This experience has given Synopsys valuable insight into compute farm efficiencies. While chip companies invest millions of dollars to provide extensive server farms for their design teams, they are often not taking maximum advantage of their compute power. There is a practical reason why.
The Compute Farm Environment
Compute farms typically include many different types of compute resources. The reason is that few businesses can afford to upgrade their entire IT estates with the latest high-performance machines in one pass. IT managers will often upgrade their server farms by adding new machines over time. The latest machines will be faster CPUs with more memory, while legacy machines will have a lower performance. The challenge for IT teams is to match the business' compute demand to the available resources so that they achieve high levels of utilization, or in other words, to get the most efficient use out of the server farm so that design teams can meet their project schedules.
Today, IT managers use various procedures to optimize the use of compute farm resources. Typically, they specify different job-based queues and specify the number of available slots – the number of jobs a machine can run, which is usually one per processor core. Relatively few IT managers specify or require memory restrictions. When memory restrictions are implemented, the queues may enforce different memory limits. If the job exceeds the memory limit, "thrashing" will occur (i.e., the resource is constantly reading and writing to the disk). Of course, this results in suboptimal use of the compute resource.
Altruism and Compute Farms Don't Mix!
An analysis of nearly a million compute server jobs shows how designers overestimate the memory requirement 80% of the time (Figure 1). Overestimating the required amount of memory leads to an underutilization of compute farm slots. Fewer designers underestimate the amount of memory required, but this is potentially just as impactful to server performance; it can cause the system to run out of memory and again lead to thrashing. Under-estimation of memory resources also adversely affects the efficiency of the entire farm because jobs run slower and hold tool licenses for longer. Either way, overestimating or underestimating memory is bad news for server farm utilization.
Figure 1: Comparing estimated memory requirements for compute farm jobs submitted across multiple design projects
While the batch job schedulers automate the submission of jobs to available compute resources, it is up to each individual designer to manually manage the process of selecting the appropriate resource type for his/her tool run. More often than not, this is the weak link in the resource efficiency chain. The last thing designers want is for their jobs to fail because of underestimating the resource requirement. The last thing IT managers want is for underutilization of compute slots because of designers overestimating the resource requirement.
In environments where a compute job has to be submitted with some memory requirement, designers quite reasonably overestimate memory requirements so that their jobs land on a system with bigger memory. Even when the resource requirements of the job clearly change, designers will run the same scripts to specify the memory requirements, which they typically don't update.
In using a shared server farm environment, designers compete for the use of resources, attempting to grab the best machines for their jobs. This approach inevitably increases demand on newer, higher-performance machines with large amounts of memory and leaves older, lower-performance machines with smaller amounts of memory sitting idle.
Improving Compute Farm Efficiency
One key to improving compute farm efficiency is to improve utilization by taking the guesswork out of predicting the resource requirements for individual tool runs. By more accurately matching each job's memory resource needs to the available queues and slots, jobs will spend less time queuing for required compute resource and chip companies will improve turnaround times across all of their design teams. In addition, accurate prediction of job resource needs can be computer automated.
Improving compute farm efficiency benefits businesses in several ways. If the availability of compute resources is perceived to be a bottleneck, the obvious solution is to increase capital expenditure on faster, bigger machines with more memory. Simply getting more out of the existing resources can postpone the time when the business has to invest in hardware upgrades.
By having to spend less time waiting for EDA tool runs to complete, design teams can accelerate their design schedules or squeeze in more tool runs to improve the quality of the results – or both.
Compute Farm Resource Optimization
By using a closed-loop system to continually assess jobs as designers submit them, it is possible to accurately match resources to the needs of each job. Synopsys has developed a compute farm resource optimization engine that analyses each job's resource utilization and uses the information it gathers to modify the job's parameters for future submission. Based on use in real design projects, the results show that making scheduling decisions computer-driven rather than manual enables better optimization of the compute farm.
Figure 2: Adaptive Resource Optimizer
The ARO Algorithm
Synopsys' Adaptive Resource Optimizer (ARO) algorithm monitors job usage patterns and uses the information it gathers over the course of a number of job runs to optimize the use of resources for future jobs. Designers can configure ARO to monitor memory use and then dynamically submit the job to specific queues depending on the job's needs and available resources.
Initially, ARO operates in "learn" mode for a pre-definable number of runs in order to measure the actual memory used. The optimizer will predict a realistic memory requirement for subsequent runs, which enables the use of the "right size" queue for the job, with an appropriate memory footprint. ARO may have to increase or decrease the memory requirement depending on whether the designer's initial request was an under- or over-estimate.
In using ARO extensively within its own compute farms for the past two years, Synopsys has seen how it can improve compute farm utilization. Figure 3 shows a job's pending and turnaround times over the course of a number of job runs. In this example, ARO monitors the job over the first five runs and begins to optimize on the sixth run. ARO succeeds in reducing the average turnaround time from 204 seconds to 70 seconds, and the job pending time (i.e. the time a job sits in queue waiting for a resource) from 144 seconds to 15 seconds – a 90% reduction.
Figure 3: Trend report showing job pending and turnaround time for a number of runs
ARO can also be configured to dynamically change the job submission queue based on an optimized memory requirement or estimated runtime of a job (Table 1). AROs "queue adaptivation" feature matches each job to the right queue, which increases the likelihood of getting the job scheduled sooner. Taking the job queuing scheme shown in Table 1 as an example, ARO placed a big memory job that required more than 128GB of memory in the "bigmem" queue, irrespective of the runtime requirement. ARO will then consider the runtime history in allocating other jobs to queues and, if necessary, override the users' original queue selection based on their runtime estimates. In most server farms, there are fewer resources available to support long jobs than short ones, so ensuring that the short job queues are properly utilized goes a long way to improving the overall utilization of the farm.
Table 1: Example job queuing scheme
Improving Compute Farm Utilization
Synopsys and its customers have realized multiple benefits from using ARO. For example, the Synopsys IT team has seen a reduction in job pending times by as much as 75% on select server farms where ARO has been deployed, and job turnaround times by as much as 10% for each job. Aggregated across thousands of jobs, this can result in significant productivity gains.
Synopsys ARO is available with the Lynx Design System – Synopsys' chip design environment that includes production-proven RTL-to-GDSII design flows using the Galaxy™ Implementation Platform. Design teams can integrate third-party tools within the Lynx Design System and use ARO to schedule jobs across all of the tools within their design infrastructure. Synopsys has validated ARO with LSF, SGE and UGE, and can adapt the optimizer to work with design teams' proprietary environments.
Lynx Design System
About the Author
Neel Desai is a product marketing manager for Synopsys' Lynx Design System and Professional Services. He has over 17 years of EDA and semiconductor experience spanning both technical and marketing responsibilities, including eight years as the product marketing manager for Design Compiler, Synopsys' flagship synthesis product. Neel received his BSEE from the University of Bombay, India, his MSEE from Pennsylvania State University and his MBA from Santa Clara University.