High performance workloads like EDA that can scale on cloud infrastructure need the ability to recover from a spot VM termination signal in order to ensure that there’s no processing time lost when a job has been running for a while. The most common solution to this problem is to build checkpoint-restore functionality in the tools. Several Synopsys tools offer this capability and users have learned to use it well for their needs over the years.
However, just having checkpoint-restore available at your disposal does not enable spot. Spot is a unique beast that adds a more stringent constraint on the deployment architecture by providing a very limited window to take a snapshot of the runtime memory state of the tool. AWS offers a 2 minute warning, which in reality may be much shorter, and Microsoft Azure offers only a 30 second notice currently.
As we all know, each EDA tool is not created equal and tools that have smaller memory footprint can successfully checkpoint their state within this warning window. Users who run verification jobs on Synopsys VCS® functional verification solution have successfully leveraged the tool’s inherent checkpoint-restore capabilities to run on spot and reduced costs significantly. Similarly, for library characterization on Synopsys PrimeLib library characterization solution, since typical distributed jobs run for only a few minutes and runtime state has a very small footprint, customers have successfully enabled spot instances by just ignoring the failures and restarting those jobs.
The challenge is more pronounced when we start exploring high memory workloads such as timing analysis, physical verification, physical design, or RTL-to-Gates implementation. The size of the runtime state for these workloads may run into several hundred gigabytes and the time needed to checkpoint is much longer than the Spot warning window provided by cloud providers. So, jobs which get terminated while running on spot cannot be restored, as no state is saved. This means several hours of runtime and compute usage costs can go to waste. For these workloads, just having checkpoint-restore capability is not enough to effectively use spot.