Before examining how system downtime can impact electronic design automation (EDA) workloads, we will first examine the broad design-for-failure principles.
Visualization
It is not enough to know what your systems look like with 100% uptime. You should also prepare for changes in the cloud environment that arise from downtime incidents. To prepare for failure, you must visualize real-time and future states.
Dependencies
Dependencies can change during an incident. If your essential tools go down, you must have a plan for how to move forward with minimal disruptions until those services come back online.
To create an effective design-for-failure approach, you must understand the types of data that persist, where the data persists, and what replication methods are available. With visualization and documentation, you can determine how your organization will respond if your system's dependencies fail. By creating redundancy among your dependent components, you can prevent single points of failure from crippling or collapsing your system.
Resiliency
For your design-for-failure strategy, you should consider multi-regional cloud solutions. You can achieve a multi-regional solution by leveraging multiple cloud providers. AWS, Microsoft Azure, and Google Cloud all offer multi-cloud and hybrid service options. When designing for failure, your organization should strike the right balance between taking control and preparing for the worst.
Stakeholders
You should also include a variety of stakeholders in the design-for-failure planning process, including IT leadership, cloud architects, and application DevOps teams. You must also ensure stakeholders have input, access, and alignment in the failure planning process.
When an incident occurs, your teams may not be able to respond effectively unless they can collaborate. Uninformed stakeholders won’t be able to participate fully in the planning or response stages. You can use visuals to communicate downtime's potential and actual effects to a broader internal and external audience. Designing to fail allows you to role-play incident response and plan different scenarios while keeping stakeholders up to speed.