Why AI Hardware Demands the Highest Verifiable QoR

Bradley Geden

Mar 12, 2020 / 6 min read

Artificial intelligence (AI) is poised to give semiconductors an enormous boost. That’s partly because more chips will be sold. But hardware will also get a bigger share of the dollars in the hardware/software stack, because, without the right hardware, software execution will be sluggish. McKinsey sees AI silicon sales quadrupling by 2025, landing somewhere in the range of $18-21 billion per year. A bigger share of that hardware will be in the cloud for training and cloud inference, but hardware for edge inference will see faster growth. So, the race is on for creating the most effective training and inference hardware engines.

But AI is posing huge challenges for silicon designers in the balancing of performance, power, and area (PPA). Aggressive optimizations implemented throughout the RTL-to-GDSII flow help to improve overall PPA results while allowing faster design closure. Those optimizations, however, are of little value if they cannot be verified through formal equivalence checking before being fully committed to ensuring that overall functionality is correct.

Challenging AI Architectures

AI designs place extreme demands upon silicon and designers. The focus of functionality is on general matrix multiplication (GEMMs) and convolutions. These require intensive multiply-accumulate (MAC) operations, with lots of moving data that drives up power. A careful balance must be struck between computation and memory, with high memory capacity and bandwidth required to avoid limiting performance due to data movement bottlenecks.

A full AI solution involves a complete software stack atop the hardware, along with design tools and an AI Software Development Kit (SDK). But it’s the hardware that’s critical for scaling. Google claims that, without its Tensor Processing Unit (TPU), the growth in Google Assistant would have required that the data center size be doubled. According to McKinsey, the majority of AI hardware will be an SoC or, like the TPU, an ASIC.

PPA refers to the balance and tradeoffs between speed, power, and silicon area (which means cost). For AI, each one of these can be important:

  • Inference, whether in the cloud or at the edge, is often a real-time operation for applications like voice assistants – making speed paramount. Cloud training isn’t a real-time operation, but it’s so compute-intensive that speed remains critical.
  • Power is a critical characteristic at the edge, where the operation may be carried out in battery-powered devices. Power is less of a priority in the cloud, although it must still be kept in check to keep data-center power down (cooling costs).
  • Cost is also important for edge applications, where low-cost hardware may support AI functionality. Cost is less of a concern in the cloud, where hardware may be re-used for many different clients, and where that use can be monetized by the cloud provider.

Keeping the balance between these three attributes is a really tough job. It may mean trying multiple architectures, and, for each of those, it places a huge burden on the EDA tools.

Design Optimizations Help with PPA

To help drive better PPA results, some tools have implemented specific AI-oriented optimizations that take place throughout the RTL-to-GDSII flow from synthesis to implementation. AI designs are computationally intensive and have significant data-path content. Data-path optimization enables designers to realize efficiencies by combining data-path operations, eliminate redundancies and optimize PPA. Retiming, as well as boundary optimizations, provide further improvements in PPA. Boundary optimizations can help optimize the boundaries of two blocks that abut. As an example, if there’s an inverter going out of one block and an inverter going into the next block, then both inverters can be removed.

AI designs, due to the significant amount of cross-connection between arithmetic units, are heavily congested. Even if you can achieve timing and power goals on the original synthesis, if a design cannot be routed, then it’s useless. Bringing congestion information into the flow as early as possible allows designers to take an early look at the impact of optimizations on congestion. At a minimum, a synthesis solution should be using the same engine as the implementation solution for congestion analysis. Taking this one step further, if the two solutions are operating on a unified data model, then it’s possible to bring even more downstream optimizations upstream and vice versa.

Advanced optimizations in synthesis utilized in AI designs.

Figure 1: Advanced optimizations in synthesis utilized in AI designs.

Advanced optimizations make extensive changes to the design structure. Ideally, all of the changes are made perfectly, and the original functionality remains. But the cost of a mask re-spin and the opportunity cost of being late to market mean that a designer will want to start running equivalency checking as early as possible in the design process through the synthesis iterations to ensure that as various optimizations are experimented with, that ultimately, they end up with a design that has the highest Quality of Results (QoR)/PPA and can be verified.

Advanced Equivalence Checking Provides the Necessary Proof

The only way to be confident – and to give all stakeholders the confidence – that overall functionality remains unperturbed through the design flow and optimization is through formal logic equivalence checking. But some of the more sophisticated optimizations may obscure internal points that must be proven equivalent. This can cause a proof that takes a very long time to complete – or might even not complete at all, requiring design engineers to switch off optimizations and sacrifice their PPA goals.

The speed with which the proof can be obtained is critical. Since multiple synthesis runs may be needed, the formal proof will run many times. It can’t become a significant rate-determining step, or else schedule pressures will tempt design teams to omit the proof, putting the design at risk. In order to make the proofs easier and faster, the equivalence-checking engine must understand the optimizations so that it can then prove functional equivalence. This “knowledge” gives the checker hints as to the changes that have happened; the duty then falls on the checker to prove that the changes made to the design at each step of the optimization process have not resulted in a change in functionality.

As tape-out approaches, last-minute engineering change orders (ECOs) are inevitable. Larger more complex ECOs which cannot be implemented manually do need to go through synthesis as a first step in the ECO process. There is no point in pushing the PPA limits when optimizations need to be switched off during ECO synthesis for the ECO tool to generate a patch. An ECO solution must maintain the optimizations made during original synthesis but avoid a complete run through the entire toolchain. It is important that the equivalence checker be able to guide the ECO process, ensuring that, even after localized fixes have been made, the optimizations and all other functionality, unrelated to the ECO, remain intact.

ECO flow needs to use the same advanced optimizations as original synthesis

Figure 2: ECO flow needs to use the same advanced optimizations as original synthesis

AI Competition is Fierce

AI is the hot new area, and everyone wants an AI story. For some, it will mean utilizing AI platforms and differentiating in software. For others, it will mean creating a custom chip. Both scenarios mean the sale of more chips into more systems.

During this early phase of AI maturation, everyone is competing for the architecture that provides the fastest performance, the lowest power, and the lowest cost. While individual designs will involve specific tradeoffs amongst these three metrics, everyone will be working hard to optimize all three as much as possible, trying to trade off as little as possible.

Effective EDA tools are critical for that effort. The advanced optimizations they can provide will boost PPA results without requiring extensive manual intervention, keeping your development schedule marching forward. But the cost of any design re-spin means that you need to achieve the highest QoR while still being able to verify your design.

Running an equivalence check provides that proof. But the optimizations are sophisticated enough that your checker must be aware of the optimizations in order to prove their validity. Tools like Synopsys’ Formality® logic equivalence checking solution have the ability to understand advanced optimizations utilized in the Design Compiler® synthesis solution or the Fusion Compiler™, RTL-to-GDSII solution  and prove the correctness of those optimizations and complete their checks quickly, giving you confidence that your design will move forward with PPA results that will make your AI chip stand out in a crowded market.

Continue Reading