CNN engines (neural network processors) are a new beast only a battle tested champion can slay. While implementing these blocks, going down the wrong path could be catastrophic to project schedule. Therefore, designing with foundation IP blocks that offer flexibility for course correction during the design cycle is necessary for a successful product rollout.
Physical design implementation of machine learning blocks usually needs floorplan iterations to determine the best placement of macros and logical hierarchies within the given die area. The iterations may need modifications to the aspect ratio of both the core area and the macros themselves so that slower macros can be placed near logical hierarchies. A wide variety of memory cuts from compilers that trade-off timing and aspect ratios is required to deal with this iterative churn. The relative placement of logic hierarchies, in turn, is affected by the routing track resources available to fit those hierarchies in a given space. While the top routing layer restrictions are already defined for a block by the top-level designer, adjusting the power ground (PG) grid to the specific logic library can optimize core density. Lack of a recipe to start PG grid design can cause implementation delays.
Congestion in MACs
With the macros fixed, the challenge then becomes to manage placement and routing congestion in the logic area while tuning the design for power-per-MHz targets. MAC blocks are notorious for routing congestion caused by higher-than-normal pin density and high net connectivity. When represented as schematics, MAC blocks have a naturally triangular shape as data passes through them, which is why optimal results are often achieved by hand-placing the datapath elements. EDA tools have made good progress to bridge the gap between a full custom layout of the MAC block and an algorithmically derived placement of individual cells that also takes care of process design rules with shrinking geometries. However, some EDA tools require compatible standard cell structures to complete the solution. Whether hand-placing the elemental blocks or relying on the tool to do so, the need for larger multi-input elemental adders, multiplexers, compressors, and sequential cells in different circuit topologies and sizes is evident.
Achieving Differentiated PPA in Time
Design timelines are long, and tapeout schedules are short. The squeeze in design cycles means there is no time to ramp up a design team on advanced process nodes. Integrating validated IP that has seen silicon success, along with a tool recipe that can start the implementation process on day one, is a “must have” to win in competitive markets. With ever-shrinking nodes, there is limited choice on foundries to use for an AI-powered SoC. Even more limited is the choice in the logic library and memory design kits provided by the foundry. This begs the question: How will the PPA of a design using a foundry default design kit differentiate itself in the market? A truly unique IP solution is crucial to pushing the PPA envelope beyond what is achieved by commonly known optimization tactics and a basic toolkit for implementation.