Multi-die systems have found their way into the semiconductor mainstream, providing an avenue for faster scaling of system functionality with reduced risk and time to market. For compute-intensive applications such as AI, ADAS, hyperscale data centers, and high-performance computing, multi-die systems have become a system architecture of choice.
Comprised of multiple dies, also called chiplets, in a single package, multi-die systems meet demands for bandwidth and power driven by applications requiring growing functionality. By providing a modular architecture, they also enable chip designers to address the needs of multiple end market opportunities through faster development of product variants.
Compared to their monolithic counterparts, multi-die systems bring more layers of software complexity, even from an engineering resource perspective. Monolithic SoCs are typically developed by a team that fits under a particular organizational structure that follows a pre-determined schedule. Since multi-die systems are comprised of individual dies that might be on various foundry process nodes, you can imagine why the traditional flow won’t work here.
From a technical standpoint, the evolution is such that a system from a decade ago that would have consisted of multiple PCBs can now be implemented as a single multi-die system. The growth in software complexity is obvious. Aside from bringing the software stack up on the hardware of each die, the entire system needs to be integrated and tested. Data moves across different dies over complex die-to-die connectivity interfaces such as Universal Chiplet Interconnect Express (UCIe), which are often based on PCIe or CXL. These interfaces require correct software setup and programming. The picture gets particularly complicated with the inclusion of software hypervisors, comprising hardware interfaces that are shared using complex virtualized functions and para-virtualized software drivers, such as the single root input/output virtualization (SR-IOV) standard. These setups are hard to develop and debug in a singular PCIe host /device setup, let alone in a complex multi-die system containing multiples of these interfaces. Debugging is challenged both technically and organizationally: debug visibility is limited due to the limited amount of debug interfaces for a now-growing chip functionality, while debug might require connecting teams from different companies to understand and together determine the best way to approach this task.
Ensuring that the end application will perform correctly involves testing the software running on the chip for more than tens of billions of cycles on an emulation system prior to production. Any decisions around the hardware components in a multi-die system should not be made without consideration of the software. Software continues to evolve, so the testing process also entails a continuous effort.