Relying on the traditional design processes will not result in the high-performance, market-leading AI solutions that every company aims for. Designers must consider a wide array of semiconductor solutions. A Semico 2018 market report states that “Architectures for both training and inference are continually being refined to arrive at the optimum configuration to deliver the right level of performance.”
Datacenter architectures include GPUs, FPGAs, ASICs, CPUs, accelerators, and High-Performance Computing (HPC) solutions, while the mobile market is a potpourri of heterogeneous on-chip processing solutions such as ISPs, DSPs, multi-core application processors, audio and sensor processing subsystems. These heterogeneous solutions are leveraged effectively with proprietary SDKs to accommodate AI and deep learning capabilities. In addition, the automotive market sees large variations based on expected autonomous capabilities. For instance, bandwidths and compute capabilities of Level 5 autonomous SoCs support far more performance than Level 2+ autonomous SoCs, as can be expected.
The three consistent challenges within these AI designs includes:
- Adding specialized processing capabilities that are much more efficient performing the necessary math such as matrix multiplications and dot products
- Efficient memory access for processing of unique coefficients such as weights and activations needed for deep learning
- Reliable, proven real-time interfaces for chip-to-chip, chip-to-cloud, sensor data, and accelerator to host connectivity
One of the biggest hurdles for machine learning algorithms is that the memory access and processing capabilities of traditional SoC architectures is not as efficient as needed. For example, popular von Neumann architectures have been criticized as not being effective enough for AI, resulting in a race to build a better machine (i.e., SoC system design).
Those who are fortunate enough to design second and third generation AI targeted SoCs have added more efficient AI hardware accelerators and/or have chosen to add capabilities to existing ISPs and DSPs to accommodate neural network challenges.
However, simply adding an efficient matrix multiplication accelerator or high bandwidth memory interface is proving to be helpful but insufficient to be a market leader in AI, reinforcing the concept to do specific optimizations during system design specifically for AI.
Machine learning and deep learning apply to a wide variety of applications, so designers vastly vary in how they define the objective of the specific hardware implementation. In addition, the advancements of the math for machine learning is changing rapidly, making architectural flexibility a strong requirement. In the case of vertically integrated companies, they may be able to narrow the scope of their design to a specific purpose, increasing optimizations, but also accommodate flexibility to match additional, evolving algorithms.
Finally, benchmarking across AI algorithms and chips is still in its infancy as discussed by The Linley Microprocessor Report’s “AI Benchmarks Remain Immature”:
“Several popular benchmark programs evaluate CPU and graphics performance, but even as AI workloads have become more common, comparing AI performance remains a challenge. Many chip vendors quote only peak execution rate in floating-point operations per second or, for integer-only designs, operations per second. But like CPUs, deep-learning accelerators (DLAs) often operate well below their peak theoretical performance owing to bottlenecks in the software, memory, or some other part of the design. Everyone agrees performance should be measured when running real applications, but they disagree on what applications and how to run them.” (January 2019)
Interesting new benchmarks are beginning to address specific markets. As an example, MLPerf is currently tackling the effectiveness of training AI SoCs and has plans to expand. While this is a great start to addressing the challenges of benchmarking, training AI SoCs is a small subset of the many different markets, algorithms, frameworks, and compression techniques that impact a system’s results.
Another organization, AI-Benchmark, is focused on benchmarking the AI capabilities in mobile phones. Mobile phones use a handful of chipsets, some with early generation versions that do not include any AI acceleration other than their traditional processors, but instead implemented AI-specific software development kits (SDKs). These benchmarks show that leveraging existing, non-AI-optimized processing solutions does not provide the throughput required.
The processor or array of processors selected typically have maximum ratings on operations per second or a specific top frequency for a specific process technology. The processor performance is also dictated by the capability of each instruction. On the other hand, interface IP (PCIe®, MIPI, DDR) and foundation IP (Logic Libraries, Memory Compilers), have maximum theoretical memory bandwidth and data throughput levels that are, in the case of interface IP, often defined by standards organizations.
However, the true performance of a system is not the sum of these parts; it lies in the ability to properly connect processors, memory interfaces and data pipes together. The total system performance is a result of the capabilities of each of the integrated pieces and how to optimize these.
While designers have made rapid advancements to the processors, the SDKs, the math and other contributing design aspects of an AI SoC, these changes have made the ability to make an apples-to-apples comparison difficult.