Cloud native EDA tools & pre-optimized hardware platforms
For Artificial Intelligence (AI) applications – like pedestrian detection for an autonomous vehicle or image quality enhancement for a digital still camera – trained neural networks have surpassed programmed digital signal processors (DSPs) in performance, efficiency and algorithmic flexibility. That doesn’t mean that DSPs aren’t needed in AI processing. In fact, just the opposite. Neural network accelerators paired with vector DSPs are a great combination for AI subsystems for a range of applications.
It's important to consider neural network processing techniques – like convolutional neural networks (CNNs) or transformers – separately from the hardware necessary to run these models. There are lots of options for implementing those models. Any processor that can perform multiplications and move large amounts of data around can eventually perform these computation-heavy models. With good quantization techniques, the trained 32-bit floating-point outputs of trained neural networks can be run on 8-bit integer controllers or processors with little to no accuracy degradation.
That means a CNN inference can be processed on a CPU, a GPU, a DSP or even a lowly microcontroller, and can attain the same accuracy. Choosing a processor for AI inference processing matters much more if real-time performance (in frames per second where a frame is an uncompressed image) is important. In other words, if you have a limited time to process the image frame before the next one is available. There are many real-time applications where high levels of performance are crucial – imagine a car barreling down on a pedestrian at 70 MPH and trying to make a quick decision on whether to apply the brakes or not. Multiple cameras, high resolution, minimum latency all drive the need for maximum computational efficiency to make a life-or-death decision.
Figure 1: AI applications span a wide range of performance requirements from a few GOPS to thousands of TOPS.
Figure 2: Different combinations of vector DSP and neural network performance.
Figure 3: The closely coupled combination of the Synopsys ARC VPX5 and ARC NPX6
The ARC VPX DSP IP excels at parallel DSP processing based on very long instruction word (VLIW)/single instruction-multiple data (SIMD) architecture and are optimized for the power, performance and area (PPA) requirements of embedded workloads. The VPX family can be configured for floating point and multiple integer formats including INT8 operations for AI inference. A range of performance is available as the VPX family operates on 128-bit (VPX2, VPX2FS), 256-bit (VPX3, VPX3FS) and 512-bit (VPX5, VPX5FS) vector words and can scale from one to four cores. This provides from 16 INT8 to 512 INT8 MACs per cycle (using dual-MAC configuration on a four core VPX5).
The ARC NPX NPU IP is dedicated to NN processing and also optimized for PPA requirements of real-time applications. The family scales from a 4,096 MAC per cycle version to a 96k MAC per cycle version which can then be scaled to multiple instances. The NXP6 family can scale from 1 to 1000s of TOPS of AI performance on a single SoC. It has also been optimized for the latest neural network models of CNN and for the emerging transformers class of models.
As seen in Figure 3, the VPX and NPX families are closely integrated. ARCsync is additional RTL that provides interrupt control between the processors. Data passes through an external NOC or AXI bus that usually is readily available in SoC systems. While the two processors can perform completely independently, the VPX5 has the ability to reach into the NPX6’s L2 memory as needed.
The close integration of the VPX5 and NPX6 is also supported by one common software development tool chain, ARC MetaWare MX, that supports any combination of NXP and VPX. SoC architects can choose the right combination of DSP performance and AI performance using these scalable processor families to maximize performance and minimize area overhead. For AI heavy workloads, the rule of thumb for the big AI, small DSP configuration is to have one VPX5 for every 8k or 16k MACs for NPX (depending on models and workload). For an NPX6-64K configuration, at least four cores of the VPX5 (at a minimum) are recommended.
It's true that neural network processing has replaced DSP processing for specific tasks like object detection of a pedestrian, but the SIMD capabilities of a vector DSP combine with its DSP and AI support capabilities make it a valuable part of an AI system. As the demand for AI processing continues to grow in embedded applications, the combination of an NPU for AI processing and a vector DSP for NPU support and DSP processing is the best recommendation for a flexible design to help future-proof an AI SoC for rapidly evolving AI.