Cloud native EDA tools & pre-optimized hardware platforms
Author: Gordon Cooper, Embedded Vision Processors Product Marketing Manager, Synopsys
Computer vision technology enjoyed a dramatic leap forward after the introduction of new deep learning techniques in 2012 when AlexNet – an early convolutional neural network – won the ImageNet Large Scale Vision Recognition Challenge (ILSVRC) (Figure 2). The competition prioritized accuracy, and, in subsequent years, each new winner pushed the top 1 and top 5 classification results (the accuracy of the graphs best guesses of what was in the image) until they surpassed human capabilities for the specific task of identifying/classifying 1,000 items. ImageNet winners accomplished these results by throwing more computational complexity at the problem and using 32-bit floating point calculations executed on banks of GPUs. Increased performance helped achieve increased detection accuracy.
Convolutional Neural Networks (CNNs) have become the standard for object detection for modern computer vision. To oversimplify, a CNN algorithm is trained to break down an object such as a pedestrian into a pattern of curves, angles, and other components, store that data in its weights or coefficients, and then search images for those patterns to identify objects with surprising accuracy.
As engineers looked to apply these ImageNet CNN graphs – VGG16, GoogleNet, ResNet, etc. – to practical embedded vision applications, it was obvious that ImageNet submissions were not hampered by embedded AI SoC constraints such as limited power budgets, memory bandwidth restrictions, minimal latency delay and small silicon area targets. In addition, ImageNet winners were not measured by real-time requirements like meeting a target frame rate. To transition computer vision from an academic exercise to practical applications, all these issues needed to be addressed. Embedded engineers need to find a way to meet the high performance and accuracy requirements of computer vision while dealing with embedded limitations. Embedded vision processors are designed to provide the best computer vision performance with smallest area and power penalties.
A first order measurement of computer vision performance is tera-operations per second (TOPS). Tera (1012) is a big number driven by the number of pixels that need to be processed and the complexity of deep learning algorithms like CNNs. Operations per second measure how much can get done in one processor clock cycle. A simple calculation for TOPS for a given vision processor is 2x the number of multiply-accumulators (MACs) x frequency (MHz) of the processor. The multiplication by two is used since a MAC is considered to be two operations in one cycle – a multiply and an accumulate.
Different computer vision applications require different levels of performance, but as a general trend, performance requirements are increasing. Facial recognition on mid-end smartphones might require less than 1 TOPS of performance. Mid-end applications such as augmented reality, surveillance, and automotive rear cameras generally need between 1 and 10 TOPS performance. On the high end are automotive front cameras used for safety-critical applications, microservers, and data centers, which can require 10 to 100 TOPS performance, or more. Embedded vision processors have been increasing their number of MACs to drive up their TOPS performance to provide a scalable solution for all these vision applications.
When Synopsys introduced its DesignWare® ARC® EV5x Vision Processor IP in 2015, it offered 64 MAC/cycle at 800MHz for about 0.1 TOPS. The EV6x, released one year later, included 880 MAC accelerators and offered about 1.3 TOPS at 800MHz. In 2017, the EV6x improved to 3520 MACs at 1.2GHz for about 8.5 TOPS of neural network performance.
In 2019, Synopsys introduced the EV7x Embedded Vision Processor IP with a deep neural network (DNN) accelerator (Figure 3). The DNN accelerator has up to 14,080 MACs and can execute all CNN graphs, including the latest, most complex graphs and custom graphs, and offers new support for batched long short-term memories (LSTMs) for applications that require time-based results. In addition to the DNN accelerator, the EV7x includes a vision engine for low-power, high-performance vision, simultaneous localization and mapping (SLAM), and DSP algorithms. Combining the performance of the EV7x DNN and the EV7x vision engine, the EV7x can scale up to 35 TOPS performance. This is about a 35,000% increase in performance in four years over the EV5x.
Adding MACs to an AI accelerator increases neural network engine performance to meet a range of real-world computer vision applications. However, that is only the first part of the story. In fact, adding MACs to an accelerator is the easiest aspect of scaling neural network graph performance. More challenging is: how do we make sure those MACs are kept busy? An ideal system is neither compute bound (lacking performance) nor I/O bound (lacking the necessary memory bandwidth). For a 4x increase in MACs, some increase in internal memory and some additional I/O bandwidth will need to be considered (Figure 4). But these can impact power or area of the vision processor. The best way to minimize bandwidth is to apply both hardware and software techniques to limit the data that needs to go to or from external memory.
There are many techniques to improve performance and limit bandwidth. Quantization converts the 32-bit floating point coefficients and data to a smaller integer format – 8 bits is the current popular format – cutting bandwidth by one quarter. Lossless compression of feature maps (the intermediate outputs from each layer of the CNN graph) are written to external memory and decompressed as they are read back so can reduce bandwidth by as much as 40%. Sparsity (looking for and avoiding the zeros in the data) and coefficient pruning (finding out which near zero coefficients can be set to zero) are two more bandwidth reduction techniques.
In addition to these hardware techniques, new CNN graphs have been developed to achieve the accuracy of earlier graphs like ResNet or GoogleNet with significantly fewer computations. MobileNet (v1 and v2) and DenseNet are two examples of more modern CNN classification graphs. However, while both are more computationally efficient, only MobileNet is well suited for embedded vision in AI SoCs. DenseNet’s topology requires extensive reuse of the feature maps, which increase bandwidth and memory requirements significantly. MobileNet, on the other hand, achieves near the same accuracy with significantly smaller coefficient and bandwidth requirements.
The pace of research in neural networks is rapid so new techniques continue to emerge. Synopsys’ new EV7x Embedded Vision Processor IP introduces two advanced techniques for bandwidth reduction. First, direct memory access (DMA) broadcasting distributes coefficients or data during layer computations within a CNN graph across groups of MACs. If each group of MACs can work on the same set of coefficients, the coefficients can be read once and distributed via DMA to each group, thereby minimizing bandwidth.
A second technique, multi-level layer fusion, expands on the concept of layer merger. Layer merging combines the convolution calculations with the non-linear activation function and pooling (down-sampling) of a CNN. Multi-level layer fusion combines groups of merged layers to minimize the number of feature maps that need to be written to external memory. Both DMA broadcasting and multi-level layer fusion combine advanced hardware features and software support. Applied to the EV7x new DNN accelerator, DMA broadcasting and multi-level layer fusion contribute to up to a 67% performance improvement and 47% bandwidth reduction over the previous architecture based on standard CNN graphs running on the 3520 MAC architecture.
The newest generation of vision processors that apply these techniques make it easier for embedded developers to meet their power, area and performance budgets when designing life-changing products based on vision and AI.
To move from research to practical reality, facial recognition algorithms need to execute on low-power always-on hardware. Imagine using hardware included in a parking meter to pay using your face. Facial detection algorithms can be running in an always-on mode in an ultra-lower power microcontroller such as the ARC EM9D low-power microcontroller IP. When a face is detected, the EV71 with DNN880 can be woken up and used to perform a quick facial detection to see if the face can be recognized and then quickly turned off to conserve power. To protect the confidentiality of biometric data and to protect the CNN graph’s topologies and coefficients, embedded vision processors such as the EV7x include optional high-speed advanced encryption standard (AES) encryption.
To enable robots or drones to move through a crowded environment – perhaps on their way to deliver your lunch from a local restaurant or a package from your favorite store – multiple vision techniques need to be applied. SLAM is an algorithm coming out of robotics research that uses camera inputs to map the environment around the robot and the robot’s position in that environment. While the robot can detect an object, it can’t identify it. That’s where CNNs come in, as CNNs are great at identifying objects. Combining SLAM with CNN makes the robot much smarter about its environment. An EV72 – with two VPUs – with the optional DNN3520 is well suited to a robotics or augmented reality application that combine SLAM on the vector processing unit to map objects with its deep neural network accelerator to identify the mapped objects.
Self-driving cars present additional challenges for embedded developers. Not only are the numbers of cameras in a car increasing, but the image resolution of each camera is also increasing. And for a car to take over from a human, it has to operate with the utmost reliability, including high levels of fault detection and redundancies. Embedding a vision processor with up to 35 TOPS performance brings self-driving cars a bit closer to reality. An EV74 with four VPUs combined with the large DNN14K provides the performance needed for automotive front camera /pedestrian detection while meeting ISO26262 functional safety guidelines (Figure 5). To meet performance requirements beyond 35 TOPS, perhaps for a multi-camera automotive pedestrian detection system, multiple instances of EV74DNN14K can be connected over a network-on-chip (NoC). The 35 TOPS DNN in each EV7x processor requires fewer instances to reach 100 TOPS compared to competitive solutions. Fewer instances reduces NoC traffic, reducing a potential performance bottleneck. All these bandwidth limitation techniques pay off for the low-end applications as well. Facial detection might only require 1 TOPS or less, but it is extremely power sensitive.