Embedding Vision in Next-Generation SoCs

Mike Thompson, Sr. Product Marketing Manager, Synopsys

Computer vision is the acquisition, processing, analysis and understanding of real-world visual information with computers. Until recently this was only possible on PCs and other high-end machines, but the advance of microprocessor technology is now enabling designers to integrate computer vision into SoCs. The resulting practical and widely deployable embedded vision functionality is showing up in emerging consumer applications such as home surveillance, gaming systems, automotive driver assist systems, smart glasses and augmented reality. This is giving rise to a whole new class of embedded processors, designed specifically for embedded vision and offering very high vision performance, but at the low power-consumption levels required for embedded applications. Embedded vision technology is included in an increasing number of SoC designs and will have a profound impact on many aspects of our lives. 

Components in a Typical Vision System

Applications of embedded vision vary, but a typical vision system uses a similar sequence of distinct steps to process and analyze image data. These are referred to as a vision pipeline and there are vastly different requirements in each stage, shown in Figure 1:

Figure 1: Vision pipeline

At the front of the pipeline, it is common to see algorithms with simple data-level parallelism and regular computations. However, in the middle portion, the data-level parallelism and the data structures are more complex, and the computation is less regular, requiring more control. At the backend the algorithms are more general-purpose in nature.

The Challenges of Embedding Vision

Although technology has made very powerful microprocessor architectures available, implementing a computer vision algorithm on embedded platforms remains a challenging task.

For example, constructing a typical image pyramid (used to scale an image to make it easier to detect an object) for a VGA frame (640x480) requires 10-15 million instructions per frame. Multiply this by 30 frames per second and this will require a processor capable of 300-450 MIPS just to handle this preliminary processing step, let alone the more advanced tasks required later in the pipeline.

This is not a problem for a desktop processor with a GPU that costs hundreds of dollars and consumes tens of watts, but is a totally different story for a resource-constrained embedded processor. Some specific challenges encountered by embedded vision systems include:

  • Power consumption: Embedded applications are often constrained to power requirements in the hundreds of milliwatts range or less.
  • Computational requirements: Vision applications have extremely high computational requirements, ranging from a few Giga Operations per Second (GOPS) to several hundred GOPS.
  • Memory usage: The on-chip memory for an embedded application is limited due to power and cost, but vision processing tasks require large buffers to store image data.
  • The dynamic nature of the market: The vision market is developing and changing rapidly, making hardcoded hardware implementations a limiting option.

These challenges emphasize the need for a flexible, configurable, low-power embedded vision platform with user-level programmability.

Embedded Vision Requires Accuracy and High-Quality Results

There are many algorithms used to implement embedded vision, but implementations based on Convolution Neural Networks (CNNs) are delivering better results than other vision algorithms. CNN-based systems attempt to replicate how biological systems see and they are designed to recognize visual patterns directly from pixel images with minimal pre-processing. They can recognize patterns with extreme variability, and have robustness to distortions and simple geometric transformations.

Fundamentally, CNN is a chain of processing steps that take an image of a certain size as an input and produce, as output, a decision or classification (usually in the form of a probability). Each network is customized for the particular task that it needs to accomplish. In addition to detection, classification, and localization of any kind of object, CNNs can also be used in many other areas such as scene recognition and labeling.

CNN Implementation in Embedded Systems

Applications that use CNN are computationally intensive. They depend on the frame size and the size of the image pyramid, as well as the size of the CNN input image and filter sizes. For example, for a VGA frame and an image pyramid using eight scales with a 1.25x downscale factor and a typical 4-layer CNN, 60-70 million multiply-and-accumulate (MAC) instructions are required per frame for a relatively simple application, or about two billion MACs for 30 frames per second processing. This number is too high even for the fastest embedded processors. For more complex applications the number can be as high as 800 million to one billion MACs per frame.

Currently CNN-based architectures are mainly mapped on CPU/GPU architectures, which are not suitable for low-power and low-cost embedded products. But this is changing with new embedded vision processors like Synopsys’ DesignWare® Embedded Vision (EV) Processors that were recently introduced into the market.

These new vision processors implement CNN on a dedicated programmable multicore engine that is optimized for efficient execution of convolutions and the associated data movement. Such an engine is organized to optimize dataflow and performance, using a systolic array of cores called processing elements (PEs). One PE passes its results directly to another using FIFO buffers, without storing and loading the data to the main memory first, which effectively eliminates shared memory bottlenecks. Vision processors offer flexibility on how the PEs are interconnected allowing the designer to easily create different application-specific processing chains by altering the number and interconnection of cores used to process each CNN stage or layer.

This new class of embedded vision processor is not restricted to just object detection with CNN. Other algorithms, like histogram of oriented gradients (HOG), Viola-Jones, SIRF and others can be implemented as well, giving designers the flexibility to tailor the processor to their application.

Efficient Embedded Vision Implementation

A real-world embedded vision processor must not only detect objects, but it must also talk to interfaces and sensors, synchronize tasks, and handle communications with a host processor. Such tasks cannot be efficiently run on the specialized CNN PE cores optimized for dataflow computation. An effective embedded vision processor must be heterogeneous and include one or more RISC cores in addition to CNN cores.

Software Considerations

All components of an embedded vision processor need to be easily programmable by the developer to provide sufficient flexibility in the system so that it can be adjusted and/or re-targeted as market requirements change. However, having a standard C/C++ compiler is not enough to easily program an embedded vision processor. A large quantity of complex software is needed and starting from scratch is not realistic. Developers need a set of high-level tools and standard libraries like OpenCV and OpenVX that work in conjunction with and complement the underlying C/C++ tool chain.

OpenCV is an open source computer vision software library that contains 2500 functions which, when used by the high-level application, can facilitate tasks like object detection and tracking, image stitching, 3D reconstruction and machine learning. OpenVX is a low-level programming framework for computer vision. It is specifically targeted at embedded/real-time systems and enables performance and power-optimized vision processing.

Figure 2: OpenCV and OpenVX to accelerate vision system development

Synopsys Embedded Vision Processor

At more than 1000 GOPS/Watt the Synopsys EV Processor Family, which includes the EV52 and EV54 processors, offers high-performance vision processing at the low power-consumption levels needed for embedded applications. The processors have two or four general-purpose RISC cores coupled with up to eight specialized CNN processing elements. This heterogeneous core combination enables designers to build embedded vision systems that achieve a balance of flexibility, high performance and low power, enabling designers to embed efficient vision processing in their SoC designs.

Conclusion

Computer vision is rapidly being deployed in embedded applications, giving rise to a new class of vision processors that offer the specialized performance required for vision but at power consumption levels that are appropriate for embedded applications. Executing advanced CNN algorithms, these processors deliver results that are approaching human recognition capabilities. Processors like the Synopsys DesignWare EV Family coupled with emerging open-source vision programming tools are changing the SoC paradigm by enabling vision to be embedded and giving designers a powerful new tool that they can use to differentiate their products.