High Accuracy Computer Vision with ISPs and Vision Processors

Gordon Cooper, Product Marketing Manager, Synopsys

Convolutional Neural Networks (CNNs) have gotten a lot of justified buzz for being the state-of-the-art technique for computer vision tasks such as object detection and recognition. For applications on the edge requiring vision processing, such as mobile phones, autonomous vehicles and augmented reality devices, a dedicated vision processor with a CNN accelerator can maximize performance while minimizing power consumption. However, the accuracy of object detection can be influenced by the quality of the image that is fed into the CNN engine or accelerator. For the highest quality of results, designers must ensure that the image coming from the camera is as enhanced as possible. For example, images captured at dusk might suffer from lack of differentiation between the image and the background. A possible way to improve the image is by using a ‘normalization’ pre-processing step.

In an example vision pipeline (Figure 1), light passes through the camera lens and strikes a CMOS sensor where each pixel is represented as a voltage proportional to intensity. The output of the CMOS sensor is fed into an image signal processor (ISP) to correct lens distortions and make color corrections. That image then is passed on to the vision processor for vision processing and object detection.

Figure 1: A typical vision system from camera to output of Convolutional Neural Network

One important task for the ISP is color image demosaicing (Figure 2). Most digital cameras get their inputs using a single image sensor overlaid with a color filter array (CFA). The output has a green appearance to the naked eye when using the most common color filter array – the Bayer pattern. The filter pattern is 50% green, 25% red and 25% blue – the extra green helps mimic the physiology of the human eye which is more sensitive to green light. Demosaicing makes the images look more natural by extrapolating the missing color values from nearby pixels. This pixel-by-pixel processing of a two-dimensional (2D) image is a good example of the types of processing an ISP must perform.

Figure 2: Demosaicing a Bayer pattern image to a normal RGB image requires two-dimensional pixel processing

Some camera manufacturers embed ISP capabilities in their camera module. Others will design their own hardwired ISP. To execute computer vision algorithms such as object detection or facial recognition on the image outputs of the ISPs, a separate CNN engine – either on-chip or off-chip – is required. Modern vision processors, such as the Synopsys EV62 (Figure 3), include both vector DSP capabilities and a neural network accelerator or engine. The vision processor’s vector DSP offers capabilities that are well suited to executing ISP functions.

Figure 3: Synopsys’ DesignWare ARC EV62 Vision Processor EV62 with CNN option includes two vector DSP cores and one tightly integrated neural network engine.

A vector DSP can perform simultaneous multiply-accumulates on different streams of data. For example, a vector DSP with a 512b wide word can perform 32 parallel 8-bit multiplies or 16 parallel 16-bit multiplies. Vector DSPs can combine their inherent parallelism with a power- and area-optimized architecture to provide a highly efficient 2D image processing solution for embedded vision applications.

A programmable vision processor requires a robust software tool chain and relevant library functions. The EV62 is supported by DesignWare® ARC® MetaWare EV Development Toolkit which includes software development tools based on OpenVX™, OpenCV, and OpenCL C embedded vision standards. Synopsys’ OpenVX implementation has extended the standard library of OpenVX kernels to include new kernels which offer OpenCV-like functionality, within the optimized, pipelined OpenVX execution environment. For vision processing, OpenVX provides both a framework and optimized vision algorithms – image functions implemented as “kernels,” which are combined to form an image processing application expressed as a graph. The standard and extended OpenVX kernels have been ported and optimized for the EV6x so that designers can take advantage of the parallelism of the vector DSP.

Figure 4 shows an example of an OpenVX graph that uses a combination of standard and extended OpenVX kernels. In this example, cropping of the image is done during the distortion correction (remap) step. The output of the demosaicing is run through distortion correction (remap), image scaling, and image normalization. Normalization adjusts the range of pixel intensity values to correct for poor contrast due to low light or glare, for example.

Figure 4: An OpenVX graph for implementing an ISP on a vision processor

Because the EV62 has two vision processor CPUs and a dedicated CNN engine, it can do double duty. One vision processor can execute the ISP algorithms while the second can do other computer vision algorithms in parallel with or in support of the CNN engine. An EV64 has four vision processor CPUs for even more parallel processing capabilities.

There is a steady stream of new academic papers on CNN use cases including proposals to train the CNN to accept the non-correct (pre-ISP) images, which would suggest that the need for the ISP as a pre-processing step for CNN will eventually no longer be required. However, there are plenty of use cases where the image from the sensor needs correction for human viewing. An example is for automotive rear-view cameras which, in addition to using CNNs for detection and collision avoidance algorithms, still need to show a clear picture on the in-cockpit display for the human operator to process.