Applications that use CNN are computationally intensive. They depend on the frame size and the size of the image pyramid, as well as the size of the CNN input image and filter sizes. For example, for a VGA frame and an image pyramid using eight scales with a 1.25x downscale factor and a typical 4-layer CNN, 60-70 million multiply-and-accumulate (MAC) instructions are required per frame for a relatively simple application, or about two billion MACs for 30 frames per second processing. This number is too high even for the fastest embedded processors. For more complex applications the number can be as high as 800 million to one billion MACs per frame.
Currently CNN-based architectures are mainly mapped on CPU/GPU architectures, which are not suitable for low-power and low-cost embedded products. But this is changing with new embedded vision processors like Synopsys’ DesignWare® Embedded Vision (EV) Processors that were recently introduced into the market.
These new vision processors implement CNN on a dedicated programmable multicore engine that is optimized for efficient execution of convolutions and the associated data movement. Such an engine is organized to optimize dataflow and performance, using a systolic array of cores called processing elements (PEs). One PE passes its results directly to another using FIFO buffers, without storing and loading the data to the main memory first, which effectively eliminates shared memory bottlenecks. Vision processors offer flexibility on how the PEs are interconnected allowing the designer to easily create different application-specific processing chains by altering the number and interconnection of cores used to process each CNN stage or layer.
This new class of embedded vision processor is not restricted to just object detection with CNN. Other algorithms, like histogram of oriented gradients (HOG), Viola-Jones, SIRF and others can be implemented as well, giving designers the flexibility to tailor the processor to their application.