For over a decade, CNNs were the most popular deep-learning model for vision processing applications. As they evolved, CNNs accurately supported an expanded set of use cases, including image classification, object detection, semantic segmentation (grouping or labeling every pixel in an image), and panoptic segmentation (identifying object locations as well as grouping or labeling every pixel in every object).
Although the computer vision application landscape was understandably dominated by CNNs, a new algorithm category—initially developed for natural language processing (NLP) such as translation and queries—has made serious inroads beyond ChatGPT. Known as a transformer, the deep-learning model pioneered by Google beats CNNs in accuracy without any modifications other than replacing language patches with image patches. Moreover, Google’s vision transformer (ViT), an optimized model based on the original transformer architecture, outperforms a comparable CNN with four times fewer computational resources.
This is because transformers use property attention mechanisms and sparsity to improve training, focus on relevant data, and bolster inference capabilities. With these techniques, transformers can better learn and understand more complex patterns—as CNNs typically address data frames without knowing what came before or after. Nevertheless, it is important to note that while transformers deliver higher accuracy, CNNs achieve significantly higher frames per second (FPS). That’s why transformers are frequently paired with CNNs—to bolster the speed and accuracy of vision processing applications.
MobileViT, introduced by Apple in early 2022, is an example of this approach. Essentially, MobileViT merges transformer and CNN features to create a lightweight model for vision classification. The combination of transformer and convolution, when compared to the CNN-only MobileNet, delivers a three percent higher accuracy rate for the same size model (6M coefficients).