Contact Sales

Search Synopsys

Multiphysics Fusion Technology for Multi-Die Designs Explained

Unified multiphysics fusion helps multi-die teams validate earlier and sign off faster.

Automotive Executive Guide: Rethinking Automotive Development

A guide to virtualization in software-defined vehicles for automotive leaders.

Mastering AI Chip Complexity

This eBook explores AI chip design trends, challenges,
and strategies for first-pass silicon success.

Embracing Transformers for Real-Time Vision Processing

Gordon Cooper

Jul 18, 2022 / 7 min read

Table of Contents

Transformers Applied to Vision
Introducing Vision Transformers and Shifted Windows Transformers
Implementing Transformers
Summary

Transformers, first proposed in a Google research paper in 2017, were initially designed for natural language processing (NLP) tasks. Recently, researchers applied transformers to vision applications, dominated in the last decade by convolutional neural networks (CNNs), and got interesting results. Transformers have proven surprisingly adaptable to vision tasks like image classification and object detection. These results have earned transformers a place next to CNNs for vision tasks trying to improve machines’ understanding of the world for future applications like context aware video inference.

In 2012, a CNN called AlexNet was the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual computer vision competition. The task was to have your machine learn and ‘classify” 1000 different images (based on the ImageNet dataset). AlexNet achieved a top-5 error rate of 15.3%. Previous winners, based on a traditional programming model, had top-5 error rates around 26% (see Figure 1). Subsequent years were dominated by CNNs. In 2016 and 2017, the winning CNNs achieved better than human accuracy and the majority of participants achieved over 95% accuracy, prompting ImageNet to roll out a new, more difficult challenge in 2018. The dominance of CNNs in ILSVRC drove a flurry of research applying CNNs to real-time vision applications. While accuracy continued to improve there was a 10x improved efficiency from ResNet in 2015 and EfficientNet in 2020. Real-time vision applications require not just accuracy, but improved performance (inference/sec or frames-per-second (fps)), a reduced model size (improving bandwidth), and power and area efficiency.

A chart highlighting significant improvements in accuracy through ILSVRC results

Figure 1: ILSVRC results highlight the significant improvements in accuracy for vision classification introduced by AlexNet, a convolutional neural network.

Classification is a building block for more complicated, and more useful, vision applications like object detection (finding the location of the object in the two dimensional image), semantic segmentation (grouping/labeling every pixel in an image) and panoptic segmentation (both identifying object locations and labeling/grouping every pixel in every object).

Transformers, as first introduced in Google Brain’s 2017 paper, were designed to improve upon recurrent neural networks (RNNs) and long short-term memory (LSTM) for NLP tasks like translation, question answering and conversational AI. RNNs and LSTMs have been used to process sequential data (i.e. digitized language and speech) but their architectures are not easily parallelizable, and thus are typically very bandwidth-limited and difficult to train. The structure of a transformer has several advantages over RNNs and LSTMs. Unlike RNNs and LSTMs that must read a string of text sequentially, transformers are significantly more parallelizable and can read in a complete sequence of words at once, allowing them to better learn contextual relationships between words in a text string.

A popular transformer for NLP, released in late 2018 by Google, is Bidirectional Encoded Representation for Transformers (BERT). BERT significantly improved results for a variety of NLP tasks and is popular enough to be included in MLCommons’ MLPerf neural network inference benchmark suite. In addition to high accuracy, transformers are much easier to train, making huge transformers possible. MTM, GPT-3, T5, ALBERT, RoBERTa, T5, Switch AS are just some of the large transformers tackling NLP tasks. Generative Pre-trained Transformer 3 (GPT-3), introduced in 2020 by OpenAI, uses deep learning to produce human-like text and does this so accurately it can be difficult to determine if the text was written by a human.

Transformers like BERT can be successfully applied in other application domains with promising results for embedded use. AI models that can be trained on broad data and applied to a wide range of applications have been dubbed foundation models. One of these domains that transformers have had surprising success in is vision.

Leveraging Synopsys ARC-V RPX-100 Processor IP for Robotics and ADAS

Learn How Synopsys ARC-V RPX-100 Processor IP Advances Robotics and ADAS With High Performance, Low Power, and RISC-V Architecture.

Download Whitepaper

Transformers Applied to Vision

Something remarkable happened in 2021. The Google Brain team applied their transformer model to image classification. There is a big difference between a sequence of words and a two-dimensional image, but the Google Brain team cut the image into small patches, put the pixels in these patches into a vector and fed the vector into the transformer. The results were surprising. Without any modification to the model, the transformer beat current state-of-the-art CNNs in classification in accuracy. While accuracy isn’t the only metric for real-time vision applications (power, cost (area) and inferences/sec are also important), it was a significant result in the vision world.

Comparative Diagram of CNN vs Transformer Architectures

Figure 2: Comparing transformer and CNN structures

It’s helpful when comparing CNNs and transformers to understand their similar structures. In Figure 2, a transformer’s structure consists of the boxes on the left side of the image. For comparison, we draw a similar structure for CNNs using typical CNN constructs like those found in ResNet – a 1x1 convolution with element-wise addition. We find the feed forward portion of the transformer is functionally identical to the 1x1 convolution of the CNN. These are matrix-matrix multiplies that apply a linear transformation on every point in the feature map.

The difference between transformers and CNNs is in how each mixes information from neighboring pixels. This happens in the transformer’s multi-head attention and the convolutional network’s 3x3 convolution. For CNNs, the information that is mixed in is based on the fixed spatial location of each pixel, as we see in Figure 3. For a 3x3 convolution, a weighted sum is calculated using neighboring pixels – the nine pixels around the center pixel.

Comparison Diagram of CNN vs Transformer for Vision Processing

Figure 3: Illustrating the difference between how a CNN’s convolution and a transformer’s attention networks mix in features of other tokens/pixels.

The transformer’s attention mechanism mixes in data not just based on location but based on learned properties. Transformers – during training – can learn to pay attention to other pixels. Attention networks have greater ability to learn and express more complex relationships.

Introducing Vision Transformers and Shifted Windows Transformers

New transformers are emerging specifically for vision tasks. Vision Transformers (ViTs), specializing in image classification, are now beating CNNs in accuracy (although to achieve this accuracy, ViTs need to be trained with very large data sets). ViTs also take a lot more computations, which lowers their fps performance.

Transformers are also being applied to object detection and semantic segmentation. Swin (shifted window) transformers provide state-of-the-art accuracy for object detection (COCO) and semantic segmentation (ADE20K). While CNNs are typically applied to still images – with no knowledge of previous or future frames – transformers can be applied across video frames. Variants of SWIN can be directly applied to video for uses like action classification. Applying transformers’ attention separately on time and on space have given state of the art results on Kinetics-400 and Kinetics-600 action classification benchmarks.

MobileViT (Figure 4), introduced in early 2022 by Apple, provides an interesting mix of both transformer and convolutions. MobileViT combines transformer and CNN features to create a lightweight model for vision classification targeting mobile applications. This combination of transformer and convolution, when compared to the CNN-only MobileNet, has 3% higher accuracy for the same size model (6M coefficients). Although MobileViT outperforms MobileNet, it is still slower than CNN implementations on today’s mobile phones, which support CNNs but were not optimized for transformers. To take advantage of the benefits of transformers, future AI accelerators for vision will need better transformer support.

Diagram Comparing CNN vs Transformer Architectures for Vision

Figure 4: MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer (image source: https://arxiv.org/abs/2110.02178)

Despite the demonstrated successes of transformers for vision tasks, it is unlikely that convolutional networks are going to go away and time soon. There are still trade-offs between the two approaches – transformers bring higher accuracy but at much less fps performance and requiring a lot more computations and data movement. To avoid the weaknesses of each, combining transformers and CNNs can produce flexible solutions that shows great promise.

Implementing Transformers

Although architecturally there are similarities, it would be unrealistic to hope that an accelerator designed specifically for CNNs will be efficient at executing transformers. Architectural enhancements needed to be considered to handle the attention mechanism at a minimum.

An example of an AI accelerator that was designed to handle both CNNs and transformers efficiently is the ARC® NPX6 NPU IP from Synopsys. The NPX6’s computation units (Figure 5) include a convolution accelerator which is designed to handle matrix-matrix multiplications critical to both CNNs and transformers. The tensor accelerator is also critical, as it was designed to handle all other non-convolution Tensor Operator Set Architecture (TOSA) operations including transformer operations.

Neural Processing Unit Applications in Detection and Recognition

Figure 5: Synopsys ARC NPX6 NPU IP

Summary

Transformers for vision have made rapid advancements and are here to stay. These attention-based networks outperform CNN-only networks in accuracy. Models that combine vision transformers with convolutions are more efficient at inference (like MobileViT) and improve on performance efficiency. This new class of neural network models is opening the door to address future AI tasks like full visual perception, which requires knowledge that may not easily be acquired by vision only. Transformers combined with CNNs are leading the way to next-generation AI. Choosing architectures that support both CNNs and transformers will be critical to SoC success for emerging AI applications.

Subscribe to the Synopsys IP Technical Bulletin

Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.

Related Resources

Datasheet

Synopsys ARC NPX6 NPU Family for AI/Neural Processing

Download Datasheet

Webinar

Addressing Real-Time Workloads in Automotive Applications with Efficient ARC-V Processors

Webinar

Implementing High Performance Real-Time Designs Using Synopsys ARC Processor IP

Continue Reading

9 min read / Apr 22, 2026

Why Next‑Generation NPUs Are Essential for Physical AI

By Gordon Cooper

Tags: Silicon IP Technical Bulletin, Processor Solutions, Silicon IP

Read Article

5 min read / Jan 30, 2026

Addressing Real-time Application Requirements with RISC-V Extensions

By Rich Collins

Tags: Silicon IP Technical Bulletin, Automotive, Processor Solutions, Silicon IP

Read Article

6 min read / Jul 17, 2025

Emerging Narrow Precision Data Types for Embedded AI

By Gordon Cooper , Markus Willems

Tags: AI & Machine Learning, Silicon IP Technical Bulletin, Processor Solutions, Silicon IP

Read Article

ASK SYNOPSYS

BETA

End Chat

Closing this window clears your chat history and ends your session. Are you sure you want to end this chat?

Legal Disclaimer

NOTICE: You are interacting with an AI-powered chatbot that provides general information about Synopsys, including its products and services, which may be incorrect or incomplete. In the event of any conflict or discrepancy, the terms of your applicable agreements supersede any information provided by this chatbot. These chats may be accessed by Synopsys and its service providers to customize the experience and improve this tool, and your use of this chatbot is an agreement to that data processing activity.

Search Synopsys

Popular Content

Multiphysics Fusion Technology for Multi-Die Designs Explained

Unified multiphysics fusion helps multi-die teams validate earlier and sign off faster.

Automotive Executive Guide: Rethinking Automotive Development

A guide to virtualization in software-defined vehicles for automotive leaders.

Mastering AI Chip Complexity

This eBook explores AI chip design trends, challenges, and strategies for first-pass silicon success.

Embracing Transformers for Real-Time Vision Processing

Synopsys IP Technical Bulletin

Leveraging Synopsys ARC-V RPX-100 Processor IP for Robotics and ADAS

Transformers Applied to Vision

Introducing Vision Transformers and Shifted Windows Transformers

Implementing Transformers

Summary

Subscribe to the Synopsys IP Technical Bulletin

Related Resources

Synopsys ARC NPX6 NPU Family for AI/Neural Processing

Addressing Real-Time Workloads in Automotive Applications with Efficient ARC-V Processors

Implementing High Performance Real-Time Designs Using Synopsys ARC Processor IP

Continue Reading

Why Next‑Generation NPUs Are Essential for Physical AI

Addressing Real-time Application Requirements with RISC-V Extensions

Emerging Narrow Precision Data Types for Embedded AI

End Chat

Legal Disclaimer

This eBook explores AI chip design trends, challenges,
and strategies for first-pass silicon success.

Synopsys IP
Technical Bulletin