Why Next‑Generation NPUs Are Essential for Physical AI

Gordon Cooper

Apr 22, 2026 / 9 min read

Advances in transformers and multimodal models now give robots and autonomous systems contextual awareness far beyond traditional perception stacks. The latest AI models can describe scenes, follow natural‑language commands, and generate robotic actions. As industry leaders project the arrival of a “billion‑robot economy,” the compute engines enabling these devices must support rapidly evolving AI capabilities without exceeding edge power budgets.

When the first NPUs appeared, they were optimized for the computing load of convolutional neural networks (CNNs) while minimizing power and area. The world of AI research has evolved significantly since that time, producing newer CNN models, transformer models, Generative AI models and now Physical AI models. Chip makers targeting edge or physical AI applications need the power and performance efficiency that an NPU provides for AI Acceleration, but they also need an enhanced NPU that supports these newest models.

Physical AI: Bringing Multimodal Intelligence Into the Real World

What is physical AI? Physical AI marks the latest evolution in embedded computing, enabling systems that sense their surroundings, interpret complex environments, and generate actions in real time. Think edge AI with motion – like a robot, a car, or a drone. Applying physical AI models, these autonomous machines can perceive their environment through multiple sensors, understand instructions using generative AI, and – based on their understanding of these multi-modal inputs – perform complex actions like moving a robot arm to pick up an object.

Figure 1: Physical AI is Edge Device AI with moving parts 

Physical AI also represents a superset of all the processing previously required by an NPU. A robot or autonomous automotive perceives the world through multiple sensors, usually a camera, and maybe audio, radar, LIDAR, or other sensors. Vision processing is very similar to the work required from a CNN or transformer model. The robot must also decode commands – perhaps an audio prompt – and use reason to interpret the intention of those commands. This is the same as a Large Language Model decoding a text prompt and producing a result.

What's new for physical AI is the idea of applying motion or an action to an end device. Unlike Generative AI models running on smartphones or AI PCs—which operate without mechanical actuation—Physical AI devices integrate motors, actuators or other control components. Based on the robot’s understanding of these multi-modal inputs (vision, audio prompt), the Physical AI model is trained to decode the resulting tokens into robotic action. This is the physical part of physical AI.

VLA Models: The Cognitive Engine Behind Next‑Generation Robotics

Vision‑Language‑Action (VLA) models represent the next stage of AI evolution for robotics and autonomous systems. These models fuse visual understanding with language‑based reasoning and robotic control, enabling systems to interpret scenes, follow instructions, and generate control actions—all within a single multimodal model framework.

Pioneering efforts such as Google DeepMind’s RT‑2 have demonstrated how action tokens can be treated similarly to language tokens, allowing the model to translate natural‑language commands and image inputs into executable robot behaviors. This reduces dependency on hand‑engineered control stacks and increases the robot’s ability to generalize across unseen tasks.

VLA models elevate the expectations placed on edge compute: transformers must run at video frame rates, actions must be generated within tens of milliseconds, and multimodal fusion must occur locally to maintain safety and responsiveness. These requirements underscore the need for NPUs that combine high throughput with low latency and transformer‑optimized scheduling.

Figure 2: Google DeepMind introduces RT‑2, the first VLA model 

VLA workloads also introduce strict performance requirements of Physical AI models. Robotic control loops commonly operate at frequencies between 10 and 100 Hz, leaving only a few milliseconds for perception, reasoning, and action decoding. Models such as SMOLVLA illustrate the complexity of this workload: each frame may require more than a dozen iterative inference cycles to generate a stable action trajectory.

This iterative processing creates intense pressure on both memory bandwidth and internal compute scheduling. Multimodal inputs must be processed simultaneously, transformer blocks must produce outputs in tight time windows, and the system must avoid stalls that introduce latency spikes.

Figure 3: GR00T N1, a Vision Language Action Model for humanoid robots  

These characteristics make VLA a defining benchmark for next‑generation NPUs. To meet these constraints, NPUs must offer optimized dataflow for transformers, support block‑based quantization formats, deliver high‑efficiency matmul operations, and ensure deterministic performance under continuous load.

Subscribe to the Synopsys IP Technical Bulletin

Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.

New block-based quantization formats minimize bandwidth

A legacy NPU for perceptive AI often imported 32-bit floating point coefficients from a TensorFlow or PyTorch framework through ONNX that were then quantized with minimal accuracy loss to integer 8 (INT8) data type for CNNs or maybe a floating point 16 (FP16) data type for transformers. That worked well for compute bound algorithms.

For large language models and the VLMs and VLAs that have followed, minimizing bandwidth of the incoming coefficients or parameters is critical to the performance of these memory bound algorithms. These models prequantization coefficients, compressing high-precision model weights (16-bit or 32-bit floating point) into lower-bit representations (e.g., 4-bit) to reduce memory footprint and accelerate inference. Readily available models on open source projects like llama.cpp and HuggingFace use a mix of data type for the coefficients: 4-bit, 5-bit, 6-bit, and 8-bit.

The table below shows some examples of pre-quantized block-based data types used by common LLMs. 

You can see that each LLM model uses a different combination of block datatypes to balance performance and bandwidth. Supporting these new pre-compressed block data types can either be done in hardware or with the sacrifice of a number of instruction cycles in software.

Scalable NPU Performance for a Diverse Edge Landscape

From a performance point of view, CNNs are mostly multiply accumulates, and so TOPS is a good first-order metric of performance. When designing an automotive AI system, very high performance is needed – thousands of TOPS. When designing a toy with AI capabilities, the care about is price and extremely low-power. Performance is secondary. The problem with using TOPS is that transformers are more than just multiply accumulates. There's softmax and other computations needed, so frames per second or inferences per second is a better metric. For LLMs, VLMs or VLAs, the metric is tokens per second (higher the better) or time to first token (smaller is better). 

Figure 4: Performance requirements for Physical AI scale depending on the application   

Physical AI spans a wide spectrum of devices, from compact consumer robotics to multi‑sensor autonomous platforms requiring hundreds or thousands of tera operations per second (TOPS). A modern NPU must therefore scale efficiently across this range while maintaining a consistent programming model.

Crucially, this scalability ensures that as AI models evolve—from LLMs to multimodal models to VLA systems—design teams can maintain software continuity while upgrading hardware capabilities across product generations.

The Increasing Need for a Unified NPU Architecture

Not every NPU can handle Physical AI workloads and VLA models or provide support for the latest pre-quantized block data types. Legacy NPUs were designed to prioritize computations needed for perceptive AI – that is perceiving the world using convolutional neural networks and later, transformer models to perform object detection or semantic segmentation. Examples of vision transformers include ViT and SWIN models.

The AI challenge, however, isn’t just about computations. Data movement and bandwidth are critical – especially with the large parameter requirements of generative AI models. Wide bandwidth interfaces, data compression in the DMA and bandwidth reduction techniques are all tools needed by an NPU to best process generative AI models.

Many edge applications – autonomous vehicles, robotics, surveillance, wearables – are demanding an NPU that can efficiently handle both perceptive and generative models. This convergence places conflicting requirements on NPUs. Perceptive AI is compute‑bound, relying heavily on MAC throughput. Generative AI is memory‑bound, requiring high bandwidth and efficient handling of quantized model weights. Without an architecture designed to balance both, edge systems face performance bottlenecks, elevated power draw, or inability to run new models.

A unified AI accelerator must therefore deliver efficient compute for CNNs and transformers while supporting the evolving quantization formats and memory patterns of modern LLMs, VLMs, and VLA models. New NPU designs, such as the enhanced NPX6, demonstrate that a single architecture can meet these combined demands while simplifying system design and software integration.

Figure 5: Enhanced NPUs must support both Perceptive AI and Generative AI 

Enhanced NPX6 NPU IP supports Physical AI

Supporting the simultaneous demands of perceptive AI, generative AI, and VLA workloads requires NPUs built around a set of foundational architectural capabilities. The enhanced NPX6 meets the computational demands as well as the bandwidth, quantization support, and latency management that Physical AI systems require.

Computationally, the NPX6 was designed from scratch for efficient processing of transformer models which are the building blocks for both generative AI and physical AI models. In the NPX6 architecture, convolutional neural networks and transformers are handled by the same computational units. Internally, the same data paths carry either integers or optional floating-point data types minimizing area.

While the NPX6 already supports a wide external bus, recent enhancements to the NPX6’s optional DMA Data Compression unit add support for pre-quantized block-based data types including Q4, Q5, Q5, and Q8. This unit also supports OCP Microscaling (MX) formats including MXFP8, MXFP6, MXFP5 and MXINT8. Supporting these data types in hardware minimizes bandwidth which also improves performance in memory bound systems and lowers power. The DMA Data Compression option maximizes bandwidth while not wasting a lot of cycles decompressing bit-packed coefficients.

The NPX6 family offers this scalability through configurations ranging from 1K MACs for small, power‑limited devices to 96K MACs and beyond for high‑performance robotics and automotive applications. This range enables system designers to tailor compute, bandwidth, and latency characteristics to their specific power and area constraints. The NPX6 configurations support from one to the thousands of TOPS needed in high end automotive applications.

Figure 6: Block diagram of the NPX6 NPU IP family   

Key architectural characteristics of the NPX6 NPU IP family includes:

  • Power‑ and area‑efficient operation: Essential for thermal‑restricted robotics and battery‑powered systems.
  • Transformer‑optimized compute units: Enables fast execution of attention, normalization, and matmul operations central to generative AI.
  • Scalability across product families: Allows designers to move from 1K to 96K MAC configurations without rewriting software.
  • Efficient multi‑level memory hierarchy: Reduces data movement costs and enables high‑bandwidth operation for memory‑intensive LLMs.
  • Virtualization and isolated multi‑model support: Supports mixed‑criticality workloads such as simultaneous driver‑assistance perception and in‑cabin LLM interaction.
  • Comprehensive toolchain support: Ensures compatibility with ONNX, GGUF, llama.cpp, and emerging model formats to reduce integration effort.
  • Features dedicated to ISO26262 functional safety (NPX6 FS family)
  • New support for advanced 4‑/5‑/6‑/8‑bit quantized coefficients: Critical for running pre‑quantized LLMs and VLMs at scale without sacrificing model compatibility.

Advancing Edge Intelligence Through Unified, Transformer‑Ready NPU Architectures

The rapid emergence of multimodal, transformer‑based generative AI—combined with real‑time robotics and autonomous systems—demands a new class of AI accelerators. These NPUs must be flexible enough to support perception and generative reasoning, efficient enough for sustained edge operation, and scalable enough to support diverse device categories.

Architectures such as the enhanced NPX6 embody this shift by providing the bandwidth, quantization flexibility, virtualization support, and transformer‑native execution required for Physical AI. By unifying predictive and generative AI within a single scalable NPU architecture, designers can accelerate innovation across robotics, automotive systems, smart devices, and next‑generation edge products.

Continue Reading

ASK SYNOPSYS
BETA
Ask Synopsys BETA This experience is in beta mode. Please double check responses for accuracy.

End Chat

Closing this window clears your chat history and ends your session. Are you sure you want to end this chat?