

# Highly efficient programing environment for handling Al workloads

Tom Michiels, System Architect Synopsys ARC<sup>®</sup> Processor Summit 2022

### Agenda



- The AI Programming Challenge
- Optimizations For Programming AI-Enabled SoCs
- Quantifying The Benefits



## The AI Programing Challenge



#### Popular & Emerging Neural Networks Are Still Evolving



© 2022 Synopsys, Inc.

### AI Software Runs On a Spectrum Of Hardware Types

CPU, GPU, DSPs, NPUs, AI Accelerators...

| Hardware    | Performance | Area<br>Efficiency | Power<br>Efficiency | Flexibility | Typical Programming<br>Model |
|-------------|-------------|--------------------|---------------------|-------------|------------------------------|
| CPU         | *           | *                  | *                   | ****        | C/C++ code                   |
| GPU         | ****        | *                  | *                   | ****        | OpenCL or CUDA               |
| FPGA        | **          | ***                | *                   | ***         | Vendor Specific              |
| DSP         | ***         | ***                | ***                 | ***         | C/C++ or OpenCL C            |
| NPU         | ****        | ****               | ****                | ***         | Vendor Specific              |
| Accelerator | ***         | ****               | ****                | **          | Hardwired or Special SDK     |

Ideally, your NN's will take advantage of any AI-enabled hardware



#### Wide Variety Of Performance For AI Edge Devices





Robotics / Drones



- Driver monitoring system
- Surveillance
- Facial recognition



- ADAS Front Cameras
- ADAS LiDAR/Radar
- High end surveillance
- High-end smartphones
- DTV
- HPC
- Microservers (inference)
- Data center (inference)

#### 10 to 1000+ TOPS

Same programming environment to serve multiple domains



AloT

#### Deep Learning Performance Outpacing Memory



- Moore's Law: CPU performance outpacing memory access speed
- GPUs initiated Deep Learning in 2012, widening the gap
- Deep Learning accelerators outpacing GPUs
- Goal: reduce data movement
  - Innovative heterogeneous memory architectures required
  - From on-chip memory compilers to high bandwidth HBM2

#### Limited memory bandwidth requires optimized data movements



### **Competing Machine Learning Frameworks**

Lack of Programming Model Standardization for AI Algorithms



#### Programming model should support all popular frameworks





#### 5 Optimizations For Programming AI-Enabled SOCs

- 1. Quantization
- 2. Multi-level Layer Fusion and Multi-level Tiling
- 3. Feature Map Compression/Decompression
- 4. Structured Sparsity
- 5. Featuremap partitioning



#### NN Applications Use Wide Range Of Data Representations



- FP32 typical format used in GPUs for NN model training
- FP16 & BF16 are NOT needed for accuracy over INT8/16 – they make the transition from GPU easier, avoids having to retrain models
- FP8 has more traction for training than inference
- **INT16** provides accuracy 'insurance' for radar and super resolution (at reduced performance)
- INT8 standard for neural network object detection
- **INT4** can save bandwidth; not very popular yet



## Mixed Precision Quantization Enables Optimized Accuracy with Minimum Bandwidth Impact



<u>Mixed-Precision</u> Quantized Model



#### Techniques for Minimizes Bandwidth Requirements

- Multi-level Layer Fusion
  - Merging multiple folded layers into single primitives reduces feature map bandwidth
  - Merged layers can be fused into layers groups and tiled, taking advantage of L1 and L2 memories
- Coefficient Pruning and Compression
  - Coefficients with a zero value are skipped/counted, a compressed coefficient bitstream is created offline
  - Compression ratio can be increased through pruning and retraining
- Feature Map Compression
  - Lossless runtime compression and decompression of feature maps to external memory
  - Approx. 40% feature-map bandwidth reduction, exploiting sparsity
- Layer, Frame based and Feature Map Partitioning with DMA Broadcasting
  - Broadcast of common data across slices to minimize bandwidth of coefficients and feature-maps loading







#### Feature Map Partitioning / DMA Broadcasting

**Synopsys**®

#### Advanced Data Bandwidth Reduction Techniques

Multi-level Layer Fusion and Multi-level Tiling





#### Data Compression/Decompression

- Coefficient Pruning
  - Coefficients with a zero value are skipped/counted
  - Decompression done between local VM memory and NN datapath registers
  - Offline coefficient pruning (with retraining) can increase proportion of zero coefficients
    - Support of structured and unstructured sparsity
- Feature map compression/decompression
  - Runtime compression and decompression
  - NN core DMA supports HW compression mode
  - Bandwidth reduction of 40~45% measured typically





#### Structured Sparsity Can Improve Performance 2X

- Sparsity takes advantage of a matrix of numbers that includes many zeros or values that will not significantly impact a calculation
- Can exploits sparsity in coefficients
  - Flexible use of sparsity in coefficient vectors in channel dimension
  - Effective speedup of 1.4X~1.8X with almost no accuracy loss
- Doubles the effective MACs on applicable layers
- Requires pruning and retraining
  - No accuracy loss for key model families:
    e.g. ResNet, ResNext, Densenet, <u>Bert, GNMT</u>
  - Other models may have accuracy vs. performance tradeoffs







**Processor Summit** 

- Higher throughput up to *N*X
- Lower latency up to *NX* due to parallel processing of a layer
- Significant bandwidth reduction (via DMA broadcasting)

#### Feature-map partitioning – contd.

Spatial partitioning: Reuse weights across cores through a broadcast DMA

Input features



Weights / Coefficients



#### Feature-map partitioning – contd.

Channel partitioning: Reuse features across cores through a broadcast DMA

Input features



Weights / Coefficients



### Quantifying the Benefits



### Synopsys Introduces ARC NPX6 NPU and MetaWare MX



- Scalable NPX6 architecture
  - 1 to 24 core NPU up to 96K MACS (440 TOPS\*)
  - Multi-NPU support (up to eight for 3500 TOPS\*)
- Trusted software tools scale with the architecture
- Convolution accelerator MAC utilization improvements with emphasis on modern network structures
- Generic Tensor accelerator Flexible Activation & support of Tensor Operator Set Architecture (TOSA)
- Memory Hierarchy high bandwidth L1 and L2 memories
- DMA broadcast lowers external memory bandwidth requirements and improves latency

\* 1.3 GHz,5nm FFC worst case conditions using sparse EDSR model



## Modular Toolkit Supports Control, DSP, Vision and ML Software Development



DesignWare® ARC® MetaWare MX Development Toolkit





- Integrated toolkit provides optimizing compilers, debugger, libraries and a simulator for development on ARC processors
- Includes Vector DSP and Linear Algebra Libraries (BLAS/LAPACK) and MATLAB Plug-In for Model-Based Design Environment
- MetaWare Neural Network SDK for enabling and optimizing Machine Learning and inference applications
- Includes simulation platforms for early software development and architectural exploration with MetaWare Virtual Platforms SDK
- Development of Computer Vision for pre- & post-processing eased with MetaWare Vision SDK

## Benchmark Performance vs. L2 CSM size and DDR Bandwidth

Result for selected NPX6-32K config – without structured sparsity

- NPX6 configuration: 8 NN cores \* 4096 MACs per core
- NN core internal memory (L1): 384 KB per NN core
- Cluster Shared Memory (L2): 0 to 16 MB



scaled\_yolo5(960x544) on NPU32K(384 KB)

- Ext. DRAM bandwidth (L3): 16, 32, 64, 128, 256 GB/s
- 8 bit data





#### Performance Gains Obtained with Structured Sparsity

|                      | NPX6-4K                                       | NPX6-16K                                      | NPX6-64K                                      |
|----------------------|-----------------------------------------------|-----------------------------------------------|-----------------------------------------------|
| Graph                | % FPS improvement<br>With Structured Sparsity | % FPS improvement<br>With Structured Sparsity | % FPS improvement<br>With Structured Sparsity |
| Inception v3         | 151%                                          | 142%                                          | 124%                                          |
| Inception v3 FHD     | 148%                                          | 148%                                          | 148%                                          |
| ResNet-50 v1.5       | 146%                                          | 147%                                          | 128%                                          |
| ResNet-50 v1.5 FHD   | 142%                                          | 147%                                          | 147%                                          |
| MobileNet v2         | 124%                                          | 133%                                          | 114%                                          |
| MobileNet v2 FHD     | 120%                                          | 121%                                          | 117%                                          |
| Yolo v3              | 152%                                          | 171%                                          | 165%                                          |
| Yolo v3 FHD          | 165%                                          | 164%                                          | 168%                                          |
| SSD-ResNet34         | 167%                                          | 171%                                          | 171%                                          |
| SSD-MobileNet        | 151%                                          | 138%                                          | 115%                                          |
| DeepLab v3           | 127%                                          | 129%                                          | 128%                                          |
| EDSR                 | 200%                                          | 191%                                          | 190%                                          |
| SRGAN                | 176%                                          | 173%                                          | 171%                                          |
| BERT_large           | 128%                                          | 135%                                          | 147%                                          |
| BERT_large (batch=4) | 128%                                          | 163%                                          | 166%                                          |
| Vit_B_16             | 144%                                          | 128%                                          | 154%                                          |
| Vit_L_16             | 132%                                          | 145%                                          | 149%                                          |
| Vit_H_16             | 129%                                          | 145%                                          | 144%                                          |
| swin_tiny            | 148%                                          | 148%                                          | 134%                                          |
| swin_small           | 156%                                          | 158%                                          | 136%                                          |
| swin_base            | 153%                                          | 163%                                          | 143%                                          |





#### **Open Neural Network Exchange**

The open standard for machine learning interoperability



 Helps solve the challenge of hardware dependency related to AI models

- Open format to represent both deep learning and traditional models
- Defines a common set of operators and file format
- AI developers can use models with a variety of frameworks, tools, runtimes, and compilers
- Enables deploying same AI models to multiple HW-accelerated targets





#### Support for Different Programming Frameworks



- MetaWare NN Compiler integrates with standard frameworks
- Automatic mapping to NPX6 and VPX5 vector DSP with no manual optimization required
  - User-driven optimization options:
    e.g. Latency, throughput, bandwidth
- Generated code can run on multiple development platforms
  - Fast Performance Models (FPM)
  - Zebu H/W Emulator
  - HAPS FPGA board



#### State-Of-The-Art System Level Modeling And Analysis

Architecture Design

Software Development

Power profiling

Benchmarking & Profiling

#### Fast Performance Model

- Fast cycle-based Performance Model of NPX6 (and VPX5 cores)
- Integrated Platform Architect simulation environments
- Virtualizer Virtual Prototyping
  - VDK (Virtualizer development Kit for early Software Development Platform)

#### ZeBu Emulation

- Accurate performance and power modeling

#### HAPS Prototyping

 NPX6 mapped to HAPS board provides cycle accurate performance for benchmarking and software development











- AI Programming is a challenge amid evolving Neural Networks, absence of a standard programing model and the wide spectrum of HW types. A key challenges is the limited memory bandwidth
- Synopsys advanced optimizations for AI includes Mixed Precision Quantization to increase accuracy, Data Bandwidth Reduction techniques like multi-level tiling, Feature Map Partitioning to minimize bandwidth requirements, and Structured Sparsity utilization
- Synopsys MetaWare MX Development Toolkit supports different programming frameworks, different HW targets, is extensible, and includes state-of-the-art system level modeling





## Thank You