Explore challenges and solutions in AI chip development
In-depth technical articles, white papers, videos, webinars, product announcements and more.
Neural Processing Units (NPUs), specialized processors designed for artificial intelligence (AI) neural networks and deep learning tasks, are being forced to evolve as technology progresses from Convolutional Neural Networks (CNNs) to transformer models, and now to Generative AI (GenAI) models. The growing number of parameters and the insatiable bandwidth requirements of GenAI – particularly Large Language Models (LLMs) – are driving changes in data formats which are used in embedded AI hardware implementations. The changes include a trend toward lower precision and floating-point formats such as the emerging OCP Microscaling (MX) data types.
Back in 2012, CNNs surged ahead of digital signal processing solutions to become the defacto standard for vision processing tasks like characterization and object detection. The starting point for training and inference of CNN algorithms was Floating-Point 32 (FP32) data types. But it did not take long for inference engines to find ways to optimize the power and area of CNN engines. This was especially important for applications targeting edge device applications. With minimal loss in accuracy, Integer 8 (INT8) became the standard format for CNN algorithms for high throughput use cases. TensorFlow – the dominant AI framework at the time – provided robust support for INT8, although using the data type required post-training quantization and calibration.
When transformer neural networks emerged in 2017 (when Google released their “All You Need is Attention” paper), the addition of attention mechanisms made transformers more sensitive to INT8 quantization than CNNs used for image classification. FP16 and Brain Float (BF) 16 became common fallback data types for transformers.
Transformers enabled the current era of GenAI models, but GenAI models’ parameter sizes are several orders of magnitude larger than CNNs and many Vision Transformers. A typical CNN algorithm might require 25M parameters while ChatGPT requires 175B parameters. This has created a mismatch in computing versus memory bandwidth requirements for an NPU. As you can see in Figure 1, the performance of GPUs targeting AI neural network workloads is growing much faster than the interconnect bandwidth capabilities.
Figure 1: The growth of AI performance (TOPS) is outpacing the growth in interconnect bandwidth (GB/s)
GPUs are typically used for AI training and server workloads, while NPUs are the AI processors of choice for AI inference where low power and small area are important requirements. This mismatch between compute capabilities and interface bandwidth is a greater challenge for an NPU as they take on GenAI workloads. NPUs used for edge device applications typically have LPDDR5 memory interfaces. LPDDR memory interfaces are limited in bandwidth compared to HBM interfaces, which are often used in server applications.
There are multiple ways that bandwidth can be reduced using an NPU.
In 2023, the OCP Microscaling Formats (MX) specification was published and introduced three floating point formats and one integer format (MXFP8, NXFP6, MXFP4, MXINT8). The MXFP8 format is adopted from the OCP 8-bit Floating Point specification (OFP8). Figure 2 shows these data types.
The four MX compliant data types in Figure 2 all have an 8-bit exponent that gets shared across a block of 32 numbers which helps reduce the memory footprint and improve hardware performance and efficiency, which in turn reduces overhead and operational costs. Another benefit of the MX data types is that FP32 or FP16 weights and activations can be “direct cast” (compressed/quantized) into these floating point MX formats during offline compilations.
Figure 2: Microscaling (MX) data types from the OCP MX Specification v1.0.
The need for these smaller data types in GenAI model implementations is due to the requirement change in NPU architectures. Since narrow bit-width data formats help reduce computational and storage costs of GenAI models, NPUs must support these new and emerging formats.
Figure 3 shows Synopsys’ Processor IP offering for SoCs adding AI capabilities. The NPX6 NPU IP provides an efficient, scalable AI inference engine. The VPX DSP IP, a family of VLIW/SIMD processors, targets a broad range of signal processing applications and can process custom neural network layers in addition to performing pre- and post-processing for neural network models.
Figure 3: The NPX6 NPU IP and the VPX DSP IP provide an integrated solution for neural network processing, future proofing, and pre- and post-processing.
The Synopsys NPX IP and VPX IP families now include new AI Data Compression Options which, when combined with a floating point unit (FPU) option, add support for INT4, BF16, OCP-FP8 and OCP-MX data compression to any ARC NPX NPU IP processor or VPX DSP IP processor. These AI Data Compression Options are fully compliant with the OCP specifications based on the OCP 8-bit Floating Point Specification (OFP8) (Rev. 1.0, Approved: June 20, 2023), and the OCP Microscaling Formats (MX) Specification (Version 1.0, Sep 2023).
The AI Data Compression Options perform data format conversions quickly in the DMA, uncompressing data when moving from system memory into internal memory and compressing data when moving from internal memory to system memory. As an example, for the NPX6, an MXFP6 format would be converted to a FP16 format for internal processing. The use of FP16 for internal calculations does not limit the overall performance because for LLMs executing on the NPX6 NPU IP are not compute bound. The bottleneck is on the bandwidth. Figure 4 below shows the data types supported by the enhanced NPX6 NPU IP and VPX DSP IP. Many of these data types are supported in DMA. The table shows the internal data paths each data types are mapped to.
Figure 4: Data types supported by the enhanced Synopsys ARC NPX6 NPU IP and Synopsys ARC VPX DSP IP families.
Since the VPX and NPX support the same data types, it is easy to transmit parameters or activations between processors using the new formats. Integrating these data types into DMA provides bandwidth and memory footprint reduction. Another benefit of supporting multiple data types in DMA is the ability to connect the processor IP directly to converters. For example, a 10-bit A/D converter can be connected to the NPX or VPX and the hardware will automatically map it to an internal datatype saving software conversion.
As GenAI models continue to evolve, they should follow the same trajectory that CNN models followed. Models will surge in parameter sizes until a satisfactory level of accuracy and efficiency is gained, and then research will pivot to optimizations, making the models more suitable for edge device applications. Both the enhanced Synopsys ARC NPX6 NPU IP and Synopsys ARC VPX DSP IP are available now for SoC designers who are interested in AI capabilities, including GenAI.
Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.