Insight Home | Next Article
Issue 1, 2013
Multi-Gigahertz FPGA Signal Processing
How can an FPGA clocked at 500 MHz support gigasample per second data throughput rates? Chris Eddington, senior product and technical marketing manager in the FPGA and Systems Group at Synopsys and Baijayanta Ray, corporate application engineer for the Synphony Model Compiler product at Synopsys, explain how design teams can quickly and easily create parallel architectures to enable multi-gigahertz signal processing applications on FPGA devices. The data reproduced in this article was first published in Xilinx’s Xcell Journal.
The growth of wireless communications continues to put pressure on the wireless spectrum. As a result, there is an increasing need for bandwidth monitoring, and ultra-fast conversion from the time domain plays a significant role in enabling high-speed signal processing. Consequently, design teams are seeking faster ways to implement the core building block used in gigahertz-speed signal processing: the fast fourier transform (FFT).
One way to speed up FFT processing is to use more parallelism in the hardware architectures. In fact, this approach is essential where the gap between the highest possible clock rates in an advanced FPGA might be several hundred megahertz, yet the target data sample rates for modern monitoring applications are measured in gigasamples per second (GS/s). The use of wideband ADCs and state-of-the-art FPGAs, such as Xilinx’s Virtex®-7, enable design teams to exploit megahertz chip clock frequencies and reach gigasample data rates.
Communications protocols are becoming increasingly packetized, resulting in decreasing signal duty cycles. This requires a dramatic decrease in scan repeat time, which is only possible with low-latency FFT cores. This is another reason to use parallel FFTs; the latency scales down almost proportionally to the ratio of sample rate to clock speed.
Parallel FFT Architecture Choices
The Radix2 Multi-Path Delay Commutator kernel (Radix2-MDC) is one approach to FFT implementation, which is highly suited to creating scalable, parallel architectures, especially for FPGA devices.
Figure 1a shows an FFT implementation based on the Radix2-MDC approach for an FFT of length 16, which exploits two parallel data streams and uses pipelining to ensure that there is the correct “distance” between the multi-path delay of each data element.
Figure 1a: Radix2-MDC kernel (Radix2 Multi-Path Delay Commutator)
Figure 1b: Using Radix2-MDC modularity to create parallel FFT architectures
Design teams can implement parallel variants of the architecture by increasing the width of the datapath and vector operations (Figure 1b).
The complex multiply operation is fundamental to FFT implementation, and how this is performed also affects the scalability of the architecture. Common choices include the 4multiply (4M) and 3multiply (3M) structures. For the lowest area, designers choose the 3M complex multiply, but for the best performance, 4M is the preferred choice. However, designers should also consider the target FPGA’s available DSP structures before making a final decision, as these may affect how efficiently different microarchitectures map to the final hardware.
An alternative to the multi-path delay approach for FFT implementation is the single-path delay feedback (SDF) FFT. While it’s easier to perform flow control and dynamic length reconfiguration with MDC structures, implementing flow control (stall) signal incorporation in an SDF structure typically leads to a significant reduction in maximum throughput.
FFT cores are available off the shelf, but these typically consist of “streaming” or “block” architectures capable of processing a maximum of one sample per clock. This constraint limits throughput to the maximum FPGA clock speed, which is insufficient for the sampling rates demanded by modern monitoring applications.
A parallel FFT architecture accepts multiple samples per clock and processes them concurrently to deliver multiple output samples per clock (Figure 2). While throughput increases, the penalty is increased area.
Figure 2: Parallel FFT architectures process multiple samples at a time to achieve greater throughput than the target device clock frequency
Table 1 below shows tradeoffs for a typical Virtex-7 FPGA design. The data shows that as parallel throughput increases, multiplier (area) utilization increases with a slightly lower multiple (better than linear). Slower system clocks and timing closure yield sub-linear throughput growth as parallelism increases; on modern FPGAs, this degradation is getting smaller. Overall, it’s possible to achieve better-than-linear throughput/area growth. Finally, latency decreases as parallelism increases.
Table 1: Typical performance and area tradeoffs for parallel FFTs on Virtex-7 class devices
Note that the specific numbers in Table 1 are only valid for a given target and configuration of the FFT, which, in this case, is
length = 1024, 16-bit input, dynamic length programmability (4-1024), and flow control. Flow control is very important for applications such as spectral monitoring, where side channel information is often utilized to change the FFT size and hence the resolution bandwidth, or to temporarily stall the FFT while other operations, such as acquisition, are going on. Theoretically, the design team can implement flow control by inserting buffers before the transform operation. In practice, for acquisition-driven operations like spectral monitoring, it is difficult to estimate the size of the buffer required, which results in the use of large, fast, and expensive memory banks.
Case Study: A 1024-length Parallel FFT for Virtex-7 and Virtex-5 FPGAs
We used Synopsys’ Synphony Model Compiler (MC), a model-based high-level synthesis tool, to demonstrate how parallel FFT architectures can map to commercially available FPGA devices. We chose to explore a set of parallel FFT results for Xilinx Virtex-7 and Virtex-5 FPGAs, using the configuration shown in Table 2.
Table 2: Case study FFT configuration parameters
Synphony MC’s library includes a parallel FFT IP block, which design engineers can use to target advanced FPGA devices like Virtex-7. The FFT is a custom IP block, which uses arithmetic primitives to create an architecture based on user-specified options such as length, precision, flow control, and dynamic programmability.
The vector support and custom block methodology in Synphony MC allowed us to create a concise, parameterizable parallel Radix2-MDC block. We used this block to instantiate parallel FFTs using different configurations, such as the subset of a 1024x16 parallel FFT datapath using the ‘PR2MDC’ custom blocks shown in Figure 3.
The PR2MDC blocks incorporate I/O that consists of a length 32 vector, which represents 16 complex samples of real and imaginary values, and a length and twiddle factor address parameter. Although the fixed-point word length grows by one bit at each stage, it is easy to adjust this by using the block’s parameters. Each block implements the parallel Radix2-MDC function (Figure 1b) using Synphony MC arithmetic operators such as add, shift and multiply. Designers can use Synphony MC to target specific microarchitectures available on their selected FPGA device. For this design, we chose a 4M complex multiply, which maps to the Xilinx DSP48E block using 18x25 multiply mode.
The block is designed to daisy chain the twiddle factor lookup for performance and flow control capability. The Synphony MC parallel FFT core shows no significant timing degradation between turning flow control on or off or using dynamic length programmability, which makes the throughput scalability it offers on advanced FPGA devices a useful feature when building real-time spectral monitoring applications.
Figure 3: The Synphony Model Compiler Parallel FFT IP block user interface and the resulting micro-architecture
We used Synphony MC to generate RTL optimized for Virtex-5 and Virtex-7 targets and Synopsys Synplify Pro with Xilinx ISE 14.2 for place and route. Table 3 shows the area and timing results following place and route. The actual scalability matches our theoretical predictions and the FPGA family capabilities. We achieved over 7 GS/s data throughput using a 16x parallel configuration with only approximately 6.3x the resources. The better-than-linear area growth is due to multiply operations that become fixed-coefficient as parallelism increases, which implement more efficiently in logic rather than using a full multiply in a DSP unit. This applies predominantly to the last K-point FFT (Table 3).
Table 3: Parallel FFT results on Virtex7 485T: -3 and Virtex5 SX95T: -2
High-throughput gigasample/s applications, such as spectral monitoring, require parallel FFT architectures in order to achieve high throughput rates with devices that use megahertz clock rates. The case study presented here shows that a parallel FFT design using the Radix2-MDC architecture can effectively achieve over 7 gigahertz throughput on FPGA devices in a 16x parallel configuration.
We developed the parallel FFT designs in a high-level design flow using Synphony Model Compiler. This enabled us to fine-tune the implementation for DSP hardware mapping across different Xilinx devices. Using Synphony MC, it was possible to explore the eight architectures characterised in Table 3 in less than a day.
For ASIC design, Synphony MC integrates with the Synopsys low power implementation flow. For more information, read the article Power-Performance Tradeoffs for Signal Processing Architectures.
About the Authors
Chris Eddington is Sr. Technical Marketing Manager for High-Level Synthesis at Synopsys and has over 20 years of experience in ASIC and FPGA design. He has held various roles in technical marketing, algorithm development and IC design at semiconductor companies that develop video and audio conferencing ICs and wireless communications systems. He holds an MS engineering degree from the University of Southern California and an undergraduate degree in Physics and Math from Principia College.
Baijayanta Ray is a corporate application engineer for the Synphony Model Compiler product at Synopsys. He received his PhD in wireless communication from Indian Institute of Technology, Kharagpur in 2010. Prior to Synopsys, he worked at National Instruments as part of the team that developed Signal Intelligence solutions.