Simplifying DSP Hardware Development
within a MATLAB-based Design Flow
Historically, exploiting FPGA or ASIC implementation of DSP algorithms has been the domain of companies with highly-skilled designers and large budgets. Now, a new generation of tools is bringing hardware-based DSP implementation within the reach of a much wider community through a new paradigm in DSP design: algorithmic synthesis. Eric Cigan, AccelChip, Inc. and Aaik van der Poel, Synopsys, cover the design flow issues in detail in the second part of this two-part article.
Part 1: A Quick Review
In the first part of this article, we made a case for enhanced productivity; that the time-consuming nature of the traditional design flow, with a manual gap between algorithm development and implementation, is no longer acceptable. We also explained why algorithm developers prefer to use the MATLAB language and analysis environment, and why it is far removed from what hardware designers like to see as a starting point.
Now we take a peek under the hood of what algorithm synthesis actually does when it automatically converts the fixed point MATLAB functional descriptions to Register Transfer Level hardware descriptions, thus closing the "exclusive knowledge" bridge with an automated, repeatable and risk-reducing flow step. We will also highlight how the risk of misinterpretation can be decreased even further when we move from prototyping in FPGAs to production in ASICs.
Addressing New Terminology
A simple metaphor will help to put into context some potentially unfamiliar terms. The process of what happens during algorithmic synthesis can be compared to managing any project.
A project (design) has a desired outcome. There is a pool (library) of people (resources) available to accomplish the tasks (operations) that need to be performed in order to make the project suc-cessful. There will be certain outside influences, like cost, completion date and holidays, (constraints, like area, timing, throughput) applied to the project that will force the manager (e.g., AccelChip DSP Synthesis tool) to give certain tasks (operations) to certain team members (resources) based on experience, expertise and availability. Starting certain tasks will depend on others being finished, as in a project Gantt chart (data flow dependencies, or Data Flow Graph - DFG). In order to minimize the project cost (area) and maximize team performance (timing and/or throughput) tasks are assigned (allocated) to certain people and must start and finish at the appropriate time (scheduling). When things fall into place, and the constraints are not overly tight, the project will finish (automatically creating Control state-machine, Data-Path, and Memory Inference) with the desired results (RTL and Gate-Level netlists) on time and within budget (meet the design specifications). Because team members communicate in different languages (e.g., the MATLAB language for algorithms, and VHDL and Verilog during Implementation) a high quality translator (AccelChip) that adheres to industry standards (Design Compiler and Design Compiler FPGA) is needed.
Algorithmic Synthesis by Example
A simple algorithm is useful in illustrating the algorithmic synthesis process:
Y = (a*b) + c + (d - e) * f
In reality, a designer will assign constraints to this (untimed) algorithm. Constraint requirements are often orthogonal and compromises are needed. Algorithmic synthesis allows us to explore the design space quickly, without changing the original source description so we can choose the so-lution that meets ALL constraints. An example of orthogonal constraints is to create an architec-ture that uses the minimal amount of silicon area (cheap), as well as creating an architecture that delivers the maximum throughput at the maximum speed.
A minimum amount of silicon area implies that a minimal set of resources will be used – use one multiplier and one adder to perform the four operations. As we are going to use one resource for multiple operations, they cannot be used at the same time so we have to control their usage. This determines the cycles or schedule – three clock cycles in this case. To control the usage of our shared operations a finite state machine is automatically created. The actual hardware components are mapped (allocated) onto the tasks, and the data-path portion of the design is created automatically. The result is one hardware component of each type (add, sub, mult) in three clock cycles as shown in Figure 1(a).
Figure 1: Alternative architectures: (a) Architecture 1 minimum Area. (b) Architecture 2 Maximum throughput.
If we want to create a design of maximum throughput and speed the algorithm synthesis application will have to know the performance of the actual hardware building blocks at our disposal (technology library). We will make the assumption that the smallest delays are found in the adders and subtractor, and both are 4ns. So if we choose a 5ns clock that component would be able to finish before the next set of data appears at its inputs. This gives us the schedule from Figure 1(b). Note that the multipliers now extend over multiple clock cycles. A multicycle component could be used but since one of the requirements is high throughput, a pipelined component is chosen. Using this component will introduce some extra hardware in the form of registers that have to store the intermediate results so the multiplier can accept the next set of inputs. We see that the clock frequency has gone up by a factor of three at the cost of using two multipliers (which are also more costly than the ones used before). After the first pass outputs will appear every 5ns. (Architecture 1 would have data appear at the output with rates of 15ns, so we also markedly improved throughput.)
In this brief introduction to algorithm synthesis, we have seen two vastly different architectures being automatically created, without changing the original source code with a given set of constraints. Parts of the design that are often very time consuming to design by hand are generated automatically, including the control FSM, datapath and memory inference.
The Floating-Point to Fixed-Point Conversion Process
In MATLAB, all math operations are performed using 64-bit floating-point arithmetic by default. As noted above, fixed-point arithmetic is generally preferable for silicon implementation, but requires careful management of bit-width to prevent overflow and underflow conditions from disrupting the algorithm’s effectiveness. To automate the process, tools should analyze the floating-point MATLAB to determine the fixed-point arithmetic artifacts.
- Bit Growth – To preserve the accuracy of an operation, bit growing takes place on the outputs of operators. For example, for a multiplication the output is sized based on the size of the input arguments.
- Constant Inferring – Tools such as AccelChip DSP Synthesis will automatically determine the number of bits required to represent constant values. 
- Port Size Inferring – Based on the input stimulus vectors, tools can determine the required size of the input ports to support the dynamic range of the design.
Figure 2: MATLAB algorithm implementing 3-D vector rotation
In the case of the 3-D vector rotation example (Figure 2 above), for instance, there are two inputs to the design function – ang, a 3-vector that contains the three rotation angles and R, a 3-vector with the Cartesian coordinates of the vector that will be rotated. The AccelChip tool analyzes the input stimulus vectors in the script file and examines the range of ang and R, and using this information to determine the number of bits to hold the maximum value reached. It also analyzes the fractional parts to determine how many bits would be needed to represent the input stimulus exactly and uses this to determine the number of fractional bits required. Following these heuristics, the AccelChip tool allocates 9 bits of I/O for each component of ang and 1 bit for each component of R. This automation shows the merit in using stimulus data that is as representative as possible of the system that will be implemented in hardware. (In this case, however, it is likely the designer would have intended that the components of the variable R should be able to have more than one bit!) Alternatively, the AccelChip DSP Synthesis tool allows the user to override the automated floating-to-fixed-point conversion process through the use of design directives that explicitly specify quantization properties.
The tool accounts for bit growth through the algorithm as variables propagate, and it follows this process out to the output ports. Table 1 below shows a summary of the bit-widths and other fixed-point attributes for the 3-D vector rotation example.
Table 1 - Inference of port width and bit growth in 3-D vector rotation example
As a result of this automation, designers spend far less time with the tedious tracking of fixed-point arithmetic and can focus on the effectiveness of the algorithm at a higher level of abstraction.
DSP Silicon Intellectual Property
DSP algorithms in all applications frequently embed mathematical functions such as digital filtering, signal transformations, encoding/decoding, etc. These functions are readily available in a domain-specific analysis and simulation tool such as MATLAB for the purpose of algorithm development. Liberated from developing these mathematical functions on their own, algorithm developers can focus their energies on producing algorithms that differentiate products rather than recoding an FFT model in “C.”
Commercial IP developers have focused on development of cores that implement these essential mathematical functions. As shown in Table 2 below, a single core implementing an FFT may have numerous implementation choices that allow that mathematical function to be optimally suited to the particular application. These choices let the designer optimize a core’s performance – as measured in throughput and latency – and resource utilization / area.
Table 2 - Example of implementation alternatives for an FFT IP core
In theory, a designer can seek IP from dozens of silicon vendors and independent IP developers, choosing IP that is ideally suited to the application. In practice, however, DSP IP from all these sources has been developed piecemeal from dozens of sources, leaving the designer with the challenge of stitching the design together and developing the design testbench. Worse yet, some DSP IP is intended only for use in specific vendor’s silicon, leading to problems if the designer wishes to retarget from one vendor to another.
Table 3 - Representative selection of commercial DSP cores
Designers should seek commercial solutions that offer a wide range of IP for the given application. Table 3 shows the variety of cores that are available for DSP applications from AccelChip.
Use of IP in 3-D Vector Rotation Example In the case of the example, the sine and cosine functions from the algorithm in Part 1/Figure 4 can be implemented using cores from AccelWare IP toolkits. As shown the figure, the sine and cosine algorithms are written as two distinct function calls in MATLAB; however, since computation of sine and cosine functions in hardware uses the same intermediate quantities, AccelChip can synthesize hardware that computes both sine and cosine with a single core. The user can additionally specify whether the sine and cosine should be computed using CORDIC approximation, bipartite tables, or a linearly-interpolated lookup table. This integration of IP and algorithmic synthesis gives designers greater ability to optimize a design for area, performance, noise, and other constraints.
Algorithmic and Hardware Design Exploration
Design exploration means the process of starting with a nominal design and then proceeding to alter it in a structured fashion in order to enhance throughput or reduce the resources consumed. In the case of algorithmic synthesis, there are levels at which this can be accomplished – algorithmic design exploration that is performed by the algorithm developer and hardware design exploration that is performed by the system designer or hardware designer.
Algorithmic Design Exploration Once the algorithm developer has made the design functionally correct, the next task is to make tradeoffs that preserve the fidelity of the design while making improvements for optimal implementation. One form of algorithmic design exploration is achieved by reducing the number of bits to represent numbers as described above. A second method, however, is to choose between different core implementations of functions in the DSP algorithm. While many IP providers offer fixed-cores that support only one implementation or micro-architecture, some provide cores with a choice of micro-architectures.
As an example, Figure 3 shows the effect of different micro-architectures to implement a square root function. Depending on whether the designer chooses a bipartite table or linearly-interpolated lookup table, the design will exhibit different cycle times, resource usage, noise characteristics and so on. The choice of which core is preferable is a function of both the needs of the application as well as the resources available.
Figure 3: Effect of two different micro-architectures for square root core
Hardware Design Exploration While the MATLAB language is highly expressive in terms of mathematical behavior, it does not uniquely define the implementation of each command and function. For instance, if a multiply-accumulate function is performed within a loop that is executed 16 times, should it be synthesized to a single hardware multiply-accumulate for minimum hardware, to 16 multiply-accumulates for ideal throughput, or should it be somewhere in between?
Algorithmic synthesis tools such as AccelChip DSP Synthesis allow the use of design directives, annotations to the MATLAB M-file that tell the algorithmic synthesis tool how to implement MATLAB commands and functions in hardware. An example of how directives can be used can be seen in the 3-D vector rotation example. The MATLAB command
R_rot=T_gamma*T_beta*T_alpha*R as shown in Part 1/Figure 4 computes a product of three 3x3 matrices and a 3-vector. Because they require indexing through rows and columns and sequential computation of inner products, vector and matrix multiplications can be considered to be implicit loops. Design directives can be placed on vector or matrix multiplication allowing loops to be unrolled by row, by column or by inner product. Table 4 shows how adjusting the level of implicit loop unrolling may be used to trade-off throughput and estimated resources.
Table 4 - Effect of Unroll directives on 3-D vector rotation example
Specifying the Verification Environment
Fixed-Point Verification Conversion to fixed-point arithmetic can mean a loss of precision, so it is necessary to verify that the fixed-point algorithm still meets specifications such as noise floors, bandwidth, and so on. But if this verification means creation of a new simulation model, designers can waste precious time. What many designers fail to realize is that floating-point MATLAB can be readily converted to fixed-point; this process is significantly easier than creating a new fixed-point model in an alternative language such as C. MATLAB models fixed-point precision by performing the arithmetic in the floating-point domain and using “constraints” to define the fixed-point representation of the data objects. These fixed-point constraints are provided through the functions “quantizer()” and “quantize()” provided with the Fixed-Point Toolbox for MATLAB. By inserting the quantize() and quantizer() functions into existing M-files, the designer can allocate the bit-widths to be used, define whether a number is signed or unsigned, and define the treatment of overflow and underflow conditions.
Tools can leverage the “quantizer()” and “quantize()” functions of MATLAB by generating a fixed-point copy of the floating-point MATLAB source as necessary to set the required data path bit-widths. This approach has the benefit that it follows the accepted, de facto standard approach of MATLAB rather than inventing a new, proprietary approach. This approach is effective both for single operations, such as multiplying two variables, and also for more complex operations like vector and array operations where iterative multiply-accumulator operations take place. The fixed-point MATLAB model can then be easily simulated and its results can be compared with the original floating-point M-file. The resulting stimulus and results from the fixed-point simulation may then serve as the “golden” result that can drive the downstream hardware design flow.
RTL Verification Effective hardware implementation of DSP algorithms invariably requires the designer to verify that the hardware matches the original design. In conventional DSP design, once the algorithm is developed the hardware designer has two tasks: to develop an RTL model for the hardware that will be synthesized and to develop a testbench, a software model that can be simulated with the RTL to assess whether it matches the algorithm developer’s intent.
When algorithmic synthesis tools are used, the need for a complete testbench becomes even more important as a means to verify the design. Because the original form of the DSP algorithm is in floating-point and it must be converted to fixed point, an additional level of verification is required. There will be some differences between floating- and fixed-point because of the effect of overflow and underflow conditions, so this level of verification must allow for an error tolerance between floating- and fixed-point designs. Once this translation is complete, however, the fixed-point design becomes the reference design against which the RTL model must be verified.
Designers should look for tools that automatically generate the testbench and that work with their preferred HDL simulator. The verification environment should ensure that the fixed-point algorithm is equivalent to the RTL, and ultimately the gate-level design. All derivatives of this design must conform to certain design metrics. These metrics are: Functionality, Fidelity, Bit Accuracy, Latency, and Throughput.
The Next Step: FPGA/ASIC Synthesis
The next step in implementing the algorithm is further optimization and conversion of the RTL netlist as it was created by the constraint-driven Algorithm synthesis, into a description ready for physical implementation into silicon. This is done through the use of FPGA and ASIC synthesis applications. The choice of hardware in which to implement the algorithm depends largely on the goal we have for its usage. FPGAs can be used and reused, which makes them an excellent choice for system prototyping. Because of their flexibility, FPGAs can be used for in-field or in-product updates when working with algorithms that are dealing with standards, or designs that are in flux. ASICs on the other hand consume less power, cost less per part, and are often capable of handling higher clock frequencies, but have much higher upfront costs and are more expensive and time-consuming to change, once created.
When considering FPGAs for prototyping the most important question to ask is:
“How can I make absolutely sure that the (expensive) ASIC I am going to create behaves and functions the same as the FPGA prototype I created, tested and approved?”
Given the different technology basis for ASIC and FPGA we cannot ever be totally sure, but we can eliminate as many discrepancy risks as possible. (Figure 4 summarizes the difficulties in tar-geting for FPGAs and ASICS.) One obvious risk is using different tools for similar tasks, like FPGA and ASIC synthesis.
Figure 4: A unified synthesis environment reduces time-to-results and risk of misinterpretations
The industry standard for synthesizing an ASIC is Design Compiler® (DC) from Synopsys. Every ASIC vendor supports it, and it has proven itself over the last 15 years with more than 125,000 successfully completed production designs. Further risk is reduced and quick time to results is provided through the highly correlated toolset of the Galaxy™ platform. This provides for quick and successful convergence to silicon implementation. Using a tool other then Design Compiler for FPGAs (DC FPGA) could bring unnecessary risk. Among others, the original RTL source and provided constraints may be interpreted differently creating slightly different implementations. This would nullify the reasons we choose to prototype in the first place; to make sure that our design assumptions were correct by proving them in (changeable) silicon. Disappointments, cost overruns, and slipped schedules can be mitigated by using a unified environment for both the FPGA prototype and the production ASIC.
All this risk avoidance is moot if the synthesis tools don’t provide the quality of results (QoR) in a reasonable turn-around time (TAT). One of the reasons DC and DC FPGA perform so well is the use of Adaptive Optimization Technology (AOT). The technology results in the best performing designs that are created in the least amount of time. AOT works by addressing three optimization opportunities:
First it takes a global look at all the different synthesis optimization schemes available for a certain silicon technology and maps those out against the algorithm design to determine which scheme is best applied to each part of the design. Then it takes the chosen schemes and optimizes them in place to create the maximum impact on the different parts of the design. Finally it reorders the now optimized schemes in such a way that combined, they create the best performing designs in the least amount of time. The time saved can now be used to run more micro-architecture explorations on the algorithm under design, by changing the various constraints to produce the most cost effective design that fully meets the overall system specifications.
In the past, DSP design flows for hardware have lacked much of the automation that is expected in software-based implementations, and have not leveraged the widespread use of MATLAB. Numerous steps in the design process have been automated by a new class of algorithmic synthesis products such as the AccelChip DSP Synthesis tool. Tasks such as automation in the conversion from floating-point to fixed-point arithmetic can be accelerated by tools that account for bit growth and provide mechanisms for trimming bits.
Intellectual property cores for DSP can be integrated with algorithmic synthesis so that designers can perform design space exploration to meet design requirements while reducing costs. Integration of algorithmic synthesis can also be readily combined with unified RTL synthesis tools for FPGA and ASIC to minimize the likelihood of deviations between FPGA prototypes and production ASICs. Verification flows for algorithmic synthesis allow use of existing MATLAB and RTL simulation products through use of self-checking RTL testbenches.
 AccelChip Inc. white paper, “Automatic Conversion of Floating-Point to Fixed-Point MATLAB, January 13, 2004.
Product Marketing Manager, AccelChip Inc.
As product marketing manager for AccelChip Inc., Eric Cigan is responsible for product planning and promotion for the AccelChip product family. He has more than fifteen years' experience in the EDA industry.
He was most recently at Mentor Graphics, Inc., where he managed product marketing and busi-ness development in Mentor's hardware/software co-verification business. Prior to this, Cigan held positions as aerospace/automotive segment manager for Analogy (now part of Synopsys) and product marketing manager, account manager and research engineer at Integrated Systems, Inc. (now part of Wind River).
Cigan began his career in control system design at the Lockheed Missiles & Space Company and the Charles Stark Draper Laboratory. Cigan holds S.B. and S.M. degrees in Mechanical Engi-neering from the Massachusetts Institute of Technology.
Aaik van der Poel
Group Marketing Manager, Synthesis, Synopsys
Aaik van der Poel currently oversees various aspects of the Synopsys (structured) ASIC, Behav-ioral, C-based, and FPGA synthesis product lines, and has over 20 years EDA marketing, sales and support experience.
He joined Synopsys in 2000 and was responsible for the CoCentric SystemC synthesis product line introduction. Prior to joining Synopsys he held senior marketing and application positions at Mentor Graphics and Tektronix Europe and was a chip designer at ICN design house in the Netherlands. Van der Poel holds a M.S. in Electrical Engineering from the University of Twente in the Netherlands and a patent on isochronous (large) system design.
AccelChip and AccelWare are registered trademarks of AccelChip Inc. Design Compiler, Synopsys and the Synopsys logo are registered trademarks of Synopsys, Inc., and Galaxy is a trademark of Synopsys, Inc. MATLAB is a trademark of The MathWorks, Inc. All other trademarks or registered trademarks mentioned in this document are the intellectual property of their respec-tive owners.
©2007 Synopsys, Inc. Synopsys and the Synopsys logo are registered trademarks of Synopsys, Inc. All other company and product names mentioned herein may be trademarks or registered trademarks of their respective owners and should be treated as such.