Performance of Different Multipliers in the DesignWare Building Block IP
-Arshid H Syed Sr. CAE
Multipliers are some of the most important components in datapath design. The DesignWare Library contains a variety of technology-independent, high-quality and high-performance IP blocks.
This article explores and compares the performance and functionality of the combinational, pipelined and sequential multipliers available in the DesignWare Building Block IP (DWBB).
Combinational Multiplier (DW02_mult): In this multiplier, widths of operands are parameterizable, and it supports both signed and unsigned operation. For more information on its usage please see the datasheet:
http://www.synopsys.com/dw/doc.php/doc/dwf/datasheets/dw02_mult.pdf
There are multiple implementations of the combinational multiplier: csa (carry-save array), wall (Booth-recoded Wallace-tree) and nbw (non-Booth recoded Wallace-tree for bit-width <41). These traditional implementations are fixed or static, and choosing between them allows the user to make area versus delay trade-offs. More information on these implementations is available in many computer arithmetic textbooks.
Starting with the 2004.12 release, a new implementation named "pparch" (parallel-prefix architecture) is available. This implementation is flexible and is dynamically generated based on context, e.g., area and timing constraints, and technology library. It exploits the characteristics of different implementations and generates the optimal architecture.
Following is a speed and area comparison for different static implementations with the flexible "pparch" implementation.
Synthesis results depend on the constraints and technology libraries used. The results in table 1 are obtained under the following conditions and constraints:
- Designs – simple, unsigned multiplication operation a*b (widths= 8, 16, 32 and 64)
- Design Compiler version– Y-2006.06-SP2
- Library – TSMC 90 nm
Flow & Constraints:
read_verilog simple_design.v
set_max_delay -from [all_inputs] -to [all_outputs] 0
set_max_delay 0 [all_outputs]
set_max_area 0
compile
#reports generation
Table 1: Synthesis results for DW02_mult
| | csa* | Wall | pparch | csa | wall | pparch | pparch |
| Bit-width | Delay (ns) | Delay(ns) | Delay(ns) | Area (gates)** | Area (gates)** | Area in (gates)** | Throughput*** |
| 8 | 1.72 | 1.20 | 0.98 | 1179 | 1157 | 1203 | 1020 |
| 16 | 3.25 | 1.60 | 1.48 | 4954 | 4153 | 3270 | 675 |
| 32 | 6.27 | 2.14 | 1.92 | 19046 | 13213 | 11500 | 520 |
| 64 | 12.58 | 2.86 | 2.54 | 58266 | 42156 | 35466 | 393 |
* All implementations except "csa" require a DesignWare license
** 1 nand2x1 gate = 2.4192 Lib Area
*** Throughput in million operations per second (MOPS) = 1000 / delay in ns
Table 1 confirms that flexible implementation "pparch" generates optimal architecture.
Depending on system requirements, a designer may choose an alternative multiplier from the DesignWare Building Block IP: pipelined multipliers (DW_mult_pipe or DW02_mult_n_stage, where n is 2, 3, 4, 5 or 6) or sequential multipliers (DW_mult_seq).
Please refer to the application note (AN 96-002) for information on throughput of combinational and pipelined multipliers:
http://www.synopsys.com/dw/doc.php/doc/dwf/manuals/dw_fdn_appnotes.pdf
Pipelined Multipliers
DW02_mult_n_stage: These multipliers are hard coded for n= 2, 3, 4, 5 and 6. The widths of the operands are parameterizable, and it supports both signed and unsigned data operation. Automatic pipeline retiming ensures optimal placement of pipeline registers within the multiplier to achieve maximum throughput. For more information, please refer to the data sheets available at:
http://www.synopsys.com/dw/doc.php/doc/dwf/datasheets/math_arith_overview.pdf
DW_mult_pipe: The widths of the operands and number of pipeline stages are parameterizable in this multiplier, and it supports both signed and unsigned operation. Automatic pipeline retiming ensures optimal placement of pipeline registers within the multiplier to achieve maximum throughput. Also, it has parameterizable stall and reset modes. For more information please see the data sheet:
http://www.synopsys.com/dw/doc.php/doc/dwf/datasheets/dw_mult_pipe.pdf
The recommended synthesis methodology for the pipelined designs is described in the guideline number 12 of the following white paper on "RTL Coding Guidelines for Datapath Synthesis".
http://www.synopsys.com/coding_guidelines.pdf
Here is the sample synthesis script:
read_verilog multiplier_instantiation.v
set clk_per 2
create_clock [find port inst_clk] -period $clk_per
compile
set_dont_touch *_reg_reg* true
set clk_per 1
create_clock [find port inst_clk] -period $clk_per
set_max_area 0
optimize_registers
compile -incr
#reports generation
There is a stall and reset mode for DW_mult_pipe that is not present in DW02_mult_n_stage.
The delay and area difference between DW02_mult_n_stage and DW_mult_pipe is marginal (see tables 2 and 3), since the underlying implementation is the same for both.
Table 2: Synthesis results for 2-stage pipelined multipliers
| | DW02_mult_2_stage | | DW_mult_pipe
(2 stages) | |
| Bit-width | Delay* | Area** | Throughput*** | Delay | Area | Throughput*** |
| 8 | 1.07 | 1093 | 934 | 1.05 | 1176 | 952 |
| 16 | 1.16 | 3653 | 862 | 1.16 | 3627 | 862 |
| 32 | 1.36 | 12196 | 735 | 1.40 | 12212 | 714 |
| 64 | 1.67 | 44352 | 598 | 1.63 | 44415 | 613 |
Table 3: Synthesis results for 3-stage pipelined multipliers
| | DW02_mult_3_stage | | DW_mult_pipe (3stages) | |
| Bit-width | Delay* | Area** | Throughput*** | Delay | Area | Throughput*** |
| 8 | 1.0 | 1253 | 1000 | 1.00 | 1197 | 1000 |
| 16 | 1.0 | 3877 | 1000 | 1.01 | 3911 | 990 |
| 32 | 1.11 | 13740 | 900 | 1.02 | 13838 | 980 |
| 64 | 1.29 | 45535 | 775 | 1.22 | 47305 | 819 |
* Delay in ns
** 1n and 2x1 gate = 2.4192 Lib Area
*** Throughput in million operations per second (MOPS) = 1000 / delay in ns
Sequential Multiplier (DW_mult_seq): DW_mult_seq is a sequential multiplier designed for low area, area-time trade-off, or high frequency (small cycle time) applications.
The widths of the operands and number of clock cycles are parameterizable and it supports both signed and unsigned data operation. Also, it has parameterizable registered input /output mode and reset mode. For more information please see the data sheet:
http://www.synopsys.com/dw/doc.php/doc/dwf/datasheets/dw_mult_seq.pdf
Flow and Constraints:
read_verilog multiplier_instantiation.v
//parameters width, tc=0, input/output_mode = 1,
// num_cycles = 3, 4 etc
set clk_per 1
create_clock [find port inst_clk] -period $clk_per
set_max_area 0
compile
compile –incr
#reports generation
Table 4: Synthesis results for sequential pipelined multipliers
| | DW_mult_seq 3 cycles | |
| Bit-width | Delay* | Area** | Throughput*** |
| 8 | 1.20 | 783 | 277 |
| 16 | 1.69 | 2410 | 197 |
| 32 | 2.09 | 7425 | 159 |
| 64 | 2.64 | 25078 | 126 |
| | DW_mult_seq 4 cycles | |
| Bit-width | Delay* | Area** | Throughput*** |
| 8 | 1.00 | 625 | 250 |
| 16 | 1.48 | 1828 | 168 |
| 32 | 1.85 | 6127 | 135 |
| 64 | 2.38 | 19703 | 105 |
* Delay in ns
** 1 n and 2x1 gate = 2.4192 Lib Area
*** Throughput in million operations per second (MOPS) = 1000 / (delay in ns * cycles)
Conclusion:
DesignWare Library has a wide variety of multipliers — combinational, pipelined and sequential. Users can select any of these multipliers based on system requirements.
Please check the following link for the complete list of DesignWare Building Blocks, including Floating Point Components:
http://www.synopsys.com/dw/doc.php/doc/dwf/intro.pdf
|