Example:
1) Calculation of pixel coordinates and transformations in 3DGraphics requires the following arithmetic calculations:
Consider a point with (x, y, z) coordinates.
Then, the length of the vector (from origin) is:
l = √ x^{2} + y^{2} + z^{2}
To implement the above expression, 3 multipliers, 2 adders and a square root function are required.
Further examination shows that dividers come into the picture to calculate the unit vector, each coordinate is divided by the vector length x/l, y/l and z/l.
Example:
2) Consider calculations of 3D objects, like intersection of objects, where inverse of matrices are used. To get an inverse of a matrix, the determinant is calculated by adders and multipliers, and then all the elements of the transposed matrix are divided by the determinant. Hence the use of arithmetic operations in 3D graphic applications is extensive.
P = 

, then inverse of P is 1/determinant [transpose of P] 
Number Representation:
Based on the representation of numbers, arithmetic circuits are divided into FixedPoint and FloatingPoint.
Sign  8bit Based Exponent  23Bit Normalized Fraction 
[ 31]  [ 30 : 23 ]  [22:0] 
The following table lists several differences between representations:
Fixed Point Arithmetic  FloatingPoint Arithmetic 
Simple circuit  Complex circuit (due to rounding and normalization) 
Small area and faster, thus suitable for volume production  Large area and slower. It used to be suitable to low volume production. But with the advancement of technology, production costs are down. 
Less accurate (the result is truncated if it exceeds the size)  More accurate (high precision) 
Smaller range of values can be handled  Wider range of values can be handled 
The objective of this paper is to explore different divider types which can be effectively used in datapath designs. Among basic arithmetic operations, division is the most complex to implement due to its iterative nature. Quotient digit selection, overflow conditions, division by zero, shifts and subtract operation, contribute to the complexity of the dividers. Apart from complexity, dividers consume more area and introduce delay, making the design bigger and slower.
Performance of Arithmetic Circuits
Delay, Area and Power are used to evaluate the performance of arithmetic circuits.
One of the other quantitative measurements of performance of arithmetic circuits is “throughput”. In general, throughput is million operations executed per second (MOPS), which is 1000/delay in nano seconds.
The dividers can be categorized as combinational and sequential.
Combinational Divider
This is a basic divider in which the outputs (quotient and reminder) are the function of current inputs only. In combinational dividers (both fixed and floating point), the successive division operations start only after the preceding division operation is completed as shown in the Fig 1:
There are many ways to implement a combinational divider, as described in academic textbooks. Dividers can be implemented by shifting and subtracting. One of the common algorithms used for division is NewtonRaphson method and digitrecurrence methods.
The combinational divider is generally constructed with a series of carry propagation adders (CPA). Thus, the size and especially the propagation delay increases nonlinearly with the width of the operands (as the operand width increases, more and wider CPAs are used). So, when applying a stringent delay constraint, optimizing these cascading CPAs on the same combinational path is timeconsuming and complex.
Due to the reasons mentioned above, combinational dividers may be suitable for designs where bitwidth is smaller, typically around 16 bits or below. Throughput of combinational dividers drastically decreases with the increase in the bitwidth of the operation. The other disadvantages of combinational dividers are, once designed; there is no user control to speed up the operation or reduce the area.
Sequential Divider
In this type of divider, the outputs (quotient and reminder) are the function of the current inputs and past inputs as well. This makes use of some storage elements (memory) and clocking to store the past inputs. Here the division is performed one bit per clock cycle (or subcycles). In these types of dividers, the partial reminders are stored and division is performed by successively subtracting a set of numbers, each being 0 or a shifted version of the divisor.
In sequential dividers (both fixed and floating point), the completion of division operation is based on the number of clock cycles configured for. Therefore, the throughput is 1000/ (delay * number of clock cycles), so as you increase the number of cycles, the delay is reduced, but not as much as the increase of number of clock cycles. So, the product of delay and number of clock cycles becomes large, which results in less throughput.
The advantages of the sequential divider are:
The disadvantage of sequential dividers is that throughput becomes low as you increase the number of cycles to perform the division operation.
The basic diagram of a floating point sequential divider is shown in Fig 2:
The successive division operation starts only after the previous division operation is complete
as shown in Fig 3:
Pipelined Divider
For repeated division operations, register pipelines can be inserted to provide increased throughput over combinational or sequential dividers. For example, if you have inserted 4 pipeline registers (called a 5stage divider), it will have a latency of four clock cycles, which means that result of a division operation is available on the fifth clock cycle after the input values are clocked. However, because of its pipelined design, it can begin a new division operation on each clock cycle.
Based on the application and requirements, the number of pipelines can be built into the design.
In summary, the following table compares these three types of dividers:
Combinatorial Divider  Sequential Divider  Pipelined Divider 
Basic divider designed with a series of carry propagation adders  Combinational divider is one of the subblocks  Combinational divider is one of the subblocks 
Large area  Smaller area  Larger area 
Large delay  Smaller delay  Smaller delay 
Large throughput for smaller bitwidth (< 16)  Small throughput  Gives maximum throughput 
Runtime is higher for large bitwidths (>32)  Runtime is faster  Runtime is faster 
Suitable for designs with smaller bitwidths and low speed designs where throughput is important  Suitable for area centric and low throughput designs.  Suitable for delay centric designs 
DesignWare® Library has more than 160 technologyindependent, highquality, highperformance Building Block IP. The DesignWare Library Datapath and Building Block IP are tightly integrated into Design Compiler (DC) and are part of the DC installation. The Library has a number of arithmetic, combinational, sequential, and floating point components, and many other blocks such as FIFOs.
There are a number of different types of dividers available in DesignWare Library:
DW_div (Combinational Integer Divider): has configurable bitwidth, supports signed and unsigned operation and it can be inferable using function call.
DW_div_pipe (Pipelined Divider): The widths of the operands and number of pipeline stages are parameterizable in this divider, and it supports both signed and unsigned operation. Automatic pipeline retiming ensures optimal placement of pipeline registers within the divider to achieve maximum throughput. Also, it has parameterizable stall and reset modes.
DW_div_seq (Sequential Divider): The widths of the operands and number of clock cycles are parameterizable and it supports both signed and unsigned data operation. Also, it has parameterizable registered input /output mode and reset mode.
DW_fp_div (Combinational Floating Point Divider): The precision format is parameterizable for either IEEE single, double precision, or a userdefined custom format, accuracy conforms to IEEE 754 Floating Point standard.
DW_fp_div_seq (Floating Point Sequential Divider): The precision is parameterizable for either IEEE single, double precision, or a userdefined custom format. Accuracy conforms to IEEE 754 Floating Point standard, parameterized number of clock cycles, registered or unregistered input and outputs, internal register for the partial pipelining.
Data sheets for all the above components are available at: http://www.synopsys.com/dw/buildingblock.php
As mentioned earlier, division operations are more complex and have bigger area and longer delay when compared with other basic operations like addition or multiplication.
The combinational floating point divider DW_fp_div, which uses combinational divider (DW_div) as its basic block, has similar issues of large area and long delays, making it unsuitable for applications where area is critical.
The DesignWare Library floating point sequential divider, DW_fp_div_seq, is suitable for low area and high frequency applications. The final division result of DW_fp_div_seq is same as DW_fp_div, but consumes less area.
The following are the features of DW_fp_div_seq:

There are three parameters that control the registers. input_mode and output_mode place the input and output registers. early_mode makes a bypass from the input of the input register to the output of the input register. internal_reg enables the partial pipeline operation. Since the integer sequential divider consumes multiple clock cycles, the input data cannot be pipelined during the operation of the integer sequential divider. Once the integer sequential divider finishes the operation and passes the result to the normalization block, it can receive the next data from the input.
Figure 6: Simulation waveforms of DW_fp_div_seq with the following parameters (Input/output_mode = 0, early_start = 0, internal_reg = 1 and num_cyc = 5) 
Area and delay for combinational vs. sequential dividers (DW_fp_div vs DW_fp_div_seq)
To evaluate the advantages and disadvantages of sequential dividers versus combinational floating point dividers, a performance bench mark has been done. The area and delay performance depends on the context, constraints, and technology library. Note that the following data is derived from the simple instantiation of the components, so results could be different (possibly better) when components are used in the actual design.
The following is the recommended methodology for synthesizing DesignWare Components:
// search path setup target
// link and technology library setup
// read_verilog ./instantiation.v
// delay constraints
// area constraints
compile_ultra
compile –incr
// reports and netlist generation
The benchmark was done with 2007.03SP4 version of DC with TSMC 90nm library. The X and Yaxis represents delay in (ns) and area (lib area) respectively. The following curves are generated with several successive timing constraints with minimum area constraint.
Figure 7  Area vs. Delay of DW_fp_div (bitwidth = 16) 
Fig 7 shows the performance of 16bit floating point combinational divider DW_fp_div. Notice the sharp increase in the area as you go below 10 ns delay.
Figure 8  Area vs. Delay of DW_fp_div_seq (bitwidth = 16, num_cyc = 4) 
Figure 9  Area vs. Delay of DW_fp_div_seq (bitwidth = 16, num_cyc = 5) 
Fig 8 & 9 shows the performance of 16bit floating point sequential divider, DW_fp_div_seq, with num_cycles 4 and 5 respectively. Notice that the area is around 30% less for smaller delays when compared with the combinational divider. Thus, it gives additional control over area by using the num_cycles parameter.
The following figures compare the DW_fp_div, and DW_fp_div_seq with 32bit operands and different num_cyc values.
Figure 10  Area vs. Delay of DW_fp_div (bitwidth = 32) 
Figure 11  Area vs. Delay of DW_fp_div_seq (bitwidth = 32, num_cyc = 4) 
Figure 12  Area vs. Delay of DW_fp_div_seq (bitwidth = 32, num_cyc = 5) 
Of all the arithmetic operations, division is one of the most complicated and it often ends up in the critical path of many designs. Dividers can be implemented in a variety of ways as previously explained. Combinational dividers are useful for smaller bitwidths and lower frequency designs. For area centric and high frequency designs sequential dividers are most suitable.