Cloud native EDA tools & pre-optimized hardware platforms

Markus Willems, *Product Marketing Manager, ARC Processor IP*, Synopsys

Traditionally, digital signal processors (DSPs) have been constrained in architecture for specific signal processing computation applications. The combination of very long instruction word (VLIW) and single instruction multiple data (SIMD) architectures provides the parallel throughput needed for high computation performance, with data generally being 16, 24, and 32-bit fixed point. This was well suited to the algorithms used in applications such as voice/audio and communications. For recent generations of algorithms and computation, the requirements have been evolving and changing due to higher dynamic range, productivity, and reliability requirements.

The latest automated driver assistance systems (ADAS) and RADAR/LiDAR applications require high accuracy in all parts of the computation path. At the front-end FFT computation, larger point FFTs, such as 4K, 8K, and higher, cannot maintain bit accuracy in fixed point (overflows), hence for the highest performance RADAR systems floating point data types are needed. This can be single- or half-precision floating point.

Another example is predictive modeling for faster/smoother system responses, which is becoming more prevalent in a variety of automotive, automation and avionics applications. Predictive modelling uses a complex statistical model to predict an outcome based on various inputs from a wide array of sensors. Machine learning algorithms are often useful for matching states of the predictive modeling to help learn expected outcomes and further improve response times and output quality. The predictive modeling is best implemented in linear algebra-based algorithms using single-precision floating point to provide the dynamic range on computation data. The mathematical computation can be classified as linear algebra operations, which require a lot of computation on the DSP, to be accompanied by high-performance calculations of neural networks, most often using 8-bit integer data types.

Certain algorithms are very sensitive to the dynamic range. For example, implementing a matrix inversion using fixed point arithmetic requires significant scaling efforts, and typically requires certain constraints on the value ranges allowed for the individual matrix elements. That is why floating-point arithmetic is strongly preferred for such algorithms.

Signal processing algorithms are often developed and analyzed using model-based approaches, such as MATLAB® or Simulink®, starting with floating-point arithmetic. Initially, all focus is on algorithmic functionality, with floating-point data types providing the maximum flexibility, with no need to spend effort on scaling and quantization.

Figure 1: Ways of mapping signal processing algorithms to different DSP architectures

If the implementation has to be done on a fixed-point architecture, two options exist, as illustrated in Figure 1: either to emulate the floating-point behavior (which takes many cycles per operation), or to convert the floating-point algorithm to a fixed-point algorithm (which is a time-consuming and tedious process). Often it is during this phase that the algorithm design team realizes that the selected algorithm is not well suited for a fixed point implementation for the reasons mentioned above, and they then have to think about alternatives. Obviously, avoiding the float-to-fixed conversion will result in a significant productivity improvement, eliminating the quantization process and providing a bridge from model-based design to final implementation.

Ready-to-use libraries are key to productivity. Prominent examples are BLAS (basic linear algebra subprograms) and LAPACK (Linear Algebra Package). These libraries are widely used for many signal processing applications, and they are provided in floating point.

Reliability is closely related to productivity. Floating point arithmetic is defined by the IEEE754 standard. It implies that one can expect the same result when executing an algorithm as a plain functional model (MATLAB model or C code) running on a host computer, or an instruction-set simulator of the DSP, or on the hardware itself (in both cases executing a binary model, generated by a compiler targeted for the DSP processor). Key to this is the availability of IEEE-compliant compilers for all these target systems.

As already mentioned, floating-point arithmetic can be executed using fixed-point hardware, in which case the floating-point behavior will be emulated, but this approach results in a significant performance degradation. This might be acceptable if the floating-point operations only account for a very limited portion of the workload. However, this is typically not the case for digital signal processing applications (unlike the payload of classical controllers, where it is mainly control code written in integer that has to be executed. This is why many embedded controllers are integer machines). To enable an efficient implementation of floating-point algorithms, it takes dedicated floating-point hardware units, i.e. native floating-point support.

Historically, the main argument against floating-point architectures has been its impact on power, area and overall performance (PPA). Floating-point units used more gates per functional unit than fixed-point units. Overall performance as measured in cycles / second was impacted by a lower fmax, as often the floating-point units were in the critical path for any synthesis optimization. These arguments no longer hold with modern DSP architectures, and the latest process technologies.

Synopsys’ DesignWare® ARC® VPX DSP IP is a family of VLIW/SIMD processors. It supports multiple vector lengths, 128-bit, 256-bit and 512-bit, making it a versatile solution for a wide range of signal-processing and AI applications.

VPX DSPs feature a 4-way VLIW architecture optimally balanced to achieve high performance with low power consumption. Each VPX DSP core integrates a high-performance 32-bit scalar pipeline and a multi-slot vector processing unit supporting 8-bit, 16-bit, and 32-bit SIMD computations. Each VPX DSP core is capable of executing one scalar and three vector instructions per cycle. The VPX DSPs are supported by configurable instruction and data caches for scalar operations and vector closely-coupled memory (VCCM) with single cycle access for vector processing.

*Figure 2: Ready-to-use library functions, optimized for the VPX architecture and implemented in floating point*

Each VPX DSP core has up to three parallel floating point processing pipelines, including two optional IEEE-754 compliant vector floating point units (VFPUA, VFPUB) that support both full- (32-bit) and half- (16-bit) precision floating point operations. The VPX cores also have the option to add a dedicated vector floating point pipe that accelerates an extensive set of linear and non-linear algebra math functions (VFFC) including div, √x, 1/√x, sin(x), cos(x), log2(x), 2x, ex and arctan(x). Such operations can be started every four cycles, and execute in parallel to the standard math operations, as illustrated in Figure 4.

“Optional” refers to the fact that the VPX processor can be configured by the user to match the specific PPA requirements of the target application. This configurability allows for a profiling-based comparison of configurations with and without native floating-point support, which will be used for the following PPA analysis.

*Figure 3: Non-blocking pipeline executes three floating point operations in two issue slots*

Table 1 shows the results for two configurations, one without the VFPU units, the other with both VFPUA and VFPUB enabled. Both were synthesized for TSMC 12nm FFC. Adding the floating-point units results in an area increase of 10% - 15% when considering the cell logic only, and an increase of 5.6% - 11% if reasonable RAM size is taken into account. It is up to the designer to judge whether this moderate area increase is acceptable for the given application.

*Table 1: Relative area numbers comparing VPX variants with and without floating point units added*

In this context, performance refers to the elapsed time it takes to complete the execution of a certain algorithm. It is the number of cycles required for such an algorithm, divided by the clock frequency. For VPX, the maximum frequency (Fmax) is not impacted, whether or not the VFPU units are included – thanks to the optimizations that went into the floating-point hardware units. Therefore, the performance will only depend on the number of cycles it takes to execute a certain algorithm. If key kernels require floating-point arithmetic (e.g. to cover the dynamic range), a fixed-point processor has to emulate the floating-point behavior. This results in a significant performance degradation. The impact on the overall performance will then depend on the relative cycle count of these kernels vs. the overall cycle count of the entire application.

Especially for battery powered devices, energy might be the more relevant metric to look at. It takes into consideration the number of cycles it takes to complete a certain task, and the power per cycle. For example, reducing the cycle count by 50%, thanks to sophisticated instructions, might result in energy savings of up to 50%, while peak power remains the same. That said, power remains, of course, an important parameter, as it determines thermal effects, and the need for special cooling.

As mentioned before, when you have to emulate floating point using a fixed-point architecture, the cycle count will increase significantly, and so will the energy consumption. The more relevant case is a scenario where the algorithm designer spent the effort to convert the floating-point code into fixed-point code, so the comparison is about fixed-point code executing on a fixed-point architecture vs. floating-point code executing on a floating-point architecture.

*Table 2: Total dynamic energy comparing float32 and int32 *

Table 2 shows some benchmarking results, featuring real application kernels provided by customers. It distinguishes between ALU intensive kernels (which are dominated by logical operations) and MAC intensive kernels (dominated by multiple / accumulate kernels). Fixed point operations use int32 data types, floating point operations use float32 data types.

For MAC-intensive operations, integer results in a 2.7% higher energy consumption than floating point. Why is this? Integer takes about 9% more cycles than floating point, mainly due to the cycles necessary for shifting and rounding. On a normalized scale, per cycle, integer is slightly more energy efficient than floating point, which might be due to the energy spent in multipliers.

For ALU-intensive operations, the picture is slightly different. Integer results in an 8% higher energy consumption than floating-point while taking 0.5% fewer cycles than floating point. As the cycle count is almost equal for both int and float, the energy difference indicates that floating point operations consume less power for the given instructions.

Depending on the application, the ratios between float and int might differ. And there will be algorithms that can be implemented with Int16 data types, or with half-precision (HP) floating point, or any mix of those. Yet these benchmarks show that thanks to the very sophisticated floating-point units of VPX the difference in energy consumption will be very limited.

There is hardly any signal processing application that does not take advantage of some kind of AI functionality. AI algorithms are compute intensive, with multiply / accumulate (MAC) operations dominating. For any embedded application, AI algorithms rely on int8 and int16 data types. Depending on the performance requirements, AI functionality is executed on a dedicated accelerator, or on a processor. DSP processors offer excellent MAC support, as this is also needed for “classical” filtering operations. To enable the efficient implementation of AI algorithms, VPX provides native support for int8 and int16 data types. This comes in addition to native support for floating point.

For AI, software tool support is key. Graph mapping tools convert graphs described in frameworks like TensorFlow and ONNX, so they can execute on the underlying architecture. Synopsys’ NN SDK toolkit targets both the new NPX AI accelerators, as well as the VPX processor IP. Using the NN SDK, one can explore whether or not a VPX-only implementation is sufficient, or a combination of VPX and NPX is required.

An increasing number of signal processing applications uses floating point arithmetic, driven by sophisticated algorithms, and overall design productivity. Modern DSP processor architectures such as Synopsys’ VPX offer native floating-point support, optimized for PPA-efficiency. Thanks to the configurability of VPX it is possible to add floating point units as an option. This provides maximum design flexibility, to find the ideal balance between algorithm requirements, programming productivity, and implementation efficiency.