Faster, More Accurate Data Computation with a New Generation of DSPs

Graham Wilson, Product Marketing Manager, ARC Processors, Synopsys

Traditionally Digital Signal Processors (DSPs) have been constrained in architecture for specific signal processing computation applications. The combination of Very Long Instruction Word (VLIW) and Single Instruction Multiple Data (SIMD) architectures provide the parallel throughput needed for high computation performance, with data generally being 16, 24, and 32 fixed point. This was well suited to the algorithms used in applications such as voice/audio and communications. In previous generations of DSP algorithm development, system and software developers would develop system algorithms in MathWorks MATLAB® and, when moving the algorithm to the DSP, would take the step of converting the floating-point datatype output from MATLAB to fixed point. This was deemed necessary as fixed-point data types were considered the place of ‘real’ DSP developers, plus the fixed-point range covered the computation needs in terms of looping and accuracy with minimal hardware usage and power consumption.

Evolution to Greater Throughput with Floating Point

For recent generations of algorithms and computation, the requirements have been evolving and changing for multiple reasons, one being time-to-market. This is because of the need to differentiate products using custom software algorithms, along with more aggressive time-to-market windows for new products. As a result, current systems may skip the step of converting from floating point to fixed point and simply run floating point on the DSP. The need for higher accuracy in data computation is also driving new requirements. For example, newer applications such as automotive ADAS and RADAR/LiDAR require high accuracy in all parts of the computation path. At the front-end FFT computation, larger point FFTs, such as 4K, 8K, and higher point FFTs cannot maintain bit accuracy in fixed point (overflows), hence for highest performance, floating point data types are needed. This can be single- or half-precision floating point.

Predictive modeling for faster/smoother system responses is becoming more prevalent in a variety of automotive applications, especially ADAS, powertrain, and motor control/management. Predictive modelling uses a statistical model that is able to predict an outcome based upon various inputs. The benefit of this type of implementation is that the DSP processor can run highly complex models, and outputs can be generated more quickly and in response to a wider array of sensor and environmental inputs. Machine learning algorithms are also useful for matching states of the predictive modeling to help learn expected outcomes and further improve response times and output quality. The predictive modeling is best implemented in linear algebra-based algorithms using single-precision floating point to provide the range and accuracy on computation data. The mathematical computation can be classified at linear algebra operations, which require a lot of computation on the DSP processor.

Applications with high processing requirements such as 5G wireless communications, with a data rate of 10 Gigabits per second, and automotive ADAS, RADAR, and LiDAR that require over 200 Gigabytes per second of data throughput need a DSP with very wide vector computation as well as the ability to perform highly parallel execution. These new driving factors for DSP computation change the architecture and instruction set architecture (ISA) requirements on traditional DSP cores. The DSP needs to offer a very high level of computation throughput, as well as provide a high level of both single- and half-precision floating point computation. Traditional DSP processors with architectures and ISAs initially focused on fixed-point data type processing, add floating point units, which are often not optimum in terms of processing and power consumption. These traditional DSP processors have limited instruction support for linear algebra, maybe SQRT and/or 1/SQRT for matrix transpose operations, but for the complete range of linear algebra algorithms, the mathematical operations have to be software emulated.

The Next-Generation DSP Architecture for a Data Centric World

A new generation of DSP core is needed to meet these computation requirements in terms of data throughput volume for floating point and linear algebra. This new DSP is the DesignWare® ARC® VPX5 Processor IP, in which floating point and linear algebra vector computation are developed as part of the native architecture, which are efficiently implemented with the vector SIMD and VLIW architectures to provide an ultra-high level of parallel processing. This DesignWare ARC VPX5 Processor IP has four dimensions of parallel execution (Figure 1).

Figure 1: The 4 dimensions of parallel execution on DesignWare ARC VPX5 Processor IP

1st Dimension: Multiple SIMD Computation Engines for Floating Point

The fundamental vector data length is 512-bit, which enables SIMD computation of data units of 8-, 16-, and 32-bit data or half- and single-precision floating point data. All the SIMD engines compute with 512-bit vector lengths; this defines the upper computation capability of the ARC VPX5 processor.

For integer data of 8-bit, 16-bit, and 32-bit length there are dual 512-bit SIMD computation engines, coupled with three ALU processing units. This gives very high levels of computation for algorithms such as machine learning convolution operations (5x5, 3x3 and 1x1).

For half- and single-precision floating point computation there are three vector SIMD computation engines, all supporting the maximum vector length of 512-bits. The dual SIMD engines for “regular” floating point vector operations provide ultra-high performance for floating point vector operations, such as DSP functions including FFT, matrix operations, etc.

The third vector SIMD floating point engine is dedicated to linear algebra mathematical functions. This dedicated engine allows the offload and parallel computation of mathematical functions and will be further described in the 4th Dimension section.

2nd Dimension: Flexibility with Multi-Task Issue VLIW

Flexible allocation in the 4-issue VLIW scheme enables the processor to allocate the maximum possible parallel operations. The VLIW scheme has been developed in tight collaboration with the development of the software compiler such that the compiler pre-allocates operations compiled from the original C-code program. The compiler, coupled with the VLIW architecture, enables the execution of operations across the multiple SIMD engines in parallel.

As an example, Figure 2 shows how the compiler is able to work with the VLIW allocation scheme to use just two VLIW slots to achieve parallel execution across three floating point SIMD engines with optimum VLIW slot allocation and reduced instruction code size. The two vector SIMD floating point engines have a zero-cycle insertion delay, so vector data can be loaded into the SIMD engines every cycle. The linear algebra vector SIMD engine has an insertion delay of four cycles, so after data is loaded, there is a wait of three extra cycles until new vector data can be loaded. The compiler can pre-allocate across the VLIW slots for this different insertion delay, giving effective parallel execution across all three vector SIMD floating point engines.

Figure 2: Compiler allocation for three parallel vector FPU execution

3rd Dimension: Configurable to Single, Dual, and Quad Core

In parallel with the multiple vector SIMD computation engines, with VLIW allocation, the DesignWare ARC VPX5 Processor IP allows scaling of the single core, to dual- and quad-core configurations. This enables the computation performance of a single-core VPX5 to be doubled or quadrupled, as needed, to meet higher computation needs. The DesignWare ARC MetaWare Development Toolkit fully supports code compiling and execution across the multi-core configuration. Additionally, semaphores are part of the product so multi-core task execution and synchronization is supported.

Data movement is a key aspect of the VPX5 product for single-, dual-, and quad-core configurations. There is a 2D Direct Memory Access (DMA) engine that is configurable for up to four channels, providing up to 512-bit transfers per cycle. The DMA can be configured to move data, in parallel between data memories of the respective multicores, local cluster memory, or in/out over the external AXI bus. This high-performance DMA complements the high computation throughput of the VPX5 processor, allowing the vector SIMD engines to constantly access new vector data in local, tightly coupled vector data memory on each core.

4th Dimension: Linear Algebra Computation

Many new generation algorithms use mathematical equations and calculations that rely on linear algebra base functions for the computation throughput. Examples of this are object tracking and identification, predictive modeling, and some filtering operations. Given this new driving trend, the VPX processors is unique in offering a dedicated vector SIMD floating point computation engine purely for linear algebra. This engine hardware accelerates linear functions such as division, SQRT, 1/SQRT, log2(x), 2^x, sine, cosine and arctan, and executes them in a SIMD vector, delivering very high performance.

What Does This Mean in Terms of Performance Numbers?

With this 4-dimension parallel processing capability, the DesignWare ARC VPX5 Processor IP addresses floating point and linear algebra processing needs of high-throughput applications and offers industry-leading performance numbers compared to other DSP processors with similar architecture. For example, in maximum configuration a VPX5 can offer 512 half-precision floating point operations per cycle, which is 768 GFLOPs if running at 1.5GHz. Also, the VPX5 offers 16 math floating point calculations per cycle, based upon linear algebra operation usage. For 8-bit integer data as used in machine learning computation algorithms, VPX5 can offer up to 512 MACs per cycle.

The VPX5 processor is supported with the DesignWare ARC MetaWare Development Toolkit, which offers  full compiler, debug, and simulation platforms. This allow developers to compile C-code algorithms quickly and efficiently to the processing engines within the VPX5 core. The cycle equivalent simulation platforms allow developers to evaluate cycle count performance and check for optimum performance on key algorithms, routines. In addition to DSP libraries that are provided, the DesignWare ARC MetaWare Development Toolkit also provides linear algebra and Machine Learning Inference (MLI) libraries. This allows developers to easily port code through API interfaces to the libraries and achieve peak performance very quickly. For the MLI algorithms, various neural network base computation components are provided for high-performance software AI computation.

Conclusion

The DesignWare ARC VPX5 Processor IP is the next-generation DSP designed to meet the data computation needs for processing-intensive applications. The four dimensions of parallel processing for floating point, AI, and linear algebra computation algorithms allow the DesignWare ARC VPX5 processor to offer ultra-high performance for applications such as automotive ADAS sensor nodes (RADAR and LiDAR), 5G New Radio (NR) communication baseband modems, powertrain, engine management, robotics, motor control and 5G automotive communication (5G C-V2X). With an industry-leading 512 FLOPs/cycle and unique 16 math FLOPs/cycle for linear algebra, VPX5 offers system developers a DSP with the performance that can meet the needs of next-generation high computation algorithms. Coupled with the ARC MetaWare Development Toolkit with DSP and math libraries, developers can quickly port C-code algorithms and achieve optimum performance for faster time-to-market.