1st Dimension: Multiple SIMD Computation Engines for Floating Point
The fundamental vector data length is 512-bit, which enables SIMD computation of data units of 8-, 16-, and 32-bit data or half- and single-precision floating point data. All the SIMD engines compute with 512-bit vector lengths; this defines the upper computation capability of the ARC VPX5 processor.
For integer data of 8-bit, 16-bit, and 32-bit length there are dual 512-bit SIMD computation engines, coupled with three ALU processing units. This gives very high levels of computation for algorithms such as machine learning convolution operations (5x5, 3x3 and 1x1).
For half- and single-precision floating point computation there are three vector SIMD computation engines, all supporting the maximum vector length of 512-bits. The dual SIMD engines for “regular” floating point vector operations provide ultra-high performance for floating point vector operations, such as DSP functions including FFT, matrix operations, etc.
The third vector SIMD floating point engine is dedicated to linear algebra mathematical functions. This dedicated engine allows the offload and parallel computation of mathematical functions and will be further described in the 4th Dimension section.
2nd Dimension: Flexibility with Multi-Task Issue VLIW
Flexible allocation in the 4-issue VLIW scheme enables the processor to allocate the maximum possible parallel operations. The VLIW scheme has been developed in tight collaboration with the development of the software compiler such that the compiler pre-allocates operations compiled from the original C-code program. The compiler, coupled with the VLIW architecture, enables the execution of operations across the multiple SIMD engines in parallel.
As an example, Figure 2 shows how the compiler is able to work with the VLIW allocation scheme to use just two VLIW slots to achieve parallel execution across three floating point SIMD engines with optimum VLIW slot allocation and reduced instruction code size. The two vector SIMD floating point engines have a zero-cycle insertion delay, so vector data can be loaded into the SIMD engines every cycle. The linear algebra vector SIMD engine has an insertion delay of four cycles, so after data is loaded, there is a wait of three extra cycles until new vector data can be loaded. The compiler can pre-allocate across the VLIW slots for this different insertion delay, giving effective parallel execution across all three vector SIMD floating point engines.