Some of the design decisions taken for Tgauss:
- Vector data path to process six bytes (= two RGB pixels) in parallel:
- vmul, vmac
- vector register files A, B with cyclic buffer access
- Separate vector memory for the input/output image
- Separate vector memory for buffering lines between the horizontal and vertical filter phases
- Split 32-bit register file into 16-bit Rh and Rl components for filter coefficient storage
- 3-level hardware loop
- Memory interfaces to support delays in image memory response
- Two pipeline execution stages (E1, E2) for higher frequency synthesis
- Instruction level parallelism (ILP) tuned to the needs of the application: a load from line buffer memory can be executed in parallel with a vector data path operation
Tgauss comes in two versions: one performing horizontal and vertical phases in an iterative sequence, the second performing two horizontal phases followed by two vertical phases, reducing the line buffer memory load traffic by 2.
To illustrate the performance, the result for a 9x9 filter, RGB processing:
- 11.9 cycles/pixel = 920 kcycles/frame (320 x 240)
- 9 ms/frame @ 100 MHz
- 133 instructions
Smaller filters need less cycles, e.g. 3x3 separable ~ 6 cycles/pixel
The example indicates options for further speed-up:
- Increase vector size
- Reduce loop size by more ILP, e.g. parallel store to XM, LM
Tcom8 is an example processor that illustrates processor concepts suited for certain communication kernels, especially those containing FFT and FIR operations.
Starting from a scalar processor model, the datapath architecture extension for Tcom8 adds SIMD vector operations on 8-component vectors, and has the following new storages:
- 4 vector word registers of 8 x 16 bit (V), split in 2 tuples (VA, VB)
- 4 vector accumulator registers of 8 x 40 bit (W), split in 2 tuples (WA, WB)
The vector memory is split in two parts, VMA and VMB, and allows simultaneous read and write from different parts, thus, (read VMA || write VMB) or (write VMA || read VMB).
A 16 x 16 bit vector multiplication consumes two cycles and allows for:
- pure 16 x 16 mult
- real complex 16 x 16 mult
- imag complex 16 x 16 mult (takes the real part as additional input)
- multiply accumulate 16 x 16 mult + 40 bit accumulator
- dedicated division instruction (both scalar & vector)
In addition, specific instructions and parallel formats have been provided to efficiently map FFT and FIR applications (as explained below).
To illustrate the performance of the core (and of course of the automatically generated compiler), the model comes with example code for FFT and FIR:
- For a 256 point FFT, the performance is about 570 cycles
- The inner loops make optimal use of the hardware:
- In steady state, it is possible to perform 2 complex butterflies, and 2 complex mults per cycle – illustrating the power of the architecture specialization for the FFT algorithm
- The FFT scales at every stage. A dynamic scaling, depending on the data can be considered as a future extension
- Specific instructions have been added for this application, in particular:
- radix 2 butterfly
- vector transpose
- For a 32 tap FIR, the performance is about 6 cycles per sample
- Specific instructions have been added for this application, in particular
- vector element select and broadcast
- vector concatenate and vector window select
Featuring: SHA 256
This model highlights the design of a programmable accelerator as an alternative to fixed-function RTL, using the SHA-256 cryptographic hash function as an example. SHA-2 comes in different variants (SHA-224, 256, 384, 512, 512/224, 512/256 ), making it a good candidate for a flexible (because programmable), yet dedicated crypto engine.
As with all models, the SHA 256 crypto processor comes in source code. In addition, the model library includes a slide deck that describes the design process, starting from of an existing 32-bit MCU, all the way to the final SHA 256 architecture.
Some of the architecture design steps taken:
- Starting with an initial 32-bit MCU
- Removing unneeded hardware elements, such a hardware multipliers
- Introduction of a special-purpose functional unit that implements the most time-consuming computations, especially the Compression Step present in the SHA256 transform function. This step contains many simple operations where a sequential execution in software is inefficient, but an implementation in hardware is inexpensive
- Adding zero-overhead loop capabilities, automatically selected by the compiler
- Adding post-increment addressing mode
- Reorganizing memory, separating program memory, data memory and a separate K-memory for the keys, which can be implemented as a 32-bit-wide ROM.
This allows for adding 3-way instruction-level parallelism (ILP) that performs two loads and one arithmetic operation in parallel. ILP can be encoded in 32-bit instructions, so there is limited impact on the required program memory