Markus Willems, Sr. Product Marketing Manager, Synopsys
Markus Willems, Sr. Product Marketing Manager, Synopsys
5G technology is the next-generation of wireless communication that promises to both increase throughput and reduce latency by up to two orders of magnitude compared to 4G. For this, the 5G standard defines new algorithms and protocols for both advanced handsets and converged fixed/mobile network applications.
While development teams have been researching 5G for quite some time, the 5G standard for new radio (NR) only converged in 2018. With 5G network deployment expected to begin in 2020, the system-on-chip (SoC) development window is very short, and over the past two years chip makers have been racing to develop 5G SoCs in time to hit the market when 5G technology makes its debut. Rather than waiting for the standard to mature, chip makers have opted for more software-programmable solutions in blocks that traditionally would be implemented as fixed-function hardware; e.g., in the physical layer and the digital front-end. At the same time, 5G’s daunting throughput and latency requirements call for acceleration at different levels of the communication protocol. As an example, Layer-2 processing, which traditionally was done on standard processors, now requires a level of performance that standard processors cannot deliver.
Application-specific instruction-set processors (ASIPs) successfully bridge the gap between highly optimized fixed-function hardware implementations and standard processor IP (Figure 1). As a result, for almost any 5G SoC, ASIPs are the key implementation choice for blocks in the architecture that require the performance of specialized hardware but also the programmability and flexibility of processor IP.
Depending on the requirements, an ASIP can be developed to execute one specific system module, such as forward error correction, or it can be used for an entire system like a vector DSP for Layer-1 baseband processing. In the first case, the programmability still allows for algorithmic variants of the module (e.g., programming it for LDPC, Viterbi, or Polar coding). In each case, the designer can make tradeoffs to balance performance, flexibility, energy consumption, reusability (or generality), and design time.
The ASIP is described using nML, a structured architecture description language that efficiently and concisely describes processor architectures at the same level of abstraction as a programmer’s manual. The language is used to define the structural characteristics of the design (memories, registers, functional units, connectivity, etc.) and the instruction-set architecture. nML also enables users to describe the cycle- and bit-accurate behavior of the datapaths and I/O interfaces.
ASIP Designer allows software developers to immediately develop and profile C/C++ software on candidate architectures. This is made possible because ASIP Designer provides a fully featured SDK (step 1 in Figure 2), automatically adapted to the defined processor architecture described in nML. The SDK includes an optimized C/C++ compiler, assembler/disassembler, linker, cycle-accurate as well as instruction-accurate instruction-set simulator, and a graphical debugger.
It’s possible for the compiler to adapt to the detail of each candidate architecture thanks to the unique and patented compiler retargetability. Classical compiler frameworks, such as GNU or LLVM, need for someone to develop an architecture-specific compiler backend, and this must be repeated for every single candidate architecture. The immediate availability of a compiler enables rapid iteration, or “compiler-in-the-loop” methodology for architectural exploration (step 2 in Figure 2).
The compiler-in-the-loop methodology implies that software engineers can provide feedback to the ASIP design engineer, and that the processor’s dynamic performance can be studied and optimized. Making these kinds of adaptations and trade-offs at this level of abstraction is much more efficient than trying to do it once an RTL description has been generated.
Once a designer is confident that the modelled ASIP meets the desired performance for the selected algorithms, they can use ASIP Designer to generate synthesizable RTL to perform implementation-level refinement and detailed verification using standard flows (step 3 in Figure 2). Designers can use Synopsys’ Design Compiler to generate a gate-level description, and can predict the circuit’s power requirement and area or use place and route tools such as Synopsys’ IC Compiler to identify the risk of routing congestion. This “synthesis-in-the-loop” approach enables educated decisions and avoids surprises later in the design process. Should the designer face problems during implementation, they can go back to the nML description to make adjustments. Because of the single-source entry in nML, the SDK and RTL will remain in sync.
ASIPs are deployed in many of the upcoming 5G SoCs, both in base stations and mobile terminals. Key application domains include those that require massive signal processing, such as in digital front-ends and baseband processing in Layer-1, but also in accelerating Layer-2 control functionality.
As illustrated in Figure 1, ASIPs address a wide range of architectures, with ASIP Designer customers designing across the entire spectrum. Such 5G-specific ASIPs include wide-vector DSPs with specialized data types, memory and register configurations and instruction sets, outperforming the performance/mW of standard off-the-shelf DSPs. While several ASIP architectures for 5G have been rolled out by ASIP Designer customers today, this article will illustrate the same concepts by means of two Synopsys-owned example designs available to licensees of ASIP Designer.
PrimeCore is a processor tuned for FFT/DFT operations. It supports FFTs with all power-of-2 sizes from 8 to 2048, and DFTs with all prime-factorizable sizes from 6 to 1536. Figure 3 illustrates the PrimeCore architecture.
PrimeCore is a 256-bit 8-lane SIMD architecture, which has three vector data-path units, all processing complex fixed-point operands. VU0 performs dedicated butterfly operations, reading data from a tailored register file. VU1 performs vector multiplications and additions, with VU2 specialized for radix-6 butterfly calculations. It features two vector memories, with one memory assigned to the coefficients used by VU1. Load/load-store operations happen in parallel to the vector operations, resulting in up to 5-way instruction-level parallelism. Though highly specialized, the architecture is entirely C-programmable, with the ASIP Designer-generated compiler fully exploiting the parallelism. A few data points to illustrate the performance: a 256-FFT takes 172 cycles, a 2048-FFT takes 1189 cycles, and a 1296 DFT takes 798 cycles. Synthesis (16nm FF) results in 350K gates for a 700 MHz clock.
The second example processor is tuned for the minimum mean square error (MMSE) equalizer algorithm, as used for 5G NR channel equalization in base stations. This algorithm is heavily dominated by matrix operations, with the matrix elements being complex floating-point numbers. The resulting architecture is a very wide (4096-bit) SIMD architecture, with 4-way instruction-level parallelism. Special focus has been given to an efficient memory concept. For cost reasons, a single-port memory has been used, which results in one complex multiply-accumulate operation per two memory accesses, and a pipelined operation. To handle the triangular matrices that are specific to this algorithm, a number of specialized indexed addressing modes have been implemented. Again, this architecture is fully C-programmable, with ASIP Designer’s compiler handling the pipelined operations. Given the nested loop structure of the MMSE algorithms, the compiler’s unique ability to perform software pipelining for both the inner and outer loops leads to a significant reduction in cycle count.