ASIP eUpdate, October 2017

Synopsys’ solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can’t find suitable processor IP, or when hardware implementations require more flexibility.

ASIP Designer

Synopsys’ solution to efficiently design and implement your own application-specific instruction-set processor (ASIP) when you can’t find suitable processor IP, or when hardware implementations require more flexibility.

This bi-annual newsletter provides you with easy access to ASIP related resources. This issue includes the following topics:  

 

Technology Feature: Fast Context Switching and Multi-Threading

ASIP Designer™ comes with a rich set of example processor models provided in source code, which serve both as a modeling reference as well as a starting point for customer designs.

In previous issues of the eUpdate Newsletter we already covered the subject of data-level parallelism and instruction-level parallelism, including the corresponding example models. In this issue we will look at example processor models that demonstrate how to do a fast context switch. These models share as a common feature that registers are duplicated per context.  This duplication makes it possible to quickly switch from one thread of execution to another.  The motivation for requiring a fast context switching mechanism can be diverse, but broadly speaking we can make a distinction between

  • Fast context switching upon an interrupt, or

  • Switching between the multiple contexts in a multi-threaded processor

Before entering into these cases, let us look at how we can model a processor that has multiple register contexts. In the nML model, only one set of registers is declared.  These registers are used to define the instructions. In nML, additional copies of a register, sometimes called shadow registers, can easily be created by adding the additional_register_contexts property to the processor model. Assume that a processor has a register file R[4] and single field register SP, then the following property will result in the automatic creation of three shadow registers for both R[4] and SP, thus creating a total of four contexts:

                     additional_register_contexts : 3 ;

If for certain registers, we want to keep a single copy that is shared by all contexts, we must add the following exclude property:

                    exclude_from_additional_register_contexts : SP ;

Once the multiple register contexts have been created, we must decide how to switch from one context to the next. Using Figure 1 below for reference, we will look at different ways to generate the context select signal that steers the context multiplexor CM.

Figure 1: Multiple register contexts: the instruction set registers (colored green) are declared in nML, the registers for the additional contexts (colored yellow) are generated automatically  

To understand the concept of fast context switches upon accepting an interrupt, Synopsys’ Tnano 16-bit microcontroller example is well suited. The context select signal is driven by the processor control unit (PCU).  When the PCU decides to accept an interrupt, the following actions are taken:

  • A jump to the start address of the interrupt service routine is initiated

  • Issuing of instructions is stopped until the first instruction of the service routine is fetched.  This flushes the pipeline

  • In the meantime, the context select signal is set to a new context.  From then on, the instructions of the service routine operate on the shadow register

This achieves a fast context switch, which takes only a few clock cycles.

Multi-threading architectures can execute multiple threads in a concurrent way.  One popular use of this technique is to hide latencies of the pipelined data path and/or latencies of the memory. Instead of waiting for a value that is still being computed on the data path or that is being loaded from the memory, the processor switches to another thread.  When control returns to the first thread at a later point, that value will have been computed or loaded.

Such multi-threading is addressed by a second example processor, ILX. ILX is a variant of the DLX 32-bit microcontroller, to which interleaved multi-threading has been added. The original DLX has a pipeline with the following stages: instruction fetch (IF), instruction decode (ID), execute (EX), memory access (ME) and register write back (WB).  Registers are written in WB and are read in the ID stage, so there is a three-cycle dependency hazard. In the DLX model, the pipeline is fully protected by adding either a register bypass or by adding stalling logic for each hazard. Different to that, in the ILX model we execute four threads and use a round-robin schedule where the instructions from the different threads are interleaved on a cycle-by-cycle rhythm. This is depicted in Figure 2.

Figure 2: Interleaved multi-threading with four threads

As you can see, two instructions from the red thread are now separated by three instructions from other threads.  Due to the multi-threading, the hazard is no longer present, and the bypasses and stalling logic can be removed.  Instead, we now need to control the context select signal in each stage separately. As you can see in Figure 3, in cycle 3, the red thread is in the WB stage while the yellow thread is in the ID (read) stage. 

Figure 3: Context selection: in a specific cycle, the register reads in ID and the register writes in WB belong to a different context

It is clear that the register write address and the register read address belong to different contexts.  When ILX’s PCU issues a new instruction, it tags that instructions with a context identifier.  The instruction and its context identifier then flow through the pipeline side by side.

Now suppose that we are in an environment where one thread – for example the blue thread – terminates and there are no other tasks to fill its slot.  We can then consider utilizing this time slot for the remaining threads, as shown in Figure 4.

Figure 4:  Interleaved multi-threading with fewer than four threads

As there are now only two cycles between two instructions of the same thread, we need to add a bypass to protect against this hazard.  If more threads are removed from the schedule, more hazards will emerge and more pipeline protection logic must be added to the design. 

This scenario is covered by a third example processor, PLX, which shows how to implement the combination of interleaved multi-threading and a fully protected pipeline.

For more details on these models or any other models, send an email to asipsupport@synopsys.com.

What’s New in ASIP Designer?

2017.03 & 2017.09 Release Update 

Since the January 2017 newsletter, ASIP Designer has seen a number of enhancements and extensions. The following is an extract, sorted by categories (please refer to the official Release Notes for the comprehensive list).

Labs and Tutorials

  • New “Getting Started” hands-on lab: Designing a small processor from scratch, it provides an easy to follow introduction to the concepts, languages and tools of ASIP Designer

  • New ASIP Programmer™ hands-on lab, tuned towards embedded software developers introducing the features of the ASIP Programmer GUI and the advanced software debugging capabilities

Example Models

  • New PLX example, with combined block-based and interleaved multi-threading using a protected pipeline (see also the Technology Feature section above)

  • New Tzscale example, implementing the RISC-V ISA with a 3-stage pipeline (Z-scale reference),
    with little endian program memory, including OnChip debugging support

  • New Flexible Accelerator example, explaining how to build an optimized yet fully programmable accelerator, illustrated using the SHA256 algorithm

  • New I/O interface example, demonstrating a multi-cycle load into a wide register

  • New SystemC example, demonstrating the integration of an ISS into both Virtualizer and OSCI

Processor Modeling

  • 64-bit address space support

  • Simplified instantiations of I/O-modules

Simultaneous Hardware / Software Debugging

  • New micro-stepping feature allows stepping through nML statements and behavioral code (nML-PDG). This enables highly efficient debugging of (multi-cycle) functional units, program control unit (PCU), and I/O interfaces

  • New waveform viewer to inspect the value of signals and registers in the processor model, including those inside the PCU and I/O interfaces

Simulators and Verification

  • Further acceleration for instruction-accurate simulators, using block-based just-in-time (JIT) transformations. Speed-up of up to 25x compared to non-JIT simulations

C/C++ Compiler

  • LLVM version updated from 3.9 to 5.0

  • New vector predication method, through guarded memory intrinsics

RTL Generation and Synthesis Support, FPGA Prototyping

  • New and modified options to further enhance an engineering change order (ECO) flow

  • Support for prototyping on HAPS80 systems, including the support of Synopsys ProtoCompiler

Additional Resources

Customer References 

Cognitive Systems is a startup that developed an innovative security system, based on wireless signal analysis. The application required a chipset that covers a wide spectrum between 650MHz and 4GHz, supporting a large variety of wireless standards. Read why Cognitive Systems decided on an ASIP, and how they managed to design a complex SIMD/VLIW DSP in less than 12 months, with a small team.

 

White Papers

  • Software Development Kits (SDKs) for Proprietary Processors – Why They Matter, What It Takes to Develop Them

    In order to develop a proprietary processor that can stand the test of time, a highly functional SDK must be developed. The complexity, cost and duration of SDK development vary depending on the architecture of the processor and the skillset of the SDK developers. In this paper, we analyze the requirements for an SDK. We then introduce a tool-based methodology for SDK development based on Synopsys’ ASIP Designer tool suite.

  • Rapid Architectural Exploration in Designing Application-Specific Processors

    Architectural exploration is at the heart of any ASIP design approach. Designers need to rapidly explore the impact of different architectural choices on power consumption and performance, ideally using real-world application C-code as part of the design flow. This white paper explains the architectural tradeoffs that are available to an ASIP designer, how to trade off performance vs. area, and why an ASIP design can still maintain full C-programmability while being optimized for a certain application domain.

  • Designing ASIPs in Multicore SoCs

    Modern SoCs integrate dozens of complex system functions, each requiring its own optimal balance of performance, flexibility, energy consumption, communication, and design time. The traditional model of a (configurable) general-purpose processor core with a number of fixed hardware accelerators no longer suffices. ASIPs can offer the best balance for each system function, and thus form the basis of new generations of multicore SoCs.