Optimizing high-end embedded designs with the ARC HS Processor Family

By Mike Thompson, Senior Product Marketing Manager, ARC Processors, Synopsys

 

The constant demand for higher performance with power budgets that are fixed or declining makes designing high-end embedded applications a challenge, sometimes forcing designers to make tradeoffs and deliver products that fall short of their design goals. Synopsys’ ARC® HS Family of embedded processors delivers more than 4,200 DMIPS (per core) at less than 80 mW to give designers the performance needed within their power budget. The ARC HS Family has been developed specifically for embedded applications and the processors feature many optimizations that enable designers to achieve their high-end embedded goals with plenty of headroom for future designs.

The Changing World

Technology is shaping and altering the world around us: reality is being augmented and “virtual reality” is becoming the norm; video is becoming more immersive, offering 3D effects and 4K resolution, with 8K on the horizon; cars are a technology showcase that, in a few years, will conceivably take over the driving for us. Our ability to interact with technology through touch, speech, and gesture is increasing at exponential rates, and information exchange is accelerating from person to person and machine to machine.

Power-Performance Paradox

The advancement in technology is being enabled by the next-generation of embedded microprocessors that are being called on to deliver levels of performance that were unthinkable just a few years ago. Increasingly, new designs are being developed with multiple processors to achieve the performance required. But this higher performance comes at the price of increased power consumption, which proves problematic for chip designers who have to increase the performance of their products, but frequently must maintain the same or lower power budgets. Mobile products that are battery powered require longer battery life and even products that are plugged in are seeing constraints on power due to heat and the desire to be green.

This power-performance paradox is made even more challenging because the typical approach to delivering more performance from a processor is to increase the transistor budget for the design, which increases both its power consumption and size. Many of the new generation of processors implement superscalar or multithreading schemes to achieve higher performance. These architectures can deliver on total performance, but they lack performance efficiency (DMIPS/mW, DMIPS/mm2), so they use a lot more power. Also, because of their size, they are limited to only modest gains in maximum clock speeds versus the previous generation of processors that they replace. What is needed is a processor that delivers on total performance, can be clocked at GHz speeds, and uses power sparingly. It is a difficult task to design a small, efficient processor that offers enough performance for today plus headroom for future design growth.

The ARC HS Embedded Solution

Synopsys’ ARC processors have been licensed by more than 170 companies and are shipped in more than 1.3 billion chips each year. As a result, Synopsys has a high level of interaction with embedded SoC designers and a clear understanding of the power-performance challenges that they face. To help address these challenges, Synopsys introduced the ARC HS Family of 32-bit processors for high-end embedded applications. ARC HS is a new generation of high-speed processors built on the advanced ARCv2 architecture. The HS Family delivers the high-end performance needed for today’s most advanced designs, with plenty left over for future use. But performance wasn’t the only design goal. The HS processors also offer performance efficiency to address SoC designs’ power-performance paradox.

ARC HS Processors for Embedded Applications

The new HS Family delivers on performance. The processors offer 2.2 GHz speeds with more than 4,200 DMIPS of performance per core on 28-nm processes (typical silicon), and all while consuming only 80 mW of power. This is more than twice the performance, with lower power consumption, of competitive processors. In some cases, ARC HS offers higher performance at less than half the power consumption than its competition. The two products in the HS Family are each optimized for specific uses. The HS34 features closely coupled instruction and data memory and is designed for hard, real-time applications, while the HS36 offers support for instruction and data cache and is designed for high-end use. Both products are also available in dual-core and quad-core versions capable of delivering more than 15,000 DMIPS of total performance on 28-nm processes.

The HS processors are designed for use in high-end applications such as solid-state drives, connected appliances, automotive controllers, media players, digital TV, set-top boxes and home networking, and have a range of features that make them ideally suited for embedded applications. They are highly configurable, enabling users to tailor them specifically for each instance on an SoC to maximize performance while minimizing power consumption and area. All of the ARC products support the addition of custom instructions that let users integrate their proprietary hardware accelerators into the processor to further increase performance and add competitive differentiation to their SoC product. The HS Processors offer a second register file that is a build-time option that supports fast interrupt and context switching without the need to save or restore core registers. The processors have support for DSP instructions and have SIMD capabilities for use with signal processing applications. The HS Family also features a robust interrupt architecture supporting up to 240 independent interrupts with 16 levels of priority, as well as auto save and restore to simplify interrupt handlers.

Improved Performance and Power with ARC HS 10-Stage Pipeline

The ARC HS Family is built on a high-speed, power-efficient 10-stage pipeline (Figure 1). This single-issue, scalar pipeline minimizes size and power consumption. To improve performance, the processor supports limited out-of-order execution for long latency instructions. Instructions graduate when they advance from commit to writeback without a result. Graduated instructions are kept in a buffer and receive a unique identification tag. The buffer can hold up to eight instructions in flight, and when their results become available they request retirement. The HS processors have sophisticated branch prediction that has high accuracy with early detection of mispredicted branches. They also have a late stage ALU in the 9th stage, which allows the processing of some instructions to be delayed from the early ALU in the 6th stage in the case of branches or interrupts that require the pipeline to be flushed. In these cases, the pipeline continues to process instructions on the backend while the front end is being reloaded. This significantly reduces the load to use, and for many instructions can eliminate it.

Figure 1: The ARC HS Processors are built on a 10-stage pipeline 

The HS family processors have a parallel load/store pipeline starting at the 6th stage to improve performance for data handling. They support 64-bit loads and store to and from register pairs to move data faster. They feature non-aligned load and store access that use banked DCCM and D cache memories, allowing them to complete without extra cycle penalties. There is also an optional low-latency memory port for fast access to peripherals and memory. This port supports single-cycle access to all peripheral registers or memory on an SoC and reduces system latency by moving this traffic off of the multilayer AMBA bus. The processors further improve efficient data movement supporting I/O coherency with data cache snooping and a programmable address space that keeps the cache coherent with shared memory of peripherals without interfering with normal cache operations.

Flexible, Configurable Architecture

The HS Family is highly configurable, enabling users to add their own proprietary hardware accelerators to the processor. The hardware can be used to dramatically increase performance, lower power consumption, or to add special differentiating features to the processor. Up to 190 custom instructions can be added, but the capability is more than just the addition of instructions: up to 28 registers can be added to the register file to be used as source and destination for the additional instructions. Condition and status codes can also be added and memory-mapped blocks are supported. This is a complete capability and can be either blocking or non-blocking, and supports out-of-order completion. Designers can add custom instructions to an HS family processor with the ARChitect tool that is delivered with every ARC product. ARChitect has a four-step wizard that makes it straightforward to add a user’s proprietary Verilog hardware to the processor.

The configurability of the processors also includes a range of optional hardware to accelerate computation and processing:

  • Multiply and multiply-accumulate options include support for a 64-bit multiplier.
  • The configurable radix 4 hardware divider enables the user to select the number of clocks needed to complete the operation and controlling area and power consumption.
  • An IEEE-754 compliant floating point unit (FPU) supports single or double precision operations, or both. The FPU takes advantage of the RISC pipeline and delivers very good floating point performance at about 10% of the power consumption of a floating point coprocessor.

Multicore ARC HS Advantages

For high-performance applications, both the ARC HS34 and HS36 processors are available in dual-core and quad-core versions (Figure 2). The multicore versions feature inter-core hardware that facilitates message passing, interrupt handling, semaphores and debug. The inter-core message passing uses a centralized SRAM that is shared by all cores with round-robin arbitration to manage simultaneous accesses. The inter-core interrupt capability allows each core to generate interrupts to the other cores, and each core can receive acknowledges from any other core. The inter-core semaphores are provided for synchronization across shared resources. The inter-core debugger can simultaneously or selectively halt, run, or reset any combination of the cores. Designed to increase performance, the multicore implementations have a 64-bit global real-time counter to synchronize multiple threads.

Figure 2: Dual-core ARC HS implementation

Test Bus Interface

All ARC HS processors support the STAR Memory System® memory test bus interface, enabling high-quality memory test coverage and yield enhancement through memory repair. The HS Family also supports ECC on all memories, and a memory protection unit supporting up to 16 partitions in the memory space is available.

ARC HS Tools

The HS Family is supported with a complete suite of development tools.

The ARChitect tool, delivered with every ARC processor, enables rapid processor configuration with an intuitive and easy-to-use GUI. Even though the ARC processors feature a large number of configurable options, they can be configured in less than one hour using the ARChitect tool. The output from ARChitect is Verilog RTL source, the makefiles for the nSIM and xCAM simulators, synthesis scripts for the build tools, configured setup files for the MetaWare compiler, and test benches. The ARC tool suite is fully integrated and the output from all of the tools integrates with the other tools, so the configuration of the processor and any custom instruction extensions are recognized and used by all of the tools.

The MetaWare Development Toolkit is a full and integrated software development environment supporting the HS Family and all ARC Processors. MetaWare includes a highly optimized C/C++ compiler, a debugger that can be used to debug real and virtual targets, and an instruction set simulator. The debugger supports debugging of up to 256 processors in a single session. It supports simultaneous debug of the dual-core and quad-core versions of the HS Family. The source, disassembly, registers and variables for each processor can be viewed side by side or one at a time. The MetaWare Development Toolkit is housed in an Eclipse IDE and can be used with the SmaRT and real-time trace (RTT) options that are available for the ARC HS Family.

The RTT option for the HS Family supports multiple CPUs and is compliant with the Nexus 5001 standard. The RTT option is configurable and can use existing system storage memory, probe memory or a combination of both. There are both on-chip and off-chip capture modes. The capture elements have programmable filters and compression modules to reduce output bandwidth.

The xCAM tool supports 100% cycle accurate simulation of the HS family processors. The tool supports the generation of an unlimited number of configurations of an ARC processor and can be used in conjunction with the ARChitect and MetaWare tool. Synopsys also offers the nSIM Pro instruction set simulator that offers cycle close simulation but at very high speeds. The NCAM mode supports processor centric algorithmic development and optimization.

Summary

The HS Family is a new generation of high-speed ARC processors that feature unrivaled performance efficiency delivering more than 4,200 DMIPS per core at less than 80 mW of power consumption. To address the power-performance paradox, the HS processors offer a customizable solution that is highly optimized for each instance on an SoC. Designed for high-end embedded applications, the HS Family is scalable and flexible with a full range of features to address a broad range of embedded requirements. For high-end embedded applications, the HS Processors offer higher performance and lower power than competitive solutions.