Insight Home | Previous Article | Next Article
Issue 3, 2013
Solving the Power-Performance Paradox for High-End Embedded Processors
ARC HS is Synopsys’ family of CPU cores for embedded applications that demand 32-bit RISC performance with minimal power consumption and area. Mike Thompson, Senior Product Marketing Manager for ARC Processors, Synopsys, explains how ARC HS achieves these goals while preserving the code density, configurability and extensibility that define the ARC processor architecture.
Teams tackling the design of integrated circuits for performance-intensive, high-end embedded applications face a persistent engineering challenge – how to meet their performance goals while keeping within the power budget. While good design practice is all about making engineering trade-offs, performance demands keep going up while power budgets keep going down. Satisfying both of these requirements can be very difficult without compromising important aspects of a design. As a result, a key decision for any embedded design is the choice of a suitable processor core. Selecting the CPU influences the overall system architecture and how it performs; it’s a decision that design teams must make diligently.
DesignWare ARC HS Family Overview
The DesignWare® ARC® HS 32-bit processor family uses the next-generation ARCv2 instruction set architecture (ISA), and the HS cores are software compatible with ARC EM4 and ARC EM6 cores. The ARC HS processors deliver higher performance than competitive cores designed for high-end embedded applications and do so with much lower power consumption, which can significantly improve overall system-level efficiency.
The new HS family delivers 3,100 DMIPS at 40% less power than competitors’ cores, which means that designers can more easily meet the requirements for their current designs with plenty of performance left over for future projects. The processor family also supports the addition of user-defined, custom instructions that enable design teams to add their proprietary hardware to the HS processor pipeline. This feature can be used to further increase performance or reduce power consumption, and it enables designers to easily use the HS family processors with their existing, proven hardware solutions.
To improve system-level efficiency, the HS architecture incorporates an auxiliary register and close-coupled memory, which enables design teams to bring the on-chip peripheral functions and memory into the processor with single cycle access. This reduces system latency and can significantly improve system-level performance and efficiency.
Applications and Performance
The ARC HS family includes two new CPU cores, the ARC HS34 and HS36.
The ARC HS34 has been designed for real-time embedded applications, such as:
- Solid-state drives
- Home gateways
- Home networking
- Mobile products
The HS34 is based on a Harvard architecture and stores instructions and data in separate, closely coupled memories (CCMs), which can be accessed at the CPU speed and are fully deterministic.
The ARC HS36 core is targeted at higher-end products, including:
- Digital cameras and tablets
- Digital TVs
- Set-top boxes
- Automobile infotainment systems
- Autonomous or semi-autonomous networked devices – the “Internet of Things”
Designers can configure the ARC HS36 with separate instruction and data CCMs and with instruction and data caches. The cache sizes are user configurable from 4KB to 64KB. The HS family features a combination of 32-bit and 16-bit instructions to achieve the smallest possible memory footprint, reducing code size by up to 40%.
Benchmark information for the ARC HS family cores is listed in Table 1 and shows the high-level of performance, low power consumption (<58 mW @1.6 GHz) and small area that can be achieved in a 28-nm HPM process.
Table 1: Key performance benchmarks for the ARC HS34 processor
Architectural Features of the ARC HS Family
Figure 1 shows the main functional blocks that make up the ARC HS cores, and shows the functions that are specific to each core, the optional blocks and the blocks that are available for creating custom instruction extensions. Design teams can extend and configure the ARC HS cores to balance power and performance based on the specific requirements for each instance on an SoC. They can extend the hardware features with new instructions, timers, interrupts, multipliers, hardware divide and loop counter, or designers can choose to omit features in order to save power and area. They can also decide how to best implement aspects of the architecture. For example, teams can configure the 32-bit register file with 16 or 32 registers and choose flip-flops or memory cells. There are many configurable options on the HS cores that can be easily selected using the ARChitect processor configuration tool.
Figure 1: DesignWare ARC HS block diagram
The ARC HS Family’s High-Performance Pipeline
The ARC HS family has a 10-stage pipeline and is based on a 32-bit Harvard architecture (Figure 2), which has been designed to deliver high performance with simple, scalar execution.
Figure 2: The ARC HS family’s 10-stage instruction pipeline
The 10-stage pipeline implements a number of features that boost performance and minimize power by saving clock cycles, including:
- Dynamic branch prediction
- Early detection of mispredicted branches
- No load-to-use penalty for common integer instructions
- Out-of-order retirement for load, divide and user-defined instructions
- Optional instructions to load and store 64-bit doublewords, eliminating the need to execute successive 32-bit loads and stores for sequential data
One noteworthy example of how the ARC HS pipeline saves valuable clock cycles is by detecting mispredicted branches in stage 7. Instead of proceeding to the final three stages, the architecture immediately flushes and re-loads the pipeline, which saves clock cycles that would otherwise be wasted, burning power and reducing system performance.
The ARC HS CPUs incorporate a number of other key performance features that enhance the operation of the pipeline.
Arithmetic and DSP Functions
- Enhanced radix-4 divider that operates in 4 to 19 clock cycles
- New hardware multiplier accelerates 64-bit multiplication
- Optional instructions support digital signal processing and single-instruction multiple-data (SIMD) operations
- Optional floating point unit (FPU) performs 32-bit/64-bit floating-point operations in hardware
Another option is a memory protection unit (MPU) that guards designated memory regions from accidental or unauthorized accesses.
The HS family supports the industry-standard ARM® AMBA® AXI™ and AHB™ interfaces with configurable 32- or 64-bit wide buses. Alternatively, designers can use the 32-bit auxiliary bus to access the peripheral functions and provide single-cycle access to the CPU core.
The processors support up to 240 user-configurable interrupts with 16 priority levels. To simplify interrupt handlers, they can automatically save and restore registers to the stack when entering or exiting an interrupt routine. Faster context switches are possible by adding a second register file when configuring the core. The CPU can then switch between two different contexts without saving and restoring the registers each time.
Development and Verification Environment
Synopsys provides a comprehensive hardware and software development environment to help design teams get to market faster. In addition, a large number of third-party developers support the ARC architecture with hardware and software development tools. The ARC Access Program expands the choice of embedded software and hardware solutions available for DesignWare ARC processor cores. This program builds on the ecosystem of third parties supporting the ARC architecture with software development tools, real-time operating systems, middleware and semiconductor IP.
The ARC MetaWare compiler generates optimized code for the HS family with 15% better code density than previous ARC processors and automatically takes advantage of the 16-bit instructions when it finds opportunities in the C/C++ source code to further reduce code size.
Adding a parallel trace port enables the use of real-time trace with ARC HS devices. Designers can choose to trace data from different sources, including the program counter, core registers, memory interfaces and auxiliary interface, and store it in system memory, probe memory or both. Alternatively, design teams can select a more compact real-time trace module, which uses the JTAG port.
Synopsys supports the ARC HS family with two simulators:
- nSIM: partial cycle accuracy instruction-set simulator
- xCAM: full cycle accuracy for profiling, optimization and verification
Developers can also prototype the ARC HS devices with Synopsys’ HAPS® FPGA-based Prototyping System.
Summary: ARC HS Family
Synopsys developed the ARC HS family to meet the needs of design teams looking for high-performance for embedded applications without compromising power budgets. The HS family’s 10-stage scalar pipeline delivers excellent performance while avoiding a ballooning gate count (and inflated power consumption), which is typical of the more complex superscalar and multi-threaded pipeline microarchitectures that offer similar levels of performance. By focusing on key features (Table 2) that make a tangible difference to performance while enabling design teams to configure the cores to meet their exact needs, including specific performance, power and area goals, Synopsys now offers an outstanding solution for high-performance embedded applications.
Table 2: Summary of DesignWare ARC HS key features
The new HS family is part of Synopsys’ broad ARC processor portfolio that is used by more than 170 customers worldwide who collectively ship more than 1.3 billion ARC-based chips annually.
About the Author
Mike Thompson is the senior manager of product marketing for ARC processors at Synopsys where he is responsible for the HS family and signal processing solutions. Mike has more than 30 years of experience in both the design and support of microprocessors, microcontrollers, IP cores and the development of embedded applications and tools. He has worked previously for Virage Logic, Actel, MIPS, ZiLOG, Philips/Signetics and AMD. He has a BSEE from Northern Illinois University and an MBA from Santa Clara University.