Many methods have been used to increase processor performance. Increasing the number of pipeline stages has been used for years to deal with the limitations in memory speeds. For example, the DesignWare® ARC® HS Processor 10-stage pipeline with two cycle memory access can be clocked at 1.8 GHz (worst case) in 16 FFC processes. There are limits to how fast embedded designs are being clocked so adding more stages to a processor’s pipeline will be of limited benefit. In the future this may change, but currently a 10-stage pipeline is optimal for embedded designs.
Superscalar implementations are a good tradeoff in terms of performance gain versus the increased area and power. Moving from a single-issue architecture to dual-issue can increase RISC performance by as much as 40% with limited increases in area and power. This is a good tradeoff for an embedded processor. Moving to tri-issue or quad-issue will further increase area and power but offer lower increases in performance. Performance at any cost is never the goal of an embedded processor.
Adding Out-of-Order (OoO) execution can increase performance for embedded applications without increasing clock speeds. Typically, a CPU that supports full OoO is overkill for embedded and a limited approach will give an optimal performance increase without blowing up the size of the processor. Limited OoO is commonly used on high-end embedded processors.
Caching is used to bring memory closer to the processor, thus increasing performance. Cache has single-cycle access for the processor and the performance improvement is the result of information being in cache when it is needed. Often used code and data is kept in Level 1 cache. Lesser used code and data are kept in slower access Level 2 cache or external memory and accessed as needed. For multicore processors maintaining coherency between the L1 data caches also improves performance. L1 caching and coherency are common in embedded processors while L2 cache (and Level 3) are used only for higher end applications.
Embedded designs are seeing increased use of multiple processors. A few years ago, a typical system-on-chip (SoC) had one to two processors. Today more than five processors is common even for low-end designs, and the number is increasing. To support this, processors for mid-range and high-end embedded applications offer multicore implementations. Processors supporting two, four, and eight CPU cores are available. Using Linux or another operating system enables programmers to get smooth operation across the CPU cores while balancing the execution to increase performance.
The usage of hardware accelerators is increasing in embedded designs. They offer high performance with minimum power and area while offloading the processor. The main drawback to hardware accelerators is that they are not programmable. Adding accelerators to work in conjunction with a processor can mitigate this. Unfortunately, existing processors have limited or no capabilities to support hardware accelerators. Some processors like the ARC Processors support custom instructions that enable the user to add hardware to the processor pipeline. While custom instructions are attractive, hardware accelerators offer additional benefits and, when used in conjunction with a processor, can offer a significant performance improvement.
There are challenges to increasing processor performance for embedded applications. Processors already have support for deeper pipelines, superscalar implementations and OoO help but can only go so far and caching is already prolific, as is coherency, so further gains there are unlikely. A path to higher performance that is already being pursued by embedded designers is to implement more CPU cores and hardware accelerators in designs.