Flexible Processing for all IoT End Node Devices

By: Graham Wilson, Sr. Product Marketing Manager, ARC Processors, Synopsys

Introduction

Originally, the Internet of Things (IoT) was a catch-all term that most people interpreted as covering almost all connected devices. However, as the market matured and application use models and requirements became more defined, many industry observers began to think of IoT applications as falling into two segments: Critical IoT and Massive IoT. Critical IoT covers mission-critical applications such as automotive communication, industrial machines, and medical procedures where low latency is critical, while Massive IoT covers the billions of connected devices that include end-node devices. These end-node devices usually have tight power and cost constraints.

End-node system-on-chip (SoC) device operation can be summarized in four main functional areas:

Sensing, from environmental sensing, for example temperature, humidity, chemical composition, which requires very low sample rates of one sample per minute or second(s), to motion, audio, voice, and vision, which can have up to 100B samples per second.

Computation, which includes system control, synchronization, machine learning/artificial intelligence (AI), digital signal processing, data encryption, and running of the operating system (OS).

Communication, which includes support of a range of various wireless communications standards, as shown in Figure 1.

Figure 1: Wireless communications standards supported by end-node devices

Figure 1: Wireless communications standards supported by end-node devices

Security, which is driven by increasing concerns about data breaches and other security risks in end-node devices. Common security features in these devices include tamper resistance, prevention of side channel accesses, execution of a Trusted Execution Environment (TEE), encryption, and implementation of safety islands.

With the tight cost and power budgets of end-node IoT devices, reducing the number of processors in the design is important. A single core that can provide all the required functionality, including controller functionality for system synchronization, a real-time operating system (RTOS), a communications PHY interface, security, and encryption, would be ideal. However, the core should also perform DSP operations such as front-end signal processing, sensor data filtering, wireless communications, and PHY computation. Finding a single processor that can meet all these requirements is a challenge, and while an end-node device might require any of these functions, it is not common for a single application to require every function. Therefore, a highly configurable core that can be tailored to meet the performance and computation throughput requirements while consuming very little power and area is the best solution. 

Single-core Implementation

Synopsys’ DesignWare® ARC® EM9D core is uniquely positioned to deliver both controller and DSP functionality within a very small footprint. Based on a three stage pipeline micro-architecture, the EM9D processor can achieve up to 4.0 CoreMark/MHz performance and 1.8 DMIPs/MHz.

With the ARC EM9D processor, multiple operations are fused into one instruction and executed in a single cycle. This yields high computation throughput and very small instruction memory. Fused instructions can, for example, load multiple data vectors from memory, perform an operation on this data (e.g., multiply accumulate), auto-update memory pointers and store the data, all in one instruction. This enables the core to perform up to seven operations in a single cycle. The ARC EM9D processor can perform two MACs per cycle, allowing high vector data computation throughput. The ARC MetaWare Compiler fully supports the fused instructions and will automatically map them from C code to execution code instructions.

The ARC EM9D processor has an optimized instruction set architecture (ISA) for end-node IoT applications. For example, a set of instructions for data streaming in and out of core data memory allows data bits to be read and written directly from/to data memory without pre-packaging bits into words, which is ideal for connecting to low data-rate sensor interfaces.

Because of this architecture, optimized ISA, and high-performance data throughput, the ARC EM9D is an extremely computation-powerful DSP. The EM9D core can execute a software algorithm for facial detection CNN computation in only 40 kcycles.

Getting the Most with Processor Configurability and Extensibility

When looking at optimizing a core in terms of performance, size, and power consumption, data memory interfaces are key. The data memory interface (load/store units) defines the amount of data loaded and stored and the frequency of these operations. Also, these units are quite large in terms of physical implementation. The ability to optimize this interface offers an advantage by giving designers the ability to balance power consumption and area against performance requirements.

The EM9D processor has a fully configurable data memory interface, supporting from one to three closely coupled data memories (DCCM, XCCM, and YCCM). These memory regions are fully supported by the MetaWare Compiler, which eliminates the need for manual data vector allocation. These memory accesses are supported with fused instructions and allow operation computation execution and parallel access to three memory regions all in one cycle, offering very high performance if needed. The configurability allows the SoC developer to tune the core memory interface to meet the computation throughput, area, and power requirements. For example, configuring the EM9D with three physical data memories will offer three times the computation performance, with a reduction of core/memory power consumption by up to 40%.

Along with data memory size and configuration, the instruction memory size is also another important factor affecting system area and power consumption. The EM9D processor offers around 15% to 20% smaller code size than competitive processors, out of the box. This is due to the highly efficient ARCv2DSP ISA, coupled with the efficient mapping of the instructions and scheduling by the compiler. On top of this, the fused instructions significantly reduce code size, and hence the required instruction memory size.

In addition to optimizing the core and memories, SoC system integration of the DSP is important for optimal performance, power, and area. End-node IoT SoCs can range from quite simple to highly complex and sometimes the traditional modular SoC interconnect system adds gate count, milliwatts, and cycle budget overhead that can also be optimized. Synopsys’ ARC processors are fully configurable and extensible, and offer the widest range of system and hardware connectivity schemes of available IP processor cores in the industry.

Peripheral hardware blocks can be connected to the processor via a dedicated peripheral interface for a ‘bus-less’ design that enables zero latency for data throughput intensive blocks. The core register bank can be extended in size and hardware blocks can directly connect to these registers, allowing core software control/status update of these hardware blocks. In addition, by using ARC Processor EXtension (APEX) technology, designers can add custom registers and interfaces in the form of an RTL description to the ISA. These connection schemes give SoC developers further flexibility to once again tune the system architecture to meet performance, power, and area goals.

To further optimize performance, an optional µDMA controller can be added to the processor. This µDMA engine is controlled directly from the ARC EM9D processor, but operates in parallel to core execution offloading heavy data movement.

Figure 2 shows an example of how this system architecture optimization can greatly improve performance, power consumption, and area.

Figure 2: Implementing a bus-less design using ARC and an APEX interface improves PPA

Figure 2: Implementing a bus-less design using ARC and an APEX interface improves PPA

Conclusion

With all these features and configuration options, the ARC EM9D processor is proven to be ideal for IoT end-node applications that require both control and digital signal processing capabilities, for the following key reasons:

  • IoT end node device operations fall primarily in the areas of sensing, computation, communications, and security, ideally running on one low-cost processing core
  • These functions require control and DSP operations
  • ARC EM9D/11D processors offer the ideal mix of high-performance control and DSP functions in a very small, ultra-low power processor
  • Configurability in data memory interfaces allows SoC developers to tune performance goals within power and die size constraints
  • The EM9D high-performance ISA and fused instructions offer leading-edge performance, with a very small code size and small core area
  • Even greater improvements in performance and reductions in power consumption and area can be gained with EM9D processor system connection schemes

The ARC EM9D offers industry-leading performance per area, coupled with the tight integration options for system peripherals and hardware accelerator blocks. This gives end-node device SoC developers the ability to tune the ARC EM9D configuration, memory size, and system connectivity to meet their performance, area, and power requirements.