Accelerating IoT Applications with a Data Fusion IP Subsystem

By Rich Collins, Product Marketing Manager, Synopsys

The vast expansion of Internet of Things (IoT) edge devices is increasing demand for low-power, “always-on” functions such as sensor fusion, image and voice detection, gesture recognition, and audio playback. Supporting this fusion of data sources requires an efficient combination of RISC and DSP processing.

This article shows how leveraging an integrated, pre-verified data fusion subsystem that is optimized for efficient DSP performance and ultra-low energy consumption can accelerate the development of high-performance, cost-optimized IoT systems and speed time to market.

Sensor fusion to data fusion

The combination of basic sensor elements into a higher order function is called sensor fusion. For example, combining the input from an accelerometer, compass and gyroscope to track 3D motion is common in all modern smart phones. The number of systems incorporating sensor fusion technology continues to explode as semiconductor suppliers push to integrate sensor interfaces into many of their SoC offerings.

In addition to sensor processing, today’s IoT applications demand more and more integrated functionality, which requires support for voice and gesture recognition, audio playback and basic image detection. A higher level of DSP processing capability is needed to perform these functions, but at the same time it must be done with the lowest energy consumption possible. Data fusion has become a standard requirement in IoT edge devices addressing applications such as wearables, personal health and fitness devices, and wireless headsets and speakers.

Advantages of an integrated subsystem

The advantages of increased integration can differentiate a silicon vendor’s device. A typical "integrated" solution today involves incorporating the various data source interfaces into a microcontroller-like architecture. This architecture, shown in Figure 1, generally includes a CPU connected through an on-chip bus to peripheral interfaces (ADC, SPI, I2C), as well as on-chip memories (ROM, RAM, eFlash). The processor is connected to a standard bus (typically AMBA based) and all of the peripherals are connected to the bus. Transactions between the processor and peripherals take three to seven clocks or more due to bus latency and traffic on the bus. This is very inefficient in terms of performance and energy consumption. 

Figure 1: Discrete implementation vs. integrated subsystem

An integrated IP subsystem with a DesignWare® ARC® EM processor offers distinct advantages to help ease integration effort while reducing on-chip latency and energy consumption compared to typical bus-based systems.

ARC EM processors provide industry-leading power/performance efficiency – saving critical battery life for IoT edge devices. The ARC EM DSP processors add DSP instructions, as well as and MUL/MAC hardware, to the baseline RISC processor for always-on functions such as voice/gesture and audio playback. The availability of licensable options, like FPU, MPU, and microDMA allows customers flexibility in making implementation choices.

An ARC processor-based subsystem implementation can eliminate the interface to an on-chip bus by replacing load/store instructions to the I/O peripherals with register move instructions. The peripheral block registers are mapped using the ARC processor’s auxiliary bus. This effectively pulls the I/O peripheral interface functionality into the CPU complex, eliminating the buses and bridges. In a similar manner, both instruction and data memories can be closely coupled to the processor, eliminating the external bus and reducing access latencies.

ARC processors and subsystems also support adding any combination of hardware extensions to the core: CPU extension registers, auxiliary extension registers, or memory mapped blocks. Designers can add 32-bit custom instructions as well.

Leveraging these high-level configuration and extension concepts enables end-customers to create highly optimized implementations.

Tightly integrated DMA improves power and performance

One of the many configurable subsystem options is a tightly integrated microDMA engine. This DMA controller allows system resources and peripherals to access memory independent of the processor, even during processor sleep modes. This can translate into real savings on cycle count and dynamic power.

To quantify this value, two basic subsystems using ARC EM processors were compared: one with the tightly coupled DMA and one without. Using an integrated subsystem SPI peripheral, an 1800-byte message was transmitted in loopback mode (primary Tx -> primary Rx). The CPU was clocked at 10 MHz and the instruction and data memories (ICCM & DCCM) were 32KB each for both implementations. Effective cycle count and dynamic power were measured in each case.

Figure 2 shows the first subsystem implementation without the microDMA engine. Figure 3 shows the second implementation, which includes the tightly coupled microDMA engine.

Results are shown in Table 1 below. For a relatively small area penalty (adding the microDMA logic adds ~10K logic gates and some memory overhead), the number of required CPU cycles decreases dramatically (as expected), but the dynamic power of the subsystem is reduced 8X.

Figure 2: Subsystem with ARC EM processor and no integrated DMA

Figure 3: Subsystem with ARC EM processor and tightly integrated DMA

Total cycles:
576K

Subsystem without integrated DMA

Subsystem with tightly integrated DMA

CPU cycles

211K (37% of total cycles)

0.128K (0.02% of total cycles)

Area (NAND equiv. gates)

397K (47K logic/350K memories)

417K (57K logic/360K memories)

Dynamic Power

56µW (22.5µW CPU/33.5µW memories)

7µW (2.3µW CPU/1.0µW memories/3.7µW DMA)

Table 1: Area/Power comparison of subsystems with/without tightly coupled DMA

An optimized data fusion subsystem

Leveraging the basic subsystem concepts above, Synopsys has developed an IP subsystem targeting the fast-growing IoT edge device market – specifically addressing “always-on” applications requiring a robust level of DSP performance to process functions such as complex sensor fusion, voice and gesture recognition, image detection and audio playback while adhering to the constrained power envelope of a battery operated device.

The DesignWare Smart Data Fusion IP Subsystem (Figure 4) is designed to efficiently process data from numerous digital and analog sensors, either as the main processing element in an MCU, or as an offload engine for the host processor in a larger SoC. The fully configurable IP subsystem includes an ARC EM5D, EM7D, EM9D or EM11D processor. This family of low-power cores combines RISC and DSP instructions and hardware to manage the extensive processing required by advanced data fusion algorithms and to improve performance for a range of audio formats including MP3, SBC, OPUS and AAC LC.

The subsystem's integrated microDMA controller enables memory and peripheral access during processor sleep modes. In addition, the subsystem incorporates highly-optimized I/O peripherals including multiple SPI, I2C and analog-to-digital converter interfaces, further lowering gate count and energy consumption.

To ease software development, the subsystem includes software drivers and a rich library of off-the-shelf DSP functions supporting filtering, correlation, matrix/vector, decimation/interpolation and complex math operations. Designers can implement these sensor-specific DSP functions in hardware using a combination of native DSP instructions and tightly coupled hardware accelerators to boost performance efficiency and reduce power consumption.

Additionally, Synopsys' embARC Open Software Platform gives software developers online access to a comprehensive suite of free and open-source software that accelerates code development for the subsystem.

Figure 4: Synopsys DesignWare Smart Data Fusion IP Subsystem

Smart Data Fusion IP Subsystem benchmarks

For a complex sensor hub implementation, a common set of signal processing functions are typically required. These functions include complex and scalar math, matrix functions, filtering, interpolation and transforms.

To analyze the performance of the Data Fusion IP Subsystem, a library of these functions was run on both the Data Fusion Subsystem and a microcontroller running with a competitor’s processor (40LP, typical process/conditions). The total number of required clock cycles to complete the benchmark was calculated in both cases.

Across the board, the competitor processor required a significantly greater number of clock cycles to complete the tasks. The additional clock cycles translate into real energy (power over the “life” of the task). On average (as seen in Figure 5), the energy consumption was more than 2X greater for the competitor’s implementation. For IoT devices demanding minimal energy consumption to save on battery life, the Smart Data Fusion IP Subsystem provides significantly more efficient processing for typical sensor functions.

Figure 5: Competitive comparison of typical fusion functions

Summary

The rapidly expanding IoT edge device market continues to push boundaries on integration, cost, and performance. Battery operation provides a constrained power envelope, but increasing demands for “always-on” functionality combining complex sensor fusion with biometric input (voice, image, touch) drive both RISC and DSP performance requirements. Designers require integration of more of these functions to eliminate board-level components and reduce cost.

Synopsys’ Smart Data Fusion IP Subsystem combines the unique capabilities of the ARC EMxD CPUs with tightly coupled peripheral interfaces and hardware accelerators along with software drivers and libraries in an integrated IP offering, providing significant gains in overall performance, while reducing software footprint, silicon area and power in embedded IoT systems.