Optimize Data Center Power & Performance Using Computational Storage with Processor IP

Scott Durrant, Strategic Marketing Manager, Synopsys and Rich Collins, Product Marketing Director, Synopsys

Introduction

According to Fortune Business Insights, service providers are increasing storage spending at a rate of about 25% annually to facilitate management of increased data over the next few years. That translates to about $85 billion in 2022, growing to nearly $300 billion in annual data center storage spending by 2027. At the same time as they are increasing data storage, data center operators want to reduce energy expenses and the carbon footprint associated with their operations. Therefore, service providers are focusing their investments on higher performance and lower power compute capabilities that reduce data movement because data movement itself is an energy intensive process. For example, every minute, more than 500 hours of content are uploaded to YouTube – content that must be stored, searched, and processed almost instantaneously, and with the least power consumption possible. A key technology to improve data handling is computational storage, which brings compute capability to the SSD storage solution to boost system performance and reduce the amount of data transferred between storage and application processors.

Part of the drive toward computational storage assumes increased use of SSD flash. Wikibon predicts that SSD flash capacity shipments are going to grow at over 30% per year for the next five years or more. This growth creates a lot of opportunity in the storage market for suppliers who choose to invest in product leadership. 

Comparing Traditional vs Computational Storage

To understand the shift from traditional storage and processing solutions to computational storage, consider an example from the United States Environmental Protection Agency (US EPA). The EPA tests air quality in hundreds of different cities around the United States every hour to measure the amount of various pollutants. These measurements are stored in a database that contains millions of records and is constantly growing.

An example query on this database would be a report of the number of times that sulfur dioxide, a common pollutant, exceeds what the EPA considers to be a healthy level (75 parts per billion or lower). Fewer than 1 in 1000 of the records in this database - fewer than 1/10 of 1% - contain data of interest. To generate the report in a traditional system, the host server runs a query (Figure 1) that copies a portion of the database from the storage server SSDs to the allocated DRAM associated with the host processor. The host CPU then scans all of those records, finds the ones of interest, and repeats the process with the next portion of the database until the system extracts all of the information of interest from millions of records.

Comparing traditional compute data transfers with computational storage.

Figure 1: Comparing traditional compute data transfers with computational storage.

With a computational storage system, the solid state drives in the storage server are replaced with computational storage drives that have processing capability built in. The host server sends a request to the storage server to provide records of interest. The processors within each of those computational storage drives pre-process the information, find the records that contain reports of pollutants in excess of 75 parts per billion, and return only the relevant information instead of moving the entire database of millions of records. The host processor then performs any necessary post-processing and returns the report. Using a computational storage system consumes much less network bandwidth because only a fraction of the database is sent over the network. In addition, it requires far fewer host CPU cycles because the host CPU only needs to look at the records of interest and not the entire database.

Processing Data in Computational Storage Devices

In a computational storage system, as the amount of storage and number of drives grows, so does the number of computational processors in the storage devices. Therefore, the processing capabilities scale along with storage. The computational storage processors can be optimized to the specific workload for an additional performance boost.

Figures 2 and 3 describe how the data can flow in a computational storage transaction.

First, a traditional host request would come from the host interface into the storage SSD controller (Computational Storage Drive) to request the data. That might be huge sums of data that would be pulled from the SSD to DRAM and then processed by the host processor. In this case, the host sends a simple high level command to the computational storage processor asking for a transaction to begin.

Second, the computational storage processor initiates and analyzes the command from the host and then initiates the read request to the DRAM. The request tells the storage processor to build the transfer descriptor (step 3), which is then used to dispatch to the appropriate Flash channel to acquire the read data from the NAND flash elements (step 4).

Computational storage drive data flow from host to dispatching the descriptor

Figure 2: Computational storage drive data flow from host to dispatching the descriptor

Next, the read request from the computational storage processor is brought in from the NAND Flash channel for analysis. The processor looks for the data or key match that was requested. If it finds a matching record, it will send that matching record to DDR DRAM (step 6). That data is then packaged in the host interface protocol and DMA’d to the host memory through the host interface, where it can then be processed or used by the host processor (step 7).

Once complete, the computational storage processor sends a message back to the host processor saying either that the transaction is complete and the data is available, or communicates an error message if the process did not result in a match (step 8). 

Computational storage drive data flow from read data to successful completion indication

Figure 3: Computational storage drive data flow from read data to successful completion indication

Using computational storage reduces the amount of data that is sent from the local storage (NAND Flash) to DRAM for processing by the host. In the US EPA example, only one in one thousand records would need to have data stored in DRAM, freeing up the host processor to focus on the most important data. 

Why AI in Computational Storage?

For the purposes of computational storage, AI is taking what is traditionally seen in brain functionality and neurons, translating that into mathematical functions, and then creating specialized hardware, accelerators, and neural network engines that can process data.

Computational storage offers a better way to manage the types of data that AI applications use. When data characterization is needed, so are training models. The system can program the processor and then adjust the processor in real time as needed. Training and inference create the common high-level definition of what's required in any AI application. But the question remains, why would an application require AI in storage?

Today’s systems generate a lot of data at the edge. Applying AI techniques at the edge, instead of sending it back through the cloud, is becoming increasingly important due to the power, performance, and dollar costs of data movement. The value of computational storage is in reducing data movement, which is also important to optimizing AI applications. Computational storage in AI applications isolates the AI processing offline within the local storage, and then moves only the required data to the host or the data center. 

Processor IP for Computational Storage

Certain types of applications can benefit from computational storage such as processor offloading, video transcoding, and search for text, images, or video. Image classification and object detection and classification in automotive applications can benefit from computational storage on the road. Each of these applications can use machine learning, encryption, and/or compression to simplify or reduce the amount of data that needs to be transported around the system to the host processor.

After determining if computational storage is a benefit to the system, designers consider which processors to use to manage the data. Putting more computational capabilities into the computational storage drive requires higher processing capability than systems not using computational storage.

Synopsys offers a range of processor IP that is particularly well suited for computational storage for a few reasons. DesignWare® ARC® Processors IP offer a very flexible, scalable architecture. The broad portfolio of processors includes low end three stage pipeline processors all the way up to much higher end 10 stage pipeline real-time and embedded application processors. Finally, Synopsys’ embedded vision processors offer a neural network accelerator that can help with the AI portions of the processing.

A computational storage drive can include multiple ARC processors for different functions

Figure 4: A computational storage drive can include multiple ARC processors for different functions

Adding computational requirements to the computational storage drive can rapidly increase the processing needs. To support the requirements, the DesignWare ARC HS6x processors feature a dual-issue, 64-bit superscalar architecture (Figure 6). The processors offer outstanding performance delivering up to 6.1 CoreMark/MHz with a small area footprint and low power consumption.

The ARC HS6x processors are based on the advanced ARCv3 instruction set architecture (ISA) and pipeline, which provides leadership power efficiency and code density. The processors feature a 52-bit physical address space and can directly address memories up to 4.5 Petabytes (4.5x1015) in size.

As you push more of the external computation beyond the storage access control into the local storage processor, you will need to architect additional processing overhead that can support the required programming workloads.  ARC HS6x cores are well suited to provide this additional processing capability.

 

ARC HS6x Processor

Figure 5: ARC HS6x Processor

For applications requiring even higher performance, multicore processor versions of the HS6x are available with support for up to 12 ARC HS6x CPU cores and up to 16 hardware accelerators in a single, coherent processor cluster. 

Conclusion

The transition from traditional storage architectures to computational storage is ongoing. In traditional storage systems, the host processor handles all of the storage requests and data copies from storage to DRAM. This is inherently less efficient than computational storage. As we move towards in-storage or computational storage architectures, manipulating data is done locally on the drive itself. In this way, instead of the host having to receive and analyze the data, the host can initiate the request and let the computational storage drive, with an integrated ARC processor, handle the pre-processing. The host waits for an indication that the processing is complete and the data subset is ready for the host. This reduces power consumption and accelerates performance as a much smaller amount of data is transferred between the host and the storage devices.