Contact Sales

Search Synopsys

Multiphysics Fusion Technology for Multi-Die Designs Explained

Unified multiphysics fusion helps multi-die teams validate earlier and sign off faster.

Automotive Executive Guide: Rethinking Automotive Development

A guide to virtualization in software-defined vehicles for automotive leaders.

Mastering AI Chip Complexity

This eBook explores AI chip design trends, challenges,
and strategies for first-pass silicon success.

Optimizing Memory for CPU, GPU and DSP Cores

Prasad Saggurti

Apr 21, 2014 / 6 min read

Table of Contents

Abstract
Different Strokes for Different Folks
CPUs and the Need for Speed
Fight the Power (Dissipation)
Conclusion

Abstract

System-on-chips (SoCs) targeted at the mobile phone, tablet and Smart TV segments use many different kinds of cores – CPUs, GPUs and DSPs. Each of them have different performance, power and area (PPA) requirements. The CPUs need high speed while the GPUs seek to drive down area and power. Base-station applications require high speed DSP implementations while most others only require speeds under 400 MHz. This article presents memory design techniques that enable SoC designers to concurrently meet all of these seemingly conflicting goals.

Foundation IP: Pushing the Boundaries of Energy- Efficient Chip Design

Selected articles on trusted solutions and technical innovations

Download Digest

Different Strokes for Different Folks

Mobile SoCs used in smartphone, tablet and Smart TV applications include multiple CPUs, GPUs and DSPs, each driving PPA in a different direction. SoC designers push CPUs towards higher performance while trying to keep power and area within budget. Target speeds of 2 GHz are common for application processors on 28-nm processes.

In contrast, designers are typically trying to reduce area and power consumption of the GPUs and DSPs and only require them to run at moderate speeds of around 400 MHz. When greater performance is required, additional processors are added to keep the dynamic power in check. In this article we will describe the optimized memory contained in Synopsys’ DesignWare® High Performance Core (HPC) Design Kit and show how these optimized memories help deliver improved PPA for CPU, GPU and DSP cores.

Even though CPU, GPU, and DSP cores have different PPA targets (as seen in Figure 1), they exist on the same die and need a comprehensive set of IP building blocks such as logic libraries and embedded memories as well as tuned design flows to achieve optimal results.

Diagram of Optimized Memory Architecture for DSP and CPU

Figure 1: CPUs, GPUs, and DSPs have different PPA targets

CPUs and the Need for Speed

There are two kinds of bottlenecks that show up when attempting to increase CPU speed: paths that go through memories and paths that don’t. To learn how to manage paths not going through memories, refer to the DesignWare Technical Bulletin article entitled Optimizing CPUs, GPUs and DSPs for High Performance and Low Power at 28-nm, which describes the different standard cells available and how best to use them in order to enable best-in-class CPU, GPU, and DSP implementations. For paths that involve memory instantiations, in addition to constraining cycle times, constraints need to be placed on setup and access times as well. Also, different CPU cores place different restrictions on the setup time and access times. For example, timing-critical paths that end in a memory require a fast setup time for the memory, while paths that start from the memory need a fast access time (see Figure 2). In such situations, the input clock to the memory can be phase shifted at the SoC level, so that the setup or access time is improved at the cost of the other. However, some paths originate from and end in a memory. In these cases, the memories need to keep the sum of the access time and setup low so that both ends of the path are addressed.

Diagram of Memory Logic Design with Annotations

Figure 2: Fast memory access and setup times increase CPU clock frequency

Most CPU cores define the required access time and setup times of the memories as a percentage of the cycle time, which gives memory designers a way to compute the required access and setup times for a target CPU clock frequency. Additionally, when choosing memories, chip designers can easily evaluate them based on the specified CPU clock frequency, making the selection of the best memories for their design a fairly straightforward exercise.

To prevent test from being a roadblock to higher operating frequency, many CPU cores now provide a separate bus as a way to test the memories without impacting the normal functional path. So when SoC designers consider a memory test solution for their CPU, they ensure that they choose a memory built-in self-test (BIST) and repair solution that can support this approach seamlessly in an automated manner.

Fight the Power (Dissipation)

Memory designers should incorporate long channel devices wherever possible when building custom cache memory instances for CPUs and DSPs, as this will reduce leakage power in these high speed memories. This technique is also useful when building memories for GPUs.

By nature, GPUs are datapath intensive designs and involve a lot of FIFO usage. These FIFOs are built out of memories that have one read port and one write port. Traditionally, FIFOs were built using 8-transistor bitcells to support asynchronous clocks for the read and write ports. However, today’s GPU cores have a single clock across the whole SoC. Memory designers can take advantage of this single clock to reduce area and power leakage. Instead of using 8-transistor bitcells and supporting two asynchronous clocks in the two-port memory, designers can use 6-transistor bitcells and support a single clock going to both the read and write ports. Both read and write operations then need to complete in a single memory clock cycle which increases the minimum cycle time. However, even with this increase in cycle time, 6-transistor FIFOs can support a 400 MHz clock frequency, resulting in tremendous area and leakage savings on each GPU core. In the example illustrated in Table 1, replacing a High-Density Two-Port Register File (HD 2P RF), which uses an 8-transistor bitcell, with an Ultra High-Density Two-Port Register File (UHD 2P RF) using a 6-transistor bitcell, results in an almost 50% reduction in area and a reduction in leakage by a third.

Comparison Table of High-Density Memory Registers

Table 1: Area and leakage reduction obtained by replacing an HD 2P RF with UHD 2P RF

In addition, there are many single port SRAMs that are instantiated in the GPU and DSP cores. A very small number of configurations contribute to 80% of the non-FIFO area. Reducing memory device sizes in these memories down to the point where there is just enough performance headroom for the instances results in a meaningful amount of area and power reduction.

GPU cores, in particular, use many instantiations of memory. This could lead to area and power inefficiency when implementing memory BIST. The wiring between the memory and the BIST processor could lead to congestion and test speed bottlenecks, if implemented incorrectly. Synopsys’ DesignWare STAR Memory System overcomes this by intelligently splitting the BIST circuitry between the memory and the BIST processor, as shown in Figure 3. While this may increase the area of the individual memory instance, the resultant BIST wiring simplification delivers a much lower area and power memory subsystem. Figure 4 shows the improvement in test frequency, area, power and coverage due to this Synopsys innovative partitioning of the memory BIST circuitry.

Diagram of Synopsys SoC Memory Optimization Process

Figure 3: DesignWare STAR Memory System BIST circuit partitioning

Bar Graph Showing Memory Optimization Benefits

Figure 4: Benefits of intelligent partitioning of the memory BIST circuitry with DesignWare STAR Memory System

Conclusion

There are many techniques memory designers can use to enable SoC designers to achieve the best CPU, GPU and DSP core implementations for their specific applications. A single set of performance-, power- and area-optimized memories that enable this multi-dimensional optimization, such as those contained in the DesignWare HPC Design Kit, can significantly reduce the design effort of hardening cores to SoC-specific requirements. When combined with a process-aware memory BIST solution, like the STAR Memory System that is tailored for the requirements of CPU test buses and GPU’s memory distribution, these lead to industry leading PPA for these different cores. Synopsys provides a unique blend of IP, tools, design flows, and expert services to help design team achieve the processor and SoC PPA goals in the shortest possible time.

Learn more about how Synopsys memories can benefit your SoC design

Subscribe to the Synopsys IP Technical Bulletin

Includes in-depth technical articles, white papers, videos, upcoming webinars, product announcements and more.

Continue Reading

Synopsys Secure Storage Solution for OTP IP

White Paper

Beating the Edge AI Power Wall with Low Voltage Foundation IP

How silicon-proven logic, memory, and I/O at ~0.4–0.5V deliver predictable PPA and faster convergence

Download

White Paper

Accelerating Automotive Innovation: SRAM Compiler Breakthroughs for 5nm and 3nm SoCs

Download

Article

Addressing AI and Advanced Packaging Challenges with Synopsys 3DIO PHY

Learn more

ASK

BETA

End Chat

Closing this window clears your chat history and ends your session. Are you sure you want to end this chat?

Legal Disclaimer

NOTICE: You are interacting with an AI-powered chatbot that provides general information about Synopsys, including its products and services, which may be incorrect or incomplete. In the event of any conflict or discrepancy, the terms of your applicable agreements supersede any information provided by this chatbot. These chats may be accessed by Synopsys and its service providers to customize the experience and improve this tool, and your use of this chatbot is an agreement to that data processing activity.