Optimized Memory for Hardening CPU, GPU and DSP Cores

By Prasad Saggurti, Senior Product Marketing Manager, Memory Compilers and Embedded Test & Repair, Synopsys

Abstract

System-on-chips (SoCs) targeted at the mobile phone, tablet and Smart TV segments use many different kinds of cores – CPUs, GPUs and DSPs. Each of them have different performance, power and area (PPA) requirements. The CPUs need high speed while the GPUs seek to drive down area and power. Base-station applications require high speed DSP implementations while most others only require speeds under 400 MHz. This article presents memory design techniques that enable SoC designers to concurrently meet all of these seemingly conflicting goals.

Different Strokes for Different Folks

Mobile SoCs used in smartphone, tablet and Smart TV applications include multiple CPUs, GPUs and DSPs, each driving PPA in a different direction. SoC designers push CPUs towards higher performance while trying to keep power and area within budget. Target speeds of 2 GHz are common for application processors on 28-nm processes.

In contrast, designers are typically trying to reduce area and power consumption of the GPUs and DSPs and only require them to run at moderate speeds of around 400 MHz. When greater performance is required, additional processors are added to keep the dynamic power in check. In this article we will describe the optimized memory contained in Synopsys’ DesignWare® High Performance Core (HPC) Design Kit and show how these optimized memories help deliver improved PPA for CPU, GPU and DSP cores.

Even though CPU, GPU, and DSP cores have different PPA targets (as seen in Figure 1), they exist on the same die and need a comprehensive set of IP building blocks such as logic libraries and embedded memories as well as tuned design flows to achieve optimal results.

Figure 1: CPUs, GPUs, and DSPs have different PPA targets

CPUs and the Need for Speed

There are two kinds of bottlenecks that show up when attempting to increase CPU speed: paths that go through memories and paths that don’t. To learn how to manage paths not going through memories, refer to the DesignWare Technical Bulletin article entitled Optimizing CPUs, GPUs and DSPs for High Performance and Low Power at 28-nm, which describes the different standard cells available and how best to use them in order to enable best-in-class CPU, GPU, and DSP implementations. For paths that involve memory instantiations, in addition to constraining cycle times, constraints need to be placed on setup and access times as well. Also, different CPU cores place different restrictions on the setup time and access times. For example, timing-critical paths that end in a memory require a fast setup time for the memory, while paths that start from the memory need a fast access time (see Figure 2). In such situations, the input clock to the memory can be phase shifted at the SoC level, so that the setup or access time is improved at the cost of the other. However, some paths originate from and end in a memory. In these cases, the memories need to keep the sum of the access time and setup low so that both ends of the path are addressed. 

Figure 2: Fast memory access and setup times increase CPU clock frequency

Most CPU cores define the required access time and setup times of the memories as a percentage of the cycle time, which gives memory designers a way to compute the required access and setup times for a target CPU clock frequency. Additionally, when choosing memories, chip designers can easily evaluate them based on the specified CPU clock frequency, making the selection of the best memories for their design a fairly straightforward exercise.

To prevent test from being a roadblock to higher operating frequency, many CPU cores now provide a separate bus as a way to test the memories without impacting the normal functional path. So when SoC designers consider a memory test solution for their CPU, they ensure that they choose a memory built-in self-test (BIST) and repair solution1 that can support this approach seamlessly in an automated manner.

Fight the Power (Dissipation)

Memory designers should incorporate long channel devices wherever possible when building custom cache memory instances for CPUs and DSPs, as this will reduce leakage power in these high speed memories. This technique is also useful when building memories for GPUs.

By nature, GPUs are datapath intensive designs and involve a lot of FIFO usage. These FIFOs are built out of memories that have one read port and one write port. Traditionally, FIFOs were built using 8-transistor bitcells to support asynchronous clocks for the read and write ports. However, today’s GPU cores have a single clock across the whole SoC. Memory designers can take advantage of this single clock to reduce area and power leakage. Instead of using 8-transistor bitcells and supporting two asynchronous clocks in the two-port memory, designers can use 6-transistor bitcells and support a single clock going to both the read and write ports. Both read and write operations then need to complete in a single memory clock cycle which increases the minimum cycle time. However, even with this increase in cycle time, 6-transistor FIFOs can support a 400 MHz clock frequency, resulting in tremendous area and leakage savings on each GPU core. In the example illustrated in Table 1, replacing a High-Density Two-Port Register File (HD 2P RF), which uses an 8-transistor bitcell, with an Ultra High-Density Two-Port Register File (UHD 2P RF) using a 6-transistor bitcell, results in an almost 50% reduction in area and a reduction in leakage by a third.

Table 1: Area and leakage reduction obtained by replacing an HD 2P RF with UHD 2P RF 

In addition, there are many single port SRAMs that are instantiated in the GPU and DSP cores. A very small number of configurations contribute to 80% of the non-FIFO area. Reducing memory device sizes in these memories down to the point where there is just enough performance headroom for the instances results in a meaningful amount of area and power reduction.

GPU cores, in particular, use many instantiations of memory. This could lead to area and power inefficiency when implementing memory BIST. The wiring between the memory and the BIST processor could lead to congestion and test speed bottlenecks, if implemented incorrectly. Synopsys’ DesignWare STAR Memory System overcomes this by intelligently splitting the BIST circuitry between the memory and the BIST processor, as shown in Figure 3. While this may increase the area of the individual memory instance, the resultant BIST wiring simplification delivers a much lower area and power memory subsystem. Figure 4 shows the improvement in test frequency, area, power and coverage due to this Synopsys innovative partitioning of the memory BIST circuitry.

Figure 3: DesignWare STAR Memory System BIST circuit partitioning 

Figure 4: Benefits of intelligent partitioning of the memory BIST circuitry with DesignWare STAR Memory System

Conclusion

There are many techniques memory designers can use to enable SoC designers to achieve the best CPU, GPU and DSP core implementations for their specific applications. A single set of performance-, power- and area-optimized memories that enable this multi-dimensional optimization, such as those contained in the DesignWare HPC Design Kit, can significantly reduce the design effort of hardening cores to SoC-specific requirements. When combined with a process-aware memory BIST solution, like the STAR Memory System that is tailored for the requirements of CPU test buses and GPU’s memory distribution, these lead to industry leading PPA for these different cores. Synopsys provides a unique blend of IP, tools, design flows, and expert services to help design team achieve the processor and SoC PPA goals in the shortest possible time.

Learn more about how Synopsys memories can benefit your SoC design.

1. Synopsys’ DesignWare STAR Memory System supports this approach