Cloud native EDA tools & pre-optimized hardware platforms
Brett Murdock, Product Marketing Director, Synopsys
In January of 2022 JEDEC released the new standard, JESD238, High Bandwidth Memory (HBM3) DRAM. The HBM3 standard offers several feature enhancements compared with the existing HBM2E standard (JESD235D) including support for larger densities, higher speed operation, increased bank count, enhanced Reliability, Availability, Serviceability (RAS) capabilities, a lower power interface and a new clocking architecture. HBM3 memories will soon be found in HPC applications such as AI, Graphics, Networking and even potentially automotive. This article highlights some of the key features of the HBM3 standard such as high capacity, low power, improved channel and clocking architecture, and more advanced RAS options. Some of the key features are highlighted in Figure 1.
HBM2E has an upper limit of 16 Gb devices which can be implemented in a 12-high stack for a total density of 24 GB. We’ve yet to see any 12-high HBM2E stacks in the market, but the standard allows for them. The HBM3 standard enables devices with up to 32 Gb of density and up to 16-high stack for a total of 64 GB storage – almost a 3x growth. Synopsys expects 16 GB and 24 GB HBM3 devices in 8-high and 12-high stack options to hit the market soon.
To support the larger density devices, HBM3 increases the number of banks available when moving from a 12-high stack to a 16-high stack offering a maximum bank count of 64 banks – an increase of 16 banks.
The HBM3 standard has a top speed of 6.4 Gbps, which is almost double the top speed of HBM2E at 3.6 Gbps.
It is not unreasonable to expect a second generation of HBM3 devices in the not-too-distant future. One need only look at the speed history of HBM2/2E, DDR5 (6400 Mbps upgraded to 8400 Mbps) and LPDDR5 maxing out at 6400 Mbps and quickly giving way to LPDDR5X operating at 8533 Mbps. HBM3 above 6.4 Gbps? It’s just a matter of when.
In addition to increasing capacity and speed, HBM3 is also keeping a focus on power efficiency. HBM2E already offers the lowest energy per bit transferred, largely due to being an unterminated interface, but HBM3 substantially improves on HBM2E. HBM3 decreases the core voltage to 1.1V compared to HBM2E’s 1.2V core voltage. In addition to the 100mV core supply drop HBM3 reduces the IO signaling down to 400mV from HBM2E’s 1.2V.
HBM2E defined a channel as a 128-bit interface, comprised of two 64-bit pseudo channels. While the pin interface is defined on a per channel basis when it comes to accessing the memory from the system, designers see pseudo-channels as a critical feature. HBM2E’s burst length to the pseudo-channels is 4 beats, allowing memory access in 32-byte packets (8 bytes wide, 4 beats per access) which is equivalent in size to most GPU cache lines.
HBM3 has kept the overall interface size the same for the HBM DRAMs – 1024-bits of data. However, this 1024-bit interface is now divided into 16 64-bit channels, or more importantly, 32 32-bit pseudo-channels. Since the width of the pseudo-channels has been reduced to 4 bytes, the burst length of accesses to the memory have increased to 8 beats – maintaining a 32-byte packet size for memory accesses.
Doubling the number of pseudo-channels will be a performance improvement over HBM2E. Combined with the increase in data rate, HBM3 can provide a substantial increase in performance over HBM2E.
Some things are carried forward from HBM2E into HBM3, such as DBI(ac) and parity on the data bus. Other things have changed such as Command and Address (CA) parity moving from being encoded in the command to being a separate signal on the CA bus.
One of the biggest changes for RAS in HBM3 is how error correcting code (ECC) is handled. Let’s start by examining the host side of ECC.
HBM2E provides an option for the host to enable a sideband ECC implementation by allowing the DM signal to be repurposed as an ECC storage location. Referencing HBM2E's pseudo-channel size, this gives the user a very familiar ECC option, much like that of DDR4 ECC DIMMs – supporting 64-bits of data and 8-bits of ECC.
HBM3 has changed this ECC approach in a few ways. The first is the removal of the DM signal entirely. If systems aim to transfer less than 32-bytes of data to the memory it will require a read-modify-write operation, which can be detrimental to performance.
With the removal of the DM signal from the HBM3 standard comes the addition of two ECC signals per pseudo-channel. This doesn’t quite provide the user with the same SECDED ECC capability as the user must really consider the entire packet access of 32-bytes of data (4 bytes of data over 8 beats) and 2-bytes of check bits (2-bits of data over 8 beats) to assemble a 34-byte / 272-bit code word.
The HBM3 standard considers the device side as well, requiring the HBM3 DRAMs to have on-die ECC. The on-die ECC is constructed using 272-bit data words and 32-bit of check bits, forming a 304-bit code word. The data word size for the HBM3 DRAMs is the code word size used by the host. Now the HBM3 DRAMs are not only protecting the data but also the host generated check bits.
The HBM3 standard provides the results of the ECC operation on a real time basis. Each pseudo-channel includes two severity signals which provide information on a burst access when reading from the HBM3 DRAM. The information provided is one of four responses – the data provided didn’t have errors, the data provided had a single error corrected, the data provided had multiple errors corrected or the data provided had uncorrected errors.
HBM3 DRAM devices also support Error Check and Scrub (ECS) when the device is in Self Refresh or when the host issues a Refresh all bank command. The results of the ECS operation may be obtained by accessing ECC transparency registers via the IEEE standard 1500 Test Access Port (TAP).
The HBM3 standard’s new RAS feature supports Refresh Management (RFM) or Adaptive Refresh Management (ARFM). Typically, RFM/ARFM is used as a technique to counter row hammer, either intentional or unintentional. Row hammer occurs when repeated accesses to a DRAM row or row regions can affect unaccessed nearby rows, compromising the data in those nearby rows. Using information in the HBM3 DRAM, HBM3 controllers can determine when additional refresh management is required to mitigate against row hammer.
One of the key features of HBM3 is the new clocking scheme. In all previous generations of HBM, a single clock from the host to the device essentially synchronized the interface between the host and device. This clock signal (CK) was used to set the transfer rate of the CA signals passing from the host to device. In addition, it fixed the rate at which data (DQ) and the data strobes (WDQS/RDQS) were transferred between the host and device (writes) or the device and host (reads).
When considering HBM2E, both the clock signal and the data strobes operate at a maximum rate of 1.8 GHz, so the maximum effective rate of information transfer on the CA interface is 3.6 Gbps just as with the data.
HBM3 changes the clocking architecture by decoupling the traditional clock signal from the host to the device and the data strobe signals. In fact, while the new maximum rate of WDQS and RDQS in HBM3 are 3.2 GHz to enable a data transfer rate of up to 6.4 Gbps, the fastest rate the CK will run from the host to the device is only 1.6 GHz (even when the data channels are operating at 6.4 Gbps).
Decoupling the clock signal from the strobes allows the clock signal to run significantly slower than the data strobes. The maximum transfer rate of information on the CA bus is now 3.2 Gbps since the CA clock has been capped at a maximum rate of 1.6 GHz. While HBM2E requires a CA transfer rate of 3.6 Gbps, HBM3 only requires a CA transfer rate of 3.2 Gbps.
The decision to decouple the CA clock and the data strobes affects not only the interface between the host and the device, it also affects the interface of the HBM3 controller and HBM3 PHY inside the host.
Inside a typical host, a controller and a PHY communicate with the external memory. The interface between the controller and PHY is commonly implemented with a specification known as the DDR PHY Interface (DFI). The DFI specification allows SoC designers to separate the design of the HBM3 controller, which typically converts system commands into HBM commands, and the HBM3 PHY, which typically converts the digital domain on the SoC to the analog domain of the host-to-device interface. Having a defined interface between the HBM3 controller and HBM3 PHY provides designers and integrators a clear delineation for splitting design teams between digital (controllers) and analog (PHYs).
In a high-performance HBM2E solution, in addition to bandwidth, latency is also a focus for the controller and PHY. In an HBM2E system the clock and the strobes run at the same frequency – up to 1.8 GHz. The lowest latency solution for an HBM2E system is to use a DFI 1:1 frequency ratio – keeping the controller, DFI, PHY and memory all running on the same 1.8 GHz clock.
The new HBM3 clocking architecture enables the user to keep focus on a low-latency, high-performance solution when migrating from HBM2E to HBM3. As noted above, the highest defined frequency for the CA bus with HBM3 is 1.6 GHz while the data strobes operate at 3.2 GHz. This enables users to implement a DFI 1:1:2 frequency ratio for an HBM3 controller and PHY. In this case, the controller, DFI, PHY and memory clock all run at 1.6 GHz while the strobe frequency is 3.2 GHz. This gives designers a DFI 1:1 frequency ratio for the command and address interface and a DFI 1:2 frequency ratio for the data, all of which minimize latency.
The HBM3 standard offers several improvements over the HBM2E standard. Some are expected improvements – larger, faster and lower power devices. Some are unexpected – channel architecture changes, RAS improvements and an updated clocking methodology. Cumulatively, the new standard offers users a significantly improved HBM memory for the next generation of SoCs.
Synopsys offers a complete HBM3 IP solution including controller, PHYs available in leading technology nodes, and Verification IP. Synopsys is an active member of JEDEC helping to drive development and adoption of the newest memory standards. Synopsys’ configurable memory interface IP solutions can be tailored to meet the exact requirements of SoCs for applications such as graphics, cloud computing, networking, AI and potentially automotive.