The video footage captured by a home’s security doorbell or a building’s surveillance system. Financial and business operations modeling. Scientific and medical research. Augmented reality and virtual reality. The list goes on when it comes to application areas that rely on HPC. The proliferation of data collected by our devices and systems, AI-fueled analytics, the availability of massive amounts of compute resources, and the cloud are converging to make it possible to quickly derive useful, actionable insights, making HPC an integral part of a much wider array of applications than when the very first supercomputers emerged in the 1940s.
A typical HPC infrastructure today consists of three key elements: compute, network, and storage. Each requires a certain level of performance, latency, power efficiency, scalability, productivity, and security. Let’s take a closer look at each element:
- Compute consists of CPUs and GPUs, accelerators, network on chip (NoC), and compute servers. This is where the high-performance data processing takes place. Complex multi-core or even multi-die system architectures, large memories with fast access, high-bandwidth I/O interfaces, power/cooling management, and security are key characteristics. In-chip monitoring and analysis can support RAS goals.
- Network consists of switches and routers, adapters, bridges, repeaters, the network interface card (like a SmartNIC), and optical and electrical interconnects. This element delivers high-performance connectivity, ideally with high throughput, low latency, energy efficiency, configurability and scalability, real-time monitoring and reporting, and security. Debug capabilities, forward error correction (FEC), and IP can support RAS requirements.
- Storage consists of a solid-state drive (SSD) or hard disk drive (HDD), storage area network (SAN), and network attached storage (NAS). Ideally, the storage element should provide high-bandwidth storage, reduced data transfer energy and latency, flexibility, scalability, reliability, and security. Capabilities like built-in self-test (BIST), error correction code (ECC), and redundancy can facilitate high levels of RAS.
There are two primary types of HPC systems: the homogeneous machines and the hybrid ones. Homogeneous machines only have CPUs. Hybrids, by contrast, have both GPUs and CPUs, with GPUs running the tasks and CPUs overseeing the computation.
HPC clusters can be composed of large numbers of servers, where the total physical size, energy use, or heat output of the computing cluster might become a serious issue. Furthermore, there are requirements for dedicated communications among the servers that are somewhat unique to clusters.
Because small design differences amount to large benefits when multiplied by the number of servers in the clusters, we are seeing the emergence of server designs that are optimized for HPC. Sometimes these are designs targeted at large, public Web operators, such as search engine firms, that deliver similar benefits in HPC clusters. However, they can also offer features only appropriate for HPC users. For example, if the system were designed to provide the cluster interconnect differently, there might be potential for significant cabling reductions.