Cross-Generation Reliability Study of NVIDIA Volta and Ampere GPUs on Supercomputers: Similarities and Differences

Authors

  • Dhatri Parakal Department of Computer Science, George Mason University, Fairfax, VA
  • Zhu Zhu Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL
  • Lishan Yang Department of Computer Science, George Mason University, Fairfax, VA

Abstract

High-performance computing (HPC) operations, like the training of large AI models, hurricane prediction simulations, and climate modeling, rely on the accuracy of GPUs to keep computing costs low and increase power efficiency. By understanding the behavior of Double-Bit Errors (DBEs) in GPUs and examining the progression of GPU reliability across architecture generations, we can gauge how improvements in previous technologies have affected reliability and gain insights into how effective or ineffective changes across generations were.
We analyze the Oak Ridge Leadership Computing Facility (OLCF) Summit Supercomputer GPU Snapshots dataset, an extensive dataset taken across two years on the occurrence of double-bit errors in 27,648 Tesla V100 GPUs on the Summit Supercomputer, one of the top 10 supercomputers in the world. Our analysis of the V100 data focuses on several error characteristics, including GPU location within a Summit node, the number of daily errors, and the time between errors. We utilize Pandas and NumPy for data organization and analysis alongside Matplotlib for data visualization to gain better insights into the overall reliability and possible error patterns of the V100 GPU. Furthermore, we compare data visualizations of the V100 GPUs to their successor, the A100 GPUs, to perform a cross-generation study. Preliminary data suggest that location and a history of failures affect future DBEs. Previous studies and initial outputs support that 1) bursty error patterns exist in both generations and 2) V100 is less reliable than the newer A100 GPUs.

Published

2024-10-13

Issue

Section

College of Engineering and Computing: Department of Computer Science