An Investigation on Numerical Bugs in GPU Programs Towards Automated Bug Detection

Abstract

General-purpose graphics processing unit (GPU) computing has emerged as a leading parallel computing paradigm, offering significant performance gains in various domains such as scientific computing and deep learning. However, GPU programs are susceptible to numerical bugs, which can lead to incorrect results or crashes. These bugs are difficult to detect, debug, and fix due to their dependence on specific input values or types and the absence of reliable error-checking mechanisms and oracles. Additionally, the unique programming conventions of GPUs complicate identifying the root causes of bugs, while fixing them requires domain-specific knowledge of GPU computing and numerical libraries. Therefore, understanding the characteristics of GPU numerical bugs is crucial for developing effective solutions. In this paper, we conduct a comprehensive study of GPU programming numerical bugs (GPU-NBs) by analyzing 397 real-world bug samples from GitHub. We identify common root causes, symptoms, input patterns, test oracles that trigger these bugs and the strategies used to fix them. We also present GPU-NBDetect, a preliminary tool designed to detect numerical bugs across six distinct bug categories. GPU-NBDetect detected a total of 226 bugs across 186 mathematical functions in four libraries, with 60 confirmed by developers. Our findings lay the groundwork for developing detection and prevention techniques for GPU numerical bugs and offer insights for building more effective debugging and auto-repair tool.

Publication
In *the ACM SIGSOFT International Symposium on Software Testing and Analysis *.
Date
Links