MENDNet: Just-in-time Fault Detection and Mitigation in AI Systems with Uncertainty Quantification and Multi-Exit Networks


Hardware faults in AI accelerators, particularly in accelerator memory, can alter pre-trained deep neural network parameters, leading to errors that compromise performance. To address this, just-intime (JIT) fault detection and mitigation are crucial. However, existing fault detection/mitigation approaches, either interrupt continuous execution or introduce significant latency, making them less ideal for JIT implementation. To circumvent this issue, this paper explores uncertainty quantification in deep neural networks as a means of facilitating an efficient and novel fault detection approach in AI accelerators. Furthermore, in order to mitigate the impact of such faults, we propose MENDNet, which leverages the properties of multi-exit neural networks, coupled with the proposed uncertainty quantification framework. By tuning the confidence threshold for inference in each exit and leveraging the energy-based uncertainty quantification metric, MENDNet can make accurate predictions even in the presence of faults in the accelerator. When evaluated on state-of-the-art network-dataset configurations and with multiple fault rate-fault position combinations, our proposed approach furnishes up to 80.42% improvement in accuracy over a traditional DNN implementation, thereby instilling the reliability of the AI accelerator in mission mode.

* The first two authors contributed equally.

In The Design Automation Conference 2024.