Failure modes and failures mitigation in GPGPUs: a reference model and its application


Francesco Terrosi, Andrea Ceccarelli and Andrea Bondavalli

Presentation title

Failure modes and failures mitigation in GPGPUs: a reference model and its application

Authors

Francesco Terrosi, Andrea Ceccarelli and Andrea Bondavalli

Institution(s)

University of Florence

Presentation type

Technical presentation

Abstract

General Purpose GPUs (GPGPUs) are highly susceptible to both transient and permanent faults. This is a serious concern for their safe and reliable usage in many domains, from autonomous driving to High Performance Computing. The research and industrial community responded fiercely to this issue, by analyzing failures impact and devising failure mitigation strategies. This led to the definition of several failure modes and mitigation approaches. Unfortunately, these are often based on different foundations and not easy to position in a consistent view. This work elaborates a GPGPU failures model, identifying relations to the GPGPU failure modes and components, and then it analyzes mitigations proposed in the literature. By proposing a unified view on failures and mitigations, the resulting model i) positions each research on the subject, ii) easily identifies the current gaps, and iii) acts as a reference for further research on GPGPU failures.


Additional material

  • Presentation slides: [pdf]