Presentation title
Failure modes and failures mitigation in GPGPUs: a reference model and its applicationAuthors
Francesco Terrosi, Andrea Ceccarelli and Andrea BondavalliInstitution(s)
University of FlorencePresentation type
Technical presentationAbstract
General Purpose GPUs (GPGPUs) are highly susceptible to both transient and permanent faults. This is a serious concern for their safe and reliable usage in many domains, from autonomous driving to High Performance Computing. The research and industrial community responded fiercely to this issue, by analyzing failures impact and devising failure mitigation strategies. This led to the definition of several failure modes and mitigation approaches. Unfortunately, these are often based on different foundations and not easy to position in a consistent view. This work elaborates a GPGPU failures model, identifying relations to the GPGPU failure modes and components, and then it analyzes mitigations proposed in the literature. By proposing a unified view on failures and mitigations, the resulting model i) positions each research on the subject, ii) easily identifies the current gaps, and iii) acts as a reference for further research on GPGPU failures.
Additional material
- Presentation slides: [pdf]