Featherweight Soft Error Resilience for GPUs 2022


Zhang Y., Jung C.

55th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Illinois, United States Of America, 1 - 05 October 2022, pp.245-262 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/micro56248.2022.00030
  • City: Illinois
  • Country: United States Of America
  • Page Numbers: pp.245-262
  • Istanbul Technical University Affiliated: No

Abstract

This paper presents Flame, a hardware/software co-designed resilience scheme for protecting GPUs against soft errors. For low-cost yet high-performance resilience, Flame uses acoustic sensors and idempotent processing for error detection and recovery, respectively. That is, Flame seeks to correct any sensor-detected errors by re-executing the idempotent region where they occurred. To achieve this, it is essential for each idempotent region to ensure the absence of errors before moving on to the next region. This is so-called soft error verification that takes sensors' worst-case detection latency (WCDL) to verify each region finished Rather than waiting for WCDL at each region end, which incurs too much performance overhead, Flame proposes WCDL-aware warp scheduling that can hide the error verification delay (i.e., WCDL) with GPU's inherent massive warp-level parallelism. When a warp hits each idempotent region boundary, Flame deschedules the warp and switches to one of the other ready warps-as if the region boundary were a regular long-latency operation triggering the warp switching. By leveraging GPU's inherent ability for the latency hiding, Flame can completely eliminate the verification delay without significant hardware modification. The experimental results demonstrate that the performance overhead of Flame is near zero, i.e., 0.6% on average for 34 GPU benchmark applications.