SC12 Home > SC12 Schedule > SC12 Presentation - Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

SCHEDULE: NOV 10-16, 2012

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

SESSION: Fault Detection and Analysis

EVENT TYPE: Papers

TIME: 11:30AM - 12:00PM

SESSION CHAIR: Pedro C. Diniz

AUTHOR(S):David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, Ron Brightwell

ROOM:255-BC

ABSTRACT:
Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results. This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI messages between replicas, we study the best suited protocols for detecting and correcting corrupted MPI messages. Using our fault injector, we observe that even a single error can have profound effects on applications by causing a cascading pattern of corruption which in most cases spreads to all other processes. Results indicate that our consistency protocols can successfully protect applications experiencing even high rates of silent data corruption.

Chair/Author Details:

Pedro C. Diniz (Chair) - University of Southern California

David Fiala - North Carolina State University

Frank Mueller - North Carolina State University

Christian Engelmann - Oak Ridge National Laboratory

Rolf Riesen - IBM Ireland

Kurt Ferreira - Sandia National Laboratories

Ron Brightwell - Sandia National Laboratories

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

SESSION: Fault Detection and Analysis

EVENT TYPE:

TIME: 11:30AM - 12:00PM

SESSION CHAIR: Pedro C. Diniz

AUTHOR(S):David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, Ron Brightwell

ROOM:255-BC

ABSTRACT:
Faults have become the norm rather than the exception for high-end computing clusters. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that allow applications to compute incorrect results. This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI messages between replicas, we study the best suited protocols for detecting and correcting corrupted MPI messages. Using our fault injector, we observe that even a single error can have profound effects on applications by causing a cascading pattern of corruption which in most cases spreads to all other processes. Results indicate that our consistency protocols can successfully protect applications experiencing even high rates of silent data corruption.

Chair/Author Details:

Pedro C. Diniz (Chair) - University of Southern California

David Fiala - North Carolina State University

Frank Mueller - North Carolina State University

Christian Engelmann - Oak Ridge National Laboratory

Rolf Riesen - IBM Ireland

Kurt Ferreira - Sandia National Laboratories

Ron Brightwell - Sandia National Laboratories

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar