SC12 Home > SC12 Schedule > SC12 Presentation - Fault Prediction Under the Microscope - A Closer Look Into HPC Systems

SCHEDULE: NOV 10-16, 2012

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

Fault Prediction Under the Microscope - A Closer Look Into HPC Systems

SESSION: Fault Detection and Analysis

EVENT TYPE: Papers

TIME: 11:00AM - 11:30AM

SESSION CHAIR: Pedro C. Diniz

AUTHOR(S):Ana Gainaru, Franck Cappello, William Kramer, Marc Snir

ROOM:255-BC

ABSTRACT:
A large percentage of computing capacity in today's large high-performance computing systems is wasted due to failures. As a consequence current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. A complement to this approach is failure avoidance, where the occurrence of a fault is predicted and preventive measures are taken. For this, monitoring systems require a reliable prediction system to give information on what will be generated and at what location. In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA toolkit to offer an adaptive and overall more efficient prediction module. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems. Furthermore, we analyze the prediction's impact on current checkpointing strategies and highlight future improvements and directions.

Chair/Author Details:

Pedro C. Diniz (Chair) - University of Southern California

Ana Gainaru - University of Illinois at Urbana-Champaign

Franck Cappello - INRIA

William Kramer - National Center for Supercomputing Applications

Marc Snir - University of Illinois at Urbana-Champaign

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

Fault Prediction Under the Microscope - A Closer Look Into HPC Systems

SESSION: Fault Detection and Analysis

EVENT TYPE:

TIME: 11:00AM - 11:30AM

SESSION CHAIR: Pedro C. Diniz

AUTHOR(S):Ana Gainaru, Franck Cappello, William Kramer, Marc Snir

ROOM:255-BC

ABSTRACT:
A large percentage of computing capacity in today's large high-performance computing systems is wasted due to failures. As a consequence current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. A complement to this approach is failure avoidance, where the occurrence of a fault is predicted and preventive measures are taken. For this, monitoring systems require a reliable prediction system to give information on what will be generated and at what location. In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA toolkit to offer an adaptive and overall more efficient prediction module. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems. Furthermore, we analyze the prediction's impact on current checkpointing strategies and highlight future improvements and directions.

Chair/Author Details:

Pedro C. Diniz (Chair) - University of Southern California

Ana Gainaru - University of Illinois at Urbana-Champaign

Franck Cappello - INRIA

William Kramer - National Center for Supercomputing Applications

Marc Snir - University of Illinois at Urbana-Champaign

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar