BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20121115T180000Z DTEND:20121115T183000Z LOCATION:255-BC DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: A large percentage of computing capacity in today's large high-performance computing systems is wasted due to failures. As a consequence current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. A complement to this approach is failure avoidance, where the occurrence of a fault is predicted and preventive measures are taken. For this, monitoring systems require a reliable prediction system to give information on what will be generated and at what location. In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA toolkit to offer an adaptive and overall more efficient prediction module. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems. Furthermore, we analyze the prediction's impact on current checkpointing strategies and highlight future improvements and directions. SUMMARY:Fault Prediction Under the Microscope - A Closer Look Into HPC Systems PRIORITY:3 END:VEVENT END:VCALENDAR BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20121115T180000Z DTEND:20121115T183000Z LOCATION:255-BC DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: A large percentage of computing capacity in today's large high-performance computing systems is wasted due to failures. As a consequence current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. A complement to this approach is failure avoidance, where the occurrence of a fault is predicted and preventive measures are taken. For this, monitoring systems require a reliable prediction system to give information on what will be generated and at what location. In this paper, we merge signal analysis concepts with data mining techniques to extend the ELSA toolkit to offer an adaptive and overall more efficient prediction module. To this end, a large part of the paper is focused on a detailed analysis of the prediction method, by applying it to two large-scale systems. Furthermore, we analyze the prediction's impact on current checkpointing strategies and highlight future improvements and directions. SUMMARY:Fault Prediction Under the Microscope - A Closer Look Into HPC Systems PRIORITY:3 END:VEVENT END:VCALENDAR