BEGIN:VCALENDAR
PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN
VERSION:1.0
BEGIN:VEVENT
DTSTART:20121113T213000Z
DTEND:20121113T220000Z
LOCATION:255-EF
DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: As the capability and component count of PFS systems increase, the MTBF correspondingly decreases. Typically, applications tolerate failures with checkpoint/restart using a PFS. While simple, this approach suffers from high overhead due to contention for PFS resources. A promising solution to this problem is multi-level checkpointing. However, while multi-level checkpointing is successful on today’s machines, it is not expected to be sufficient for exascale class machines, where the total memory sizes and failure rates are predicted to be orders of magnitude higher. Our solution to this problem is a system that combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and a model describing its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0 × on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.
SUMMARY:Design and Modeling of a Non-Blocking Checkpointing System
PRIORITY:3
END:VEVENT
END:VCALENDAR
