BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20121113T203000Z DTEND:20121113T210000Z LOCATION:255-EF DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: High performance computing (HPC) systems use checkpoint and restart for tolerating failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for overloaded PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of a failure.=0A=0ATo alleviate these problems, we demonstrate a scalable checkpoint-restart system, mcrEngine. mcrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Experimental results with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and recovery overhead by up to 62% over a baseline with no aggregation or compression. SUMMARY:McrEngine - A Scalable Checkpointing System Using Data-Aware Aggregation and Compression PRIORITY:3 END:VEVENT END:VCALENDAR BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20121113T203000Z DTEND:20121113T210000Z LOCATION:255-EF DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: High performance computing (HPC) systems use checkpoint and restart for tolerating failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for overloaded PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of a failure.=0A=0ATo alleviate these problems, we demonstrate a scalable checkpoint-restart system, mcrEngine. mcrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Experimental results with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and recovery overhead by up to 62% over a baseline with no aggregation or compression. SUMMARY:McrEngine - A Scalable Checkpointing System Using Data-Aware Aggregation and Compression PRIORITY:3 END:VEVENT END:VCALENDAR