SC12 Home > SC12 Schedule > SC12 Presentation - McrEngine - A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

SCHEDULE: NOV 10-16, 2012

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

McrEngine - A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

SESSION: Checkpointing

EVENT TYPE: Papers, Best Student Paper Finalists

TIME: 1:30PM - 2:00PM

SESSION CHAIR: Frank Mueller

AUTHOR(S):Tanzima Z. Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, Rudolf Eigenmann

ROOM:255-EF

ABSTRACT:
High performance computing (HPC) systems use checkpoint and restart for tolerating failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for overloaded PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of a failure. To alleviate these problems, we demonstrate a scalable checkpoint-restart system, mcrEngine. mcrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Experimental results with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and recovery overhead by up to 62% over a baseline with no aggregation or compression.

Chair/Author Details:

Frank Mueller (Chair) - North Carolina State University

Tanzima Z. Islam - Purdue University

Kathryn Mohror - Lawrence Livermore National Laboratory

Saurabh Bagchi - Purdue University

Adam Moody - Lawrence Livermore National Laboratory

Bronis R. de Supinski - Lawrence Livermore National Laboratory

Rudolf Eigenmann - Purdue University

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

McrEngine - A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

SESSION: Checkpointing

EVENT TYPE: , Best Student Paper Finalists

TIME: 1:30PM - 2:00PM

SESSION CHAIR: Frank Mueller

AUTHOR(S):Tanzima Z. Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, Rudolf Eigenmann

ROOM:255-EF

ABSTRACT:
High performance computing (HPC) systems use checkpoint and restart for tolerating failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for overloaded PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of a failure. To alleviate these problems, we demonstrate a scalable checkpoint-restart system, mcrEngine. mcrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Experimental results with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and recovery overhead by up to 62% over a baseline with no aggregation or compression.

Chair/Author Details:

Frank Mueller (Chair) - North Carolina State University

Tanzima Z. Islam - Purdue University

Kathryn Mohror - Lawrence Livermore National Laboratory

Saurabh Bagchi - Purdue University

Adam Moody - Lawrence Livermore National Laboratory

Bronis R. de Supinski - Lawrence Livermore National Laboratory

Rudolf Eigenmann - Purdue University

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar