SC12 Home > SC12 Schedule > SC12 Presentation - Programming Model Extensions for Resilience in Extreme Scale Computing

SCHEDULE: NOV 10-16, 2012

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

Programming Model Extensions for Resilience in Extreme Scale Computing

SESSION: Research Poster Reception

EVENT TYPE: Posters and Electronic Posters

TIME: 5:15PM - 7:00PM

SESSION CHAIR: Torsten Hoefler

AUTHOR(S):Saurabh Hukerikar, Pedro C. Diniz, Robert F. Lucas

ROOM:East Entrance

ABSTRACT:
System resilience is a key challenge to building extreme scale systems. A large number of HPC applications are inherently resilient, but application programmers lack mechanisms to convey their fault tolerance knowledge to the system. We present a cross-layer approach to resilience in which we propose a set of programming model extensions and develop a runtime inference framework that can reason about the context and significance of faults, as they occur, to the application programmer's fault tolerance expectations. We demonstrate using a set accelerated fault injection experiments the validity of our approach with a set of real scientific and engineering codes. Our experiments show that a cross-layer approach that explicitly engages the programmer in expressing fault tolerance knowledge which is then leveraged across the layers of system abstraction can significantly improve the dependability of long running HPC applications.

Chair/Author Details:

Torsten Hoefler (Chair) - ETH Zurich

Saurabh Hukerikar - University of Southern California

Pedro C. Diniz - University of Southern California

Robert F. Lucas - University of Southern California

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

Programming Model Extensions for Resilience in Extreme Scale Computing

SESSION: Research Poster Reception

EVENT TYPE:

TIME: 5:15PM - 7:00PM

SESSION CHAIR: Torsten Hoefler

AUTHOR(S):Saurabh Hukerikar, Pedro C. Diniz, Robert F. Lucas

ROOM:East Entrance

ABSTRACT:
System resilience is a key challenge to building extreme scale systems. A large number of HPC applications are inherently resilient, but application programmers lack mechanisms to convey their fault tolerance knowledge to the system. We present a cross-layer approach to resilience in which we propose a set of programming model extensions and develop a runtime inference framework that can reason about the context and significance of faults, as they occur, to the application programmer's fault tolerance expectations. We demonstrate using a set accelerated fault injection experiments the validity of our approach with a set of real scientific and engineering codes. Our experiments show that a cross-layer approach that explicitly engages the programmer in expressing fault tolerance knowledge which is then leveraged across the layers of system abstraction can significantly improve the dependability of long running HPC applications.

Chair/Author Details:

Torsten Hoefler (Chair) - ETH Zurich

Saurabh Hukerikar - University of Southern California

Pedro C. Diniz - University of Southern California

Robert F. Lucas - University of Southern California

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar