SCHEDULE: NOV 10-16, 2012
When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.
An overview of fault-tolerant techniques for HPC
SESSION: An overview of fault-tolerant techniques for HPC
EVENT TYPE: Tutorials
TIME: 1:30PM - 5:00PM
Presenter(s):Thomas Hérault, Yves Robert
ROOM:355-F
ABSTRACT:
Resilience is a critical issue for large-scale platforms.
This tutorial provides a comprehensive survey on
fault-tolerant techniques for high-performance computing.
It is organized along four main topics:
(i) An overview of failure types (software/hardware, transient/fail-stop),
and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) Application-specific techniques, such as
ABFT for grid-based algorithms
or fixed-point convergence for iterative applications;
(iii) General-purpose techniques, which include several checkpoint
and rollback recovery protocols, possibly combined with replication; and
(iv) Relevant execution scenarios will be evaluated
and compared through quantitative models
(from Young's approximation
to Daly's formulas and recent work).
The tutorial is open to all SC12 attendees who are interested
in the current status and expected promise of fault-tolerant approaches
for scientific applications. There are no audience prerequisites:
background will be provided for all protocols and probabilistic models.
Only the last part of the tutorial devoted to assessing the future
of the methods will involve more advanced analysis tools.
Chair/Presenter Details:
Thomas Hérault - University of Tennessee, Knoxville
Yves Robert - ENS Lyon
Click here to download .ics calendar file
Click here to download .vcs calendar file
Click here to add event to your Google Calendar
An overview of fault-tolerant techniques for HPC
SESSION: An overview of fault-tolerant techniques for HPC
EVENT TYPE:
TIME: 1:30PM - 5:00PM
Presenter(s):Thomas Hérault, Yves Robert
ROOM:355-F
ABSTRACT:
Resilience is a critical issue for large-scale platforms.
This tutorial provides a comprehensive survey on
fault-tolerant techniques for high-performance computing.
It is organized along four main topics:
(i) An overview of failure types (software/hardware, transient/fail-stop),
and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) Application-specific techniques, such as
ABFT for grid-based algorithms
or fixed-point convergence for iterative applications;
(iii) General-purpose techniques, which include several checkpoint
and rollback recovery protocols, possibly combined with replication; and
(iv) Relevant execution scenarios will be evaluated
and compared through quantitative models
(from Young's approximation
to Daly's formulas and recent work).
The tutorial is open to all SC12 attendees who are interested
in the current status and expected promise of fault-tolerant approaches
for scientific applications. There are no audience prerequisites:
background will be provided for all protocols and probabilistic models.
Only the last part of the tutorial devoted to assessing the future
of the methods will involve more advanced analysis tools.
Chair/Presenter Details:
Thomas Hérault - University of Tennessee, Knoxville
Yves Robert - ENS Lyon
Click here to download .ics calendar file