BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20121111T203000Z DTEND:20121112T000000Z LOCATION:355-F DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: Resilience is a critical issue for large-scale platforms. =0AThis tutorial provides a comprehensive survey on=0Afault-tolerant techniques for high-performance computing. =0AIt is organized along four main topics:=0A(i) An overview of failure types (software/hardware, transient/fail-stop),=0Aand typical probability distributions (Exponential, Weibull, Log-Normal);=0A(ii) Application-specific techniques, such as =0AABFT for grid-based algorithms =0Aor fixed-point convergence for iterative applications;=0A(iii) General-purpose techniques, which include several checkpoint =0Aand rollback recovery protocols, possibly combined with replication; and=0A(iv) Relevant execution scenarios will be evaluated =0Aand compared through quantitative models =0A(from Young's approximation=0Ato Daly's formulas and recent work).=0A=0AThe tutorial is open to all SC12 attendees who are interested =0Ain the current status and expected promise of fault-tolerant approaches=0Afor scientific applications. There are no audience prerequisites:=0A background will be provided for all protocols and probabilistic models.=0AOnly the last part of the tutorial devoted to assessing the future=0Aof the methods will involve more advanced analysis tools. SUMMARY:An overview of fault-tolerant techniques for HPC PRIORITY:3 END:VEVENT END:VCALENDAR BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20121111T203000Z DTEND:20121112T000000Z LOCATION:355-F DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: Resilience is a critical issue for large-scale platforms. =0AThis tutorial provides a comprehensive survey on=0Afault-tolerant techniques for high-performance computing. =0AIt is organized along four main topics:=0A(i) An overview of failure types (software/hardware, transient/fail-stop),=0Aand typical probability distributions (Exponential, Weibull, Log-Normal);=0A(ii) Application-specific techniques, such as =0AABFT for grid-based algorithms =0Aor fixed-point convergence for iterative applications;=0A(iii) General-purpose techniques, which include several checkpoint =0Aand rollback recovery protocols, possibly combined with replication; and=0A(iv) Relevant execution scenarios will be evaluated =0Aand compared through quantitative models =0A(from Young's approximation=0Ato Daly's formulas and recent work).=0A=0AThe tutorial is open to all SC12 attendees who are interested =0Ain the current status and expected promise of fault-tolerant approaches=0Afor scientific applications. There are no audience prerequisites:=0A background will be provided for all protocols and probabilistic models.=0AOnly the last part of the tutorial devoted to assessing the future=0Aof the methods will involve more advanced analysis tools. SUMMARY:An overview of fault-tolerant techniques for HPC PRIORITY:3 END:VEVENT END:VCALENDAR