SCHEDULE: NOV 10-16, 2012
When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.
A Study of DRAM Failures in the Field
SESSION: Fault Detection and Analysis
EVENT TYPE: Papers
TIME: 10:30AM - 11:00AM
SESSION CHAIR: Pedro C. Diniz
AUTHOR(S):Vilas Sridharan, Dean Liberty
ROOM:255-BC
ABSTRACT:
Most modern computer systems use dynamic random access memory (DRAM) as a main memory store. Recent publications have confirmed that DRAM is a common source of failures in the field. Therefore, further attention to the faults experienced by DRAM is warranted. We present a study of 11 months of DRAM errors in a large high-performance computing cluster. Our goal is to understand the failure modes, rates, and fault types experienced by DRAM in production settings.
We draw several conclusions from our study. First, DRAM failure modes appear dominated by permanent faults. Second, DRAMs are susceptible to large multi-bit failures, such as failures that affect an entire DRAM row or column. Third, some DRAM failures can affect shared board-level circuitry, disrupting accesses to other DRAM devices that share the same circuitry. Finally, we find that chipkill functionality is extremely effective, reducing the node failure rate from DRAM errors by over 36x.
Chair/Author Details:
Pedro C. Diniz (Chair) - University of Southern California
Vilas Sridharan - AMD
Dean Liberty - AMD
Click here to download .ics calendar file
Click here to download .vcs calendar file
Click here to add event to your Google Calendar
A Study of DRAM Failures in the Field
SESSION: Fault Detection and Analysis
EVENT TYPE:
TIME: 10:30AM - 11:00AM
SESSION CHAIR: Pedro C. Diniz
AUTHOR(S):Vilas Sridharan, Dean Liberty
ROOM:255-BC
ABSTRACT:
Most modern computer systems use dynamic random access memory (DRAM) as a main memory store. Recent publications have confirmed that DRAM is a common source of failures in the field. Therefore, further attention to the faults experienced by DRAM is warranted. We present a study of 11 months of DRAM errors in a large high-performance computing cluster. Our goal is to understand the failure modes, rates, and fault types experienced by DRAM in production settings.
We draw several conclusions from our study. First, DRAM failure modes appear dominated by permanent faults. Second, DRAMs are susceptible to large multi-bit failures, such as failures that affect an entire DRAM row or column. Third, some DRAM failures can affect shared board-level circuitry, disrupting accesses to other DRAM devices that share the same circuitry. Finally, we find that chipkill functionality is extremely effective, reducing the node failure rate from DRAM errors by over 36x.
Chair/Author Details:
Pedro C. Diniz (Chair) - University of Southern California
Vilas Sridharan - AMD
Dean Liberty - AMD
Click here to download .ics calendar file