SC12 Home > SC12 Schedule > SC12 Presentation - A Study of DRAM Failures in the Field

SCHEDULE: NOV 10-16, 2012

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

A Study of DRAM Failures in the Field

SESSION: Fault Detection and Analysis

EVENT TYPE: Papers

TIME: 10:30AM - 11:00AM

SESSION CHAIR: Pedro C. Diniz

AUTHOR(S):Vilas Sridharan, Dean Liberty

ROOM:255-BC

ABSTRACT:
Most modern computer systems use dynamic random access memory (DRAM) as a main memory store. Recent publications have confirmed that DRAM is a common source of failures in the field. Therefore, further attention to the faults experienced by DRAM is warranted. We present a study of 11 months of DRAM errors in a large high-performance computing cluster. Our goal is to understand the failure modes, rates, and fault types experienced by DRAM in production settings. We draw several conclusions from our study. First, DRAM failure modes appear dominated by permanent faults. Second, DRAMs are susceptible to large multi-bit failures, such as failures that affect an entire DRAM row or column. Third, some DRAM failures can affect shared board-level circuitry, disrupting accesses to other DRAM devices that share the same circuitry. Finally, we find that chipkill functionality is extremely effective, reducing the node failure rate from DRAM errors by over 36x.

Chair/Author Details:

Pedro C. Diniz (Chair) - University of Southern California

Vilas Sridharan - AMD

Dean Liberty - AMD

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

A Study of DRAM Failures in the Field

SESSION: Fault Detection and Analysis

EVENT TYPE:

TIME: 10:30AM - 11:00AM

SESSION CHAIR: Pedro C. Diniz

AUTHOR(S):Vilas Sridharan, Dean Liberty

ROOM:255-BC

ABSTRACT:
Most modern computer systems use dynamic random access memory (DRAM) as a main memory store. Recent publications have confirmed that DRAM is a common source of failures in the field. Therefore, further attention to the faults experienced by DRAM is warranted. We present a study of 11 months of DRAM errors in a large high-performance computing cluster. Our goal is to understand the failure modes, rates, and fault types experienced by DRAM in production settings. We draw several conclusions from our study. First, DRAM failure modes appear dominated by permanent faults. Second, DRAMs are susceptible to large multi-bit failures, such as failures that affect an entire DRAM row or column. Third, some DRAM failures can affect shared board-level circuitry, disrupting accesses to other DRAM devices that share the same circuitry. Finally, we find that chipkill functionality is extremely effective, reducing the node failure rate from DRAM errors by over 36x.

Chair/Author Details:

Pedro C. Diniz (Chair) - University of Southern California

Vilas Sridharan - AMD

Dean Liberty - AMD

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar