BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20121115T173000Z DTEND:20121115T180000Z LOCATION:255-BC DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: Most modern computer systems use dynamic random access memory (DRAM) as a main memory store. Recent publications have confirmed that DRAM is a common source of failures in the field. Therefore, further attention to the faults experienced by DRAM is warranted. We present a study of 11 months of DRAM errors in a large high-performance computing cluster. Our goal is to understand the failure modes, rates, and fault types experienced by DRAM in production settings.=0A=0AWe draw several conclusions from our study. First, DRAM failure modes appear dominated by permanent faults. Second, DRAMs are susceptible to large multi-bit failures, such as failures that affect an entire DRAM row or column. Third, some DRAM failures can affect shared board-level circuitry, disrupting accesses to other DRAM devices that share the same circuitry. Finally, we find that chipkill functionality is extremely effective, reducing the node failure rate from DRAM errors by over 36x. SUMMARY:A Study of DRAM Failures in the Field PRIORITY:3 END:VEVENT END:VCALENDAR BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20121115T173000Z DTEND:20121115T180000Z LOCATION:255-BC DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: Most modern computer systems use dynamic random access memory (DRAM) as a main memory store. Recent publications have confirmed that DRAM is a common source of failures in the field. Therefore, further attention to the faults experienced by DRAM is warranted. We present a study of 11 months of DRAM errors in a large high-performance computing cluster. Our goal is to understand the failure modes, rates, and fault types experienced by DRAM in production settings.=0A=0AWe draw several conclusions from our study. First, DRAM failure modes appear dominated by permanent faults. Second, DRAMs are susceptible to large multi-bit failures, such as failures that affect an entire DRAM row or column. Third, some DRAM failures can affect shared board-level circuitry, disrupting accesses to other DRAM devices that share the same circuitry. Finally, we find that chipkill functionality is extremely effective, reducing the node failure rate from DRAM errors by over 36x. SUMMARY:A Study of DRAM Failures in the Field PRIORITY:3 END:VEVENT END:VCALENDAR