Memory scrubbing
View on WikipediaThis article needs additional citations for verification. (February 2015) |
Memory scrubbing consists of reading from each computer memory location, correcting bit errors (if any) with an error-correcting code (ECC), and writing the corrected data back to the same location.[1]
Due to the high integration density of modern computer memory chips, the individual memory cell structures became small enough to be vulnerable to cosmic rays and/or alpha particle emission. The errors caused by these phenomena are called soft errors. Over 8% of dual in-line memory modules (DIMMs) experience at least one correctable error per year.[2] This can be a problem for DRAM and SRAM based memories. The probability of a soft error at any individual memory bit is very small. However, together with the large amount of memory modern computers—especially servers—are equipped with, and together with extended periods of uptime, the probability of soft errors in the total memory installed is significant.[citation needed]
The information in an ECC memory is stored redundantly enough to correct single bit error per memory word. Hence, an ECC memory can support the scrubbing of the memory content. Namely, if the memory controller scans systematically through the memory, the single bit errors can be detected, the erroneous bit can be determined using the ECC checksum, and the corrected data can be written back to the memory.
Overview
[edit]It is important to check each memory location periodically, frequently enough, before multiple bit errors within the same word are too likely to occur, because the one bit errors can be corrected, but the multiple bit errors are not correctable, in the case of usual (as of 2008) ECC memory modules.
In order not to disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS setup program.
The normal memory reads issued by the CPU or DMA devices are checked for ECC errors, but due to data locality reasons they can be confined to a small range of addresses, leaving other memory locations untouched for a very long time. These locations can become vulnerable to more than one soft error, while scrubbing ensures the checking of the whole memory within a guaranteed time.
On some systems, not only the main memory (DRAM-based) can be scrubbed but also the CPU caches (SRAM-based). On most systems the scrubbing rates for both can be set independently. Because cache is much smaller than the main memory, the scrubbing for caches does not need to happen as frequently.
Memory scrubbing increases reliability, therefore it can be classified as a Reliability, availability and serviceability (RAS) feature.
Variants
[edit]There are usually two variants, known as patrol scrubbing and demand scrubbing. While they both essentially perform memory scrubbing and associated error correction (if it is doable), the main difference is how these two variants are initiated and executed. Patrol scrubbing runs in an automated manner when the system is idle, while demand scrubbing performs the error correction when the data is actually requested from main memory.[3]
See also
[edit]- Data scrubbing, a general category containing memory scrubbing
- Soft error, an important reason for doing memory scrubbing
- Error detection and correction, a general theory used for memory scrubbing
- Memory refresh, which preserves information stored in memory
References
[edit]- ^ Ronald K. Burek. "The NEAR Solid-State Data Recorders". Johns Hopkins APL Technical Digest. 1998.
- ^ DRAM Errors in the Wild: A Large-Scale Field Study
- ^ "Supermicro X9SRA motherboard manual" (PDF). Supermicro. March 5, 2014. pp. 4–10. Retrieved February 22, 2015.
Memory scrubbing
View on GrokipediaFundamentals
Definition
Memory scrubbing is a process in computing systems equipped with error-correcting code (ECC) memory, where the system periodically reads data from each memory location in random access memory (RAM), detects any bit errors using the associated ECC parity bits, corrects single-bit errors on the fly, and writes the corrected data back to the same location to prevent error accumulation.[1] This technique primarily addresses soft errors, which are transient bit flips caused by external factors such as cosmic rays or alpha particles from packaging materials, without causing permanent physical damage to the memory cells; in contrast, hard errors result from physical degradation or defects in the memory hardware and typically require replacement rather than correction.[4][5] By leveraging ECC, which enables single-error correction and double-error detection (SECDED), memory scrubbing maintains data integrity in volatile memory, reducing the risk of uncorrectable multi-bit errors that could lead to system crashes or silent data corruption.[6] A common implementation of ECC in server-grade RAM uses a 72-bit word consisting of 64 bits of data and 8 bits of parity, allowing the memory controller to identify and fix a single flipped bit while flagging multi-bit issues for further action.[7] Unlike data scrubbing, which broadly applies to non-volatile storage systems like RAID arrays to verify and repair parity inconsistencies across disks, memory scrubbing specifically targets RAM for proactive bit-flip correction in volatile environments.[1] Memory scrubbing can operate in various modes to detect and correct errors.Historical Background
Memory scrubbing emerged in the late 1970s and 1980s alongside the adoption of error-correcting code (ECC) memory in mainframe computers to address soft errors caused by cosmic rays and other transient phenomena in large-scale systems. Early implementations focused on periodically reading and correcting single-bit errors in RAM to prevent accumulation into uncorrectable multi-bit failures, building on ECC principles that dated back to the 1950s but gained practical use in high-reliability environments like IBM mainframes. By the mid-1980s, research demonstrated the reliability benefits of soft error scrubbing in single-error-protected RAM systems, showing it could significantly improve mean time to failure depending on error rates and scrubbing intervals.[8][9] In the 1990s, memory scrubbing techniques advanced with the proliferation of server hardware, particularly through Intel's support for ECC in the Pentium Pro processor introduced in 1995, which enabled error correction in high-end workstations and servers to handle growing memory densities. This era saw scrubbing integrated into enterprise systems to mitigate accumulating errors in DRAM.[10] The 2010s marked expanded applications in specialized domains, such as NASA's space missions, where scrubbing was employed in radiation-hardened systems to counter single-event upsets from cosmic radiation in SDRAM and other memories. Research during this period also advanced scrubbing for field-programmable gate arrays (FPGAs), exemplified by Microchip's PolarFire family announced in 2016, which incorporated ECC and scrubbing to handle configuration memory errors in harsh environments.[11][12] In the 2020s, developments extended scrubbing to non-volatile memories like NOR flash for boot file preservation, as demonstrated in NASA's 2021 application for the Descent and Landing Computer, where it ensured data integrity against radiation-induced corruption during space operations. These evolutions reflect ongoing adaptations to denser, more vulnerable memory technologies while prioritizing reliability in mission-critical systems.[13]Types
Patrol Scrubbing
Patrol scrubbing is a proactive memory integrity mechanism that automatically scans the entire memory system—or designated subsections—at fixed intervals, reading data from each address, applying error-correcting code (ECC) to detect and fix correctable errors, and rewriting the corrected data without requiring user intervention.[1][14] This background process leverages hardware engines in the memory controller or platform to perform read-modify-write operations across the memory array, mitigating the risk of error accumulation over time.[15] It operates during low-activity periods to minimize performance overhead, utilizing idle cycles in dynamic random-access memory (DRAM) when the system is otherwise unoccupied. In server environments like Dell PowerEdge systems, intervals are configurable through BIOS settings, with the standard mode executing a complete scrub once every 24 hours and an extended mode performing it hourly for heightened reliability.[16] Patrol scrubbing specifically targets correctable errors (CEs) to prevent their progression to uncorrectable errors that could lead to data loss or system failure. In the Linux Error Detection and Correction (EDAC) framework, it proactively scrubs the full address range in the background, correcting single-bit errors detected via ECC before they compound.[1] This technique is prevalent in volatile memory such as DRAM for server and high-reliability applications, and its application has expanded to persistent memory since 2020, including Intel Optane (discontinued in 2022), where periodic hardware-driven scrubs ensured ongoing data consistency across non-volatile storage.[14][17][18] As a complementary approach to demand scrubbing, patrol scrubbing provides scheduled, system-wide maintenance during idle times rather than reactive corrections. Extensions include application-aware variants, as in US patent 12332740B2, which prioritize patrol scrubbing of regions associated with critical tasks based on application or tenant needs.[19][1]Demand Scrubbing
Demand scrubbing is an error correction technique in ECC-enabled memory systems that activates reactively upon detection of correctable errors during normal data access or in response to explicit software triggers, targeting only the affected memory regions rather than the full address space.[3] This method ensures immediate remediation by performing a read-modify-write operation: upon encountering a single-bit error in a read transaction, the memory controller corrects the data using the ECC and writes the fixed version back to the same location.[20] In systems like Cisco UCS servers, demand scrubbing is enabled in the BIOS to handle such errors transparently during processor-initiated memory reads for data or instructions, preventing error accumulation without interrupting ongoing operations.[20] Operational details include integration with hardware and software interfaces for precise control. For instance, in HPE servers, demand scrubbing can be toggled via BIOS settings to enable writing corrected data back to memory immediately after a correctable error detection, complementing broader error management strategies.[21] In the Linux kernel, the EDAC (Error Detection and Correction) subsystem supports on-demand scrubbing through a sysfs interface, allowing userspace applications to initiate targeted scrubs on specific memory banks or address ranges, such as by writing to/sys/devices/system/edac/mc/mcX/scrub for a designated device.[1] Additionally, ACPI Address Range Scrubbing (ARS) enables platform firmware to notify the OS of error-prone regions in persistent memory (e.g., NVDIMMs), triggering kernel-initiated scrubs on those exact ranges to clear latent errors before they propagate.[22]
This approach prioritizes error handling in high-risk or recently errored areas, such as after an ECC interrupt signals a correctable fault, thereby focusing resources efficiently. In NVIDIA DRIVE platforms, for example, the safety cluster performs demand scrubbing on DRAM locations flagged by correctable error interrupts, ensuring reliability in safety-critical environments without full-memory scans. Demand scrubbing was developed as a targeted complement to proactive patrol methods, providing responsive correction for event-driven scenarios.[23]
Implementation
Hardware Components
ECC-enabled memory modules form the foundational hardware for memory scrubbing, providing the necessary redundancy to detect and correct errors. These modules, such as Dual In-line Memory Modules (DIMMs), typically incorporate an 8-bit error-correcting code (ECC) for every 64 bits of data, enabling single-error correction and double-error detection through Hamming-based algorithms.[24][25] ECC DIMMs feature an additional memory chip per side compared to non-ECC variants, dedicating space for parity bits that support scrubbing operations. This structure ensures that soft errors, like single event upsets (SEUs), can be identified and repaired without data loss. Memory controllers, often integrated into processors or standalone chipsets, execute the core scrubbing mechanism via read-modify-write (RMW) cycles. In Intel Xeon-based systems, these controllers perform patrol scrubbing by systematically reading memory locations, applying ECC correction if errors are found, and rewriting corrected data to prevent error accumulation.[15][14] Similarly, AMD EPYC processors include integrated memory controllers that support proactive error scrubbing through dedicated hardware paths.[26] Upon detecting a correctable error, the controller automatically initiates scrubbing to maintain data integrity without host intervention.[27] Architectural integration of scrubbers occurs at the chipset level, where dedicated engines handle background operations. For example, patrol scrub engines in AMD EPYC and Intel Xeon platforms scan memory during low-utilization periods, offloading the process from the CPU to minimize performance impact.[15] In radiation-hardened environments, such as space applications, Field-Programmable Gate Arrays (FPGAs) embed SEU correction logic directly into their configuration memory, using techniques like triple modular redundancy (TMR) to mitigate radiation-induced bit flips.[28][29] These FPGAs provide built-in scrubbing capabilities that continuously monitor and restore affected bits, enhancing reliability in high-radiation settings.[30] Hardware offload extends scrubbing to storage devices, reducing system-level overhead. KIOXIA's RAID Offload technology, developed in the 2020s for enterprise NVMe SSDs, enables data scrubbing directly within the drive, verifying and correcting errors using onboard parity without taxing host CPU or memory resources.[31] This approach alleviates bandwidth pressure on the host by performing inspections where the data resides, supporting efficient rebuilds in degraded RAID configurations.[32] In non-volatile contexts, NOR flash memory interfaces via Serial Peripheral Interface (SPI) facilitate scrubbing in critical applications, such as NASA's boot file preservation systems, where read, write, and erase cycles detect and repair radiation errors.[33] Performance considerations for scrubbing hardware focus on efficient resource use. Background operations typically consume less than 0.1% of memory bandwidth for standard daily scrubbing intervals on large memory systems, ensuring minimal interference with active workloads while maintaining error rates below thresholds for reliable operation.[34] These engines are configurable through firmware, allowing adjustments to scrub rates based on system demands and environmental factors.[35]Software Support
Software support for memory scrubbing encompasses operating system drivers, firmware interfaces, and management tools that facilitate the control, scheduling, and monitoring of scrubbing operations. In the Linux kernel, the Error Detection and Correction (EDAC) subsystem provides core support for memory scrubbing through dedicated drivers that interact with hardware controllers. The generic EDAC scrub control, introduced in kernel version 6.15, offers a standardized sysfs interface under/sys/bus/edac/devices/<dev-name>/scrubX/ for managing scrubbers, allowing userspace applications to enable or disable patrol (background) and demand (on-demand) scrubbing modes. This abstraction supports various hardware backends, such as CXL and ACPI RAS2, enabling configurable scrub rates—often expressed in hours for full memory passes—to balance reliability and performance.[1]
Management tools extend this support into firmware and application layers. BIOS and UEFI settings commonly include options to configure memory patrol scrubbing and associated rates, such as enabling/disabling the feature or setting intervals like 24 hours for complete memory coverage, as implemented in platforms from vendors like Dell, HPE, and Lenovo. For real-time systems, application-aware scheduling enhances scrubbing by prioritizing critical tasks; for instance, a patented technique dynamically adjusts patrol scrubbing based on workload sensitivity to minimize latency impacts in embedded environments.[36][37]
ACPI specifications further standardize address range scrubbing (ARS), defined in the ACPI 6.4 standard as a process to inspect specified memory regions for correctable or uncorrectable errors, with results reported to the operating system for proactive management. In Windows Server environments, integration occurs via Windows Management Instrumentation (WMI) for ECC monitoring, leveraging classes like Win32_PhysicalMemory to detect error correction capabilities and querying Windows Hardware Error Architecture (WHEA) logs for corrected memory events, though direct scrub control remains hardware-dependent.
Error logging mechanisms ensure visibility into scrubbing outcomes. In Linux, EDAC drivers hook into the kernel's logging system, reporting scrubbed errors—such as corrected single-bit flips—to syslog or dmesg for analysis and alerting. For Arm-based systems, a dedicated memory scrubbing algorithm optimizes SRAM scrubbing by sequentially reading, verifying with ECC, and rewriting affected locations, with errors logged via platform-specific hooks to facilitate reliability tracking in resource-constrained devices.[38][39]