Hubbry Logo
Application checkpointingApplication checkpointingMain
Open search
Application checkpointing
Community hub
Application checkpointing
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Application checkpointing
Application checkpointing
from Wikipedia

Checkpointing is a technique that provides fault tolerance for computing systems. It involves saving a snapshot of an application's state, so that it can restart from that point in case of failure. This is particularly important for long-running applications that are executed in failure-prone computing systems.

Checkpointing in distributed systems

[edit]

In the distributed computing environment, checkpointing is a technique that helps tolerate failures that would otherwise force a long-running application to restart from the beginning. The most basic way to implement checkpointing is to stop the application, copy all the required data from the memory to reliable storage (e.g., parallel file system), then continue with execution.[1] In the case of failure, when the application restarts, it does not need to start from scratch. Rather, it will read the latest state ("the checkpoint") from the stable storage and execute from that point. While there is ongoing debate on whether checkpointing is the dominant I/O workload on distributed computing systems, the general consensus is that checkpointing is one of the major I/O workloads.[2][3]

There are two main approaches for checkpointing in the distributed computing systems: coordinated checkpointing and uncoordinated checkpointing. In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of two-phase commit protocol algorithm. In the uncoordinated checkpointing, each process checkpoints its own state independently. It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. The need for establishing a consistent state (i.e., no missing messages or duplicated messages) may force other processes to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so-called domino effect).[4][5]

Implementations for applications

[edit]

Save State

[edit]

One of the original and now most common means of application checkpointing was a "save state" feature in interactive applications, in which the user of the application could save the state of all variables and other data and either continue working or exit the application and restart the application and restore the saved state at a later time. This was implemented through a "save" command or menu option in the application. In many cases, it became standard practice to ask the user, if they had unsaved work when exiting an application, if they wanted to save their work before doing so.

This functionality became extremely important for usability in applications in which a particular task could not be completed in one sitting (such as playing a video game expected to take dozens of hours) or in which the work was being done over a long period of time (such as data entry into a document such as rows in a spreadsheet).

The problem with save state is it requires the operator of a program to request the save. For non-interactive programs, including automated or batch processed workloads, the ability to checkpoint such applications also had to be automated.

Checkpoint/Restart

[edit]

As batch applications began to handle tens to hundreds of thousands of transactions, where each transaction might process one record from one file against several different files, the need for the application to be restartable at some point without the need to rerun the entire job from scratch became imperative. Thus the "checkpoint/restart" capability was born, in which after a number of transactions had been processed, a "snapshot" or "checkpoint" of the state of the application could be taken. If the application failed before the next checkpoint, it could be restarted by giving it the checkpoint information and the last place in the transaction file where a transaction had successfully completed. The application could then restart at that point.

Checkpointing tends to be expensive, so it was generally not done with every record, but at some reasonable compromise between the cost of a checkpoint vs. the value of the computer time needed to reprocess a batch of records. Thus the number of records processed for each checkpoint might range from 25 to 200, depending on cost factors, the relative complexity of the application and the resources needed to successfully restart the application.

Fault Tolerance Interface (FTI)

[edit]

FTI is a library that aims to provide computational scientists with an easy way to perform checkpoint/restart in a scalable fashion.[6] FTI leverages local storage plus multiple replications and erasures techniques to provide several levels of reliability and performance. FTI provides application-level checkpointing that allows users to select which data needs to be protected, in order to improve efficiency and avoid space, time and energy waste. It offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.

Berkeley Lab Checkpoint/Restart (BLCR)

[edit]

The Future Technologies Group at the Lawrence National Laboratories are developing a hybrid kernel/user implementation of checkpoint/restart called BLCR. Their goal is to provide a robust, production quality implementation that checkpoints a wide range of applications, without requiring changes to be made to application code.[7] BLCR focuses on checkpointing parallel applications that communicate through MPI, and on compatibility with the software suite produced by the SciDAC Scalable Systems Software ISIC. Its work is broken down into 4 main areas: Checkpoint/Restart for Linux (CR), Checkpointable MPI Libraries, Resource Management Interface to Checkpoint/Restart and Development of Process Management Interfaces.

DMTCP

[edit]

DMTCP (Distributed MultiThreaded Checkpointing) is a tool for transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets.[8] It does not modify the user's program or the operating system. Among the applications supported by DMTCP are Open MPI, Python, Perl, and many programming languages and shell scripting languages. With the use of TightVNC, it can also checkpoint and restart X Window applications, as long as they do not use extensions (e.g. no OpenGL or video). Among the Linux features supported by DMTCP are open file descriptors, pipes, sockets, signal handlers, process id and thread id virtualization (ensure old pids and tids continue to work upon restart), ptys, fifos, process group ids, session ids, terminal attributes, and mmap/mprotect (including mmap-based shared memory). DMTCP supports the OFED API for InfiniBand on an experimental basis.[9]

Collaborative checkpointing

[edit]

Some recent protocols perform collaborative checkpointing by storing fragments of the checkpoint in nearby nodes.[10] This is helpful because it avoids the cost of storing to a parallel file system (which often becomes a bottleneck for large-scale systems) and it uses storage that is closer.[citation needed] This has found use particularly in large-scale supercomputing clusters. The challenge is to ensure that when the checkpoint is needed when recovering from a failure, the nearby nodes with fragments of the checkpoints are available.[citation needed]

Docker

[edit]

Docker and the underlying technology contain a checkpoint and restore mechanism.[11]

CRIU

[edit]

CRIU is a user space checkpoint library.[12]

Implementation for embedded and ASIC devices

[edit]

Mementos

[edit]

Mementos is a software system that transforms general-purpose tasks into interruptible programs for platforms with frequent interruptions such as power outages. It was designed for batteryless embedded devices such as RFID tags and smart cards which rely on harvesting energy from ambient background sources. Mementos frequently senses the available energy in the system and decides whether to checkpoint the program due to impending power loss versus continuing computation. If checkpointing, data will be stored in a non-volatile memory. When the energy becomes sufficient for reboot, the data is retrieved from non-volatile memory and the program continues from the stored state. Mementos has been implemented on the MSP430 family of microcontrollers. [13]

Idetic

[edit]

Idetic is a set of automatic tools which helps application-specific integrated circuit (ASIC) developers automatically embed checkpoints in their designs. It targets high-level synthesis tools and adds the checkpoints at the register-transfer level (Verilog code). It uses a dynamic programming approach to locate low overhead points in the state machine of the design. Since the checkpointing in hardware level involves sending the data of dependent registers to a non-volatile memory, the optimum points are required to have minimum number of registers to store. Idetic is deployed and evaluated on energy harvesting RFID tag device.[14]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Application checkpointing is a fault-tolerance mechanism in computing systems that involves periodically saving a snapshot of a running application's state, such as contents, registers, and , to a persistent storage file, allowing the application to be restarted or resumed from that exact point after a , interruption, or migration. This technique is essential for ensuring the reliability of long-running computations, particularly in environments prone to hardware faults, power outages, or resource reallocations. In (HPC) and distributed systems, application checkpointing enhances system utilization by enabling jobs to exceed walltime limits through restart capabilities, supporting preemptive scheduling for time-critical tasks, and facilitating maintenance operations without complete job loss. It originated from early research in the 1980s, with foundational algorithms like the Chandy-Lamport method for consistent global states in distributed environments, and has since evolved to address challenges in uniprocessor and parallel processing, with recent advances as of 2025 including transparent checkpointing for GPU-accelerated workloads using tools like CRIUgpu to support AI and applications. Beyond , it supports additional functionalities such as process migration across nodes, by replaying states, and job swapping in memory-constrained systems. Checkpointing approaches vary by system scope and implementation: in uniprocessor settings, it typically captures (globals, heap, stack) and CPU registers either at the operating system level or transparently at the user level; in distributed systems, coordinated methods synchronize processes to form consistent global checkpoints, while message-logging protocols (pessimistic or optimistic) record communications to enable recovery. Tools like DMTCP provide user-space checkpointing without kernel modifications, ideal for serial or threaded applications, whereas MPI-agnostic frameworks such as MANA scale to large parallel jobs on modern HPC clusters. However, overheads—often dominated by I/O costs—remain a key challenge, with the optimal checkpoint interval often chosen to be proportional to the of the product of the checkpoint cost and the mean time to to minimize overall overhead, and incremental methods reducing storage and time expenses compared to full checkpoints.

Fundamentals

Definition and Purpose

Application checkpointing is the process of saving the state of a running application or process to persistent storage, allowing it to be reconstructed and resumed from that point after an interruption such as a system failure, process migration, or suspension. This technique is essential for enabling in long-running computations, where unexpected failures could otherwise result in significant lost progress and downtime. The core purpose of application checkpointing extends beyond mere recovery to support resource optimization and system reliability. It facilitates load balancing by permitting the migration of processes to underutilized nodes, enhances through to prior states for , and aids by capturing state before entering low-energy modes. In high-performance computing environments, for instance, checkpointing allows a interrupted by a hardware crash to restart seamlessly, minimizing recomputation and improving overall system resilience. Key state components captured during checkpointing include the application's (such as heap, stack, and data segments), processor registers, and operating system-associated elements like file descriptors, open network connections, and pending signals. Unlike data backups, which preserve static files and configurations for archival purposes, checkpointing targets this volatile runtime state to ensure executable continuity. In distributed systems, checkpointing requires coordination across multiple processes to maintain global consistency, though details of such approaches vary.

Historical Development

Application checkpointing traces its origins to the in multiprogramming systems, where it served as a mechanism for job suspension and efficient restart following interruptions or failures. In IBM's OS/360 operating system, released in the late and enhanced through the , the Advanced Checkpoint/Restart facility allowed programmers to designate points during job execution to save the state of main storage, registers, and data sets, enabling resumption without reprocessing prior steps in multiprogrammed environments like Multiprogramming with a Variable Number of Tasks (MVT). This approach addressed the limitations of by supporting recovery from abnormal terminations, marking an early shift toward more resilient long-running computations in and multiprogramming contexts. By the , checkpointing advanced in fault-tolerant systems, particularly for emerging supercomputers and distributed environments, to mitigate hardware failures in computationally intensive tasks. Research during this decade focused on rollback-recovery techniques, with seminal work on coordinated checkpointing protocols emerging to capture consistent global states across distributed processes. A key milestone was the 1985 Chandy-Lamport , which provided a method for creating distributed snapshots without requiring global synchronization, influencing subsequent fault-tolerance designs in . These developments were driven by the growing scale of computing systems, where decreased, necessitating reliable state preservation for uninterrupted operation. The 1990s and 2000s saw checkpointing proliferate with the rise of (HPC), fueled by the expansion of parallel and as documented in the list starting in 1993. As supercomputer sizes grew from hundreds to thousands of processors, the need for intensified due to higher failure rates; coordinated and uncoordinated checkpointing protocols became standard to minimize downtime in long-running simulations. Evolution drivers included the transition from batch-oriented to interactive and persistent workloads, alongside the influence of message-passing interfaces like MPI, which integrated checkpoint-restart for scalability in distributed environments. In the and , checkpointing integrated deeply with and technologies, responding to the demands of cloud-native and . Virtual machine checkpointing enabled and rapid recovery in virtualized infrastructures, while container support facilitated lightweight state capture for and orchestrated workloads. These advancements addressed exascale challenges, where systems exceeding 10^18 FLOPS operations face frequent faults, prompting optimized multi-level checkpointing to reduce overhead and storage costs in massive-scale HPC. Brief ties to distributed systems challenges underscore how such integration enhances resilience across hybrid environments.

Core Principles and Mechanisms

Local Checkpointing Techniques

Local checkpointing techniques enable the capture and restoration of an application's state on a single node, focusing on individual or process groups without inter-node coordination. These methods primarily involve saving the image, resource handles, and execution context to allow resumption after interruptions such as failures or migrations. Key approaches include memory dumping, which captures the process's , and for handles like files and signals, often implemented through system calls or kernel interventions. Hybrid methods combine user-space and kernel-space operations to balance transparency and efficiency. Memory dumping techniques form the core of local checkpointing by serializing the process's virtual memory. One common method uses the fork() system call to create a child process that serves as a snapshot, allowing the parent to continue execution while the child dumps its memory pages to storage; this leverages copy-on-write semantics to minimize overhead during the fork itself. Alternatively, ptrace() enables a tracer process to attach to the target, pause it, and read memory regions directly, facilitating precise control over the dumping process without forking. These approaches ensure that both dirty and clean pages are accounted for, though untouched pages may still incur costs due to page table traversal. For multi-threaded or multi-process applications on a single node, dumping extends to the process forest, recording shared memory and hierarchies to maintain consistency. Capturing handle states is essential for preserving non-memory resources during local checkpointing. File handles are saved by querying kernel structures for descriptors, positions, and attributes, ensuring open files can be reopened or repositioned upon restart without data loss. Signal states, including pending signals and masks, are recorded to replicate the process's signal-handling context, preventing loss of asynchronous events. These elements are typically extracted via system calls like getdents() for files or sigprocmask() for signals, integrated into the dumping phase. In hybrid user-kernel approaches, user-space libraries intercept and log these states, while kernel modules provide low-level access for completeness, reducing the need for full kernel modifications. The checkpointing process begins with pre-checkpoint preparation, where the application is quiesced to stabilize its state—often by pausing execution with signals like SIGSTOP or using control groups to halt I/O operations and flush buffers. State follows, writing the , data, and CPU registers to disk or a in a structured format, such as ELF images for executables. Post-restart validation involves loading the checkpoint, recreating the process hierarchy (e.g., via recursive fork() calls), restoring handles, and verifying consistency through checksums or mechanisms if inconsistencies arise; may discard partial restores to ensure atomicity. This sequence minimizes downtime, typically under 1 ms per process for small images excluding I/O. Basic coordinated local checkpointing algorithms emphasize synchronous pausing and saving to avoid inconsistencies within a single node. In a coordinated scheme, an initiator (e.g., a master ) signals all related processes to pause simultaneously, dumps their states in a consistent order, and confirms completion before resuming; this prevents internal dependencies like races from corrupting the snapshot. For single- scenarios, the algorithm simplifies to a self-coordinated pause-dump-resume cycle, inherently avoiding domino effects since no external message dependencies exist—unlike in distributed settings where orphan messages can cascade rollbacks. Algorithms like DumpForest analyze trees in linear time (O(n for n processes) to ensure hierarchical consistency during serialization. Overhead in local checkpointing arises from pausing, dumping, and I/O, with memory-intensive applications experiencing 10-20% runtime slowdown at high checkpoint frequencies due to handling and costs. For instance, dumping 1 GB of can take around 300-500 ms, scaling linearly with size, while lightweight incremental methods reduce this by tracking only modified bytes, achieving about 15% throughput degradation overall. Space overhead matches usage, often compressed to 50-80% of the original size, making it viable for frequent checkpoints in fault-prone environments.

Distributed Checkpointing Approaches

Distributed checkpointing strategies address the need for consistent global states across multiple nodes in networked systems, particularly in message-passing environments like MPI, where processes communicate asynchronously. These approaches build on local checkpointing techniques by incorporating inter-node coordination to ensure recovery from failures without inconsistencies in distributed computations. The primary methods include coordinated, communication-induced, and uncoordinated checkpointing, each balancing consistency, overhead, and differently. In coordinated checkpointing, all processes pause simultaneously via a global barrier synchronization to record their states, forming a consistent global checkpoint that includes no in-flight messages. This approach guarantees a valid recovery line, avoiding issues like orphan messages—those received after a process rolls back but sent before the sender's failure—or lost messages in transit during a crash. However, it introduces significant synchronization overhead, as the barrier can delay the entire system, making it less suitable for large-scale deployments. Communication-induced checkpointing, by contrast, triggers checkpoints based on message events without requiring a global halt, reducing synchronization costs while maintaining consistency. A seminal example is the Chandy-Lamport algorithm (1985), which uses marker messages to capture a consistent global snapshot in message-passing systems. In this marker-based protocol, an initiator process records its local state and sends markers along all outgoing channels; upon receiving the first marker on an incoming channel, a process records its state and forwards markers on its outgoing channels, while recording channel states as the messages received after the sender's state but before the receiver's. This handles non-FIFO channels and zigzag dependencies—causal chains that could form inconsistent cuts in the state graph—by ensuring the snapshot represents a consistent partial ordering of events. The algorithm supports rollback-recovery by storing checkpoints and channel states in stable storage, allowing restarts from the snapshot without cascading rollbacks. Uncoordinated checkpointing permits processes to save states independently at their convenience, minimizing immediate coordination but risking the , where inconsistent checkpoints lead to a chain of rollbacks due to unresolved dependencies, orphan messages, or lost messages in MPI-like models. To mitigate this, rollback-recovery schemes often combine uncoordinated checkpoints with message logging to stable storage, replaying nondeterministic events upon restart. These protocols address zigzag dependencies by reconstructing causal histories during recovery. Scalability remains a key challenge across approaches, particularly in large clusters where coordinated methods incur overhead proportional to the number of nodes (n) from global synchronization, potentially dominating computation time. For instance, barrier-based coordination can introduce delays scaling linearly or worse with system size, exacerbating I/O bottlenecks during simultaneous writes. Hybrid methods, such as communication-induced protocols integrated with selective coordination, alleviate these issues for large-scale systems by localizing most checkpointing while ensuring global consistency only when necessary, thus improving performance in environments with thousands of nodes.

Use Cases and Applications

High-Performance Computing Environments

In (HPC) environments, application checkpointing is essential for enabling in long-running simulations, such as those in climate modeling and , where computations can span days or weeks on large-scale clusters. These applications often involve massive parallel processing across thousands of nodes, making them vulnerable to hardware failures that could otherwise result in complete restarts and significant resource waste. By periodically saving the application state, checkpointing allows simulations to resume from the last valid point, preserving computational progress in domains requiring and iterative refinement. Checkpointing integrates seamlessly with HPC job schedulers like SLURM and to facilitate automatic restarts upon failure or time limit expiration, enhancing workflow efficiency in batch-oriented systems. For MPI-based applications, which dominate parallel HPC workloads, checkpointing captures communicator states and message logs to ensure coordinated recovery across distributed processes without losing synchronization. This integration supports requeueing mechanisms in schedulers, where failed jobs are automatically resubmitted with the latest checkpoint, minimizing manual intervention and optimizing on supercomputers. The primary benefit of checkpointing in HPC is the reduction of wall-clock time losses due to failures, particularly in petascale and exascale systems where the mean time between failures (MTBF) can drop to as low as 10 hours in petascale configurations and minutes in exascale setups. In operational exascale systems like Frontier (available since 2022), checkpointing remains essential for achieving high utilization rates despite frequent failures. For instance, models indicate that checkpointing at optimal intervals can recover a significant portion of progress upon failure, significantly improving overall job efficiency compared to full restarts from the beginning. Checkpointing is also critical for large-scale machine learning training workloads, where it enables recovery from node failures in distributed deep learning jobs on exascale systems, reducing training time overheads. The U.S. Department of Energy (DOE) emphasized this in its 2010s initiatives for resilient exascale computing, funding research into advanced checkpointing to achieve high utilization rates (e.g., 80%) despite frequent interrupts, as traditional methods alone would be unsustainable at exascale scales. Despite these advantages, checkpointing in HPC faces significant challenges from I/O bottlenecks in parallel file systems like Lustre and GPFS, where simultaneous writes from thousands of nodes can overwhelm metadata servers and network bandwidth, leading to delays of minutes or more per checkpoint. In Lustre, striping patterns and object storage targets often become saturated during coordinated checkpoints, exacerbating contention in large-scale runs. Similarly, GPFS experiences bottlenecks from metadata-intensive operations, prompting the development of specialized filesystems like PLFS to aggregate and optimize checkpoint I/O patterns for better .

Cloud and Containerized Systems

In cloud and containerized environments, application checkpointing facilitates the seamless relocation of workloads across virtual machines (VMs) and containers, enhancing and resource efficiency in elastic infrastructures. , a key use case, allows running containers to be transferred between hosts without interruption, leveraging tools like Checkpoint/Restore In Userspace (CRIU) to capture and restore states, including and file descriptors. This is particularly valuable for handling interruptions in preemptible resources, such as AWS EC2 Spot Instances, where a two-minute reclamation warning triggers checkpointing to save simulation states to persistent storage like Amazon FSx for Lustre, enabling resumption on new instances and achieving up to 90% cost savings compared to on-demand pricing. Another prominent application is autoscaling in architectures, where checkpointing preserves application state during dynamic scaling events, preventing in distributed systems. For instance, in containerized using RDMA networks, checkpointing supports load balancing by migrating active workloads transparently, maintaining low-latency communication without application modifications. Integration with orchestration platforms like occurs through CRI plugins, such as those in CRI-O and containerd, which invoke CRIU to checkpoint pods and restore them on different nodes. CRIU specifically addresses network state challenges by repairing TCP connections during restore, using kernel features to reconstruct socket states, queues, and options, though it requires consistent IP addressing to avoid reconnection failures. The benefits of checkpointing in these systems include enabling zero-downtime updates and deployments, as migrated containers resume execution with minimal pause, supporting high-availability setups in production clouds. In , checkpointing has gained traction to mitigate cold starts—the latency from initializing idle functions—by restoring pre-warmed snapshots, with systems like achieving median end-to-end latency reductions of 37.2% across benchmarks by optimizing snapshot timing. This approach aligns with the expansion of serverless paradigms in the late 2010s, where checkpoint-restore techniques emerged to handle evictable resources and reduce initialization overheads. Despite these advantages, challenges persist, particularly with dynamic IP addressing in cloud environments, where migrating containers often require IP preservation to re-establish TCP connections seamlessly, as address changes can lead to session disruptions and increased overhead from reconnection logic. Security concerns arise in shared storage for checkpoints, as these files contain sensitive states vulnerable to unauthorized access in multi-tenant clouds; techniques like and access controls are essential, but misconfigurations can expose data during migration.

Key Implementations

Transparent User-Level Tools

Transparent user-level tools for application checkpointing operate entirely within user space, requiring no modifications to the operating system kernel or the application's . These tools achieve transparency by intercepting system calls and library functions at runtime, typically using techniques such as dynamic library preloading (e.g., LD_PRELOAD on ) to insert wrappers around key APIs. This approach enables checkpointing of multithreaded and distributed applications across heterogeneous environments, making it particularly suitable for scenarios where kernel access is restricted or impractical. A prominent example is DMTCP (Distributed MultiThreaded CheckPointing), introduced in 2007, which provides transparent checkpointing for both single-host and cluster-based computations. DMTCP employs dynamic proxying for network communications, including MPI and TCP connections, to facilitate seamless restarts even after process migration or failure. By virtualizing process identifiers (PIDs) and file descriptors, DMTCP ensures that restarted processes perceive a consistent environment, supporting applications like , MPICH2, and OpenMPI without recompilation. The core mechanisms in these tools revolve around process virtualization and dependency management. Syscall interception targets operations such as , and socket to track and virtualize resources like processes, files, and network sockets. For instance, DMTCP uses wrappers for libc functions to monitor and replay these calls during restart, while avoiding interception of high-frequency syscalls like read and write to minimize overhead. External dependencies, such as third-party libraries or (e.g., via ), are handled through extensible plugins that capture and restore library-specific state, ensuring compatibility with diverse software stacks. Collaborative checkpointing extends this by enabling state sharing among cluster nodes, where processes exchange checkpoint fragments directly to reduce central coordination and improve in decentralized environments. These tools offer significant advantages in portability and ease of deployment, as they operate independently of kernel versions and can run on standard distributions without elevated privileges. DMTCP, for example, demonstrates potential for transferability between different machines or even from clusters to desktops. A practical case is its adoption in the 2010s for distributed applications, where developers could pause executions at error points, analyze states, and resume without altering code, as seen in high-energy physics simulations and scientific workflows. Despite these benefits, transparent user-level tools incur limitations, including potential performance impacts from syscall wrapping, which can introduce overhead—typically under 2% for compute-bound tasks but higher (up to several times) for I/O-intensive workloads due to additional layers. DMTCP's reliance on a centralized coordinator for may also create bottlenecks in very large clusters, though runtime overhead remains negligible for most applications. Additionally, support for advanced features like RDMA or full graphical interfaces is often incomplete, restricting applicability in certain high-performance scenarios.

Kernel and System-Level Tools

Kernel and system-level tools for application checkpointing integrate directly with the operating system kernel to achieve low-overhead and restoration, primarily targeting environments for efficient handling of processes and system resources. These tools leverage kernel subsystems to applications, serialize and file descriptors, and manage isolation, enabling seamless checkpointing without extensive user-space modifications. One seminal implementation is Berkeley Lab Checkpoint/Restart (BLCR), introduced in 2003 as a module designed for system-level checkpointing of scientific applications on clusters. However, BLCR is deprecated and incompatible with recent kernels. BLCR employs the kernel's freezing capabilities, such as sending SIGSTOP signals to quiesce processes, followed by dumping the image including memory pages, open files, and signal handlers into a checkpoint file. Upon restart, it reconstructs the state by loading the image and resuming execution, with support for handling kernel objects like device drivers that user-space tools often cannot access. This kernel integration provided advantages in efficiency and completeness, particularly for multi- applications, as demonstrated in early 2000s (HPC) clusters where BLCR enabled preemptive scheduling and for long-running jobs on Linux-based supercomputers. A more recent advancement is Checkpoint/Restore In Userspace (CRIU), developed starting in 2011 by the team as a userspace tool that relies on kernel facilities for comprehensive process migration. CRIU utilizes the kernel freezer subsystem—often via —to halt processes non-intrusively, then serializes the state by reading memory pages and resources through the /proc filesystem, capturing elements like network connections and mounts. Restoration involves injecting the saved image into a new pid namespace, allowing isolated restarts that preserve the original process tree and dependencies. Key advantages include its ability to checkpoint kernel-managed objects such as namespaces and , facilitating container migration in environments like and Docker without kernel modifications beyond standard features. In the 2020s, CRIU has seen widespread adoption in (now Virtuozzo) for live container relocation, supporting scalable in production systems. These kernel-supported tools offer lower overhead compared to purely user-level alternatives, which prioritize portability but incur higher interception costs.

Specialized Fault-Tolerance Frameworks

Specialized fault-tolerance frameworks provide application-integrated solutions for checkpointing in parallel and distributed systems, enabling tailored resilience mechanisms that leverage application-specific knowledge to optimize overhead and recovery. These frameworks typically require explicit integration via APIs, allowing developers to define checkpoint scopes, storage strategies, and recovery behaviors, which contrasts with more general-purpose tools by prioritizing efficiency in high-failure-rate environments like large-scale simulations. The Fault Tolerance Interface (FTI), developed in 2011, exemplifies a multi-level checkpointing framework for (HPC) applications, supporting coordinated saves across distributed processes with options for compression to reduce volume. FTI organizes checkpoints into tiers: in-memory (Level 1) for rapid, low-overhead saves; local disk (Level 2) for node-local persistence; parallel file systems (Level 3) for shared durability; and Reed-Solomon encoding (Level 4) for fault-resilient redundancy across nodes, ensuring recovery from multiple failures without full recomputation. Its application-integrated allows selective protection of datasets, such as key variables or arrays in scientific codes, while integrating capabilities for testing resilience under simulated failures like node crashes or network partitions. This multi-tier storage minimizes global I/O contention by prioritizing faster local tiers for frequent checkpoints, with compression via algorithms like zlib for compressible patterns common in simulations. FTI's customizability to application semantics enables optimizations like asynchronous checkpointing and deduplication of redundant data blocks, further lowering I/O demands in bandwidth-limited HPC settings. Post-2015, FTI has been used in various HPC projects, supporting fault-tolerant workflows in climate modeling. These integrations demonstrate its role in enhancing for petascale and exascale systems by reducing checkpoint overheads to around 8% in tested benchmarks. An earlier example from the early is the coordinated checkpointing mechanism for MPI applications, as implemented in automated application-level tools that synchronize state saves across processes to ensure consistency without orphan messages. This approach, detailed in a 2003 study, uses compiler-assisted wrappers to insert checkpoint calls at communication barriers, saving process states to stable storage in a global consistent snapshot, which is particularly suited for tightly coupled scientific codes like those using MPI for parallel linear algebra. By coordinating via MPI collectives, it avoids rollback inconsistencies but requires minimal code annotations to handle application-specific data. For broader applicability, the Scalable Checkpoint/Restart (SCR) library, introduced around 2010, acts as a generic wrapper for scientific codes, providing multi-level checkpointing through a simple API that abstracts storage backends like RAM, burst buffers, or parallel filesystems. As of 2022, SCR has been enhanced in SCR-Exa for next-generation exascale computing. SCR enables coordinated or uncoordinated modes, with resilient checkpoints using replication to tolerate silent errors, and has been integrated into codes like those in the LLNL suite for plasma physics simulations. Its design emphasizes scalability, caching recent checkpoints locally to defer expensive I/O, achieving significant speedup in restart times for large jobs compared to single-level methods. While these frameworks offer advantages like reduced I/O through tiered storage and application-aware optimizations—they generally necessitate code modifications for integration, limiting their use in legacy or black-box applications without developer intervention. In HPC environments, they integrate seamlessly with MPI runtimes to trigger checkpoints at boundaries, enhancing overall job throughput under failure rates exceeding one per day.

Implementations for Resource-Constrained Devices

Embedded System Solutions

Embedded systems, characterized by severe constraints on , power, and resources, require specialized checkpointing approaches to ensure reliability in unpredictable environments such as intermittent power supplies. Unlike general-purpose systems, embedded checkpointing prioritizes minimal overhead to avoid draining limited batteries or exceeding scarce non-volatile storage, often leveraging techniques tailored for devices like microcontrollers in IoT sensors. A seminal solution is Mementos, a developed in 2011 that enables long-running computations on RFID-scale devices by transforming general-purpose programs into interruptible ones through energy-aware state checkpointing. Mementos uses compile-time to insert checks and runtime to non-volatile memory, selectively capturing only modified state to resume execution after power failures without full memory dumps. This selective approach achieves significant space efficiency, reducing checkpoint sizes by orders of magnitude compared to naive full-state saves, enabling operation on devices with as little as 1 KB of flash storage. Key mechanisms in embedded checkpointing include checkpoint-on-failure strategies that trigger state saves to durable storage like flash or only when power loss is imminent, minimizing proactive overhead in energy-scarce scenarios. To handle frequent power cycles, systems employ task graphs that model program execution as a of atomic tasks, allowing partial resumption from the last completed node upon and ensuring forward progress without re-executing stable prefixes. Complementary techniques such as thread migration—transferring execution context between cores or devices for load balancing—and state compression further optimize resource use; for instance, compression algorithms can reduce checkpoint footprints by 50-90% in environments by encoding redundant or predictable state patterns. These solutions offer distinct advantages for resource-constrained devices, enabling reliable operation of IoT applications like wireless sensor networks where failures from power fluctuations are common, without requiring hardware modifications or OS support. For example, Mementos demonstrates acceptable cycle overheads while ensuring complete program recovery on platforms like the MSP430 microcontroller, making it viable for tasks that span multiple power cycles. More recent advancements, such as (2018), introduce adaptive dynamic checkpointing that adjusts intervals based on power failure predictions, achieving up to 2.25× speedup over earlier systems like Mementos in intermittent computing scenarios. In battery-powered embedded devices, such checkpointing has gained prominence since the 2010s alongside the rise of energy-harvesting trends, where ambient sources like solar or power untethered nodes, allowing checkpointing to bridge execution gaps and extend operational lifetimes.

ASIC and Hardware-Specific Methods

Application-specific integrated circuits () and field-programmable gate arrays (FPGAs) enable hardware-integrated checkpointing methods tailored to the unique constraints of custom or reconfigurable hardware, particularly in power-constrained or fault-prone environments. These approaches leverage the inherent parallelism and low-level control of hardware to capture and restore computational state with minimal overhead, contrasting with software-based techniques by embedding checkpoint logic directly into the . A seminal method for is Idetic, which automates the insertion of checkpoints into designs for transiently powered systems. Developed by Mirhoseini et al., Idetic analyzes the control data flow graph (CDFG) of an application using dynamic programming to optimally place checkpoints that minimize recomputation energy and overhead. It generates a (FSM) to manage state transitions and employs on-chip registers as buffers to snapshot critical data, storing it in (NVM) for persistence across power failures. Upon recovery, occurs by retrieving the last valid checkpoint and resuming from the corresponding FSM state, enabling long-running computations like cryptographic algorithms on energy-harvesting platforms such as RFID tags. This yields low overhead, with area increases under 5%, energy under 11%, and execution time under 16%, while adaptive checkpointing based on voltage can improve overall efficiency by up to 1.3× compared to static strategies. For FPGAs, hardware checkpointing often utilizes shadow execution techniques to achieve near-zero runtime overhead during state capture. As described by Koch et al., the shadow scan chain (SHC) method duplicates flip-flops into a parallel shadow register set, allowing the primary logic to continue execution uninterrupted while the shadow chain copies the state. This enables rapid by reloading the shadow state into the primary registers upon fault detection, supported by reconfigurable logic for custom recovery paths. Compared to traditional FPGA readback methods, SHC reduces checkpoint latency dramatically—for instance, from 84,000 cycles to 1,188 cycles for a DES56 module—offering 10- to 100-fold speedups in state transfer for complex designs. These mechanisms are particularly effective in distributed fault-tolerant systems, where FPGAs serve as nodes requiring predictable recovery. The primary advantages of ASIC and FPGA checkpointing lie in their low-latency , essential for real-time systems where delays could compromise safety. In , these methods handle both digital and analog states natively, capturing mixed-signal data in on-chip buffers without software intervention, which is critical for sensor-integrated designs. For FPGAs, reconfigurable logic facilitates dynamic adjustment of checkpoint , ensuring compliance with timing constraints in fault-prone environments. Such capabilities support applications in autonomous vehicles, where hardware faults from or demand sub-millisecond recovery to maintain control loops. In the 2020s, these hardware-specific methods have gained traction in edge AI deployments on fault-prone hardware, such as battery-less sensors or radiation-exposed devices. This hardware focus complements software parallels in embedded systems by offloading checkpoint overhead to dedicated logic, achieving higher reliability in resource-limited settings.

Challenges and Future Directions

Common Limitations and Trade-offs

Application checkpointing, while essential for , incurs significant storage overhead due to the need to capture full dumps, often on the order of the 's size or more, amounting to gigabytes per in large-scale (HPC) environments. This overhead is exacerbated in memory-intensive applications, where saving the entire application state, including heap, stack, and file descriptors, requires substantial disk or network storage resources. Time costs associated with checkpointing, primarily from I/O operations and pauses that halt , can add notable overhead to overall runtime. In petascale systems, these costs can manifest as up to 20% application slowdown due to during checkpoint writes. For non-deterministic applications, such as those using MPI primitives like MPI_ANY_SOURCE, achieving consistency during restart poses additional challenges, as message delivery order may vary, potentially leading to invalid execution states unless coordinated global snapshots or message logging are employed. A key trade-off in checkpointing involves frequency versus overhead: more frequent checkpoints reduce potential work loss on but amplify cumulative time and storage costs, while infrequent checkpoints minimize routine overhead at the risk of longer recovery times. This balance is often modeled using space-time optimization frameworks, such as the Young/Daly formula, which approximates the optimal checkpoint interval as 2μC\sqrt{2 \mu C}
Add your contribution
Related Hubs
User Avatar
No comments yet.