Hubbry Logo
search
logo

Application checkpointing

logo
Community Hub0 Subscribers
Write something...
Be the first to start a discussion here.
Be the first to start a discussion here.
See all
Application checkpointing

Checkpointing is a technique that provides fault tolerance for computing systems. It involves saving a snapshot of an application's state, so that it can restart from that point in case of failure. This is particularly important for long-running applications that are executed in failure-prone computing systems.

In the distributed computing environment, checkpointing is a technique that helps tolerate failures that would otherwise force a long-running application to restart from the beginning. The most basic way to implement checkpointing is to stop the application, copy all the required data from the memory to reliable storage (e.g., parallel file system), then continue with execution. In the case of failure, when the application restarts, it does not need to start from scratch. Rather, it will read the latest state ("the checkpoint") from the stable storage and execute from that point. While there is ongoing debate on whether checkpointing is the dominant I/O workload on distributed computing systems, the general consensus is that checkpointing is one of the major I/O workloads.

There are two main approaches for checkpointing in the distributed computing systems: coordinated checkpointing and uncoordinated checkpointing. In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of two-phase commit protocol algorithm. In the uncoordinated checkpointing, each process checkpoints its own state independently. It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. The need for establishing a consistent state (i.e., no missing messages or duplicated messages) may force other processes to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so-called domino effect).

One of the original and now most common means of application checkpointing was a "save state" feature in interactive applications, in which the user of the application could save the state of all variables and other data and either continue working or exit the application and restart the application and restore the saved state at a later time. This was implemented through a "save" command or menu option in the application. In many cases, it became standard practice to ask the user, if they had unsaved work when exiting an application, if they wanted to save their work before doing so.

This functionality became extremely important for usability in applications in which a particular task could not be completed in one sitting (such as playing a video game expected to take dozens of hours) or in which the work was being done over a long period of time (such as data entry into a document such as rows in a spreadsheet).

The problem with save state is it requires the operator of a program to request the save. For non-interactive programs, including automated or batch processed workloads, the ability to checkpoint such applications also had to be automated.

As batch applications began to handle tens to hundreds of thousands of transactions, where each transaction might process one record from one file against several different files, the need for the application to be restartable at some point without the need to rerun the entire job from scratch became imperative. Thus the "checkpoint/restart" capability was born, in which after a number of transactions had been processed, a "snapshot" or "checkpoint" of the state of the application could be taken. If the application failed before the next checkpoint, it could be restarted by giving it the checkpoint information and the last place in the transaction file where a transaction had successfully completed. The application could then restart at that point.

Checkpointing tends to be expensive, so it was generally not done with every record, but at some reasonable compromise between the cost of a checkpoint vs. the value of the computer time needed to reprocess a batch of records. Thus the number of records processed for each checkpoint might range from 25 to 200, depending on cost factors, the relative complexity of the application and the resources needed to successfully restart the application.

See all
User Avatar
No comments yet.