No Access

A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS

FAISAL SHAHZAD

Erlangen Regional Computing Center, University of Erlangen-Nuremberg, 91058 Erlangen, Germany

Search for more papers by this author

MARKUS WITTMANN

Erlangen Regional Computing Center, University of Erlangen-Nuremberg, 91058 Erlangen, Germany

Search for more papers by this author

MORITZ KREUTZER

Erlangen Regional Computing Center, University of Erlangen-Nuremberg, 91058 Erlangen, Germany

Search for more papers by this author

THOMAS ZEISER

Erlangen Regional Computing Center, University of Erlangen-Nuremberg, 91058 Erlangen, Germany

Search for more papers by this author

GEORG HAGER

Erlangen Regional Computing Center, University of Erlangen-Nuremberg, 91058 Erlangen, Germany

Search for more papers by this author

, and

GERHARD WELLEIN

Erlangen Regional Computing Center, University of Erlangen-Nuremberg, 91058 Erlangen, Germany

Search for more papers by this author

https://doi.org/10.1142/S0129626413400112Cited by:13 (Source: Crossref)

Abstract

The road to exascale computing poses many challenges for the High Performance Computing (HPC) community. Each step on the exascale path is mainly the result of a higher level of parallelism of the basic building blocks (i.e., CPUs, memory units, networking components, etc.). The reliability of each of these basic components does not increase at the same rate as the rate of hardware parallelism. This results in a reduction of the mean time to failure (MTTF) of the whole system. A fault tolerance environment is thus indispensable to run large applications on such clusters. Checkpoint/Restart (C/R) is the classic and most popular method to minimize failure damage. Its ease of implementation makes it useful, but typically it introduces significant overhead to the application. Several efforts have been made to reduce the C/R overhead. In this paper we compare various C/R techniques for their overheads by implementing them on two different categories of applications. These approaches are based on parallel-file-system (PFS)-level checkpoints (synchronous/asynchronous) and node-level checkpoints. We utilize the Scalable Checkpoint/Restart (SCR) library for the comparison of node-level checkpoints. For asynchronous PFS-level checkpoints, we use the Damaris library, the SCR asynchronous feature, and application-based checkpointing via dedicated threads. Our baseline for overhead comparison is the naïve application-based synchronous PFS-level checkpointing method. A 3D lattice-Boltzmann (LBM) flow solver and a Lanczos eigenvalue solver are used as prototypical applications in which all the techniques considered here may be applied.

Keywords: