Realization of An Efficient Concurrent Partial Snapshot Algorithm for Large-scale and Dynamic Distributed Systems

Rentaro Watanabe, Yonghwan Kim, Fukuhito Ooshita, Hirotsugu Kakugawa, Toshimitsu Masuzawa

Abstract


This paper presents a new algorithm for taking a partial snapshot in a dynamic distributed system. A global snapshot is a collection of local states of all the nodes in the system and plays important roles in many applications such as checkpoint-based rollback recovery. But it becomes impractical and instead the partial snapshot becomes useful when the distributed system becomes large or dynamic. In the partial snapshot, all the nodes do not store their states, but instead, a node stores its state only when it has causal relation to the node initiating the snapshot algorithm.

Moriya and Araragi presented an efficient partial snapshot algorithm for a dynamic distributed system where nodes can join to and leave from the system. But they consider only the case where a single node initiates the snapshot algorithm. Kim et al. modified the algorithm so that multiple nodes can concurrently initiate the snapshot algorithm. This paper presents another method to allow the multiple concurrent initiators. Both the algorithms merge the partial snapshot processes initiated by the initiators and finally one of the initiators coordinates the process for taking the partial snapshot. The new algorithm executes the merge more distributedly than the previous one, and succeeds to reduce the number of messages to take the partial snapshot. 


Keywords


Distributed system; Fault tolerance; Checkpoint-based rollback recovery; Snapshot algorithm; Partial-snapshot algorithm

Full Text:

PDF

Refbacks

  • There are currently no refbacks.