Evolving fault-tolerance in Hadoop with robust auto-recovering JobTracker

Nobuyuki Kuromatsu, Masao Okita, Kenichi Hagihara


Hadoop is a popular open source software for supporting a large scale distributed data processing. While it achieves high reliability, the job scheduler, named JobTracker, remains the single point of failure. If the JobTracker fails to stop during a job execution, the job is canceled immediately and all of intermediate results are lost. We propose an auto-recovery system against the fail-stop without additional hardware. Our recovery mechanism is based on a checkpoint method. A snapshot of the JobTracker is stored on a distributed file system periodically. When the system detects the fail-stop by using timeout, it automatically recovers the JobTracker by a snapshot. The key feature of our system is a transparent recovery such that a job execution continues during a temporary fail-stop of the JobTracker and completes itself with a little rollback. The system achieves fault-tolerance for the JobTracker with overheads less than 4.3% of the total execution time. It reduces the reassigned tasks caused by a rollback compared to a nae rollback.


Fault-torelance; checkpoint and recovery; Hadoop; master-worker

Full Text:



  • There are currently no refbacks.