incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ChiaHung Lin" <chl...@nuk.edu.tw>
Subject Re: Checkpointer Process
Date Tue, 25 Oct 2011 07:37:21 GMT
Just some thoughts on why to programme checkpointer as separated process. The idea is centered
on isolation. Because fault will occur, ensuring that failures/ errors would not adversely
affect other parts of the system becomes critical. Also, performing user tasks and saving
data to hdfs are two different issues so our goal is to ensure user tasks would continuously
work even if checkpointing process fails. As long as user tasks keep continuously performing
their job smoothly, checkpointing process can be ignored. 

There were 4 options considered previously:

1.) Checkpointer runs on the same process with bsp task.
2.) A separated checkpointing process per bsp task on each machine. 
3.) A separated checkopinting process per machine.
4.) Checkpointing processes in forms of server farm. 

The problem for the first one is if the checkpoining process fails, user tasks may fail as
well, which is an unwanted behaviour for users. The fourth has a problem that it affects arbitrary
user tasks for recovery if both processes fail. The second and third is similar except that
the second option would min user tasks to be affected if both processes fail. Running checkpointer
as separated process has an advantage that if only checkpointing process fails, it is not
necessary to recover. For example, suppose a BSP job performs its tasks from supersteps 1
to 10. At the same time a separated checkpointing process stands by. In the first 3 supersteps,
both processes work well. After the supersteps 4, the checkpointing process fails, but the
user task is continuously doing it task. At the supersteps 7, the checkpointer is back (e.g.
restart). And if user task keeps working until it finishes, there is no need to perform recovery
in this case. If bsp task fails after checkpointing process is back, the system has chances
to recover from the latest snapshot. 

I understand the current implementation is not perfect. But that would be good if we can work
toward this direction because these are recommended to the best of my knowledge. 

-----Original message-----
From:Thomas Jungblut <thomas.jungblut@googlemail.com>
To:hama-dev@incubator.apache.org
Date:Fri, 14 Oct 2011 15:54:10 +0200
Subject:Checkpointer Process

Hi all.
My idea:
Since YARN and multitasking we should consider moving the Checkpointer
process into the BSPPeer itself instead of a single process.

It would be great if we could discuss what would be the real advantage and
disadvantage of integrating it in the same process / a daemon process.

-- 
Thomas Jungblut
Berlin <thomas.jungblut@gmail.com>


--
ChiaHung Lin
Department of Information Management
National University of Kaohsiung
Taiwan

Mime
View raw message