incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Checkpointer Process
Date Tue, 25 Oct 2011 08:22:15 GMT
> 1.) Checkpointer runs on the same process with bsp task.
> 2.) A separated checkpointing process per bsp task on each machine.
> 3.) A separated checkopinting process per machine.
> 4.) Checkpointing processes in forms of server farm.

When some task fails, the whole tasks will be re-started with previous
checkpoint data. Right?

I'm +1 for the first idea. I believe this way is simple and reliable.

2011/10/25 ChiaHung Lin <chl501@nuk.edu.tw>:
> Just some thoughts on why to programme checkpointer as separated process. The idea is
centered on isolation. Because fault will occur, ensuring that failures/ errors would not
adversely affect other parts of the system becomes critical. Also, performing user tasks and
saving data to hdfs are two different issues so our goal is to ensure user tasks would continuously
work even if checkpointing process fails. As long as user tasks keep continuously performing
their job smoothly, checkpointing process can be ignored.
>
> There were 4 options considered previously:
>
> 1.) Checkpointer runs on the same process with bsp task.
> 2.) A separated checkpointing process per bsp task on each machine.
> 3.) A separated checkopinting process per machine.
> 4.) Checkpointing processes in forms of server farm.
>
> The problem for the first one is if the checkpoining process fails, user tasks may fail
as well, which is an unwanted behaviour for users. The fourth has a problem that it affects
arbitrary user tasks for recovery if both processes fail. The second and third is similar
except that the second option would min user tasks to be affected if both processes fail.
Running checkpointer as separated process has an advantage that if only checkpointing process
fails, it is not necessary to recover. For example, suppose a BSP job performs its tasks from
supersteps 1 to 10. At the same time a separated checkpointing process stands by. In the first
3 supersteps, both processes work well. After the supersteps 4, the checkpointing process
fails, but the user task is continuously doing it task. At the supersteps 7, the checkpointer
is back (e.g. restart). And if user task keeps working until it finishes, there is no need
to perform recovery in this case. If bsp task fails after checkpointing process is back, the
system has chances to recover from the latest snapshot.
>
> I understand the current implementation is not perfect. But that would be good if we
can work toward this direction because these are recommended to the best of my knowledge.
>
> -----Original message-----
> From:Thomas Jungblut <thomas.jungblut@googlemail.com>
> To:hama-dev@incubator.apache.org
> Date:Fri, 14 Oct 2011 15:54:10 +0200
> Subject:Checkpointer Process
>
> Hi all.
> My idea:
> Since YARN and multitasking we should consider moving the Checkpointer
> process into the BSPPeer itself instead of a single process.
>
> It would be great if we could discuss what would be the real advantage and
> disadvantage of integrating it in the same process / a daemon process.
>
> --
> Thomas Jungblut
> Berlin <thomas.jungblut@gmail.com>
>
>
> --
> ChiaHung Lin
> Department of Information Management
> National University of Kaohsiung
> Taiwan
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message