hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chia-Hung Lin <cli...@googlemail.com>
Subject Re: Recovery Issues
Date Wed, 14 Mar 2012 13:00:09 GMT
It would be simpler for the first version of fault tolerance  to focus
on rollback to the level at the beginning of start of a superstep. For
example, when a job executes to the middle of 12th superstep, then the
task fails. The system should restart that task (assume checkpoint on
every superstep and fortunately the system have the latest snapshot of
e.g. the 11th superstep for that task) and put the necessary messages
(checkpointed at the 11th superstep that would be transferred to the
12th superstep) back to the task so that the failed task can restart
from the 12th superstep.

If later on the community wants the checkpoint to the level at
specific detail within a task. Probably we can exploit something like
incremental checkpoint[1] or Memento pattern, which requires users'
assistance, to restart a task for local checkpoint.

[1]. Efficient Incremental Checkpointing of Java Programs.
hal.inria.fr/inria-00072848/PDF/RR-3810.pdf

On 14 March 2012 15:29, Edward J. Yoon <edwardyoon@apache.org> wrote:
> If user do something with communication APIs in a while read.next()
> loop or in memory, recovery is not simple. So in my opinion, first of
> all, we should have to separate (or hide) the data handlers to
> somewhere from bsp() function like M/R or Pregel. For example,
>
> ftbsp(Communicator comm);
> setup(DataInput in);
> close(DataOutput out);
>
> And then, maybe we can design the flow of checkpoint-based (Task
> Failure) recovery like this:
>
> 1. If some task failed to execute setup() or close() functions, just
> re-attempt, finally return "job failed" message.
>
> 2. If some task failed in the middle of processing, and it should be
> re-launched, the statuses of JobInProgress and TaskInProgress should
> be changed.
>
> 3. And, in a every step, all tasks should check whether status of
> rollback to earlier checkpoint or keep running (or waiting to leave
> barrier).
>
> 4. Re-launch a failed task.
>
> 5. Change the status to RUNNING from ROLLBACK or RECOVERY.
>
> On Mon, Mar 12, 2012 at 5:34 PM, Thomas Jungblut
> <thomas.jungblut@googlemail.com> wrote:
>> Ah yes, good points.
>>
>> If we don't have a checkpoint from the current superstep we have to do a
>> global rollback of the least known messages.
>> So we shouldn't offer this configurability through the BSPJob API, this is
>> for specialized users only.
>>
>> One more issue that I have in mind is how we would be able to recover the
>>> values of static variables that someone would be holding in each bsp job.
>>> This scenario is a problem if a user is maintaining some static variable
>>> state whose lifecycle spans across multiple supersteps.
>>>
>>
>> Ideally you would transfer your shared state through the messages. I
>> thought of making a backup function available in the BSP class where
>> someone can backup their internal state, but I guess this is not how BSP
>> should be written.
>>
>> Which does not mean that we don't want to provide this in next releases.
>>
>> Am 12. März 2012 09:01 schrieb Suraj Menon <menonsuraj5@gmail.com>:
>>
>>> Hello,
>>>
>>> I want to understand single task rollback. So consider a scenario, where
>>> all tasks checkpoint every 5 supersteps. Now when one of the tasks failed
>>> at superstep 7, it would have to recover from the checkpointed data at
>>> superstep 5. How would it get messages from the peer BSPs at superstep 6
>>> and 7?
>>>
>>> One more issue that I have in mind is how we would be able to recover the
>>> values of static variables that someone would be holding in each bsp job.
>>> This scenario is a problem if a user is maintaining some static variable
>>> state whose lifecycle spans across multiple supersteps.
>>>
>>> Thanks,
>>> Suraj
>>>
>>> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut <
>>> thomas.jungblut@googlemail.com> wrote:
>>>
>>> > I guess we have to slice some issues needed for checkpoint recovery.
>>> >
>>> > In my opinion we have two types of recovery:
>>> > - single task recovery
>>> > - global recovery of all tasks
>>> >
>>> > And I guess we can simply make a rule:
>>> > If a task fails inside our barrier sync method (since we have a double
>>> > barrier, after enterBarrier() and before leaveBarrier()), we have to do
a
>>> > global recovery.
>>> > Else we can just do a single task rollback.
>>> >
>>> > For those asking why we can't do just always a global rollback: it is too
>>> > costly and we really do not need it in any case.
>>> > But we need it in the case where a task fails inside the barrier (between
>>> > enter and leave) just because a single rollbacked task can't trip the
>>> > enterBarrier-Barrier.
>>> >
>>> > Anything I have forgotten?
>>> >
>>> >
>>> > --
>>> > Thomas Jungblut
>>> > Berlin <thomas.jungblut@gmail.com>
>>> >
>>>
>>
>>
>>
>> --
>> Thomas Jungblut
>> Berlin <thomas.jungblut@gmail.com>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon

Mime
View raw message