hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Menon <menonsur...@gmail.com>
Subject Re: Recovery Issues
Date Mon, 12 Mar 2012 08:01:07 GMT

I want to understand single task rollback. So consider a scenario, where
all tasks checkpoint every 5 supersteps. Now when one of the tasks failed
at superstep 7, it would have to recover from the checkpointed data at
superstep 5. How would it get messages from the peer BSPs at superstep 6
and 7?

One more issue that I have in mind is how we would be able to recover the
values of static variables that someone would be holding in each bsp job.
This scenario is a problem if a user is maintaining some static variable
state whose lifecycle spans across multiple supersteps.


On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> I guess we have to slice some issues needed for checkpoint recovery.
> In my opinion we have two types of recovery:
> - single task recovery
> - global recovery of all tasks
> And I guess we can simply make a rule:
> If a task fails inside our barrier sync method (since we have a double
> barrier, after enterBarrier() and before leaveBarrier()), we have to do a
> global recovery.
> Else we can just do a single task rollback.
> For those asking why we can't do just always a global rollback: it is too
> costly and we really do not need it in any case.
> But we need it in the case where a task fails inside the barrier (between
> enter and leave) just because a single rollbacked task can't trip the
> enterBarrier-Barrier.
> Anything I have forgotten?
> --
> Thomas Jungblut
> Berlin <thomas.jungblut@gmail.com>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message