hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Recovery Issues
Date Wed, 14 Mar 2012 07:29:40 GMT
If user do something with communication APIs in a while read.next()
loop or in memory, recovery is not simple. So in my opinion, first of
all, we should have to separate (or hide) the data handlers to
somewhere from bsp() function like M/R or Pregel. For example,

ftbsp(Communicator comm);
setup(DataInput in);
close(DataOutput out);

And then, maybe we can design the flow of checkpoint-based (Task
Failure) recovery like this:

1. If some task failed to execute setup() or close() functions, just
re-attempt, finally return "job failed" message.

2. If some task failed in the middle of processing, and it should be
re-launched, the statuses of JobInProgress and TaskInProgress should
be changed.

3. And, in a every step, all tasks should check whether status of
rollback to earlier checkpoint or keep running (or waiting to leave
barrier).

4. Re-launch a failed task.

5. Change the status to RUNNING from ROLLBACK or RECOVERY.

On Mon, Mar 12, 2012 at 5:34 PM, Thomas Jungblut
<thomas.jungblut@googlemail.com> wrote:
> Ah yes, good points.
>
> If we don't have a checkpoint from the current superstep we have to do a
> global rollback of the least known messages.
> So we shouldn't offer this configurability through the BSPJob API, this is
> for specialized users only.
>
> One more issue that I have in mind is how we would be able to recover the
>> values of static variables that someone would be holding in each bsp job.
>> This scenario is a problem if a user is maintaining some static variable
>> state whose lifecycle spans across multiple supersteps.
>>
>
> Ideally you would transfer your shared state through the messages. I
> thought of making a backup function available in the BSP class where
> someone can backup their internal state, but I guess this is not how BSP
> should be written.
>
> Which does not mean that we don't want to provide this in next releases.
>
> Am 12. März 2012 09:01 schrieb Suraj Menon <menonsuraj5@gmail.com>:
>
>> Hello,
>>
>> I want to understand single task rollback. So consider a scenario, where
>> all tasks checkpoint every 5 supersteps. Now when one of the tasks failed
>> at superstep 7, it would have to recover from the checkpointed data at
>> superstep 5. How would it get messages from the peer BSPs at superstep 6
>> and 7?
>>
>> One more issue that I have in mind is how we would be able to recover the
>> values of static variables that someone would be holding in each bsp job.
>> This scenario is a problem if a user is maintaining some static variable
>> state whose lifecycle spans across multiple supersteps.
>>
>> Thanks,
>> Suraj
>>
>> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut <
>> thomas.jungblut@googlemail.com> wrote:
>>
>> > I guess we have to slice some issues needed for checkpoint recovery.
>> >
>> > In my opinion we have two types of recovery:
>> > - single task recovery
>> > - global recovery of all tasks
>> >
>> > And I guess we can simply make a rule:
>> > If a task fails inside our barrier sync method (since we have a double
>> > barrier, after enterBarrier() and before leaveBarrier()), we have to do a
>> > global recovery.
>> > Else we can just do a single task rollback.
>> >
>> > For those asking why we can't do just always a global rollback: it is too
>> > costly and we really do not need it in any case.
>> > But we need it in the case where a task fails inside the barrier (between
>> > enter and leave) just because a single rollbacked task can't trip the
>> > enterBarrier-Barrier.
>> >
>> > Anything I have forgotten?
>> >
>> >
>> > --
>> > Thomas Jungblut
>> > Berlin <thomas.jungblut@gmail.com>
>> >
>>
>
>
>
> --
> Thomas Jungblut
> Berlin <thomas.jungblut@gmail.com>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message