incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Menon <menonsur...@gmail.com>
Subject Recovery Issues
Date Wed, 14 Mar 2012 18:20:41 GMT
Hello,

+1 on Chiahung's comment on getting things right till the beginning of
superstep on failure.
I think we are all over the place (or atleast I am) in our thinking on
fault tolerance. I want to take a step backward.
First we should decide the nature of faults that we have to recover from. I
am putting these in points below:

1. Hardware failure.- In my opinion, this is the most important failure
reason we should be working on.
Why? - Most other nature of faults would happen because of errors in user
programmer logic or Hama framework bugs if any. I will get to these
scenarios in the next point.
Picking up from Chiahung's mail, say we have 15 bsp tasks running in
parallel for the 12th superstep. Out of it, node running task number 5,6
and 7 failed during execution. We would have to rollback that bsp task from
the checkpointed data in superstep 11 based on the task id, atttempt id and
job number. This would mean that the BSPChild executor in GroomServer
has to be communicated that BSP Peer should be started in recovery mode and
not assume superstep as -1. It should then load the input message queue
with the checkpointed messages. It would be provided with the partition
that holds the checkpointed data for the task id. All other tasks would be
waiting in the sync barrier until recovered tasks 5,6 and 7 enters the sync
of 12 th superstep to continue.

2. Software failure.

a) If a user's bsp job fails or throws a Runtime exception while running,
the odds are that if we recover his task state again, the task would fail
again because we would be recovering his error state. Hadoop has a feature
where we can tolerate failure of certain configurable percentage of failed
tasks during computation. I think we should be open to the idea of users
wanting to be oblivious to certain data.

b) If we have a JVM error on GroomServer or in BSPPeer process, we can
retry the execution of tasks as in case of hardware failure.

I agree with Chiahung to focus on recovery logic and how to start it. With
current design, it could be handled in following changes.

1. GroomServer should get a new directive on recovery task (different than
new task)
2. BSPPeerImpl should have a new flavor for initialization for recovery
task, where:
    - it has a non-negative superstep to start with
    - the partition and input file is from the checkpointed data.
    - the messages are read from partition and the input queue is filled
with the messages if superstep > 0

We should provide a Map to the users to store their global state that
should be recovered on failure.
I will get some time from tomorrow evening. I shall get things more
organized.

Thanks,
Suraj

On Wed, Mar 14, 2012 at 9:00 AM, Chia-Hung Lin <clin4j@googlemail.com>
wrote:
>
> It would be simpler for the first version of fault tolerance  to focus
> on rollback to the level at the beginning of start of a superstep. For
> example, when a job executes to the middle of 12th superstep, then the
> task fails. The system should restart that task (assume checkpoint on
> every superstep and fortunately the system have the latest snapshot of
> e.g. the 11th superstep for that task) and put the necessary messages
> (checkpointed at the 11th superstep that would be transferred to the
> 12th superstep) back to the task so that the failed task can restart
> from the 12th superstep.
>
> If later on the community wants the checkpoint to the level at
> specific detail within a task. Probably we can exploit something like
> incremental checkpoint[1] or Memento pattern, which requires users'
> assistance, to restart a task for local checkpoint.
>
> [1]. Efficient Incremental Checkpointing of Java Programs.
> hal.inria.fr/inria-00072848/PDF/RR-3810.pdf
>
> On 14 March 2012 15:29, Edward J. Yoon <edwardyoon@apache.org> wrote:
> > If user do something with communication APIs in a while read.next()
> > loop or in memory, recovery is not simple. So in my opinion, first of
> > all, we should have to separate (or hide) the data handlers to
> > somewhere from bsp() function like M/R or Pregel. For example,
> >
> > ftbsp(Communicator comm);
> > setup(DataInput in);
> > close(DataOutput out);
> >
> > And then, maybe we can design the flow of checkpoint-based (Task
> > Failure) recovery like this:
> >
> > 1. If some task failed to execute setup() or close() functions, just
> > re-attempt, finally return "job failed" message.
> >
> > 2. If some task failed in the middle of processing, and it should be
> > re-launched, the statuses of JobInProgress and TaskInProgress should
> > be changed.
> >
> > 3. And, in a every step, all tasks should check whether status of
> > rollback to earlier checkpoint or keep running (or waiting to leave
> > barrier).
> >
> > 4. Re-launch a failed task.
> >
> > 5. Change the status to RUNNING from ROLLBACK or RECOVERY.
> >
> > On Mon, Mar 12, 2012 at 5:34 PM, Thomas Jungblut
> > <thomas.jungblut@googlemail.com> wrote:
> >> Ah yes, good points.
> >>
> >> If we don't have a checkpoint from the current superstep we have to do
a
> >> global rollback of the least known messages.
> >> So we shouldn't offer this configurability through the BSPJob API,
this is
> >> for specialized users only.
> >>
> >> One more issue that I have in mind is how we would be able to recover
the
> >>> values of static variables that someone would be holding in each bsp
job.
> >>> This scenario is a problem if a user is maintaining some static
variable
> >>> state whose lifecycle spans across multiple supersteps.
> >>>
> >>
> >> Ideally you would transfer your shared state through the messages. I
> >> thought of making a backup function available in the BSP class where
> >> someone can backup their internal state, but I guess this is not how
BSP
> >> should be written.
> >>
> >> Which does not mean that we don't want to provide this in next
releases.
> >>
> >> Am 12. März 2012 09:01 schrieb Suraj Menon <menonsuraj5@gmail.com>:
> >>
> >>> Hello,
> >>>
> >>> I want to understand single task rollback. So consider a scenario,
where
> >>> all tasks checkpoint every 5 supersteps. Now when one of the tasks
failed
> >>> at superstep 7, it would have to recover from the checkpointed data at
> >>> superstep 5. How would it get messages from the peer BSPs at
superstep 6
> >>> and 7?
> >>>
> >>> One more issue that I have in mind is how we would be able to recover
the
> >>> values of static variables that someone would be holding in each bsp
job.
> >>> This scenario is a problem if a user is maintaining some static
variable
> >>> state whose lifecycle spans across multiple supersteps.
> >>>
> >>> Thanks,
> >>> Suraj
> >>>
> >>> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut <
> >>> thomas.jungblut@googlemail.com> wrote:
> >>>
> >>> > I guess we have to slice some issues needed for checkpoint recovery.
> >>> >
> >>> > In my opinion we have two types of recovery:
> >>> > - single task recovery
> >>> > - global recovery of all tasks
> >>> >
> >>> > And I guess we can simply make a rule:
> >>> > If a task fails inside our barrier sync method (since we have a
double
> >>> > barrier, after enterBarrier() and before leaveBarrier()), we have
to do a
> >>> > global recovery.
> >>> > Else we can just do a single task rollback.
> >>> >
> >>> > For those asking why we can't do just always a global rollback: it
is too
> >>> > costly and we really do not need it in any case.
> >>> > But we need it in the case where a task fails inside the barrier
(between
> >>> > enter and leave) just because a single rollbacked task can't trip
the
> >>> > enterBarrier-Barrier.
> >>> >
> >>> > Anything I have forgotten?
> >>> >
> >>> >
> >>> > --
> >>> > Thomas Jungblut
> >>> > Berlin <thomas.jungblut@gmail.com>
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> Thomas Jungblut
> >> Berlin <thomas.jungblut@gmail.com>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message