incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Menon <menonsur...@gmail.com>
Subject Re: Recovery Issues
Date Fri, 16 Mar 2012 12:33:20 GMT
I am planning to add the following subtasks under HAMA-505


   1.

   BSP Peer should have the ability to start with a non-zero superstep from
   a partition of checkpointed message for that task ID, attempt ID
   2.

   For configurable number of attempts, BSPMaster should direct groomserver
   to run the recovery task on failure. Implement recovery directive from
   BSPMaster to GroomServer. The task state should be started with state
   RECOVERING
   3.

   Checkpointing can be configured to be done asynchronous to the
   sync.(This is in lines of last comment left by Chiahung)
   4.

   Maintain global state on last superstep for which checkpointing was
   completed successfully
   5.

   Should we stats and logs for the failed task that was re-attempted.

Please share your opinions.

Thanks,
Suraj

   1.


On Wed, Mar 14, 2012 at 5:19 PM, Edward J. Yoon <edwardyoon@apache.org>wrote:

> Nice plan.
>
> On Thu, Mar 15, 2012 at 3:58 AM, Thomas Jungblut
> <thomas.jungblut@googlemail.com> wrote:
> > Great we started a nice discussion about it.
> >
> > If user do something with communication APIs in a while read.next()
> >> loop or in memory, recovery is not simple. So in my opinion, first of
> >> all, we should have to separate (or hide) the data handlers to
> >> somewhere from bsp() function like M/R or Pregel. For example,
> >
> >
> > Yes, but do you know what? We can simply add another layer (a retry
> layer)
> > onto the MessageService instead of innovating new fancy methods and
> > classes. Hiding is the key.
> >
> > Otherwise I totally agree Chia-Hung. I will take a deeper look into
> > incremental checkpointing the next few days.
> >
> > a) If a user's bsp job fails or throws a Runtime exception while running,
> >> the odds are that if we recover his task state again, the task would
> fail
> >> again because we would be recovering his error state. Hadoop has a
> feature
> >> where we can tolerate failure of certain configurable percentage of
> failed
> >> tasks during computation. I think we should be open to the idea of users
> >> wanting to be oblivious to certain data.
> >
> >
> > Not for certain. There are cases where usercode may fail due to unlucky
> > timing, e.G. a downed service in an API. This is actually no runtime
> > exception like ClassNotFound or stuff which is very unlikely to be fixed
> in
> > several attempts.
> > I'm +1 for making it smarter than Hadoop, to decide on the exception type
> > if something can be recovered or not. However Hadoop is doing it right.
> >
> > 2. BSPPeerImpl should have a new flavor for initialization for recovery
> >> task, where:
> >>    - it has a non-negative superstep to start with
> >
> >
> > Yes, this must trigger a read of the last checkpointed messages.
> >
> > We should provide a Map to the users to store their global state that
> >> should be recovered on failure.
> >
> >
> > This is really really tricky and I would put my efford in doing the
> > simplest thing first (sorry for the users). We can add a recovery for
> user
> > internals later on.
> > In my opinion internal state is not how BSP should be coded: Everything
> can
> > be stateless, really!
> >
> > I know it is difficult for the first time, but I observed that the less
> > state you store, the more simpler the code gets and that's what
> frameworks
> > are for.
> > So let's add this bad-style edge case in one of the upcoming releases if
> > there is really a demand for it.
> >
> > I take a bit of time on the weekend to write the JIRA issues from the
> > things we discussed here.
> >
> > I think we can really start right away to implement it, the simplest case
> > is very straightforward and we can improve later on.
> >
> > Am 14. März 2012 19:20 schrieb Suraj Menon <menonsuraj5@gmail.com>:
> >
> >> Hello,
> >>
> >> +1 on Chiahung's comment on getting things right till the beginning of
> >> superstep on failure.
> >> I think we are all over the place (or atleast I am) in our thinking on
> >> fault tolerance. I want to take a step backward.
> >> First we should decide the nature of faults that we have to recover
> from. I
> >> am putting these in points below:
> >>
> >> 1. Hardware failure.- In my opinion, this is the most important failure
> >> reason we should be working on.
> >> Why? - Most other nature of faults would happen because of errors in
> user
> >> programmer logic or Hama framework bugs if any. I will get to these
> >> scenarios in the next point.
> >> Picking up from Chiahung's mail, say we have 15 bsp tasks running in
> >> parallel for the 12th superstep. Out of it, node running task number 5,6
> >> and 7 failed during execution. We would have to rollback that bsp task
> from
> >> the checkpointed data in superstep 11 based on the task id, atttempt id
> and
> >> job number. This would mean that the BSPChild executor in GroomServer
> >> has to be communicated that BSP Peer should be started in recovery mode
> and
> >> not assume superstep as -1. It should then load the input message queue
> >> with the checkpointed messages. It would be provided with the partition
> >> that holds the checkpointed data for the task id. All other tasks would
> be
> >> waiting in the sync barrier until recovered tasks 5,6 and 7 enters the
> sync
> >> of 12 th superstep to continue.
> >>
> >> 2. Software failure.
> >>
> >> a) If a user's bsp job fails or throws a Runtime exception while
> running,
> >> the odds are that if we recover his task state again, the task would
> fail
> >> again because we would be recovering his error state. Hadoop has a
> feature
> >> where we can tolerate failure of certain configurable percentage of
> failed
> >> tasks during computation. I think we should be open to the idea of users
> >> wanting to be oblivious to certain data.
> >>
> >> b) If we have a JVM error on GroomServer or in BSPPeer process, we can
> >> retry the execution of tasks as in case of hardware failure.
> >>
> >> I agree with Chiahung to focus on recovery logic and how to start it.
> With
> >> current design, it could be handled in following changes.
> >>
> >> 1. GroomServer should get a new directive on recovery task (different
> than
> >> new task)
> >> 2. BSPPeerImpl should have a new flavor for initialization for recovery
> >> task, where:
> >>    - it has a non-negative superstep to start with
> >>    - the partition and input file is from the checkpointed data.
> >>    - the messages are read from partition and the input queue is filled
> >> with the messages if superstep > 0
> >>
> >> We should provide a Map to the users to store their global state that
> >> should be recovered on failure.
> >> I will get some time from tomorrow evening. I shall get things more
> >> organized.
> >>
> >> Thanks,
> >> Suraj
> >>
> >> On Wed, Mar 14, 2012 at 9:00 AM, Chia-Hung Lin <clin4j@googlemail.com>
> >> wrote:
> >> >
> >> > It would be simpler for the first version of fault tolerance  to focus
> >> > on rollback to the level at the beginning of start of a superstep. For
> >> > example, when a job executes to the middle of 12th superstep, then the
> >> > task fails. The system should restart that task (assume checkpoint on
> >> > every superstep and fortunately the system have the latest snapshot of
> >> > e.g. the 11th superstep for that task) and put the necessary messages
> >> > (checkpointed at the 11th superstep that would be transferred to the
> >> > 12th superstep) back to the task so that the failed task can restart
> >> > from the 12th superstep.
> >> >
> >> > If later on the community wants the checkpoint to the level at
> >> > specific detail within a task. Probably we can exploit something like
> >> > incremental checkpoint[1] or Memento pattern, which requires users'
> >> > assistance, to restart a task for local checkpoint.
> >> >
> >> > [1]. Efficient Incremental Checkpointing of Java Programs.
> >> > hal.inria.fr/inria-00072848/PDF/RR-3810.pdf
> >> >
> >> > On 14 March 2012 15:29, Edward J. Yoon <edwardyoon@apache.org> wrote:
> >> > > If user do something with communication APIs in a while read.next()
> >> > > loop or in memory, recovery is not simple. So in my opinion, first
> of
> >> > > all, we should have to separate (or hide) the data handlers to
> >> > > somewhere from bsp() function like M/R or Pregel. For example,
> >> > >
> >> > > ftbsp(Communicator comm);
> >> > > setup(DataInput in);
> >> > > close(DataOutput out);
> >> > >
> >> > > And then, maybe we can design the flow of checkpoint-based (Task
> >> > > Failure) recovery like this:
> >> > >
> >> > > 1. If some task failed to execute setup() or close() functions, just
> >> > > re-attempt, finally return "job failed" message.
> >> > >
> >> > > 2. If some task failed in the middle of processing, and it should
be
> >> > > re-launched, the statuses of JobInProgress and TaskInProgress should
> >> > > be changed.
> >> > >
> >> > > 3. And, in a every step, all tasks should check whether status of
> >> > > rollback to earlier checkpoint or keep running (or waiting to leave
> >> > > barrier).
> >> > >
> >> > > 4. Re-launch a failed task.
> >> > >
> >> > > 5. Change the status to RUNNING from ROLLBACK or RECOVERY.
> >> > >
> >> > > On Mon, Mar 12, 2012 at 5:34 PM, Thomas Jungblut
> >> > > <thomas.jungblut@googlemail.com> wrote:
> >> > >> Ah yes, good points.
> >> > >>
> >> > >> If we don't have a checkpoint from the current superstep we have
> to do
> >> a
> >> > >> global rollback of the least known messages.
> >> > >> So we shouldn't offer this configurability through the BSPJob
API,
> >> this is
> >> > >> for specialized users only.
> >> > >>
> >> > >> One more issue that I have in mind is how we would be able to
> recover
> >> the
> >> > >>> values of static variables that someone would be holding in
each
> bsp
> >> job.
> >> > >>> This scenario is a problem if a user is maintaining some static
> >> variable
> >> > >>> state whose lifecycle spans across multiple supersteps.
> >> > >>>
> >> > >>
> >> > >> Ideally you would transfer your shared state through the messages.
> I
> >> > >> thought of making a backup function available in the BSP class
> where
> >> > >> someone can backup their internal state, but I guess this is not
> how
> >> BSP
> >> > >> should be written.
> >> > >>
> >> > >> Which does not mean that we don't want to provide this in next
> >> releases.
> >> > >>
> >> > >> Am 12. März 2012 09:01 schrieb Suraj Menon <menonsuraj5@gmail.com
> >:
> >> > >>
> >> > >>> Hello,
> >> > >>>
> >> > >>> I want to understand single task rollback. So consider a scenario,
> >> where
> >> > >>> all tasks checkpoint every 5 supersteps. Now when one of the
tasks
> >> failed
> >> > >>> at superstep 7, it would have to recover from the checkpointed
> data
> >> at
> >> > >>> superstep 5. How would it get messages from the peer BSPs
at
> >> superstep 6
> >> > >>> and 7?
> >> > >>>
> >> > >>> One more issue that I have in mind is how we would be able
to
> recover
> >> the
> >> > >>> values of static variables that someone would be holding in
each
> bsp
> >> job.
> >> > >>> This scenario is a problem if a user is maintaining some static
> >> variable
> >> > >>> state whose lifecycle spans across multiple supersteps.
> >> > >>>
> >> > >>> Thanks,
> >> > >>> Suraj
> >> > >>>
> >> > >>> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut <
> >> > >>> thomas.jungblut@googlemail.com> wrote:
> >> > >>>
> >> > >>> > I guess we have to slice some issues needed for checkpoint
> >> recovery.
> >> > >>> >
> >> > >>> > In my opinion we have two types of recovery:
> >> > >>> > - single task recovery
> >> > >>> > - global recovery of all tasks
> >> > >>> >
> >> > >>> > And I guess we can simply make a rule:
> >> > >>> > If a task fails inside our barrier sync method (since
we have a
> >> double
> >> > >>> > barrier, after enterBarrier() and before leaveBarrier()),
we
> have
> >> to do a
> >> > >>> > global recovery.
> >> > >>> > Else we can just do a single task rollback.
> >> > >>> >
> >> > >>> > For those asking why we can't do just always a global
rollback:
> it
> >> is too
> >> > >>> > costly and we really do not need it in any case.
> >> > >>> > But we need it in the case where a task fails inside
the barrier
> >> (between
> >> > >>> > enter and leave) just because a single rollbacked task
can't
> trip
> >> the
> >> > >>> > enterBarrier-Barrier.
> >> > >>> >
> >> > >>> > Anything I have forgotten?
> >> > >>> >
> >> > >>> >
> >> > >>> > --
> >> > >>> > Thomas Jungblut
> >> > >>> > Berlin <thomas.jungblut@gmail.com>
> >> > >>> >
> >> > >>>
> >> > >>
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Thomas Jungblut
> >> > >> Berlin <thomas.jungblut@gmail.com>
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Best Regards, Edward J. Yoon
> >> > > @eddieyoon
> >>
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <thomas.jungblut@gmail.com>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message