hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suraj Menon <menonsur...@gmail.com>
Subject Re: Recovery Issues
Date Wed, 14 Mar 2012 19:06:18 GMT
+1

On Wed, Mar 14, 2012 at 2:58 PM, Thomas Jungblut <
thomas.jungblut@googlemail.com> wrote:

> Great we started a nice discussion about it.
>
> If user do something with communication APIs in a while read.next()
> > loop or in memory, recovery is not simple. So in my opinion, first of
> > all, we should have to separate (or hide) the data handlers to
> > somewhere from bsp() function like M/R or Pregel. For example,
>
>
> Yes, but do you know what? We can simply add another layer (a retry layer)
> onto the MessageService instead of innovating new fancy methods and
> classes. Hiding is the key.
>
> Otherwise I totally agree Chia-Hung. I will take a deeper look into
> incremental checkpointing the next few days.
>
> a) If a user's bsp job fails or throws a Runtime exception while running,
> > the odds are that if we recover his task state again, the task would fail
> > again because we would be recovering his error state. Hadoop has a
> feature
> > where we can tolerate failure of certain configurable percentage of
> failed
> > tasks during computation. I think we should be open to the idea of users
> > wanting to be oblivious to certain data.
>
>
> Not for certain. There are cases where usercode may fail due to unlucky
> timing, e.G. a downed service in an API. This is actually no runtime
> exception like ClassNotFound or stuff which is very unlikely to be fixed in
> several attempts.
> I'm +1 for making it smarter than Hadoop, to decide on the exception type
> if something can be recovered or not. However Hadoop is doing it right.
>
> 2. BSPPeerImpl should have a new flavor for initialization for recovery
> > task, where:
> >    - it has a non-negative superstep to start with
>
>
> Yes, this must trigger a read of the last checkpointed messages.
>
> We should provide a Map to the users to store their global state that
> > should be recovered on failure.
>
>
> This is really really tricky and I would put my efford in doing the
> simplest thing first (sorry for the users). We can add a recovery for user
> internals later on.
> In my opinion internal state is not how BSP should be coded: Everything can
> be stateless, really!
>
> I know it is difficult for the first time, but I observed that the less
> state you store, the more simpler the code gets and that's what frameworks
> are for.
> So let's add this bad-style edge case in one of the upcoming releases if
> there is really a demand for it.
>
> I take a bit of time on the weekend to write the JIRA issues from the
> things we discussed here.
>
> I think we can really start right away to implement it, the simplest case
> is very straightforward and we can improve later on.
>
> Am 14. März 2012 19:20 schrieb Suraj Menon <menonsuraj5@gmail.com>:
>
> > Hello,
> >
> > +1 on Chiahung's comment on getting things right till the beginning of
> > superstep on failure.
> > I think we are all over the place (or atleast I am) in our thinking on
> > fault tolerance. I want to take a step backward.
> > First we should decide the nature of faults that we have to recover
> from. I
> > am putting these in points below:
> >
> > 1. Hardware failure.- In my opinion, this is the most important failure
> > reason we should be working on.
> > Why? - Most other nature of faults would happen because of errors in user
> > programmer logic or Hama framework bugs if any. I will get to these
> > scenarios in the next point.
> > Picking up from Chiahung's mail, say we have 15 bsp tasks running in
> > parallel for the 12th superstep. Out of it, node running task number 5,6
> > and 7 failed during execution. We would have to rollback that bsp task
> from
> > the checkpointed data in superstep 11 based on the task id, atttempt id
> and
> > job number. This would mean that the BSPChild executor in GroomServer
> > has to be communicated that BSP Peer should be started in recovery mode
> and
> > not assume superstep as -1. It should then load the input message queue
> > with the checkpointed messages. It would be provided with the partition
> > that holds the checkpointed data for the task id. All other tasks would
> be
> > waiting in the sync barrier until recovered tasks 5,6 and 7 enters the
> sync
> > of 12 th superstep to continue.
> >
> > 2. Software failure.
> >
> > a) If a user's bsp job fails or throws a Runtime exception while running,
> > the odds are that if we recover his task state again, the task would fail
> > again because we would be recovering his error state. Hadoop has a
> feature
> > where we can tolerate failure of certain configurable percentage of
> failed
> > tasks during computation. I think we should be open to the idea of users
> > wanting to be oblivious to certain data.
> >
> > b) If we have a JVM error on GroomServer or in BSPPeer process, we can
> > retry the execution of tasks as in case of hardware failure.
> >
> > I agree with Chiahung to focus on recovery logic and how to start it.
> With
> > current design, it could be handled in following changes.
> >
> > 1. GroomServer should get a new directive on recovery task (different
> than
> > new task)
> > 2. BSPPeerImpl should have a new flavor for initialization for recovery
> > task, where:
> >    - it has a non-negative superstep to start with
> >    - the partition and input file is from the checkpointed data.
> >    - the messages are read from partition and the input queue is filled
> > with the messages if superstep > 0
> >
> > We should provide a Map to the users to store their global state that
> > should be recovered on failure.
> > I will get some time from tomorrow evening. I shall get things more
> > organized.
> >
> > Thanks,
> > Suraj
> >
> > On Wed, Mar 14, 2012 at 9:00 AM, Chia-Hung Lin <clin4j@googlemail.com>
> > wrote:
> > >
> > > It would be simpler for the first version of fault tolerance  to focus
> > > on rollback to the level at the beginning of start of a superstep. For
> > > example, when a job executes to the middle of 12th superstep, then the
> > > task fails. The system should restart that task (assume checkpoint on
> > > every superstep and fortunately the system have the latest snapshot of
> > > e.g. the 11th superstep for that task) and put the necessary messages
> > > (checkpointed at the 11th superstep that would be transferred to the
> > > 12th superstep) back to the task so that the failed task can restart
> > > from the 12th superstep.
> > >
> > > If later on the community wants the checkpoint to the level at
> > > specific detail within a task. Probably we can exploit something like
> > > incremental checkpoint[1] or Memento pattern, which requires users'
> > > assistance, to restart a task for local checkpoint.
> > >
> > > [1]. Efficient Incremental Checkpointing of Java Programs.
> > > hal.inria.fr/inria-00072848/PDF/RR-3810.pdf
> > >
> > > On 14 March 2012 15:29, Edward J. Yoon <edwardyoon@apache.org> wrote:
> > > > If user do something with communication APIs in a while read.next()
> > > > loop or in memory, recovery is not simple. So in my opinion, first of
> > > > all, we should have to separate (or hide) the data handlers to
> > > > somewhere from bsp() function like M/R or Pregel. For example,
> > > >
> > > > ftbsp(Communicator comm);
> > > > setup(DataInput in);
> > > > close(DataOutput out);
> > > >
> > > > And then, maybe we can design the flow of checkpoint-based (Task
> > > > Failure) recovery like this:
> > > >
> > > > 1. If some task failed to execute setup() or close() functions, just
> > > > re-attempt, finally return "job failed" message.
> > > >
> > > > 2. If some task failed in the middle of processing, and it should be
> > > > re-launched, the statuses of JobInProgress and TaskInProgress should
> > > > be changed.
> > > >
> > > > 3. And, in a every step, all tasks should check whether status of
> > > > rollback to earlier checkpoint or keep running (or waiting to leave
> > > > barrier).
> > > >
> > > > 4. Re-launch a failed task.
> > > >
> > > > 5. Change the status to RUNNING from ROLLBACK or RECOVERY.
> > > >
> > > > On Mon, Mar 12, 2012 at 5:34 PM, Thomas Jungblut
> > > > <thomas.jungblut@googlemail.com> wrote:
> > > >> Ah yes, good points.
> > > >>
> > > >> If we don't have a checkpoint from the current superstep we have to
> do
> > a
> > > >> global rollback of the least known messages.
> > > >> So we shouldn't offer this configurability through the BSPJob API,
> > this is
> > > >> for specialized users only.
> > > >>
> > > >> One more issue that I have in mind is how we would be able to
> recover
> > the
> > > >>> values of static variables that someone would be holding in each
> bsp
> > job.
> > > >>> This scenario is a problem if a user is maintaining some static
> > variable
> > > >>> state whose lifecycle spans across multiple supersteps.
> > > >>>
> > > >>
> > > >> Ideally you would transfer your shared state through the messages.
I
> > > >> thought of making a backup function available in the BSP class where
> > > >> someone can backup their internal state, but I guess this is not how
> > BSP
> > > >> should be written.
> > > >>
> > > >> Which does not mean that we don't want to provide this in next
> > releases.
> > > >>
> > > >> Am 12. März 2012 09:01 schrieb Suraj Menon <menonsuraj5@gmail.com>:
> > > >>
> > > >>> Hello,
> > > >>>
> > > >>> I want to understand single task rollback. So consider a scenario,
> > where
> > > >>> all tasks checkpoint every 5 supersteps. Now when one of the tasks
> > failed
> > > >>> at superstep 7, it would have to recover from the checkpointed
data
> > at
> > > >>> superstep 5. How would it get messages from the peer BSPs at
> > superstep 6
> > > >>> and 7?
> > > >>>
> > > >>> One more issue that I have in mind is how we would be able to
> recover
> > the
> > > >>> values of static variables that someone would be holding in each
> bsp
> > job.
> > > >>> This scenario is a problem if a user is maintaining some static
> > variable
> > > >>> state whose lifecycle spans across multiple supersteps.
> > > >>>
> > > >>> Thanks,
> > > >>> Suraj
> > > >>>
> > > >>> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut <
> > > >>> thomas.jungblut@googlemail.com> wrote:
> > > >>>
> > > >>> > I guess we have to slice some issues needed for checkpoint
> > recovery.
> > > >>> >
> > > >>> > In my opinion we have two types of recovery:
> > > >>> > - single task recovery
> > > >>> > - global recovery of all tasks
> > > >>> >
> > > >>> > And I guess we can simply make a rule:
> > > >>> > If a task fails inside our barrier sync method (since we
have a
> > double
> > > >>> > barrier, after enterBarrier() and before leaveBarrier()),
we have
> > to do a
> > > >>> > global recovery.
> > > >>> > Else we can just do a single task rollback.
> > > >>> >
> > > >>> > For those asking why we can't do just always a global rollback:
> it
> > is too
> > > >>> > costly and we really do not need it in any case.
> > > >>> > But we need it in the case where a task fails inside the
barrier
> > (between
> > > >>> > enter and leave) just because a single rollbacked task can't
trip
> > the
> > > >>> > enterBarrier-Barrier.
> > > >>> >
> > > >>> > Anything I have forgotten?
> > > >>> >
> > > >>> >
> > > >>> > --
> > > >>> > Thomas Jungblut
> > > >>> > Berlin <thomas.jungblut@gmail.com>
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Thomas Jungblut
> > > >> Berlin <thomas.jungblut@gmail.com>
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards, Edward J. Yoon
> > > > @eddieyoon
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <thomas.jungblut@gmail.com>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message