incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Jungblut <thomas.jungb...@googlemail.com>
Subject Recovery Issues
Date Sat, 10 Mar 2012 09:11:38 GMT
I guess we have to slice some issues needed for checkpoint recovery.

In my opinion we have two types of recovery:
- single task recovery
- global recovery of all tasks

And I guess we can simply make a rule:
If a task fails inside our barrier sync method (since we have a double
barrier, after enterBarrier() and before leaveBarrier()), we have to do a
global recovery.
Else we can just do a single task rollback.

For those asking why we can't do just always a global rollback: it is too
costly and we really do not need it in any case.
But we need it in the case where a task fails inside the barrier (between
enter and leave) just because a single rollbacked task can't trip the
enterBarrier-Barrier.

Anything I have forgotten?


-- 
Thomas Jungblut
Berlin <thomas.jungblut@gmail.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message