incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Jungblut <>
Subject Recovery Issues
Date Sat, 10 Mar 2012 09:11:38 GMT
I guess we have to slice some issues needed for checkpoint recovery.

In my opinion we have two types of recovery:
- single task recovery
- global recovery of all tasks

And I guess we can simply make a rule:
If a task fails inside our barrier sync method (since we have a double
barrier, after enterBarrier() and before leaveBarrier()), we have to do a
global recovery.
Else we can just do a single task rollback.

For those asking why we can't do just always a global rollback: it is too
costly and we really do not need it in any case.
But we need it in the case where a task fails inside the barrier (between
enter and leave) just because a single rollbacked task can't trip the

Anything I have forgotten?

Thomas Jungblut
Berlin <>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message