incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Jungblut <thomas.jungb...@googlemail.com>
Subject Fault Tolerance in 0.5.0
Date Thu, 02 Feb 2012 11:39:14 GMT
Hey,

I had a bit of time to go through the jira issues and sort out several
things related to Fault Tolerance.

Here are my results:

Fault Tolerance in Hama (all jiras related):

[HAMA-199] Add fault tolerance to BSPPeer < CLOSE, too generic
[HAMA-445] Make configurable checkpointing
[HAMA-440] Features required in recovery procedure.
[HAMA-498] BSPTask should periodically ping its parent.

Then I have splitted this in two main parts, "Detect Failure" and "Solve
Failure":

Detect Failure:
[HAMA-370] Failure detector for Hama < Nearly complete?
[HAMA-498] BSPTask should periodically ping its parent.

Solve Failure:
[HAMA-445] Make configurable checkpointing
> TODO:
> Groom needs functionality to restart a task
> BSPMaster needs functionality to restart a groom

Also here is MISC, which is not strongly related.

MISC:
[HAMA-445] Make configurable checkpointing
[HAMA-440] Features required in recovery procedure.
> TODO mainly discussion:
> New BSP "interface", with a chaining of supersteps to make restarting
tasks more simpler (contained in 440)


Let's make an umbrella jira for this larger task and close 199, since this
is way too generic and too old.
We should also split 440, because it combines too much unrelated things
together.

Also "Lin" has assigned the majority of them. What is your progress? And do
you mind splitting these?

[LINKS]
https://issues.apache.org/jira/browse/HAMA-440
https://issues.apache.org/jira/browse/HAMA-119
https://issues.apache.org/jira/browse/HAMA-445
https://issues.apache.org/jira/browse/HAMA-440
https://issues.apache.org/jira/browse/HAMA-370
https://issues.apache.org/jira/browse/HAMA-498

-- 
Thomas Jungblut
Berlin <thomas.jungblut@gmail.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message