Return-Path: X-Original-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BA50B9EC9 for ; Fri, 3 Feb 2012 01:24:02 +0000 (UTC) Received: (qmail 8318 invoked by uid 500); 3 Feb 2012 01:24:02 -0000 Delivered-To: apmail-incubator-hama-dev-archive@incubator.apache.org Received: (qmail 8222 invoked by uid 500); 3 Feb 2012 01:24:02 -0000 Mailing-List: contact hama-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hama-dev@incubator.apache.org Delivered-To: mailing list hama-dev@incubator.apache.org Received: (qmail 8214 invoked by uid 99); 3 Feb 2012 01:24:01 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Feb 2012 01:24:01 +0000 Received: from localhost (HELO mail-tul01m020-f175.google.com) (127.0.0.1) (smtp-auth username edwardyoon, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Feb 2012 01:23:59 +0000 Received: by obhx4 with SMTP id x4so4012349obh.6 for ; Thu, 02 Feb 2012 17:23:58 -0800 (PST) MIME-Version: 1.0 Received: by 10.50.202.97 with SMTP id kh1mr6050587igc.19.1328232238573; Thu, 02 Feb 2012 17:23:58 -0800 (PST) Received: by 10.42.222.135 with HTTP; Thu, 2 Feb 2012 17:23:58 -0800 (PST) In-Reply-To: References: Date: Fri, 3 Feb 2012 10:23:58 +0900 Message-ID: Subject: Re: Fault Tolerance in 0.5.0 From: "Edward J. Yoon" To: hama-dev@incubator.apache.org Content-Type: text/plain; charset=UTF-8 +1 On Thu, Feb 2, 2012 at 8:39 PM, Thomas Jungblut wrote: > Hey, > > I had a bit of time to go through the jira issues and sort out several > things related to Fault Tolerance. > > Here are my results: > > Fault Tolerance in Hama (all jiras related): > > [HAMA-199] Add fault tolerance to BSPPeer < CLOSE, too generic > [HAMA-445] Make configurable checkpointing > [HAMA-440] Features required in recovery procedure. > [HAMA-498] BSPTask should periodically ping its parent. > > Then I have splitted this in two main parts, "Detect Failure" and "Solve > Failure": > > Detect Failure: > [HAMA-370] Failure detector for Hama < Nearly complete? > [HAMA-498] BSPTask should periodically ping its parent. > > Solve Failure: > [HAMA-445] Make configurable checkpointing >> TODO: >> Groom needs functionality to restart a task >> BSPMaster needs functionality to restart a groom > > Also here is MISC, which is not strongly related. > > MISC: > [HAMA-445] Make configurable checkpointing > [HAMA-440] Features required in recovery procedure. >> TODO mainly discussion: >> New BSP "interface", with a chaining of supersteps to make restarting > tasks more simpler (contained in 440) > > > Let's make an umbrella jira for this larger task and close 199, since this > is way too generic and too old. > We should also split 440, because it combines too much unrelated things > together. > > Also "Lin" has assigned the majority of them. What is your progress? And do > you mind splitting these? > > [LINKS] > https://issues.apache.org/jira/browse/HAMA-440 > https://issues.apache.org/jira/browse/HAMA-119 > https://issues.apache.org/jira/browse/HAMA-445 > https://issues.apache.org/jira/browse/HAMA-440 > https://issues.apache.org/jira/browse/HAMA-370 > https://issues.apache.org/jira/browse/HAMA-498 > > -- > Thomas Jungblut > Berlin -- Best Regards, Edward J. Yoon @eddieyoon