Return-Path: X-Original-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 636499615 for ; Sat, 4 Feb 2012 14:45:39 +0000 (UTC) Received: (qmail 29102 invoked by uid 500); 4 Feb 2012 14:45:39 -0000 Delivered-To: apmail-incubator-hama-dev-archive@incubator.apache.org Received: (qmail 29081 invoked by uid 500); 4 Feb 2012 14:45:39 -0000 Mailing-List: contact hama-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hama-dev@incubator.apache.org Delivered-To: mailing list hama-dev@incubator.apache.org Received: (qmail 29073 invoked by uid 99); 4 Feb 2012 14:45:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Feb 2012 14:45:38 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of thomas.jungblut@googlemail.com designates 209.85.160.47 as permitted sender) Received: from [209.85.160.47] (HELO mail-pw0-f47.google.com) (209.85.160.47) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Feb 2012 14:45:33 +0000 Received: by pbbb4 with SMTP id b4so3278863pbb.6 for ; Sat, 04 Feb 2012 06:45:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=+vv4WiBbCkH+0e1ci0C8mmca/PKOqUGAtxuhkY0XS4g=; b=twt87cHglPv/N/cL8hmUAqFuZnc+8mAneeLh1vo1xNFV0RIuZuNg2vufBA4Aj+Ghjk Tq43C93xLfAhiBi3TqoNQz+6GrwbnyXmDd5vIQHnQeRsXOMOhQZ6tjVsmSM52TfWh7aA DSITSkJzJzwU7mftNzu2dHZm7wjRr6AiMeKSQ= MIME-Version: 1.0 Received: by 10.68.193.167 with SMTP id hp7mr28326143pbc.3.1328366713545; Sat, 04 Feb 2012 06:45:13 -0800 (PST) Received: by 10.68.217.69 with HTTP; Sat, 4 Feb 2012 06:45:13 -0800 (PST) In-Reply-To: References: Date: Sat, 4 Feb 2012 15:45:13 +0100 Message-ID: Subject: Re: Fault Tolerance in 0.5.0 From: Thomas Jungblut To: hama-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=047d7b15aebf17a30504b8247ab2 --047d7b15aebf17a30504b8247ab2 Content-Type: text/plain; charset=ISO-8859-1 Thanks. I just "refactored" our issue tracker ;) Hope it wasn't to spammy. 2012/2/4 Chia-Hung Lin > +1 It's good if we have an umbrella jira so we can track it easier. > > Failure detection (HAMA-370) was already done and tested on my > machines previously. > > First point in HAMA-440 is not needed because it has been integrated > into bsp task. > > > > On 3 February 2012 09:38, Edward J. Yoon wrote: > > We also can separate the issue into two parts: 1) cluster high > > availability and 2) fault tolerant job processing. Only HAMA-370 is > > related with 1). > > > > On Fri, Feb 3, 2012 at 10:23 AM, Edward J. Yoon > wrote: > >> +1 > >> > >> On Thu, Feb 2, 2012 at 8:39 PM, Thomas Jungblut > >> wrote: > >>> Hey, > >>> > >>> I had a bit of time to go through the jira issues and sort out several > >>> things related to Fault Tolerance. > >>> > >>> Here are my results: > >>> > >>> Fault Tolerance in Hama (all jiras related): > >>> > >>> [HAMA-199] Add fault tolerance to BSPPeer < CLOSE, too generic > >>> [HAMA-445] Make configurable checkpointing > >>> [HAMA-440] Features required in recovery procedure. > >>> [HAMA-498] BSPTask should periodically ping its parent. > >>> > >>> Then I have splitted this in two main parts, "Detect Failure" and > "Solve > >>> Failure": > >>> > >>> Detect Failure: > >>> [HAMA-370] Failure detector for Hama < Nearly complete? > >>> [HAMA-498] BSPTask should periodically ping its parent. > >>> > >>> Solve Failure: > >>> [HAMA-445] Make configurable checkpointing > >>>> TODO: > >>>> Groom needs functionality to restart a task > >>>> BSPMaster needs functionality to restart a groom > >>> > >>> Also here is MISC, which is not strongly related. > >>> > >>> MISC: > >>> [HAMA-445] Make configurable checkpointing > >>> [HAMA-440] Features required in recovery procedure. > >>>> TODO mainly discussion: > >>>> New BSP "interface", with a chaining of supersteps to make restarting > >>> tasks more simpler (contained in 440) > >>> > >>> > >>> Let's make an umbrella jira for this larger task and close 199, since > this > >>> is way too generic and too old. > >>> We should also split 440, because it combines too much unrelated things > >>> together. > >>> > >>> Also "Lin" has assigned the majority of them. What is your progress? > And do > >>> you mind splitting these? > >>> > >>> [LINKS] > >>> https://issues.apache.org/jira/browse/HAMA-440 > >>> https://issues.apache.org/jira/browse/HAMA-119 > >>> https://issues.apache.org/jira/browse/HAMA-445 > >>> https://issues.apache.org/jira/browse/HAMA-440 > >>> https://issues.apache.org/jira/browse/HAMA-370 > >>> https://issues.apache.org/jira/browse/HAMA-498 > >>> > >>> -- > >>> Thomas Jungblut > >>> Berlin > >> > >> > >> > >> -- > >> Best Regards, Edward J. Yoon > >> @eddieyoon > > > > > > > > -- > > Best Regards, Edward J. Yoon > > @eddieyoon > -- Thomas Jungblut Berlin --047d7b15aebf17a30504b8247ab2--