Return-Path: X-Original-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2BA3796B9 for ; Thu, 2 Feb 2012 11:39:42 +0000 (UTC) Received: (qmail 98535 invoked by uid 500); 2 Feb 2012 11:39:42 -0000 Delivered-To: apmail-incubator-hama-dev-archive@incubator.apache.org Received: (qmail 98460 invoked by uid 500); 2 Feb 2012 11:39:41 -0000 Mailing-List: contact hama-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hama-dev@incubator.apache.org Delivered-To: mailing list hama-dev@incubator.apache.org Received: (qmail 98452 invoked by uid 99); 2 Feb 2012 11:39:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Feb 2012 11:39:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of thomas.jungblut@googlemail.com designates 209.85.160.47 as permitted sender) Received: from [209.85.160.47] (HELO mail-pw0-f47.google.com) (209.85.160.47) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Feb 2012 11:39:35 +0000 Received: by pbbb4 with SMTP id b4so1723187pbb.6 for ; Thu, 02 Feb 2012 03:39:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=0tp/DWYNX430eIjV8WXnuuGWwUeDJrGnQgkBeV3Ey3c=; b=Q0eqW+UuTcVNduDFB2zyF3BsVP+uOYq+KSr+0k0A4X3DyVpSgtEISjp2MRsC67+ap/ LhfYkjAg+V19G8Em/4vDxbLPK7wc4H5pMcKMo0b1+kv3myG3uKmDVQOORtGHbR/RyKA5 e9EE1/YnSGF7mQUt8Ni5aVtmHooaErKMFtLqU= MIME-Version: 1.0 Received: by 10.68.200.65 with SMTP id jq1mr7428922pbc.54.1328182754673; Thu, 02 Feb 2012 03:39:14 -0800 (PST) Received: by 10.68.217.69 with HTTP; Thu, 2 Feb 2012 03:39:14 -0800 (PST) Date: Thu, 2 Feb 2012 12:39:14 +0100 Message-ID: Subject: Fault Tolerance in 0.5.0 From: Thomas Jungblut To: hama-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=047d7b10d1194a04fd04b7f9a5a1 --047d7b10d1194a04fd04b7f9a5a1 Content-Type: text/plain; charset=ISO-8859-1 Hey, I had a bit of time to go through the jira issues and sort out several things related to Fault Tolerance. Here are my results: Fault Tolerance in Hama (all jiras related): [HAMA-199] Add fault tolerance to BSPPeer < CLOSE, too generic [HAMA-445] Make configurable checkpointing [HAMA-440] Features required in recovery procedure. [HAMA-498] BSPTask should periodically ping its parent. Then I have splitted this in two main parts, "Detect Failure" and "Solve Failure": Detect Failure: [HAMA-370] Failure detector for Hama < Nearly complete? [HAMA-498] BSPTask should periodically ping its parent. Solve Failure: [HAMA-445] Make configurable checkpointing > TODO: > Groom needs functionality to restart a task > BSPMaster needs functionality to restart a groom Also here is MISC, which is not strongly related. MISC: [HAMA-445] Make configurable checkpointing [HAMA-440] Features required in recovery procedure. > TODO mainly discussion: > New BSP "interface", with a chaining of supersteps to make restarting tasks more simpler (contained in 440) Let's make an umbrella jira for this larger task and close 199, since this is way too generic and too old. We should also split 440, because it combines too much unrelated things together. Also "Lin" has assigned the majority of them. What is your progress? And do you mind splitting these? [LINKS] https://issues.apache.org/jira/browse/HAMA-440 https://issues.apache.org/jira/browse/HAMA-119 https://issues.apache.org/jira/browse/HAMA-445 https://issues.apache.org/jira/browse/HAMA-440 https://issues.apache.org/jira/browse/HAMA-370 https://issues.apache.org/jira/browse/HAMA-498 -- Thomas Jungblut Berlin --047d7b10d1194a04fd04b7f9a5a1--