Return-Path: X-Original-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 289B79FFC for ; Sat, 10 Mar 2012 09:12:05 +0000 (UTC) Received: (qmail 49729 invoked by uid 500); 10 Mar 2012 09:12:04 -0000 Delivered-To: apmail-incubator-hama-dev-archive@incubator.apache.org Received: (qmail 49703 invoked by uid 500); 10 Mar 2012 09:12:04 -0000 Mailing-List: contact hama-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hama-dev@incubator.apache.org Delivered-To: mailing list hama-dev@incubator.apache.org Received: (qmail 49695 invoked by uid 99); 10 Mar 2012 09:12:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Mar 2012 09:12:04 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of thomas.jungblut@googlemail.com designates 209.85.220.175 as permitted sender) Received: from [209.85.220.175] (HELO mail-vx0-f175.google.com) (209.85.220.175) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Mar 2012 09:11:59 +0000 Received: by vcbfl13 with SMTP id fl13so2425991vcb.6 for ; Sat, 10 Mar 2012 01:11:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=KscQ2pLmByJSorLFFDTD8hgIGzQ6TZmWLOT8nKe7lAI=; b=CK73y+ZCisNbWZ5FUpNq/Gyt4an9CxQGF+fXWbQSvR8ndcyCNIcOPGjBiCVe0dYETQ XL3aNC+cUJbpUh8Ljk5AJ+eu1f0R5kQdE/9reXeRmBF7HZ8U/bhWi2RAZkQ+9GlDFXP8 vi/ILzlTpARpv1mAJmPxJvNKFC1pzArSHWVnO0IK+xjWuonP1FsxyJwG9OcLgRpfZXb9 jCyYzz6yi3gfUvK3NsjdiensUk48Q1cYE2FE57m7ju5l4CR+FN0hvpr/I3bBjvt8RY1L E78wGlGkRcmOVZKnaIkjcoMeCaD9prBsm/wUIxjN0koLWLdkJYijK4Xt1/9RgY7iwwJq DFpQ== MIME-Version: 1.0 Received: by 10.52.65.239 with SMTP id a15mr8546439vdt.51.1331370698267; Sat, 10 Mar 2012 01:11:38 -0800 (PST) Received: by 10.220.215.3 with HTTP; Sat, 10 Mar 2012 01:11:38 -0800 (PST) Date: Sat, 10 Mar 2012 10:11:38 +0100 Message-ID: Subject: Recovery Issues From: Thomas Jungblut To: hama-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=20cf307f338088ddfc04badfe551 X-Virus-Checked: Checked by ClamAV on apache.org --20cf307f338088ddfc04badfe551 Content-Type: text/plain; charset=ISO-8859-1 I guess we have to slice some issues needed for checkpoint recovery. In my opinion we have two types of recovery: - single task recovery - global recovery of all tasks And I guess we can simply make a rule: If a task fails inside our barrier sync method (since we have a double barrier, after enterBarrier() and before leaveBarrier()), we have to do a global recovery. Else we can just do a single task rollback. For those asking why we can't do just always a global rollback: it is too costly and we really do not need it in any case. But we need it in the case where a task fails inside the barrier (between enter and leave) just because a single rollbacked task can't trip the enterBarrier-Barrier. Anything I have forgotten? -- Thomas Jungblut Berlin --20cf307f338088ddfc04badfe551--