Return-Path: X-Original-To: apmail-giraph-user-archive@www.apache.org Delivered-To: apmail-giraph-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9522C95AB for ; Mon, 18 Mar 2013 17:23:06 +0000 (UTC) Received: (qmail 51038 invoked by uid 500); 18 Mar 2013 17:23:06 -0000 Delivered-To: apmail-giraph-user-archive@giraph.apache.org Received: (qmail 51006 invoked by uid 500); 18 Mar 2013 17:23:06 -0000 Mailing-List: contact user-help@giraph.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@giraph.apache.org Delivered-To: mailing list user@giraph.apache.org Received: (qmail 50997 invoked by uid 99); 18 Mar 2013 17:23:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Mar 2013 17:23:06 +0000 X-ASF-Spam-Status: No, hits=-2.8 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ytian@us.ibm.com designates 32.97.182.138 as permitted sender) Received: from [32.97.182.138] (HELO e8.ny.us.ibm.com) (32.97.182.138) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Mar 2013 17:22:57 +0000 Received: from /spool/local by e8.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 18 Mar 2013 13:22:35 -0400 Received: from d01dlp01.pok.ibm.com (9.56.250.166) by e8.ny.us.ibm.com (192.168.1.108) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 18 Mar 2013 13:22:34 -0400 Received: from d01relay05.pok.ibm.com (d01relay05.pok.ibm.com [9.56.227.237]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id 8E72838C805E for ; Mon, 18 Mar 2013 13:22:33 -0400 (EDT) Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay05.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r2IHMX4O329444 for ; Mon, 18 Mar 2013 13:22:33 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r2IHMWoS007150 for ; Mon, 18 Mar 2013 14:22:33 -0300 Received: from d01ml604.pok.ibm.com (d01ml604.pok.ibm.com [9.63.8.151]) by d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id r2IHMWNN007103 for ; Mon, 18 Mar 2013 14:22:32 -0300 In-Reply-To: References: To: user@giraph.apache.org MIME-Version: 1.0 Subject: Re: about fault tolerance in Giraph X-KeepSent: 0C10D5AD:E5AA65AE-85257B32:005F5F55; type=4; name=$KeepSent X-Mailer: Lotus Notes Release 8.5.1FP5 SHF29 November 12, 2010 Message-ID: From: Yuanyuan Tian Date: Mon, 18 Mar 2013 10:22:28 -0700 X-MIMETrack: Serialize by Router on D01ML604/01/M/IBM(Release 9.0|February 24, 2013) at 03/18/2013 13:22:32, Serialize complete at 03/18/2013 13:22:32 Content-Type: multipart/alternative; boundary="=_alternative 005F6C5488257B32_=" X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13031817-9360-0000-0000-0000115F134A X-Virus-Checked: Checked by ClamAV on apache.org This is a multipart message in MIME format. --=_alternative 005F6C5488257B32_= Content-Type: text/plain; charset="US-ASCII" Can anyone help me answer the question? Yuanyuan From: Yuanyuan Tian/Almaden/IBM@IBMUS To: user@giraph.apache.org Date: 03/15/2013 02:05 PM Subject: about fault tolerance in Giraph Hi I was testing the fault tolerance of Giraph on a long running job. I noticed that when one of the worker throw an exception, the whole job failed without retrying the task, even though I turned on the checkpointing and there were available map slots in my cluster. Why wasn't the fault tolerance mechanism working? I was running a version of Giraph downloaded sometime in June 2012 and I used Netty for the communication layer. Thanks, Yuanyuan --=_alternative 005F6C5488257B32_= Content-Type: text/html; charset="US-ASCII" Can anyone help me answer the question?

Yuanyuan



From:        Yuanyuan Tian/Almaden/IBM@IBMUS
To:        user@giraph.apache.org
Date:        03/15/2013 02:05 PM
Subject:        about fault tolerance in Giraph




Hi

I was testing the fault tolerance of Giraph on a long running job. I noticed that when one of the worker throw an exception, the whole job failed without retrying the task, even though I turned on the checkpointing and there were available map slots in my cluster. Why wasn't the fault tolerance mechanism working?


I was running a version of Giraph downloaded sometime in June 2012 and I used Netty for the communication layer.


Thanks,


Yuanyuan

--=_alternative 005F6C5488257B32_=--