giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: about fault tolerance in Giraph
Date Mon, 18 Mar 2013 22:03:59 GMT
How many retries did you set for hadoop map task failures?  Might want 
to try 10?

Avery

On 3/18/13 2:38 PM, Yuanyuan Tian wrote:
> Hi Avery,
>
> I was just testing how Giraph can handle fault tolerance. I wrote a 
> simple algorithm that could run without a problem. Then I artificially 
> added a line of code to throw an IOException for the 12th superstep 
> when the taskID is the 0001 and attempt ID is 0000. The job returned 
> the excepted IOException, but it cannot recover from it. There is no 
> retry of the failed task, even though there are empty map slots left 
> in the cluster. Eventually, the whole job failed after time out.
>
> Yuanyuan
>
>
>
> From: Avery Ching <aching@apache.org>
> To: user@giraph.apache.org
> Date: 03/18/2013 02:09 PM
> Subject: Re: about fault tolerance in Giraph
> ------------------------------------------------------------------------
>
>
>
> Hi Yuanyuan,
>
> We haven't tested this feature in a while.  But it should work.  What 
> did the job report about why it failed?
>
> Avery
>
> On 3/18/13 10:22 AM, Yuanyuan Tian wrote:
> Can anyone help me answer the question?
>
> Yuanyuan
>
>
>
> From: Yuanyuan Tian/Almaden/IBM@IBMUS
> To: _user@giraph.apache.org_ <mailto:user@giraph.apache.org>
> Date: 03/15/2013 02:05 PM
> Subject: about fault tolerance in Giraph
> ------------------------------------------------------------------------
>
>
>
> Hi
>
> I was testing the fault tolerance of Giraph on a long running job. I 
> noticed that when one of the worker throw an exception, the whole job 
> failed without retrying the task, even though I turned on the 
> checkpointing and there were available map slots in my cluster. Why 
> wasn't the fault tolerance mechanism working?
>
> I was running a version of Giraph downloaded sometime in June 2012 and 
> I used Netty for the communication layer.
>
> Thanks,
>
> Yuanyuan
>


Mime
View raw message