I have tried increase number of reties to 10. But it is the same: no retry at all. Isn't retrying of failed task the default behavior for hadoop? Why isn't it working in the case of Giraph?

Here is the message from master:

2013-03-18 18:23:30,628 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 54 of 55 required).  This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers.
2013-03-18 18:23:30,628 FATAL org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Not enough healthy workers for superstep 12
2013-03-18 18:23:30,629 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 12
2013-03-18 18:23:30,649 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201303181655_0004
2013-03-18 18:23:30,703 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = null, exiting...
java.lang.NullPointerException
                at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1411)
                at org.apache.giraph.graph.MasterThread.run(MasterThread.java:111)
2013-03-18 18:23:30,705 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.



The workers except for the one who threw the expected exception report the following error:

2013-03-18 18:20:54,107 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher
java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot recover - WatchedEvent state:Disconnected type:None path:null
                at org.apache.giraph.graph.BspService.process(BspService.java:974)
                at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
                at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
2013-03-18 18:20:55,110 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server idp30.almaden.ibm.com/172.16.0.30:22181
2013-03-18 18:20:55,111 WARN org.apache.zookeeper.ClientCnxn: Session 0x13d8037f8100008 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
                at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
                at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
                at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2013-03-18 18:20:55,218 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-03-18 18:20:55,254 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2013-03-18 18:20:55,254 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName ytian for UID 3005 from the native implementation
2013-03-18 18:20:55,257 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.IllegalStateException: startSuperstep: KeeperException getting assignments
                at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:928)
                at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:649)
                at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:891)
                at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
                at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:396)
                at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
                at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201303181655_0004/_applicationAttemptsDir/0/_superstepDir/2/_partitionAssignments
                at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
                at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
                at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
                at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
                at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:909)


Yuanyuan



From:        Avery Ching <aching@apache.org>
To:        user@giraph.apache.org
Cc:        Yuanyuan Tian/Almaden/IBM@IBMUS
Date:        03/18/2013 03:05 PM
Subject:        Re: about fault tolerance in Giraph




How many retries did you set for hadoop map task failures?  Might want to try 10?

Avery

On 3/18/13 2:38 PM, Yuanyuan Tian wrote:

Hi Avery,

I was just testing how Giraph can handle fault tolerance. I wrote a simple algorithm that could run without a problem. Then I artificially added a line of code to throw an IOException for the 12th superstep when the taskID is the 0001 and attempt ID is 0000. The job returned the excepted IOException, but it cannot recover from it. There is no retry of the failed task, even though there are empty map slots left in the cluster. Eventually, the whole job failed after time out.


Yuanyuan




From:        
Avery Ching <aching@apache.org>
To:        
user@giraph.apache.org
Date:        
03/18/2013 02:09 PM
Subject:        
Re: about fault tolerance in Giraph




Hi Yuanyuan,

We haven't tested this feature in a while.  But it should work.  What did the job report about why it failed?

Avery

On 3/18/13 10:22 AM, Yuanyuan Tian wrote:

Can anyone help me answer the question?


Yuanyuan




From:        
Yuanyuan Tian/Almaden/IBM@IBMUS
To:        
user@giraph.apache.org
Date:        
03/15/2013 02:05 PM
Subject:        
about fault tolerance in Giraph




Hi


I was testing the fault tolerance of Giraph on a long running job. I noticed that when one of the worker throw an exception, the whole job failed without retrying the task, even though I turned on the checkpointing and there were available map slots in my cluster. Why wasn't the fault tolerance mechanism working?


I was running a version of Giraph downloaded sometime in June 2012 and I used Netty for the communication layer.

Thanks,


Yuanyuan