Hi Claudio...

I turned checkpointin on and executed the giraph job.

hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dmapred.job.map.memory.mb=1500 -Dmapred.map.child.java.opts=-Xmx1G -Dgiraph.useSuperstepCounters=false -Dgiraph.useOutOfCoreMessages=true -Dgiraph.checkpointFrequency=1 org.apache.giraph.examples.MyShortestDistance -vif org.apache.giraph.examples.io.formats.MyShortestDistanceVertexInputFormat -vip /user/hduser/big_input/my_line_rank_input6.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hduser/sp_output530/sd_output -w 1 -mc org.apache.giraph.examples.MyShortestDistance\$MyMasterCompute


14/01/31 09:47:57 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
14/01/31 09:47:57 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one.
14/01/31 09:48:21 INFO job.GiraphJob: run: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201401310947_0001
14/01/31 09:49:24 INFO job.HaltApplicationUtils$DefaultHaltInstructionsWriter: writeHaltInstructions: To halt after next superstep execute: 'bin/halt-application --zkServer kanha-Vostro-1014:22181 --zkNode /_hadoopBsp/job_201401310947_0001/_haltComputation'
14/01/31 09:49:24 INFO mapred.JobClient: Running job: job_201401310947_0001
14/01/31 09:49:25 INFO mapred.JobClient:  map 100% reduce 0%
14/01/31 09:59:15 INFO mapred.JobClient: Task Id : attempt_201401310947_0001_m_000001_0, Status : FAILED
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hduser/_bsp/_checkpoints/job_201401310947_0001/4.kanha-Vostro-1014_1.metadata could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:596)
    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1383)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1379)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1377)

    at org.apache.hadoop.ipc.Client.call(Client.java:1030)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3104)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2975)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)

Task attempt_201401310947_0001_m_000001_0 failed to report status for 600 seconds. Killing!
attempt_201401310947_0001_m_000001_0: SLF4J: Class path contains multiple SLF4J bindings.
attempt_201401310947_0001_m_000001_0: SLF4J: Found binding in [file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201401310947_0001/jars/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000001_0: SLF4J: Found binding in [jar:file:/usr/local/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000001_0: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
attempt_201401310947_0001_m_000001_0: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
attempt_201401310947_0001_m_000001_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201401310947_0001_m_000001_0: log4j:WARN Please initialize the log4j system properly.
14/01/31 09:59:19 INFO mapred.JobClient:  map 50% reduce 0%
14/01/31 09:59:31 INFO mapred.JobClient:  map 100% reduce 0%
14/01/31 10:14:15 INFO mapred.JobClient:  map 50% reduce 0%
14/01/31 10:14:20 INFO mapred.JobClient: Task Id : attempt_201401310947_0001_m_000000_0, Status : FAILED
java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

attempt_201401310947_0001_m_000000_0: SLF4J: Class path contains multiple SLF4J bindings.
attempt_201401310947_0001_m_000000_0: SLF4J: Found binding in [file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201401310947_0001/jars/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000000_0: SLF4J: Found binding in [jar:file:/usr/local/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000000_0: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
attempt_201401310947_0001_m_000000_0: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/01/31 10:14:30 INFO mapred.JobClient:  map 100% reduce 0%
14/01/31 10:24:14 INFO mapred.JobClient: Task Id : attempt_201401310947_0001_m_000001_1, Status : FAILED
java.lang.IllegalStateException: run: Caught an unrecoverable exception registerHealth: Trying to get the new application attempt by killing self
    at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.IllegalStateException: registerHealth: Trying to get the new application attempt by killing self
    at org.apache.giraph.worker.BspServiceWorker.registerHealth(BspServiceWorker.java:627)
    at org.apache.giraph.worker.BspServiceWorker.startSuperstep(BspServiceWorker.java:681)
    at org.apache.giraph.worker.BspServiceWorker.setup(BspServiceWorker.java:486)
    at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:246)
    at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91)
    ... 7 more
Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /_hadoopBsp/job_201401310947_0001/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/kanha-Vostro-1014_1
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:110)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
    at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
    at org.apache.giraph.zk.ZooKeeperExt.createExt(ZooKeeperExt.java:152)
    at org.apache.giraph.worker.BspServiceWorker.registerHealth(BspServiceWorker.java:611)
    ... 11 more

Task attempt_201401310947_0001_m_000001_1 failed to report status for 600 seconds. Killing!
attempt_201401310947_0001_m_000001_1: SLF4J: Class path contains multiple SLF4J bindings.
attempt_201401310947_0001_m_000001_1: SLF4J: Found binding in [file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201401310947_0001/jars/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000001_1: SLF4J: Found binding in [jar:file:/usr/local/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201401310947_0001_m_000001_1: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
attempt_201401310947_0001_m_000001_1: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
attempt_201401310947_0001_m_000001_1: log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ClientCnxn).
attempt_201401310947_0001_m_000001_1: log4j:WARN Please initialize the log4j system properly.
14/01/31 10:24:15 INFO mapred.JobClient:  map 50% reduce 0%
14/01/31 10:24:24 INFO mapred.JobClient:  map 100% reduce 0%


please suggest me something related to fix this failure..

Thanks
Jyoti



On Wed, Jan 29, 2014 at 10:16 PM, Claudio Martella <claudio.martella@gmail.com> wrote:
looks like one of your workers died. If you expect such a long job, I'd suggest you turn checkpointing on.


On Wed, Jan 29, 2014 at 5:30 PM, Jyoti Yadav <rao.jyoti26yadav@gmail.com> wrote:
Thanks all for your reply..
Actually i am working with an algorithm in which single source shortest path  algorithm  runs for thousands of vertices .suppose on an average for one vertex this algo takes 5-6 supersteps,then for thousands of vertices,count of superstep is extremely large..In that case at run time following error is thrown...

 ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=kanha-Vostro-1014, MRtaskID=1, port=30001) on superstep 19528
2014-01-28 05:11:36,852 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 19528 took 636.831 seconds ended with state WORKER_FAILURE and is now on superstep 19528
2014-01-28 05:11:39,446 ERROR org.apache.giraph.master.MasterThread: masterThread: Master algorithm failed with ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: -1

Any ideas??

Thanks
Jyoti 


On Wed, Jan 29, 2014 at 8:55 PM, Peter Grman <peter.grman@gmail.com> wrote:

Yes but you can disable the counters per superstep, if you don't need the data, and than I had around 2000 after which my algorithm stopped.

Cheers
Peter

On Jan 29, 2014 4:22 PM, "Claudio Martella" <claudio.martella@gmail.com> wrote:
the limit is currently defined by the maximum number of counters your jobtracker allows. Hence, by default the max number of supersteps is around 90.

check http://giraph.apache.org/faq.html to see how to increase it.


On Wed, Jan 29, 2014 at 4:12 PM, Jyoti Yadav <rao.jyoti26yadav@gmail.com> wrote:
Hi folks..

Is there any limit for maximum no of supersteps while running a giraph job??

Thanks
Jyoti



--
   Claudio Martella
   




--
   Claudio Martella