Hi Hassan,

Thanks for the elaborate response. I am not running Giraph jobs paralleley , i am trying to run one job with 900M edges .
I have removed  the _bsp folder as well before every run .

I did also checkout the latest code from phabricator commit. Out of Core works perfectly for 300M records but when i increase the data set to 500M

Exception1:
script:
hadoop jar /usr/local/giraph/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.1-jar-with-dependencies.jar.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000 -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021 -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m org.apache.giraph.examples.ConnectedComponentsComputation   -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /VUID/input_500M -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /VUID/ouput_500M -w 4 -ca giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,mapred.map.max.attempts=2,giraph.numOutputThreads=10,giraph.numInputThreads=10,giraph.numComputeThreads=4,giraph.waitForPerWorkerRequests=true,giraph.zkSessionMsecTimeout=1200000


After Supert step1 execution get stuck , this is if i run without out of core,

2016-05-17 19:11:06,536 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 2267ms
2016-05-17 19:11:55,694 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 1314ms
2016-05-17 19:11:55,695 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 47407ms
2016-05-17 19:11:58,930 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 1465ms
2016-05-17 19:12:03,222 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 2655ms
2016-05-17 19:13:00,659 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 3013ms
2016-05-17 19:13:00,659 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 52769ms
2016-05-17 19:13:04,359 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 1961ms
2016-05-17 19:14:01,246 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = Allocation Failure, duration = 2925ms
2016-05-17 19:14:01,247 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 52743ms
2016-05-17 19:14:50,029 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 48410ms
2016-05-17 19:15:34,967 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 44449ms
2016-05-17 19:16:20,120 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 44219ms
2016-05-17 19:17:06,099 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 45196ms
2016-05-17 19:18:21,141 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 73103ms
2016-05-17 19:19:56,003 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 93005ms
2016-05-17 19:21:24,339 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 88099ms
2016-05-17 19:22:57,828 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 93429ms
2016-05-17 19:24:31,891 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 94049ms
2016-05-17 19:25:53,983 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: inst


Exception 2:

I got heap space error after step2 when i run with out of core enabled. But with the code i had checked out from GIT TRUNK of giraph i was able to successfully run 600M edges with in memory. Not sure why i am getting heap space error in the code i had checked out from your commit


2016-05-17 18:31:14,387 INFO [main] org.apache.giraph.worker.BspServiceWorker: finishSuperstep: Completed superstep 2 with global stats (vtx=451097689,finVtx=451097689,edges=499895018,msgCount=16110,msgBytesCount=129712,haltComputation=false, checkpointStatus=NONE) and classes (computation=org.apache.giraph.examples.ConnectedComponentsComputation,incoming=org.apache.giraph.conf.DefaultMessageClasses@18388a3c,outgoing=org.apache.giraph.conf.DefaultMessageClasses@1d035be3)
2016-05-17 18:31:14,395 INFO [main-EventThread] org.apache.giraph.worker.BspServiceWorker: processEvent : partitionExchangeChildrenChanged (at least one worker is done sending partitions)
2016-05-17 18:31:14,414 WARN [main-EventThread] org.apache.giraph.bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_1463146675144_0120/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished, type=NodeDeleted, state=SyncConnected)
2016-05-17 18:31:14,414 WARN [main-EventThread] org.apache.giraph.bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_1463146675144_0120/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions, type=NodeDeleted, state=SyncConnected)
2016-05-17 18:31:50,368 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 35912ms
2016-05-17 18:31:50,371 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 0
2016-05-17 18:31:50,371 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 0
2016-05-17 18:32:53,125 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: call: last GC happened a while ago and the amount of used memory is high (used memory fraction is 0.96). Calling GC manually
2016-05-17 18:36:09,485 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: call: manual GC is done. It took 196.36 seconds. Used memory fraction is 0.96
2016-05-17 18:36:09,485 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 0
2016-05-17 18:36:09,485 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 0
2016-05-17 18:36:42,959 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: call: last GC happened a while ago and the amount of used memory is high (used memory fraction is 0.89). Calling GC manually
2016-05-17 18:37:45,432 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: call: manual GC is done. It took 62.47 seconds. Used memory fraction is 0.89
2016-05-17 18:37:45,432 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1
2016-05-17 18:37:45,432 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 5
2016-05-17 18:37:45,432 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 62656ms
2016-05-17 18:37:45,433 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 33689ms
2016-05-17 18:37:45,433 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 92383ms
2016-05-17 18:37:45,433 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Allocation Failure, duration = 70285ms
2016-05-17 18:37:45,433 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = Ergonomics, duration = 33473ms
2016-05-17 18:37:45,434 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS Scavenge, action = end of minor GC, cause = System.gc(), duration = 63ms
2016-05-17 18:37:45,434 INFO [Service Thread] org.apache.giraph.graph.GraphTaskManager: installGCMonitoring: name = PS MarkSweep, action = end of major GC, cause = System.gc(), duration = 62409ms
2016-05-17 18:37:45,438 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
	at it.unimi.dsi.fastutil.ints.Int2ObjectOpenHashMap.<init>(Int2ObjectOpenHashMap.java:107)
	at it.unimi.dsi.fastutil.ints.Int2ObjectOpenHashMap.<init>(Int2ObjectOpenHashMap.java:115)
	at org.apache.giraph.comm.messages.primitives.IntByteArrayMessageStore.<init>(IntByteArrayMessageStore.java:89)
	at org.apache.giraph.comm.messages.InMemoryMessageStoreFactory.newStoreWithoutCombiner(InMemoryMessageStoreFactory.java:128)
	at org.apache.giraph.comm.messages.InMemoryMessageStoreFactory.newStore(InMemoryMessageStoreFactory.java:178)
	at org.apache.giraph.comm.messages.InMemoryMessageStoreFactory.newStore(InMemoryMessageStoreFactory.java:54)
	at org.apache.giraph.comm.ServerData.prepareSuperstep(ServerData.java:285)
	at org.apache.giraph.comm.netty.NettyWorkerServer.prepareSuperstep(NettyWorkerServer.java:97)
	at org.apache.giraph.worker.BspServiceWorker.startSuperstep(BspServiceWorker.java:700)
	at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:329)
	
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92)

Exception 3:

I got this when i ran 600M edges with 5 nodes and 9 workers.
I was still getting this error with older versions of Giraph as well.
By older version i meant the version checked out from Trunk.
I am having these issues only with huge data set only.

2016-05-17 17:56:33,576 ERROR [ooc-io-0] org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed java.lang.RuntimeException: java.io.EOFException at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:114) at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:36) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:47) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:290) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:335) at org.apache.giraph.ooc.data.OutOfCoreDataManager.loadPartitionData(OutOfCoreDataManager.java:198) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:368) at org.apache.giraph.ooc.io.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:66) at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:102) ... 6 more 2016-05-17 17:56:33,580 INFO [ooc-io-0] org.apache.giraph.ooc.OutOfCoreIOCallableFactory: afterExecute: an out-of-core thread terminated unexpectedly with java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.EOFException 2016-05-17 17:56:34,937 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:34,937 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:37,437 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:37,437 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:38,202 INFO [main] org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, task ids: [6] 2016-05-17 17:56:39,937 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:39,938 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:42,438 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:42,438 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:44,938 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:44,938 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:47,439 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:47,439 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:49,939 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:49,939 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:52,440 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:52,440 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:54,941 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:54,941 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:55,420 INFO [netty-server-worker-7] org.apache.giraph.comm.netty.handler.RequestDecoder: decode: Server window metrics MBytes/sec received = 0, MBytesReceived = 0.0007, ave received req MBytes = 0.0001, secs waited = 52.835 2016-05-17 17:56:55,421 INFO [main] org.apache.giraph.utils.TaskIdsPermitsBarrier: waitForRequiredPermits: Waiting for 0 more aggregator requests 2016-05-17 17:56:55,421 INFO [main] org.apache.giraph.graph.GraphTaskManager: execute: 8 partitions to process with 1 compute thread(s), originally 1 thread(s) on superstep 3 2016-05-17 17:56:55,421 INFO [main] org.apache.giraph.ooc.OutOfCoreEngine: startIteration: with 0 partitions in memory and 1 active threads 2016-05-17 17:56:55,422 INFO [compute-0] org.apache.giraph.ooc.OutOfCoreEngine: getNextPartition: waiting until a partition becomes available! 2016-05-17 17:56:57,442 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:57,442 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:56:59,942 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:56:59,942 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:57:02,443 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:57:02,443 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:57:04,943 INFO [check-memory] org.apache.giraph.ooc.OutOfCoreEngine: updateActiveThreadsFraction: updating the number of active threads to 1 2016-05-17 17:57:04,943 INFO [check-memory] org.apache.giraph.ooc.ThresholdBasedOracle: updateRequestsCredit: updating the credit to 20 2016-05-17 17:57:05,422 ERROR [compute-0] org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread! at org.apache.giraph.ooc.OutOfCoreEngine.getNextPartition(OutOfCoreEngine.java:285) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:174) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-05-17 17:57:05,423 ERROR [main] org.apache.giraph.graph.GraphMapper: Caught an unrecoverable exception Exception occurred java.lang.IllegalStateException: Exception occurred at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:253) at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:817) at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:364) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread! at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:250) ... 10 more Caused by: java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread! at org.apache.giraph.ooc.OutOfCoreEngine.getNextPartition(OutOfCoreEngine.java:285) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:174) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-05-17 17:57:05,424 ERROR [main] org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got failure, unregistering health on /_hadoopBsp/job_1463146675144_0119/_applicationAttemptsDir/0/_superstepDir/3/_workerHealthyDir/ip-172-31-42-220.eu-west-1.compute.internal_4 on superstep 3 2016-05-17 17:57:05,427 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IllegalStateException: run: Caught an unrecoverable exception Exception occurred at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:108) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.lang.IllegalStateException: Exception occurred at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:253) at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:817) at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:364) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92) ... 7 more Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread! at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:250) ... 10 more Caused by: java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread! at org.apache.giraph.ooc.OutOfCoreEngine.getNextPartition(OutOfCoreEngine.java:285) at org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:174) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-05-17 17:57:05,428 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task


On Tue, May 17, 2016 at 12:40 AM, Hassan Eslami <hsn.eslami@gmail.com> wrote:
Ramesh,

The out-of-core mechanism keeps spilled data in files in local job directory, which is usually obtained from Hadoop's "mapred.job.id". This should be different from one run to another, so there shouldn't be any conflict between different runs using out-of-core mechanism. However, you may have manually overwritten related Hadoop/YARN config, so there may be conflict in your case. That means, if you run your jobs subsequently, a later job may make some decisions based on already existing files from a previous job. This can be one reason you are getting this error. Please make sure the local job directory is different from run to run, or simply delete the "_bsp/_partitions" directory from your local job directory every time you run your job using out-of-core.

As a side note, you don't need to specify out-of-core messages (giraph.maxMessagesInMemory=100,giraph.useOutOfCoreMessages=true) anymore. Also, you can try a new out-of-core feature in which you don't have to specify the number of partitions in memory either (you can also get rid of giraph.maxPartitionsInMemory=5). This new feature is extensively tested, but is still under review and has not been pushed to the code base yet. You can access this feature here: https://reviews.facebook.net/D55479

Best,
Hassan

On Sat, May 14, 2016 at 10:46 PM, Ramesh Krishnan <ramesh.154089@gmail.com> wrote:
Thanks Hassan. I have removed the checkpointing, still getting a different error

Script :

hadoop jar /usr/local/giraph.back.1.2.0/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.0-jar-with-dependencies.jar  org.apache.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000 -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021 -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m org.apache.giraph.examples.ConnectedComponentsComputation   -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /test/input_10M -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /test/ouput_10M -w 5 -ca giraph.userPartitionCount=150,giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,giraph.maxPartitionsInMemory=5,mapred.map.max.attempts=2,giraph.maxMessagesInMemory=100,giraph.useOutOfCoreMessages=true,giraph.useOutOfCoreGraph=true

Exception:
2016-05-15 05:34:28,113 INFO [ooc-io-0] org.apache.giraph.ooc.OutOfCoreIOCallable: call: execution of IO command LoadPartitionIOCommand: (partitionId = 107, superstep = 0) failed!
2016-05-15 05:34:28,114 ERROR [ooc-io-0] org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed
java.lang.RuntimeException: java.io.EOFException
	at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:76)
	at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:30)
	at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
	at java.io.DataInputStream.readInt(DataInputStream.java:392)
	at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:47)
	at org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:286)
	at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:329)
	at org.apache.giraph.ooc.data.OutOfCoreDataManager.loadPartitionData(OutOfCoreDataManager.java:195)
	at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:360)
	at org.apache.giraph.ooc.io.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:64)
	at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:72)
	... 6 more
2016-05-15 05:34:28,117 INFO [ooc-io-0] org.apache.giraph.ooc.OutOfCoreIOCallableFactory: afterExecute: an out-of-core thread terminated unexpectedly with java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.EOFException
2016-05-15 05:34:28,441 INFO [compute-0] org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: processing partition 117 is done!
2016-05-15 05:34:29,111 INFO [compute-0] org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: processing partition 27 is done!
2016-05-15 05:34:29,620 INFO [compute-0] org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: processing partition 127 is done!
2016-05-15 05:34:30,123 INFO [compute-0] org.apache.giraph.ooc.FixedOutOfCoreEngine: doneProcessingPartition: processing partition 22 is done!
2016-05-15 05:34:30,123 INFO [compute-0] org.apache.giraph.ooc.FixedOutOfCoreEngine: getNextPartition: waiting until a partition becomes available!
2016-05-15 05:34:31,123 ERROR [compute-0] org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed
java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread
	at org.apache.giraph.ooc.FixedOutOfCoreEngine.getNextPartition(FixedOutOfCoreEngine.java:81)
	at org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:153)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:69)
	at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2016-05-15 05:34:31,124 ERROR [main] org.apache.giraph.graph.GraphMapper: Caught an unrecoverable exception Exception occurred
java.lang.IllegalStateException: Exception occurred
	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:253)
	at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:761)
	at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:349)
	at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:206)
	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:250)
	... 10 more
Caused by: java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread
	at org.apache.giraph.ooc.FixedOutOfCoreEngine.getNextPartition(FixedOutOfCoreEngine.java:81)
	at org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:153)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:69)
	at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2016-05-15 05:34:31,125 ERROR [main] org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got failure, unregistering health on /_hadoopBsp/job_1463146675144_0036/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-172-31-37-39.eu-west-1.compute.internal_2 on superstep 0


On Sun, May 15, 2016 at 3:54 AM, Hassan Eslami <hsn.eslami@gmail.com> wrote:
Hi Ramesh!

Thanks for bringing this up, and thanks for trying out the new out-of-core mechanism. The new out-of-core mechanism has not been integrated with checkpointing yet. This is part of an ongoing project, and we should have the integration within a few weeks. In the meantime, you can try out-of-core without checkpointing enabled.

Best,
Hassan


On Saturday, May 14, 2016, Ramesh Krishnan <ramesh.154089@gmail.com> wrote:
PFA the correct logs for the concurrent exception

2016-05-14 19:10:55,733 ERROR [ooc-io-0] org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed
java.lang.RuntimeException: java.io.EOFException
	at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:76)
	at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:30)
	at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
	at java.io.DataInputStream.readInt(DataInputStream.java:392)
	at org.apache.hadoop.io.IntWritable.readFields(IntWritable.java:47)
	at org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(DiskBackedPartitionStore.java:286)
	at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadInMemoryPartitionData(DiskBackedPartitionStore.java:329)
	at org.apache.giraph.ooc.data.OutOfCoreDataManager.loadPartitionData(OutOfCoreDataManager.java:195)
	at org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(DiskBackedPartitionStore.java:360)
	at org.apache.giraph.ooc.io.LoadPartitionIOCommand.execute(LoadPartitionIOCommand.java:64)
	at org.apache.giraph.ooc.OutOfCoreIOCallable.call(OutOfCoreIOCallable.java:72)
	... 6 more
2016-05-14 19:10:55,737 INFO [ooc-io-0] org.apache.giraph.ooc.OutOfCoreIOCallableFactory: afterExecute: an out-of-core thread terminated unexpectedly with java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.EOFException
2016-05-14 19:10:55,739 INFO [checkpoint-vertices-7] org.apache.giraph.ooc.FixedOutOfCoreEngine: getNextPartition: waiting until a partition becomes available!
2016-05-14 19:10:56,426 ERROR [checkpoint-vertices-6] org.apache.giraph.utils.LogStacktraceCallable: Execution of callable failed
java.lang.RuntimeException: Job Failed due to a failure in an out-of-core IO thread
	at org.apache.giraph.ooc.FixedOutOfCoreEngine.getNextPartition(FixedOutOfCoreEngine.java:81)
	at org.apache.giraph.ooc.data.DiskBackedPartitionStore.getNextPartition(DiskBackedPartitionStore.java:187)
	at org.apache.giraph.worker.BspServiceWorker$3$1.call(BspServiceWorker.java:1398)
	at org.apache.giraph.worker.BspServiceWorker$3$1.call(BspServiceWorker.java:1392)
	at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


On Sun, May 15, 2016 at 1:02 AM, Ramesh Krishnan <ramesh.154089@gmail.com> wrote:

Hi Team,

I have the latest build of giraph running on a 5 node cluster. When i try to use OutofCore Graph option for a huge data set like 600Milion edges i am running into
the following exception. Please find below the script being executed and the exception logs. I have tried all possible ways and could not avoid this issue , i am really in need of your help.

Script:hadoop jar /usr/local/giraph.back.1.2.0/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.0-jar-with-dependencies.jar  org.apache.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000 -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021 -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m org.apache.giraph.examples.ConnectedComponentsComputation   -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /test/input_10M -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /test/ouput_10M -w 5 -ca giraph.userPartitionCount=150,giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,giraph.maxPartitionsInMemory=10,mapred.map.max.attempts=2,giraph.maxMessagesInMemory=100,giraph.numOutputThreads=10,giraph.useOutOfCoreMessages=true,giraph.numOutputThreads=4,giraph.numInputThreads=4,giraph.useOutOfCoreGraph=true,giraph.cleanupCheckpointsAfterSuccess=true,giraph.checkpointFrequency=1

Exception:
hadoop jar /usr/local/giraph.back.1.2.0/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.0-jar-with-dependencies.jar  org.apache.giraph.GiraphRunner -Dmapreduce.task.timeout=12000000 -Dmapred.job.tracker=ip-172-31-42-220.eu-west-1.compute.internal:8021 -Dmapreduce.map.memory.mb=23480 -Dmapreduce.map.java.opts=-Xmx22480m org.apache.giraph.examples.ConnectedComponentsComputation   -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /test/input_10M -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /test/ouput_10M -w 5 -ca giraph.userPartitionCount=150,giraph.SplitMasterWorker=true,giraph.isStaticGraph=true,giraph.maxPartitionsInMemory=10,mapred.map.max.attempts=2,giraph.maxMessagesInMemory=100,giraph.numOutputThreads=10,giraph.useOutOfCoreMessages=true,giraph.numOutputThreads=4,giraph.numInputThreads=4,giraph.useOutOfCoreGraph=true,giraph.cleanupCheckpointsAfterSuccess=true,giraph.checkpointFrequency=1

thanks
Ramesh