Hello,

I’m trying to run Giraph job with 180092160 vertex on a 18 nodes 440G memory cluster. I used 144 workers with default partitioning. However, my job is always killed after superstep 0 with error as following:

2015-05-22 05:20:57,668 ERROR [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: Missing chosen workers [Worker(hostname=bespin05.umiacs.umd.edu, MRtaskID=2, port=30002), Worker(hostname=bespin04d.umiacs.umd.edu, MRtaskID=6, port=30006), Worker(hostname=bespin03a.umiacs.umd.edu, MRtaskID=14, port=30014)] on superstep 0
2015-05-22 05:20:57,668 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 0 took 77.624 seconds ended with state WORKER_FAILURE and is now on superstep 0
2015-05-22 05:20:57,673 FATAL [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No last good checkpoints can be found, killing the job.
java.io.FileNotFoundException: File hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015 does not exist.
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:658)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1485)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1525)
	at org.apache.giraph.utils.CheckpointingUtils.getLastCheckpointedSuperstep(CheckpointingUtils.java:106)
	at org.apache.giraph.bsp.BspService.getLastCheckpointedSuperstep(BspService.java:1196)
	at org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(BspServiceMaster.java:1289)
	at org.apache.giraph.master.MasterThread.run(MasterThread.java:148)

This job works ok with customized partitioning with 144 workers and each worker partitioned in 144/72/180 by vertex id.

Also, default partitioning some job with 100051200 vertex input works good too.

Anyone could help?

Many thanks

Best wishes

Hai