giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hai Lan <lanhai1...@gmail.com>
Subject Graph job self-killed after superstep 0 with large input
Date Fri, 22 May 2015 10:25:09 GMT
Hello,

I’m trying to run Giraph job with 180092160 vertex on a 18 nodes 440G memory cluster. I
used 144 workers with default partitioning. However, my job is always killed after superstep
0 with error as following:

2015-05-22 05:20:57,668 ERROR [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
barrierOnWorkerList: Missing chosen workers [Worker(hostname=bespin05.umiacs.umd.edu <http://bespin05.umiacs.umd.edu/>,
MRtaskID=2, port=30002), Worker(hostname=bespin04d.umiacs.umd.edu <http://bespin04d.umiacs.umd.edu/>,
MRtaskID=6, port=30006), Worker(hostname=bespin03a.umiacs.umd.edu <http://bespin03a.umiacs.umd.edu/>,
MRtaskID=14, port=30014)] on superstep 0
2015-05-22 05:20:57,668 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.MasterThread:
masterThread: Coordination of superstep 0 took 77.624 seconds ended with state WORKER_FAILURE
and is now on superstep 0
2015-05-22 05:20:57,673 FATAL [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
getLastGoodCheckpoint: No last good checkpoints can be found, killing the job.
java.io.FileNotFoundException: File hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015
<hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015>
does not exist.
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:658)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716)
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1485)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1525)
	at org.apache.giraph.utils.CheckpointingUtils.getLastCheckpointedSuperstep(CheckpointingUtils.java:106)
	at org.apache.giraph.bsp.BspService.getLastCheckpointedSuperstep(BspService.java:1196)
	at org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(BspServiceMaster.java:1289)
	at org.apache.giraph.master.MasterThread.run(MasterThread.java:148)

This job works ok with customized partitioning with 144 workers and each worker partitioned
in 144/72/180 by vertex id.

Also, default partitioning some job with 100051200 vertex input works good too.

Anyone could help?

Many thanks

Best wishes

Hai
Mime
View raw message