giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hai Lan <>
Subject Graph job self-killed after superstep 0 with large input
Date Fri, 22 May 2015 10:25:09 GMT

I’m trying to run Giraph job with 180092160 vertex on a 18 nodes 440G memory cluster. I
used 144 workers with default partitioning. However, my job is always killed after superstep
0 with error as following:

2015-05-22 05:20:57,668 ERROR [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
barrierOnWorkerList: Missing chosen workers [Worker( <>,
MRtaskID=2, port=30002), Worker( <>,
MRtaskID=6, port=30006), Worker( <>,
MRtaskID=14, port=30014)] on superstep 0
2015-05-22 05:20:57,668 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.MasterThread:
masterThread: Coordination of superstep 0 took 77.624 seconds ended with state WORKER_FAILURE
and is now on superstep 0
2015-05-22 05:20:57,673 FATAL [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster:
getLastGoodCheckpoint: No last good checkpoints can be found, killing the job. File hdfs://
does not exist.
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(
	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(
	at org.apache.hadoop.fs.FileSystem.listStatus(
	at org.apache.hadoop.fs.FileSystem.listStatus(
	at org.apache.giraph.utils.CheckpointingUtils.getLastCheckpointedSuperstep(
	at org.apache.giraph.bsp.BspService.getLastCheckpointedSuperstep(
	at org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(

This job works ok with customized partitioning with 144 workers and each worker partitioned
in 144/72/180 by vertex id.

Also, default partitioning some job with 100051200 vertex input works good too.

Anyone could help?

Many thanks

Best wishes

View raw message