giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tunvall, Fredrik" <Fredrik.Tunv...@ovum.com>
Subject RE: Master always fails on dataset
Date Fri, 18 Oct 2013 16:25:25 GMT
I will reach out right now

From: Simon McGloin [mailto:simonmcgloin@gmail.com]
Sent: Friday, October 18, 2013 12:24 PM
To: user@giraph.apache.org
Subject: Re: Master always fails on dataset

Thanks Claudio. Yes the machines are homogenous. Unfortunately I don't have ganglia installed.
You were right it is a memory issue. I've reduced the number of partitions down to 1 with
-Dgiraph.maxPartitionsInMemory=1 and now my jobs are failing due to running out of diskspace
on HDFS. Each HDFS mount has 100gb of space. I will increase the size of HDFS and order more
memory next week. Is there anyway to calculate the memory requirements of a giraph job? I
presume it depends on the algorithm being run.

On Thu, Oct 17, 2013 at 6:42 PM, Claudio Martella <claudio.martella@gmail.com<mailto:claudio.martella@gmail.com>>
wrote:
Try decreasing the number of partitions you keep in memory. You're running out of memory.
Also, are your nodes homogenous? It could be one particular machine swapping or something.
If you have ganglia, try investigating the usage of memory.

On Thu, Oct 17, 2013 at 7:39 PM, Simon McGloin <simonmcgloin@gmail.com<mailto:simonmcgloin@gmail.com>>
wrote:
Hey Guys.

I have a problem running my giraph job on a dataset with 20,000,000 edges and 2,000,000 vertices.
All the vertices are Text based. The giraph job works perfectly on smaller datasets but always
fails on larger ones. The setup I have is a 3 node cluster, each with 24 cores and 24 GB of
ram. The cluster has a total of 60 mappers each with mapred.child.java.opts set to -Xmx1000m.
If I don't use the Out-of-Core option then the job fails due to running out of java heap space.
When I use -Dgiraph.useOutOfCoreGraph=true then the master eventually fails due to a worker
disconnecting from zookeeper. The worker just throws a warning and doesn't actually fail.
I've been using the -Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart
the mapper. I'm new to zookeeper too so if this is a zookeeper problem then let me know and
I can investigate it as such.

Below is the options I'm using and the errors I'm currently getting
Any help or tips are appreciated,
Simon

Options:
-Dgiraph.zkList=10.10.5.103:2181<http://10.10.5.103:2181>,10.10.5.104:2181<http://10.10.5.104:2181>,10.10.5.105:2181<http://10.10.5.105:2181>
-Dgiraph.checkpointFrequency=1
-Dgiraph.useOutOfCoreGraph=true
-Dgiraph.zkSessionMsecTimeout=600000
-Dgiraph.numComputeThreads=2

Master Log:
2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList:
0 out of 50 workers finished on superstep 1 on path /_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir
2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive:
Missing chosen worker Worker(hostname=node1.mycompany.com<http://node1.mycompany.com>,
MRtaskID=30, port=30030) on superstep 1
2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination
of superstep 1 took 78.851 seconds ended with state WORKER_FAILURE and is now on superstep
1
2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread: masterThread: Master
algorithm failed with RuntimeException
java.lang.RuntimeException: restartFromCheckpoint: KeeperException
at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
for /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
... 1 more
2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler
on thread org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException: restartFromCheckpoint:
KeeperException, exiting...
java.lang.IllegalStateException: java.lang.RuntimeException: restartFromCheckpoint: KeeperException
at org.apache.giraph.master.MasterThread.run(MasterThread.java:181)
Caused by: java.lang.RuntimeException: restartFromCheckpoint: KeeperException
at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
for /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
... 1 more


Worker 30 log:
2013-10-17 18:19:07,309 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition:
writing partition edges 1927 to /data/var/hdfs/data/mapred/taskTracker/simon/jobcache/job_201310161506_0064/attempt_201310161506_0064_m_000030_0/work/_bsp/_partitions/job_201310161506_0064/partition-1927_edges
2013-10-17 18:19:45,736 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Future result
not ready yet java.util.concurrent.FutureTask@c07bacb
2013-10-17 18:19:45,737 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98
2013-10-17 18:19:45,789 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 40183ms for sessionid 0x341c716ad860073, closing socket connection
and attempting reconnect
2013-10-17 18:19:46,113 WARN org.apache.giraph.bsp.BspService: process: Disconnected from
ZooKeeper (will automatically try to recover) WatchedEvent state:Disconnected type:None path:null
2013-10-17 18:19:46,113 WARN org.apache.giraph.worker.InputSplitsHandler: process: Problem
with zookeeper, got event with path null, state Disconnected, event type None
2013-10-17 18:19:46,746 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server /10.10.5.105:2181<http://10.10.5.105:2181>
2013-10-17 18:19:46,747 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to node3.mycompany.com/10.10.5.105:2181<http://node3.mycompany.com/10.10.5.105:2181>,
initiating session
2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper
service, session 0x341c716ad860073 has expired, closing socket connection
2013-10-17 18:19:46,750 WARN org.apache.giraph.bsp.BspService: process: Got unknown null path
event WatchedEvent state:Expired type:None path:null
2013-10-17 18:19:46,750 WARN org.apache.giraph.worker.InputSplitsHandler: process: Problem
with zookeeper, got event with path null, state Expired, event type None
2013-10-17 18:19:46,750 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2013-10-17 18:20:33,546 INFO org.apache.giraph.comm.netty.handler.RequestDecoder: decode:
Server window metrics MBytes/sec sent = 0, MBytes/sec received = 0.0059, MBytesSent = 0.0008,
MBytesReceived = 0.7636, ave sent req MBytes = 0, ave received req MBytes = 0.0111, secs waited
= 128.396
2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Future result
not ready yet java.util.concurrent.FutureTask@c07bacb
2013-10-17 18:20:45,737 INFO org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@4f786b98





--
   Claudio Martella
   claudio.martella@gmail.com<mailto:claudio.martella@gmail.com>


Mime
View raw message