giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristen Hardwick <khardw...@spryinc.com>
Subject Re: DataStreamer Exception - LeaseExpiredException
Date Mon, 20 Jan 2014 17:01:02 GMT
Sorry to bug everyone again, but does anyone have any ideas on this? Please
let me know if I'm leaving out any crucial information that could get me
some help.

Thanks!
Kristen


On Mon, Jan 13, 2014 at 5:48 PM, Kristen Hardwick <khardwick@spryinc.com>wrote:

> Hi all,
>
> I had a very productive day today getting this stuff figured out.
> Unfortunately, it appears that I've stumbled onto a possible race condition
> during the cleanup step of the code for the application.
>
> I put some information here that explains why I think it is a race
> condition. http://pastebin.com/Qswb98dq Basically, I tried the exact same
> command twice, making no other changes - the first time it failed and the
> second time it succeeded.
>
> This makes me think that the LeaseExpiredException/DataStreamerException
> is caused because the files have been cleaned up just before they are
> needed. Possibly inside the BspServiceMaster, but I am not at all sure
> about that.
>
> Is anyone already aware of this? Should I log it as a bug? I do have
> access to (DEBUG) logs of both the successful and failed attempts if anyone
> wants to see them.
>
> Thanks,
> Kristen Hardwick
>
>
> On Mon, Jan 13, 2014 at 11:03 AM, Kristen Hardwick <khardwick@spryinc.com>wrote:
>
>> Hi Avery (or anyone else that knows),
>>
>> Could you please give me some details that would help me find the past
>> threads that might address this issue? I searched Google with various
>> combinations of "giraph datastreamer exception yarn lease expired
>> zookeeper" and didn't really come up with anything that seemed relevant.
>>
>> Is it possible that it's just a memory issue on my end? I'm running
>> inside a VM - a single node cluster with 8 GB of memory allocated to it.
>> Could that have anything to do with it? Right now I'm investigating the
>> code to try to lower the amount of memory allocated to the containers.
>>
>> Thanks,
>> Kristen
>>
>>
>> On Fri, Jan 10, 2014 at 8:45 PM, Avery Ching <aching@apache.org> wrote:
>>
>>>  This looks more like the Zookeeper/YARN issues mentioned in the past.
>>> Unfortunately, I do not have a YARN instance to test this with.  Does
>>> anyone else have any insights here?
>>>
>>>
>>> On 1/10/14 1:48 PM, Kristen Hardwick wrote:
>>>
>>>  Hi all, I'm requesting help again! I'm trying to get this
>>> SimpleShortestPathsComputation example working, but I'm stuck again. Now
>>> the job begins to run and seems to work until the final step (it performs 3
>>> supersteps), but the overall job is failing.
>>>
>>>  In the master, among other things, I see:
>>>
>>>  ...
>>>  14/01/10 15:04:17 INFO master.MasterThread: setup: Took 0.87 seconds.
>>> 14/01/10 15:04:17 INFO master.MasterThread: input superstep: Took 0.708
>>> seconds.
>>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 0: Took 0.158
>>> seconds.
>>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 1: Took 0.344
>>> seconds.
>>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 2: Took 0.064
>>> seconds.
>>> 14/01/10 15:04:17 INFO master.MasterThread: shutdown: Took 0.162 seconds.
>>> 14/01/10 15:04:17 INFO master.MasterThread: total: Took 2.31 seconds.
>>> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: Master is ready to commit
>>> final job output data.
>>> 14/01/10 15:04:18 INFO yarn.GiraphYarnTask: Master has committed the
>>> final job output data.
>>>  ...
>>>
>>>  To me, that looks promising - like the job was successful. However, in
>>> the WORKER_ONLY containers, I see these things:
>>>
>>>  ...
>>>  14/01/10 15:04:17 INFO graph.GraphTaskManager: cleanup: Starting for
>>> WORKER_ONLY
>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>>> event
>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions,
>>> type=NodeDeleted, state=SyncConnected)
>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent :
>>> partitionExchangeChildrenChanged (at least one worker is done sending
>>> partitions)
>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>>> event
>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished,
>>> type=NodeDeleted, state=SyncConnected)
>>> 14/01/10 15:04:17 INFO netty.NettyClient: stop: reached wait threshold,
>>> 1 connections closed, releasing NettyClient.bootstrap resources now.
>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job state
>>> changed, checking to see if it needs to restart
>>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already
>>> exists
>>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>>> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: [STATUS: task-1]
>>> saveVertices: Starting to save 2 vertices using 1 threads
>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: saveVertices: Starting
>>> to save 2 vertices using 1 threads
>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job state
>>> changed, checking to see if it needs to restart
>>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already
>>> exists
>>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state path is
>>> empty! -
>>> /_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState
>>> 14/01/10 15:04:17 ERROR zookeeper.ClientCnxn: Error while calling watcher
>>> java.lang.NullPointerException
>>>         at java.io.StringReader.<init>(StringReader.java:50)
>>>         at org.json.JSONTokener.<init>(JSONTokener.java:66)
>>>         at org.json.JSONObject.<init>(JSONObject.java:402)
>>>         at
>>> org.apache.giraph.bsp.BspService.getJobState(BspService.java:716)
>>>         at
>>> org.apache.giraph.worker.BspServiceWorker.processEvent(BspServiceWorker.java:1563)
>>>         at org.apache.giraph.bsp.BspService.process(BspService.java:1095)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>>>         at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>>> event
>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_vertexInputSplitsAllReady,
>>> type=NodeDeleted, state=SyncConnected)
>>>  14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>>> event
>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_addressesAndPartitions,
>>> type=NodeDeleted, state=SyncConnected)
>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent :
>>> partitionExchangeChildrenChanged (at least one worker is done sending
>>> partitions)
>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>>> event
>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>>> type=NodeDeleted, state=SyncConnected)
>>> ...
>>> 14/01/10 15:04:17 WARN hdfs.DFSClient: DataStreamer Exception
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>> No lease on
>>> /user/spry/Shortest/_temporary/1/_temporary/attempt_1389300168420_0024_m_000001_1/part-m-00001:
>>> File does not exist. Holder DFSClient_NONMAPREDUCE_-643344145_1 does not
>>> have any open files.
>>>         at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
>>>         at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
>>>         at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2480)
>>>         at
>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555)
>>>         at
>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>>>  ...
>>>
>>>  I apologize for the wall of error message, but I tried to leave in at
>>> least some of the parts that might be useful. I put the entire YARN log
>>> here: http://tny.cz/af229738
>>>
>>>  Has anyone ever seen this before? This is the command I'm using to run:
>>>
>>>  hadoop jar
>>> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
>>> org.apache.giraph.GiraphRunner -Dgiraph.SplitMasterWorker=false
>>> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000
>>> -Dgiraph.useInputSplitLocality=false
>>> org.apache.giraph.examples.SimpleShortestPathsComputation -vif
>>> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
>>> -vip /user/spry/input -vof
>>> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
>>> /user/spry/Shortest -w 1
>>>
>>>  My setup is still the same as the other email if you saw it:
>>>
>>>  I compiled Giraph with this command, and everything built successfully
>>> except "Apache Giraph Distribution" which it doesn't seem like I need:
>>>
>>> mvn -Phadoop_yarn -Dhadoop.version=2.2.0 -DskipTests clean package
>>>
>>> I am running with the following components:
>>>
>>>  Single node cluster
>>>  Giraph 1.1
>>>  Hadoop 2.2.0 (Hortonworks)
>>>  Java 1.7.0_45
>>>
>>>  Thanks in advance,
>>> -Kristen Hardwick
>>>
>>>
>>>
>>
>

Mime
View raw message