giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Reisman <apache.mail...@gmail.com>
Subject Re: DataStreamer Exception - LeaseExpiredException
Date Thu, 30 Jan 2014 02:16:13 GMT
Hi Kristen, thanks for posting this. During the port to YARN I encountered
some race problems with the output sequence. The YARN implementation has to
handle this a bit differently than the non-YARN and although we got it
figured out at the time, I haven't really looked at it in many months and
non-YARN Giraph has evolved quickly since then. Wouldn't shock me if there
is trouble here, if I recall the solution seemed a bit delicate.

If you have some ideas for a patch I'd be happy to review, I am pretty
strapped for time right now but if you post a ticket to the Giraph JIRA and
no one else attempts a patch I'm sure either myself or Mohammed will take a
swipe at it eventually. Thanks!

Eli


On Mon, Jan 20, 2014 at 9:01 AM, Kristen Hardwick <khardwick@spryinc.com>wrote:

> Sorry to bug everyone again, but does anyone have any ideas on this?
> Please let me know if I'm leaving out any crucial information that could
> get me some help.
>
> Thanks!
> Kristen
>
>
> On Mon, Jan 13, 2014 at 5:48 PM, Kristen Hardwick <khardwick@spryinc.com>wrote:
>
>> Hi all,
>>
>> I had a very productive day today getting this stuff figured out.
>> Unfortunately, it appears that I've stumbled onto a possible race condition
>> during the cleanup step of the code for the application.
>>
>> I put some information here that explains why I think it is a race
>> condition. http://pastebin.com/Qswb98dq Basically, I tried the exact
>> same command twice, making no other changes - the first time it failed and
>> the second time it succeeded.
>>
>> This makes me think that the LeaseExpiredException/DataStreamerException
>> is caused because the files have been cleaned up just before they are
>> needed. Possibly inside the BspServiceMaster, but I am not at all sure
>> about that.
>>
>> Is anyone already aware of this? Should I log it as a bug? I do have
>> access to (DEBUG) logs of both the successful and failed attempts if anyone
>> wants to see them.
>>
>> Thanks,
>> Kristen Hardwick
>>
>>
>> On Mon, Jan 13, 2014 at 11:03 AM, Kristen Hardwick <khardwick@spryinc.com
>> > wrote:
>>
>>> Hi Avery (or anyone else that knows),
>>>
>>> Could you please give me some details that would help me find the past
>>> threads that might address this issue? I searched Google with various
>>> combinations of "giraph datastreamer exception yarn lease expired
>>> zookeeper" and didn't really come up with anything that seemed relevant.
>>>
>>> Is it possible that it's just a memory issue on my end? I'm running
>>> inside a VM - a single node cluster with 8 GB of memory allocated to it.
>>> Could that have anything to do with it? Right now I'm investigating the
>>> code to try to lower the amount of memory allocated to the containers.
>>>
>>> Thanks,
>>> Kristen
>>>
>>>
>>> On Fri, Jan 10, 2014 at 8:45 PM, Avery Ching <aching@apache.org> wrote:
>>>
>>>>  This looks more like the Zookeeper/YARN issues mentioned in the
>>>> past.  Unfortunately, I do not have a YARN instance to test this with.
>>>> Does anyone else have any insights here?
>>>>
>>>>
>>>> On 1/10/14 1:48 PM, Kristen Hardwick wrote:
>>>>
>>>>  Hi all, I'm requesting help again! I'm trying to get this
>>>> SimpleShortestPathsComputation example working, but I'm stuck again. Now
>>>> the job begins to run and seems to work until the final step (it performs
3
>>>> supersteps), but the overall job is failing.
>>>>
>>>>  In the master, among other things, I see:
>>>>
>>>>  ...
>>>>  14/01/10 15:04:17 INFO master.MasterThread: setup: Took 0.87 seconds.
>>>> 14/01/10 15:04:17 INFO master.MasterThread: input superstep: Took 0.708
>>>> seconds.
>>>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 0: Took 0.158
>>>> seconds.
>>>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 1: Took 0.344
>>>> seconds.
>>>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 2: Took 0.064
>>>> seconds.
>>>> 14/01/10 15:04:17 INFO master.MasterThread: shutdown: Took 0.162
>>>> seconds.
>>>> 14/01/10 15:04:17 INFO master.MasterThread: total: Took 2.31 seconds.
>>>> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: Master is ready to commit
>>>> final job output data.
>>>> 14/01/10 15:04:18 INFO yarn.GiraphYarnTask: Master has committed the
>>>> final job output data.
>>>>  ...
>>>>
>>>>  To me, that looks promising - like the job was successful. However,
>>>> in the WORKER_ONLY containers, I see these things:
>>>>
>>>>  ...
>>>>  14/01/10 15:04:17 INFO graph.GraphTaskManager: cleanup: Starting for
>>>> WORKER_ONLY
>>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>>>> event
>>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions,
>>>> type=NodeDeleted, state=SyncConnected)
>>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent :
>>>> partitionExchangeChildrenChanged (at least one worker is done sending
>>>> partitions)
>>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>>>> event
>>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished,
>>>> type=NodeDeleted, state=SyncConnected)
>>>> 14/01/10 15:04:17 INFO netty.NettyClient: stop: reached wait threshold,
>>>> 1 connections closed, releasing NettyClient.bootstrap resources now.
>>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job state
>>>> changed, checking to see if it needs to restart
>>>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already
>>>> exists
>>>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>>>> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: [STATUS: task-1]
>>>> saveVertices: Starting to save 2 vertices using 1 threads
>>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: saveVertices: Starting
>>>> to save 2 vertices using 1 threads
>>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job state
>>>> changed, checking to see if it needs to restart
>>>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already
>>>> exists
>>>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>>>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state path is
>>>> empty! -
>>>> /_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState
>>>> 14/01/10 15:04:17 ERROR zookeeper.ClientCnxn: Error while calling
>>>> watcher
>>>> java.lang.NullPointerException
>>>>         at java.io.StringReader.<init>(StringReader.java:50)
>>>>         at org.json.JSONTokener.<init>(JSONTokener.java:66)
>>>>         at org.json.JSONObject.<init>(JSONObject.java:402)
>>>>         at
>>>> org.apache.giraph.bsp.BspService.getJobState(BspService.java:716)
>>>>         at
>>>> org.apache.giraph.worker.BspServiceWorker.processEvent(BspServiceWorker.java:1563)
>>>>         at
>>>> org.apache.giraph.bsp.BspService.process(BspService.java:1095)
>>>>         at
>>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>>>>         at
>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>>>> event
>>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_vertexInputSplitsAllReady,
>>>> type=NodeDeleted, state=SyncConnected)
>>>>  14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>>>> unprocessed event
>>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_addressesAndPartitions,
>>>> type=NodeDeleted, state=SyncConnected)
>>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent :
>>>> partitionExchangeChildrenChanged (at least one worker is done sending
>>>> partitions)
>>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>>>> event
>>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>>>> type=NodeDeleted, state=SyncConnected)
>>>> ...
>>>> 14/01/10 15:04:17 WARN hdfs.DFSClient: DataStreamer Exception
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>>>> No lease on
>>>> /user/spry/Shortest/_temporary/1/_temporary/attempt_1389300168420_0024_m_000001_1/part-m-00001:
>>>> File does not exist. Holder DFSClient_NONMAPREDUCE_-643344145_1 does not
>>>> have any open files.
>>>>         at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
>>>>         at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
>>>>         at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2480)
>>>>         at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555)
>>>>         at
>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>>>>  ...
>>>>
>>>>  I apologize for the wall of error message, but I tried to leave in at
>>>> least some of the parts that might be useful. I put the entire YARN log
>>>> here: http://tny.cz/af229738
>>>>
>>>>  Has anyone ever seen this before? This is the command I'm using to
>>>> run:
>>>>
>>>>  hadoop jar
>>>> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
>>>> org.apache.giraph.GiraphRunner -Dgiraph.SplitMasterWorker=false
>>>> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000
>>>> -Dgiraph.useInputSplitLocality=false
>>>> org.apache.giraph.examples.SimpleShortestPathsComputation -vif
>>>> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
>>>> -vip /user/spry/input -vof
>>>> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
>>>> /user/spry/Shortest -w 1
>>>>
>>>>  My setup is still the same as the other email if you saw it:
>>>>
>>>>  I compiled Giraph with this command, and everything built
>>>> successfully except "Apache Giraph Distribution" which it doesn't seem like
>>>> I need:
>>>>
>>>> mvn -Phadoop_yarn -Dhadoop.version=2.2.0 -DskipTests clean package
>>>>
>>>> I am running with the following components:
>>>>
>>>>  Single node cluster
>>>>  Giraph 1.1
>>>>  Hadoop 2.2.0 (Hortonworks)
>>>>  Java 1.7.0_45
>>>>
>>>>  Thanks in advance,
>>>> -Kristen Hardwick
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message