giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristen Hardwick <khardw...@spryinc.com>
Subject Re: DataStreamer Exception - LeaseExpiredException
Date Thu, 30 Jan 2014 21:29:23 GMT
Eli, Chuan,

Thanks for taking a look into my issue! GIRAPH-747 definitely seems to
address the exact issue I'm running into, even down to the class I thought
was causing the problem. I created a bug ticket a few days ago (GIRAPH-828)
which has the details of my environment, including the command I'm running
and the full logs where the problem occurs. I just linked my ticket to
GIRAPH-747, but if it makes sense for me to delete mine instead, please let
me know.

I will definitely put a comment in there so that people watching it are
aware of Chuan's patch. Avery Ching was asking me for more information in
the comments, so he might be able to help validate the solution.

Thanks again,
Kristen


On Wed, Jan 29, 2014 at 9:35 PM, Eli Reisman <apache.mailbox@gmail.com>wrote:

> Sorry, I do think this will solve it and it makes sense people are
> encountering the prob when using -w 1 I'll get this reviewed and committed
> (patch 747)
>
> Mohammed, any objections?
>
>
>
> On Wed, Jan 29, 2014 at 6:22 PM, Chuan Lei <leichuan@gmail.com> wrote:
>
>> Hi Kristen,
>>
>> I had this problem before and submitted a Jira ticket (GIRAPH-747) with
>> path. You may want to take a look at it. Hope that can solve your problem.
>>
>> Thanks,
>> Chuan
>>
>> On Jan 29, 2014, at 9:16 PM, Eli Reisman <apache.mailbox@gmail.com>
>> wrote:
>>
>> > Hi Kristen, thanks for posting this. During the port to YARN I
>> encountered some race problems with the output sequence. The YARN
>> implementation has to handle this a bit differently than the non-YARN and
>> although we got it figured out at the time, I haven't really looked at it
>> in many months and non-YARN Giraph has evolved quickly since then. Wouldn't
>> shock me if there is trouble here, if I recall the solution seemed a bit
>> delicate.
>> >
>> > If you have some ideas for a patch I'd be happy to review, I am pretty
>> strapped for time right now but if you post a ticket to the Giraph JIRA and
>> no one else attempts a patch I'm sure either myself or Mohammed will take a
>> swipe at it eventually. Thanks!
>> >
>> > Eli
>> >
>> >
>> > On Mon, Jan 20, 2014 at 9:01 AM, Kristen Hardwick <
>> khardwick@spryinc.com> wrote:
>> > Sorry to bug everyone again, but does anyone have any ideas on this?
>> Please let me know if I'm leaving out any crucial information that could
>> get me some help.
>> >
>> > Thanks!
>> > Kristen
>> >
>> >
>> > On Mon, Jan 13, 2014 at 5:48 PM, Kristen Hardwick <
>> khardwick@spryinc.com> wrote:
>> > Hi all,
>> >
>> > I had a very productive day today getting this stuff figured out.
>> Unfortunately, it appears that I've stumbled onto a possible race condition
>> during the cleanup step of the code for the application.
>> >
>> > I put some information here that explains why I think it is a race
>> condition. http://pastebin.com/Qswb98dq Basically, I tried the exact
>> same command twice, making no other changes - the first time it failed and
>> the second time it succeeded.
>> >
>> > This makes me think that the
>> LeaseExpiredException/DataStreamerException is caused because the files
>> have been cleaned up just before they are needed. Possibly inside the
>> BspServiceMaster, but I am not at all sure about that.
>> >
>> > Is anyone already aware of this? Should I log it as a bug? I do have
>> access to (DEBUG) logs of both the successful and failed attempts if anyone
>> wants to see them.
>> >
>> > Thanks,
>> > Kristen Hardwick
>> >
>> >
>> > On Mon, Jan 13, 2014 at 11:03 AM, Kristen Hardwick <
>> khardwick@spryinc.com> wrote:
>> > Hi Avery (or anyone else that knows),
>> >
>> > Could you please give me some details that would help me find the past
>> threads that might address this issue? I searched Google with various
>> combinations of "giraph datastreamer exception yarn lease expired
>> zookeeper" and didn't really come up with anything that seemed relevant.
>> >
>> > Is it possible that it's just a memory issue on my end? I'm running
>> inside a VM - a single node cluster with 8 GB of memory allocated to it.
>> Could that have anything to do with it? Right now I'm investigating the
>> code to try to lower the amount of memory allocated to the containers.
>> >
>> > Thanks,
>> > Kristen
>> >
>> >
>> > On Fri, Jan 10, 2014 at 8:45 PM, Avery Ching <aching@apache.org> wrote:
>> > This looks more like the Zookeeper/YARN issues mentioned in the past.
>>  Unfortunately, I do not have a YARN instance to test this with.  Does
>> anyone else have any insights here?
>> >
>> >
>> > On 1/10/14 1:48 PM, Kristen Hardwick wrote:
>> >> Hi all, I'm requesting help again! I'm trying to get this
>> SimpleShortestPathsComputation example working, but I'm stuck again. Now
>> the job begins to run and seems to work until the final step (it performs 3
>> supersteps), but the overall job is failing.
>> >>
>> >> In the master, among other things, I see:
>> >>
>> >> ...
>> >> 14/01/10 15:04:17 INFO master.MasterThread: setup: Took 0.87 seconds.
>> >> 14/01/10 15:04:17 INFO master.MasterThread: input superstep: Took
>> 0.708 seconds.
>> >> 14/01/10 15:04:17 INFO master.MasterThread: superstep 0: Took 0.158
>> seconds.
>> >> 14/01/10 15:04:17 INFO master.MasterThread: superstep 1: Took 0.344
>> seconds.
>> >> 14/01/10 15:04:17 INFO master.MasterThread: superstep 2: Took 0.064
>> seconds.
>> >> 14/01/10 15:04:17 INFO master.MasterThread: shutdown: Took 0.162
>> seconds.
>> >> 14/01/10 15:04:17 INFO master.MasterThread: total: Took 2.31 seconds.
>> >> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: Master is ready to commit
>> final job output data.
>> >> 14/01/10 15:04:18 INFO yarn.GiraphYarnTask: Master has committed the
>> final job output data.
>> >> ...
>> >>
>> >> To me, that looks promising - like the job was successful. However, in
>> the WORKER_ONLY containers, I see these things:
>> >>
>> >> ...
>> >> 14/01/10 15:04:17 INFO graph.GraphTaskManager: cleanup: Starting for
>> WORKER_ONLY
>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>> unprocessed event
>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions,
>> type=NodeDeleted, state=SyncConnected)
>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent :
>> partitionExchangeChildrenChanged (at least one worker is done sending
>> partitions)
>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>> unprocessed event
>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished,
>> type=NodeDeleted, state=SyncConnected)
>> >> 14/01/10 15:04:17 INFO netty.NettyClient: stop: reached wait
>> threshold, 1 connections closed, releasing NettyClient.bootstrap resources
>> now.
>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job
>> state changed, checking to see if it needs to restart
>> >> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already
>> exists
>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>> >> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: [STATUS: task-1]
>> saveVertices: Starting to save 2 vertices using 1 threads
>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: saveVertices: Starting
>> to save 2 vertices using 1 threads
>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job
>> state changed, checking to see if it needs to restart
>> >> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already
>> exists
>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>> >> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state path is
>> empty! -
>> /_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState
>> >> 14/01/10 15:04:17 ERROR zookeeper.ClientCnxn: Error while calling
>> watcher
>> >> java.lang.NullPointerException
>> >>         at java.io.StringReader.<init>(StringReader.java:50)
>> >>         at org.json.JSONTokener.<init>(JSONTokener.java:66)
>> >>         at org.json.JSONObject.<init>(JSONObject.java:402)
>> >>         at
>> org.apache.giraph.bsp.BspService.getJobState(BspService.java:716)
>> >>         at
>> org.apache.giraph.worker.BspServiceWorker.processEvent(BspServiceWorker.java:1563)
>> >>         at
>> org.apache.giraph.bsp.BspService.process(BspService.java:1095)
>> >>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>> >>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>> unprocessed event
>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_vertexInputSplitsAllReady,
>> type=NodeDeleted, state=SyncConnected)
>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>> unprocessed event
>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_addressesAndPartitions,
>> type=NodeDeleted, state=SyncConnected)
>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent :
>> partitionExchangeChildrenChanged (at least one worker is done sending
>> partitions)
>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and
>> unprocessed event
>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>> type=NodeDeleted, state=SyncConnected)
>> >> ...
>> >> 14/01/10 15:04:17 WARN hdfs.DFSClient: DataStreamer Exception
>> >>
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /user/spry/Shortest/_temporary/1/_temporary/attempt_1389300168420_0024_m_000001_1/part-m-00001:
>> File does not exist. Holder DFSClient_NONMAPREDUCE_-643344145_1 does not
>> have any open files.
>> >>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
>> >>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
>> >>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2480)
>> >>         at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555)
>> >>         at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>> >> ...
>> >>
>> >> I apologize for the wall of error message, but I tried to leave in at
>> least some of the parts that might be useful. I put the entire YARN log
>> here: http://tny.cz/af229738
>> >>
>> >> Has anyone ever seen this before? This is the command I'm using to run:
>> >>
>> >> hadoop jar
>> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
>> org.apache.giraph.GiraphRunner -Dgiraph.SplitMasterWorker=false
>> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000
>> -Dgiraph.useInputSplitLocality=false
>> org.apache.giraph.examples.SimpleShortestPathsComputation -vif
>> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
>> -vip /user/spry/input -vof
>> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
>> /user/spry/Shortest -w 1
>> >>
>> >> My setup is still the same as the other email if you saw it:
>> >>
>> >> I compiled Giraph with this command, and everything built successfully
>> except "Apache Giraph Distribution" which it doesn't seem like I need:
>> >>
>> >> mvn -Phadoop_yarn -Dhadoop.version=2.2.0 -DskipTests clean package
>> >>
>> >> I am running with the following components:
>> >>
>> >> Single node cluster
>> >> Giraph 1.1
>> >> Hadoop 2.2.0 (Hortonworks)
>> >> Java 1.7.0_45
>> >>
>> >> Thanks in advance,
>> >> -Kristen Hardwick
>> >>
>> >
>> >
>> >
>> >
>> >
>>
>>
>

Mime
View raw message