giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chuan Lei <leich...@gmail.com>
Subject Re: DataStreamer Exception - LeaseExpiredException
Date Thu, 30 Jan 2014 02:22:23 GMT
Hi Kristen,

I had this problem before and submitted a Jira ticket (GIRAPH-747) with path. You may want
to take a look at it. Hope that can solve your problem.

Thanks,
Chuan

On Jan 29, 2014, at 9:16 PM, Eli Reisman <apache.mailbox@gmail.com> wrote:

> Hi Kristen, thanks for posting this. During the port to YARN I encountered some race
problems with the output sequence. The YARN implementation has to handle this a bit differently
than the non-YARN and although we got it figured out at the time, I haven't really looked
at it in many months and non-YARN Giraph has evolved quickly since then. Wouldn't shock me
if there is trouble here, if I recall the solution seemed a bit delicate.
> 
> If you have some ideas for a patch I'd be happy to review, I am pretty strapped for time
right now but if you post a ticket to the Giraph JIRA and no one else attempts a patch I'm
sure either myself or Mohammed will take a swipe at it eventually. Thanks!
> 
> Eli
> 
> 
> On Mon, Jan 20, 2014 at 9:01 AM, Kristen Hardwick <khardwick@spryinc.com> wrote:
> Sorry to bug everyone again, but does anyone have any ideas on this? Please let me know
if I'm leaving out any crucial information that could get me some help.
> 
> Thanks!
> Kristen
> 
> 
> On Mon, Jan 13, 2014 at 5:48 PM, Kristen Hardwick <khardwick@spryinc.com> wrote:
> Hi all,
> 
> I had a very productive day today getting this stuff figured out. Unfortunately, it appears
that I've stumbled onto a possible race condition during the cleanup step of the code for
the application.
> 
> I put some information here that explains why I think it is a race condition. http://pastebin.com/Qswb98dq
Basically, I tried the exact same command twice, making no other changes - the first time
it failed and the second time it succeeded.
> 
> This makes me think that the LeaseExpiredException/DataStreamerException is caused because
the files have been cleaned up just before they are needed. Possibly inside the BspServiceMaster,
but I am not at all sure about that.
> 
> Is anyone already aware of this? Should I log it as a bug? I do have access to (DEBUG)
logs of both the successful and failed attempts if anyone wants to see them.
> 
> Thanks,
> Kristen Hardwick
> 
> 
> On Mon, Jan 13, 2014 at 11:03 AM, Kristen Hardwick <khardwick@spryinc.com> wrote:
> Hi Avery (or anyone else that knows),
> 
> Could you please give me some details that would help me find the past threads that might
address this issue? I searched Google with various combinations of "giraph datastreamer exception
yarn lease expired zookeeper" and didn't really come up with anything that seemed relevant.

> 
> Is it possible that it's just a memory issue on my end? I'm running inside a VM - a single
node cluster with 8 GB of memory allocated to it. Could that have anything to do with it?
Right now I'm investigating the code to try to lower the amount of memory allocated to the
containers.
> 
> Thanks,
> Kristen
> 
> 
> On Fri, Jan 10, 2014 at 8:45 PM, Avery Ching <aching@apache.org> wrote:
> This looks more like the Zookeeper/YARN issues mentioned in the past.  Unfortunately,
I do not have a YARN instance to test this with.  Does anyone else have any insights here?
> 
> 
> On 1/10/14 1:48 PM, Kristen Hardwick wrote:
>> Hi all, I'm requesting help again! I'm trying to get this SimpleShortestPathsComputation
example working, but I'm stuck again. Now the job begins to run and seems to work until the
final step (it performs 3 supersteps), but the overall job is failing.
>> 
>> In the master, among other things, I see:
>> 
>> ...
>> 14/01/10 15:04:17 INFO master.MasterThread: setup: Took 0.87 seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: input superstep: Took 0.708 seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 0: Took 0.158 seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 1: Took 0.344 seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 2: Took 0.064 seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: shutdown: Took 0.162 seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: total: Took 2.31 seconds.
>> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: Master is ready to commit final job output
data.
>> 14/01/10 15:04:18 INFO yarn.GiraphYarnTask: Master has committed the final job output
data.
>> ...
>> 
>> To me, that looks promising - like the job was successful. However, in the WORKER_ONLY
containers, I see these things:
>> 
>> ...
>> 14/01/10 15:04:17 INFO graph.GraphTaskManager: cleanup: Starting for WORKER_ONLY
>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions,
type=NodeDeleted, state=SyncConnected)
>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent : partitionExchangeChildrenChanged
(at least one worker is done sending partitions)
>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished,
type=NodeDeleted, state=SyncConnected)
>> 14/01/10 15:04:17 INFO netty.NettyClient: stop: reached wait threshold, 1 connections
closed, releasing NettyClient.bootstrap resources now.
>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job state changed,
checking to see if it needs to restart
>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already exists (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: [STATUS: task-1] saveVertices: Starting
to save 2 vertices using 1 threads
>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: saveVertices: Starting to save 2
vertices using 1 threads
>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job state changed,
checking to see if it needs to restart
>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already exists (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state path is empty! - /_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState
>> 14/01/10 15:04:17 ERROR zookeeper.ClientCnxn: Error while calling watcher
>> java.lang.NullPointerException
>>         at java.io.StringReader.<init>(StringReader.java:50)
>>         at org.json.JSONTokener.<init>(JSONTokener.java:66)
>>         at org.json.JSONObject.<init>(JSONObject.java:402)
>>         at org.apache.giraph.bsp.BspService.getJobState(BspService.java:716)
>>         at org.apache.giraph.worker.BspServiceWorker.processEvent(BspServiceWorker.java:1563)
>>         at org.apache.giraph.bsp.BspService.process(BspService.java:1095)
>>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_vertexInputSplitsAllReady,
type=NodeDeleted, state=SyncConnected)
>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_addressesAndPartitions,
type=NodeDeleted, state=SyncConnected)
>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent : partitionExchangeChildrenChanged
(at least one worker is done sending partitions)
>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
type=NodeDeleted, state=SyncConnected)
>> ...
>> 14/01/10 15:04:17 WARN hdfs.DFSClient: DataStreamer Exception
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /user/spry/Shortest/_temporary/1/_temporary/attempt_1389300168420_0024_m_000001_1/part-m-00001:
File does not exist. Holder DFSClient_NONMAPREDUCE_-643344145_1 does not have any open files.
>>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
>>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
>>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2480)
>>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555)
>>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>> ...
>> 
>> I apologize for the wall of error message, but I tried to leave in at least some
of the parts that might be useful. I put the entire YARN log here: http://tny.cz/af229738
>> 
>> Has anyone ever seen this before? This is the command I'm using to run:
>> 
>> hadoop jar giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner -Dgiraph.SplitMasterWorker=false -Dgiraph.zkList="localhost:2181"
-Dgiraph.zkSessionMsecTimeout=600000 -Dgiraph.useInputSplitLocality=false org.apache.giraph.examples.SimpleShortestPathsComputation
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/spry/input
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/spry/Shortest -w 1
>> 
>> My setup is still the same as the other email if you saw it:
>> 
>> I compiled Giraph with this command, and everything built successfully except "Apache
Giraph Distribution" which it doesn't seem like I need:
>> 
>> mvn -Phadoop_yarn -Dhadoop.version=2.2.0 -DskipTests clean package
>> 
>> I am running with the following components:
>> 
>> Single node cluster
>> Giraph 1.1
>> Hadoop 2.2.0 (Hortonworks)
>> Java 1.7.0_45
>> 
>> Thanks in advance,
>> -Kristen Hardwick
>> 
> 
> 
> 
> 
> 


Mime
View raw message