giraph-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kristen Hardwick (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (GIRAPH-828) Race condition during Giraph cleanup phase
Date Wed, 29 Jan 2014 15:28:10 GMT

     [ https://issues.apache.org/jira/browse/GIRAPH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kristen Hardwick updated GIRAPH-828:
------------------------------------

    Attachment: noChkpointCleanup2.txt
                noChkpointCleanup1.txt

No problem. I just attached the full debug logs of the first two runs where I removed the
checkpoint cleanup flag. The error is the LeaseExpiredException when it tries to access whichever
file does not end up in the output directory.

I can give you logs from other runs if you want them. They all have the same behavior. The
Application Master works fine, one container writes its output successfully, and the other
container can't write its output because of a "File does not exist" LeaseExpiredException
that seems to manifest after the first container has cleaned up.

> Race condition during Giraph cleanup phase
> ------------------------------------------
>
>                 Key: GIRAPH-828
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-828
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>         Environment: Giraph 1.1,
> Hadoop 2.2.0,
> Java 1.7.0_45
>            Reporter: Kristen Hardwick
>             Fix For: 1.1.0
>
>         Attachments: noChkpointCleanup1.txt, noChkpointCleanup2.txt
>
>
> Running the exact same launch command twice, making no other changes, has different completion
results. For example the first time the application will fail, and the second time it will
succeed. Just for proof, this is what happened when I tried to run the SimpleShortestPathsComputation
example: [PasteBin Link|http://pastebin.com/Qswb98dq]. This happens consistently, although
the job does fail much more often than it succeeds.
> The PageRank example also has the same issue. In fact, the timing issue is even more
obvious there. I followed directions [here|http://marsty5.com/2013/05/29/run-example-in-giraph-pagerank/]
and ran the SimplePageRankComputation example with this command:
> {code}
> hadoop jar giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner -Dgiraph.cleanupCheckpointsAfterSuccess=false -Dgiraph.logLevel=DEBUG
-Dgiraph.SplitMasterWorker=false -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000
-Dgiraph.useInputSplitLocality=false org.apache.giraph.examples.SimplePageRankComputation
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/spry/input
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/spry/PageRank -w 2
-mc org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute
> {code}
> The job technically failed, but I did get output from part file 1 (I expected to have
values printed for all vertices between 0 and 4).
> {code}
> 0    0.16682289373110673
> 4    0.17098446073203233
> 2    0.17098446073203233
> {code}
> I ran the exact same command again (with no changes to the environment except for deleting
the /user/spry/PageRank HDFS directory) and got no part files. I ran it one more time and
got only the data from part file 2:
> {code}
> 1    0.24178880797750438
> 3    0.24178880797750438
> {code}
> I tried a few more times, but I haven't been able to see both part files in the output
directory yet.
> In the logs, I see hopeful things like this:
> {code}
> 14/01/22 09:47:48 INFO master.MasterThread: setup: Took 3.144 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: input superstep: Took 2.582 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: superstep 0: Took 0.827 seconds.
> ...
> 14/01/22 09:47:48 INFO master.MasterThread: superstep 30: Took 0.56 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: shutdown: Took 2.591 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: total: Took 30.18 seconds.
> 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master is ready to commit final job output
data.
> {code}
> and like this:
> {code}
> 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master has committed the final job output
data.
> 14/01/22 09:47:48 DEBUG ipc.Client: Stopping client
> 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to hadoop2.j7.master/127.0.0.1:8020
from yarn: closed
> 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to hadoop2.j7.master/127.0.0.1:8020
from yarn: stopped, remaining connections 0
> {code}
> Really only one of the containers even fails. And it's with a DataStreamer/LeaseExpired
exception saying that the part file no longer exists. This log is from the run where part
file 2 was not written out:
> {code}
> 14/01/22 09:47:48 WARN hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /user/spry/PageRank/_temporary/1/_temporary/attempt_1389643303411_0029_m_000002_1/part-m-00002:
File does not exist. Holder DFSClient_NONMAPREDUCE_1153765281_1 does not have any open files.
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
> ...
> 14/01/22 09:47:48 ERROR worker.BspServiceWorker: unregisterHealth: Got failure, unregistering
health on /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
on superstep 30
> 14/01/22 09:47:48 DEBUG zookeeper.ClientCnxn: Reading reply sessionid:0x1438d139efc0039,
packet:: clientPath:null serverPath:null finished:false header:: 589,2  replyHeader:: 589,13968,-101
 request:: '/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2,-1
 response:: null
> 14/01/22 09:47:48 ERROR graph.GraphTaskManager: run: Worker failure failed on another
RuntimeException, original expection will be rethrown
> java.lang.IllegalStateException: unregisterHealth: KeeperException - Couldn't delete
/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
>         at org.apache.giraph.worker.BspServiceWorker.unregisterHealth(BspServiceWorker.java:656)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message