hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3606) Spark container fails to launch if spark-assembly.jar file has different timestamp
Date Mon, 11 May 2015 17:13:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538190#comment-14538190
] 

Steve Loughran commented on YARN-3606:
--------------------------------------

Looking at timestamp is the strategy chosen based on a key assumption : there is a single
artifact to localise by downloading from a single shared filesystem. Trying to use local filesystems,
each with a cached copy of the artifact, isn't what the NM expects to be doing. If it is,
then the normal localisation checks aren't


I think the checksum is probably omitted as you have to read the whole file to see if it has
changed; plus there's the cost of actually recalculating that checksum prior to launching
every container. Timestamps aren't too great though —the check as stands will reject the
same file with two different times *or* two differently sized files with the same timestamp.

> Spark container fails to launch if spark-assembly.jar file has different timestamp
> ----------------------------------------------------------------------------------
>
>                 Key: YARN-3606
>                 URL: https://issues.apache.org/jira/browse/YARN-3606
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 2.6.0
>         Environment: YARN 2.6.0
> Spark 1.3.1
>            Reporter: Michael Le
>            Priority: Minor
>
> In a YARN cluster, when submitting a Spark job, the Spark job will fail to run because
YARN fails to launch containers on the other nodes (not the node where the job submission
took place).
> YARN checks for similar spark-assembly.jar file by looking at the timestamps. This check
will fail when the spark-assembly.jar is the same but copied to the location at different
time.
> YARN throws this exception:
> 15/05/07 20:13:22 INFO yarn.ExecutorRunnable: Setting up executor with commands: List({{JAVA_HOME}}/bin/java,
-server, -XX:OnOutOfMemoryError='kill %p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp,
'-Dspark.driver.port=52357', -Dspark.yarn.app.container.log.dir=<LOG_DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend,
--driver-url, akka.tcp://sparkDriver@xxx:52357/user/CoarseGrainedScheduler, --executor-id,
4, --hostname, xxx, --cores, 1, --app-id, application_1431047540996_0001, --user-class-path,
file:$PWD/__app__.jar, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
> 15/05/07 20:13:22 INFO impl.ContainerManagementProtocolProxy: Opening proxy : xxx:34165
> 15/05/07 20:13:27 INFO yarn.YarnAllocator: Completed container container_1431047540996_0001_02_000005
(state: COMPLETE, exit status: -1000)
> 15/05/07 20:13:27 INFO yarn.YarnAllocator: Container marked as failed: container_1431047540996_0001_02_000005.
Exit status: -1000. Diagnostics: Resource file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar
changed on src filesystem (expected 1430944255000, was 1430944249000
> java.io.IOException: Resource file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar
changed on src filesystem (expected 1430944255000, was 1430944249000
>         at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>         at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
>         at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>         at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Problem can be easily replicated by setting up two nodes and copying the spark-assembly.jar
to each node but changing the timestamp of the file on one of the nodes. Then execute spark-shell
--master yarn-client. Observe the nodemanager log on the other node to find the error.
> Work around is to make sure the jar file has the same timestamp. But it looks like perhaps
the function that does the copy and check of the jar file (org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
should check for file similarity using a checksum rather than timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message