hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Le (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-3606) Spark container fails to launch if spark-assembly.jar file has different timestamp
Date Fri, 08 May 2015 20:35:01 GMT
Michael Le created YARN-3606:
--------------------------------

             Summary: Spark container fails to launch if spark-assembly.jar file has different
timestamp
                 Key: YARN-3606
                 URL: https://issues.apache.org/jira/browse/YARN-3606
             Project: Hadoop YARN
          Issue Type: Bug
          Components: yarn
    Affects Versions: 2.6.0
         Environment: YARN 2.6.0
Spark 1.3.1
            Reporter: Michael Le
            Priority: Minor


In a YARN cluster, when submitting a Spark job, the Spark job will fail to run because YARN
fails to launch containers on the other nodes (not the node where the job submission took
place).

YARN checks for similar spark-assembly.jar file by looking at the timestamps. This check will
fail when the spark-assembly.jar is the same but copied to the location at different time.

YARN throws this exception:

15/05/07 20:13:22 INFO yarn.ExecutorRunnable: Setting up executor with commands: List({{JAVA_HOME}}/bin/java,
-server, -XX:OnOutOfMemoryError='kill %p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp,
'-Dspark.driver.port=52357', -Dspark.yarn.app.container.log.dir=<LOG_DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend,
--driver-url, akka.tcp://sparkDriver@xxx:52357/user/CoarseGrainedScheduler, --executor-id,
4, --hostname, xxx, --cores, 1, --app-id, application_1431047540996_0001, --user-class-path,
file:$PWD/__app__.jar, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
15/05/07 20:13:22 INFO impl.ContainerManagementProtocolProxy: Opening proxy : xxx:34165
15/05/07 20:13:27 INFO yarn.YarnAllocator: Completed container container_1431047540996_0001_02_000005
(state: COMPLETE, exit status: -1000)
15/05/07 20:13:27 INFO yarn.YarnAllocator: Container marked as failed: container_1431047540996_0001_02_000005.
Exit status: -1000. Diagnostics: Resource file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar
changed on src filesystem (expected 1430944255000, was 1430944249000
java.io.IOException: Resource file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar
changed on src filesystem (expected 1430944255000, was 1430944249000
        at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Problem can be easily replicated by setting up two nodes and copying the spark-assembly.jar
to each node but changing the timestamp of the file on one of the nodes. Then execute spark-shell
--master yarn-client. Observe the nodemanager log on the other node to find the error.

Work around is to make sure the jar file has the same timestamp. But it looks like perhaps
the function that does the copy and check of the jar file (org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
should check for file similarity using a checksum rather than timestamp.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message