spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stavros Kontopoulos (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
Date Thu, 30 Nov 2017 19:03:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273182#comment-16273182
] 

Stavros Kontopoulos edited comment on SPARK-22657 at 11/30/17 7:02 PM:
-----------------------------------------------------------------------

[~stevel@apache.org] I guess it is possible to setup that config and bypass the issue. S3n
is just an example fs implementation used to demonstrate the issue.


was (Author: skonto):
[~stevel@apache.org] I guess it is possible to setup that config and bypass the issue. S3n
is just an example fs implementation.

> Hadoop fs implementation classes are not loaded if they are part of the app jar or other
jar when --packages flag is used 
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22657
>                 URL: https://issues.apache.org/jira/browse/SPARK-22657
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Stavros Kontopoulos
>
> To reproduce this issue run:
> ./bin/spark-submit --master mesos://leader.mesos:5050 \
> --packages com.github.scopt:scopt_2.11:3.5.0 \
> --conf spark.cores.max=8 \
> --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
\
> --conf spark.mesos.executor.docker.forcePullImage=true \
> --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
\
> --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out
> within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image
> You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme:
s3n"
> This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos
bug.
> The specific spark job used above can be found here: https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 
> Can be built with sbt assembly in that dir.
> Using this code : https://gist.github.com/skonto/4f5ff1e5ede864f90b323cc20bf1e1cbat the
beginning of the main method...
> you get the following output : https://gist.github.com/skonto/d22b8431586b6663ddd720e179030da4
> (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
to to get the modified job)
> The job works fine if --packages is not used.
> The commit that introduced this issue is (before that things work as expected):
> 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support
for resources adding to Spark [32m(5 months ago) [1;34m<jerryshao>[m Thu, 6 Jul 2017
15:32:49 +0800
> The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311
> https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem
is first created.
> The Filesystem class is initialized there, before the main of the spark job is launched...
the reason is --packages logic uses hadoop libraries to download files....
> Maven resolution happens before the app jar and the resolved jars are added to the classpath.
So at that moment there is no s3n to add to the static map when the Filesystem static members
are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS).
> Later in the spark job main where we try to access the s3n filesystem (create a second
filesystem) we get the exception (at this point the app jar has the s3n implementation in
it and its on the class path but that scheme is not loaded in the static map of the Filesystem
class)... 
> hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is
with the static map which is filled once and only once.
> That's why we see two prints of the map contents in the output(gist)  above when --packages
is used. The first print is before creating the s3n filesystem. We use reflection there to
get the static map's entries. When --packages is not used that map is empty before creating
the s3n filesystem since up to that point the Filesystem class is not yet loaded by the classloader.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message