spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stavros Kontopoulos (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
Date Thu, 30 Nov 2017 11:06:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272444#comment-16272444
] 

Stavros Kontopoulos edited comment on SPARK-22657 at 11/30/17 11:05 AM:
------------------------------------------------------------------------

[~srowen]This came up in the tests we are running for dcos spark. After that commit which
uses hadoop libraries for file downloading I guess it shouldnt work (previously it did, I
built spark with the previous commit).  Users just create the jars and submit stuff. This
shows that users should start thinking about classloading, which is not obvious  unless it
is documented somewhere, which I might have missed. I don't see a workaround for now, the
easy thing I guess to do is to put all hadoop dependencies as part of the packages list or
on the classpath to be resolved at once. One option though would be for trivial stuff like
jar downloading for dependency resolution Spark not to use hadoop libraries. I think there
was a discussion on this in the past. Why use for example hadoop libraries to get a file with
http(s) anyway?


was (Author: skonto):
[~srowen]This came up in the tests we are running for dcos spark. After that commit which
uses hadoop libraries for file downloading I guess it shouldnt work (previously it did). 
Users just create the jars and submit stuff. This shows that users should start thinking about
classloading, which is not obvious  unless it is documented somewhere, which I might have
missed. I don't see a workaround for now, the easy thing I guess to do is to put all hadoop
dependencies as part of the packages list or on the classpath to be resolved at once. One
option though would be for trivial stuff like jar downloading for dependency resolution Spark
not to use hadoop libraries. I think there was a discussion on this in the past. Why use for
example hadoop libraries to get a file with http(s) anyway?

> Hadoop fs implementation classes are not loaded if they are part of the app jar or other
jar when --packages flag is used 
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22657
>                 URL: https://issues.apache.org/jira/browse/SPARK-22657
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Stavros Kontopoulos
>
> To reproduce this issue run:
> ./bin/spark-submit --master mesos://leader.mesos:5050 \
> --packages com.github.scopt:scopt_2.11:3.5.0 \
> --conf spark.cores.max=8 \
> --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
\
> --conf spark.mesos.executor.docker.forcePullImage=true \
> --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
\
> --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out
> within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image
> You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme:
s3n"
> This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos
bug.
> The specific spark job used above can be found here: https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 
> Can be built with sbt assembly in that dir.
> Using this code : https://gist.github.com/skonto/4f5ff1e5ede864f90b323cc20bf1e1cbat the
beginning of the main method...
> you get the following output : https://gist.github.com/skonto/d22b8431586b6663ddd720e179030da4
> (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
to to get the modified job)
> The job works fine if --packages is not used.
> The commit that introduced this issue is (before that things work as expected):
> 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support
for resources adding to Spark [32m(5 months ago) [1;34m<jerryshao>[m Thu, 6 Jul 2017
15:32:49 +0800
> The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311
> https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem
is first created.
> The Filesystem class is initialized there, before the main of the spark job is launched...
the reason is --packages logic uses hadoop libraries to download files....
> Maven resolution happens before the app jar and the resolved jars are added to the classpath.
So at that moment there is no s3n to add to the static map when the Filesystem static members
are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS).
> Later in the spark job main where we try to access the s3n filesystem (create a second
filesystem) we get the exception (at this point the app jar has the s3n implementation in
it and its on the class path but that scheme is not loaded in the static map of the Filesystem
class)... 
> hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is
with the static map which is filled once and only once.
> That's why we see two prints of the map contents in the output(gist)  above when --packages
is used. The first print is before creating the s3n filesystem. We use reflection there to
get the static map's entries. When --packages is not used that map is empty before creating
the s3n filesystem since up to that point the Filesystem class is not yet loaded by the classloader.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message