hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws
Date Thu, 28 Jun 2018 14:17:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526341#comment-16526341

Steve Loughran commented on HADOOP-15559:

FWIW, there's one other workaround someone other than hadoop-dev or spark-dev could do: build
their own module equivalent to spark-hadoop-cloud and publish it on maven independently; this
could even be done for shipping spark releases. You'd then just depend on org.example.something/spark-hadoop-cloud-binding
and get the right stuff. 

You just need to find a volunteer...

> Clarity on Spark compatibility with hadoop-aws
> ----------------------------------------------
>                 Key: HADOOP-15559
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15559
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: documentation, fs/s3
>            Reporter: Nicholas Chammas
>            Priority: Minor
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a command-line
tool for launching Apache Spark clusters on AWS. One of the things I try to do for my users
is make it straightforward to use Spark with {{s3a://}}. I do this by recommending that users
start Spark with the {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should work with
what versions of Spark.
> Spark releases are [built against Hadoop 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/].
At the same time, I've been told that I should be able to use newer versions of Hadoop and
Hadoop libraries with Spark, so for example, running Spark built against Hadoop 2.7 alongside HDFS
2.8 should work, and there is [no need to build Spark explicitly against Hadoop 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how to pair
Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with {{hadoop-aws:2.7.6}} but
not with {{hadoop-aws:2.8.4}}. Trying the latter yields the following error when I try to
access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748){code}
> So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR release
of Hadoop that Spark is built against. However, neither [this page|https://wiki.apache.org/hadoop/AmazonS3] nor
[this one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to pair
the correct version of {{hadoop-aws}} with Spark.
> Would it be appropriate to add some guidance somewhere on what versions of {{hadoop-aws}}
work with what versions and builds of Spark? It would help eliminate this kind of guesswork
and slow spelunking.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message