spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-22513) Provide build profile for hadoop 2.8
Date Mon, 26 Mar 2018 11:14:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413685#comment-16413685
] 

Steve Loughran edited comment on SPARK-22513 at 3/26/18 11:13 AM:
------------------------------------------------------------------

API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions.
 * The profiles are similar except in the cloud profiles where later releases add more stuff
(hadoop-azure in 2.7). No changes there between 2.7 and 2.8

There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) Hadoop 3 adds a new "hadoop-cloud-storage"
profile which is intended to allow descendants to get a "cruft removed and up to date" set
of FS dependencies; if new things go in, this will be updated. And I have the task of shading
that soon.
 * There's also the inevitable changes in versions of things, specifically and inevitably
jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle
so there's no need to worry about the version of jackson it builds against. As to Guava, that's
the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK,
usual disclaimers, etc).

 * We all strive to keep the semantics of things the same, but one person's "small improvement"
is always someone else's "fundamental regression in the way things behave" —the eternal
losing battle of software engineering. Best strategy there: build and test against alpha and
beta releases, complain when things don't work & make sure it's fixed in the final one.

FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6
is Java 6 only; the rest of branch-2 is Java 7. 
 Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/Insighs, albeit with a fair amount
of backporting. Using 2.7 as the foundation means you don't have to worry about what was backported,
except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if
you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work
with Azure outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need to
git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage speedups. AWS
EMR and google dataproc are both 2.8.x —no idea about changes made.

You can build spark against any Hadoop version you like on the 2.x line without problems.
{code:java}
mvn package -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0
{code}
Against 3.x things compile but Hive is unhappy unless you have one of: a spark hive module
with a patch to hive's version check case statement or apache hadoop trunk pretending to
be a branch-2 line `-Ddeclared.hadoop.version=2.11`. That works OK for spark build & test
but MUST NOT be deployed as HDFS version checking will be unhappy.

Clear :)?

ps: don't mention Java 9 (HADOOP-11123)  10 (HADOOP-11423) or 11 (HADOOP-15338)]. thanks.


was (Author: stevel@apache.org):
API wise, everything compiled against 2.6 should compile against all subsequent 2.x versions.
 * The profiles are similar except in the cloud profiles where later releases add more stuff
(hadoop-azure in 2.7). No changes there between 2.7 and 2.8

There's the hadoop-allyun module in recent 2.x+ (forthcoming 2.9?) Hadoop 3 adds a new "hadoop-cloud-storage"
profile which is intended to allow descendants to get a "cruft removed and up to date" set
of FS dependencies; if new things go in, this will be updated. And I have the task of shading
that soon.
 * There's also the inevitable changes in versions of things, specifically and inevitably
jackson being the most visible. For Hadoop 3 (and 2.9+?) we've move to the shaded amazon-sdk-bundle
so there's no need to worry about the version of jackson it builds against. As to Guava, that's
the same everywhere but you can bump it up to at least guava 19.0 without problems (AFAIK,
usual disclaimers, etc).

 * We all strive to keep the semantics of things the same, but one person's "small improvement"
is always someone else's "fundamental regression in the way things behave". —the eternal
losing battle of software engineeing. Best strategy there: build and test against alpha and
beta releases, complain when things don't work & make sure it's fixed in the final one.

FWIW, removing the 2.6 and setting 2.7 as the bare minimum would be a good move. Hadoop 2.6
is Java 6 only; the rest of branch-2 is Java 7. 
 Hadoop 2.7 is the foundation of CDH HDP and microsoft HD/I, albeit with a fair amount of
backporting. Using 2.7 as the foundation means you don't have to worry about what was backported,
except to complain when someone broke compatibility. As to ASF 2.8.x, I'd recommend it if
you want to use the ASF artifacts (bug fixes, way better S3 performance), and, if you work
with Azure outside HDP or HD/Insights, 2.9. I don't know about CDH there, Sean will need to
git log --grep for HADOOP-14660 and HADOOP-14535 as the big columnar storage speedups.

Otherwise, you can build spark against any version you like on the 2.x line without problems.
{code:java}
mvn install -Phadoop-2.7,hadoop-cloud,yarn Dhadoop.version=2.9.0
{code}
Against 3.x things compile but Hive is unhappy unless you have one of: a spark hive module
with a patch to hive's version check case statement or apache 3.x branch pretending to be
a branch-2 line `-Ddeclared.hadoop.version=2.11`, which works OK for spark build & test
but MUST NOT be deployed as HDFS version checking will be unhappy.

Clear :)?

ps: don't mention Java 9 (HADOOP-11123)  10 (HADOOP-11423) or 11 (HADOOP-15338)]. thanks.

> Provide build profile for hadoop 2.8
> ------------------------------------
>
>                 Key: SPARK-22513
>                 URL: https://issues.apache.org/jira/browse/SPARK-22513
>             Project: Spark
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 2.2.0
>            Reporter: Christine Koppelt
>            Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. Therefore
it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message