hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
Date Thu, 01 Jun 2017 05:56:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032497#comment-16032497
] 

Josh Rosen commented on HIVE-16391:
-----------------------------------

I tried to see whether Spark can consume existing Hive 1.2.1 artifacts, but it looks like
neither the regular nor {{core}} hive-exec artifacts can work:

* We can't use the regular Hive uber-JAR artifacts because they include many transitive dependencies
but do not relocate those dependencies' classes into a private namespace, so this will cause
multiple versions of the same class to be included on the classpath. To see this, note the
long list of artifacts at https://github.com/apache/hive/blob/release-1.2.1/ql/pom.xml#L685
but there is only one relocation pattern (for Kryo).
* We can't use the {{core}}-classified artifact:
** We actually need Kryo to be shaded in {{hive-exec}} because Spark now uses Kryo 3 (which
is needed by Chill 0.8.x, which is needed for Scala 2.12) while Hive uses Kryo 2.
** In addition, I think that Spark needs to shade Hive's {{com.google.protobuf:protobuf-java}}
dependency.
** The published {{hive-exec}} POM is a "dependency-reduced" POM which doesn't declare {{hive-exec}}'s
transitive dependencies. To see this, compare the declared dependencies in the published POM
in Maven Central (http://central.maven.org/maven2/org/apache/hive/hive-exec/1.2.1/hive-exec-1.2.1.pom)
to the dependencies the source repo's POM:  https://github.com/apache/hive/blob/release-1.2.1/ql/pom.xml.
The lack of declared dependencies creates an additional layer of pain for us when consuming
the {{core}} JAR because we now have to shoulder the burden of declaring explicit dependencies
on {{hive-exec}}'s transitive dependencies (since they're no longer bundled in an uber JAR
when we use the {{core}} JAR), making it harder to use tools like Maven's {{dependency:tree}}
to help us spot potential dep. conflicts.

Spark's current custom Hive fork is effectively making three changes compared to Hive 1.2.1
order to work around the above problems plus some legacy issues which are no longer relevant:

* Remove the shading/bundling of most non-Hive classes, with the exception of Kryo and Protobuf.
This has the effect of making the published POM non-dependency-reduced, easing the dep. management
story in Spark's POMs, while still ensuring that we relocate classes that conflict with Spark.
* Package the hive-shims into the hive-exec JAR. I don't think that this is strictly necessary.
* Downgrade Kryo to 2.21. This isn't necessary anymore: there was an earlier time where we
purposely _unshaded_ Kryo and pinned Hive's version to match Spark's. The only reason that
this change is present today was to minimize the diff between versions 1 and 2 of Spark's
Hive fork.

For the full details, see https://github.com/apache/hive/compare/release-1.2.1...JoshRosen:release-1.2.1-spark2,
which compares the current Version 2 of our Hive fork to stock Hive 1.2.1.

Maven classifiers do not allow the declaration of different dependencies for artifacts depending
on their classifiers, so if we wanted to publish a {{hive-exec core}}-like artifact which
declares its transitive dependencies then this would need to be done under a new Maven artifact
name or new version (e.g. Hive 1.2.2-spark).

That said, proper declaration of transitive dependencies isn't a hard blocker for us: a long,
long, long time ago, I think that Spark may have actually built with a stock {{core}} artifact
and explicitly declared the transitive deps, so if we've handled that dependency declaration
before then we can do it again at the cost of some pain in the future if we want to bump to
Hive 2.x.

Therefore, I think the minimal change needed in Hive's build is to add a new classifier, say
{{core-spark}}, which behaves like {{core}} except that it shades and relocates Kryo and Protobuf.
If this artifact existed then I think Spark could use that classified artifact, declare an
explicit dependency on the shim artifacts (assuming Kryo and Protobuf don't need to be shaded
there) and explicitly pull in all of {{hive-exec}}'s transitive dependencies. This avoids
the need to publish separate _versions_ for Spark: instead, Spark would just consume a differently-packaged/differently-classified
version of a stock Hive release.

If we go with this latter approach, then I guess Hive would need to publish 1.2.3 or 1.2.2.1
in order to introduce the new classified artifact.

Does this sound like a reasonable approach? Or would it make more sense to have a separate
Hive branch and versioning scheme for Spark (e.g. {{branch-1.2-spark}} and Hive {{1.2.1-spark}})?
I lean towards the former approach (releasing 1.2.3 with an additional Spark-specific classifier),
especially if we want to fix bugs or make functional / non-packaging changes later down the
road (I think [~stevel@apache.org] had a few changes / fixes he wanted to make).

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-16391
>                 URL: https://issues.apache.org/jira/browse/HIVE-16391
>             Project: Hive
>          Issue Type: Task
>          Components: Build Infrastructure
>            Reporter: Reynold Xin
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the only change
in the fork is to work around the issue that Hive publishes only two sets of jars: one set
with no dependency declared, and another with all the dependencies included in the published
uber jar. That is to say, Hive doesn't publish a set of jars with the proper dependencies
declared.
> There is general consensus on both sides that we should remove the forked Hive.
> The change in the forked version is recorded here https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message