Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-issues@hadoop.apache.org
Date: Mon, 2 Mar 2015 18:54:06 +0000 (UTC)
From: "Colin Patrick McCabe (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: <JIRA.12778594.1425232036000.54283.1425322446653@Atlassian.JIRA>
In-Reply-To: <JIRA.12778594.1425232036000@Atlassian.JIRA>
References: <JIRA.12778594.1425232036000@Atlassian.JIRA>
 <JIRA.12778594.1425232036932@arcas>
Subject: [jira] [Commented] (HADOOP-11656) Classpath isolation for
 downstream clients
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/HADOOP-11656?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14=
343565#comment-14343565 ]=20

Colin Patrick McCabe commented on HADOOP-11656:
-----------------------------------------------

Thank you for filing this, [~busbey].  +1000 for fixing this... it is a hug=
e pain point in Hadoop deployments.

bq. Steve wrote: There's another strategy, which is pure-REST-client. Why d=
o we need an HDFS client using Hadoop IPC when we have webHDFS? Same for YA=
RN? Even a YARN app shouldn't need to pull in yarn-*.jar, though there's en=
ough IPC there & other things you probably would have to.

A pure REST client is slower than a pure java client, and can't do things l=
ike zero-copy reads, short circuit reads, and so forth.  Another way of rea=
lizing this is to see that httpfs and webfs have been around for a long tim=
e, and haven't solved this problem for our users.

bq. the other strategy "ultra-lean client" is appealing, though we're fairl=
y contaminated with Guava, commons-logging, SLF4J, httpclient, commons-lang=
, etc. The notion of "single client JAR" is going to be hard to pull off wi=
thout embracing Shading, and the wrongness that comes from that.

Guava is a really nice library.  It's nice on the server, and it's just as =
nice on the client.  We had this discussion earlier when someone attempted =
to remove Guava from the client... "that dog won't hunt."  And even if it d=
id, we have Jackson, Protobuf, AmazonAWS, zookeeper, jersey, glassfish, avr=
o, jetty, and on and on.

We *can't* solve this problem by minimizing dependencies.  Because even if =
we do a huge amount of code-worsening wheel-reinvention to get rid of our n=
ice utility libraries, we still are stuck with dependencies like Protobuf a=
nd Jetty.  The Protobuf 2.4.1 -> 2.5.0 transition caused a huge amount of p=
ain for users and developers.  And we all know about the security implicati=
ons of using old libraries.  In a larger sense, good software architecture =
should involve code reuse and libraries when appropriate.  Treating depende=
ncies as "contamination" will just result in more "not invented here" syndr=
ome.  It doesn't scale.

bq. ps, we don't really make dependency promises. If you look at the Hadoop=
 compatibility document, you can see we explicitly say "no guarantees". Tha=
t's not an accident. We're just being somewhat cautious about updating thin=
gs. If, say, HBase, accumulo & Oozie all wanted a co-ordinated update, we c=
ould try.

"Not making dependency promises" is just kicking the problem out to our use=
rs.  It makes people unwilling to upgrade because they don't know if their =
code will be broken by the removal or alteration of a jar they need.  Case =
in point: Jackson 1.8.8 -> 1.9 broke a lot of user code because it removed =
{{defaultPrettyPrintingWriter}} and replaced it with a function called {{wr=
iterWithDefaultPrettyPrinter}}.  This is why some enterprise distros didn't=
 pick up the change.

We have tried dependency harmonization in the past.  It doesn't work, becau=
se different projects have different release schedules and different needs.=
  Not to mention different communities.  Also, projects like HBase want to =
support multiple versions of Hadoop.  This means that they either have to l=
ive with mixed versions of things like Guava, Jetty, etc. or agree to never=
 update dependencies.

bq. Do you propose writing your own classloader? If so, we're in trouble =
=E2=80=94based on my experience with every single classloader I have encoun=
tered. The consensus has gathered around OSGi not because it is any better =
than other people's attempts, it is simply no worse, and with "a standard",=
 you the individual don't take the hit for: security problems, .class leaka=
ge, object equality breakage, classloader leakage, etc etc. Simple example,=
 UGI relies on being a singleton for its identity management. Embrace class=
loaders and you have >1 UGI singleton, so had better be confident that thei=
r doAs identities worked as required.

Hadoop is a big project and worth the effort to manage our own CLASSPATH.  =
If there are problems we can work through them.  I am not opposed to OSGi b=
ut I think that is a separate discussion.

> Classpath isolation for downstream clients
> ------------------------------------------
>
>                 Key: HADOOP-11656
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11656
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>              Labels: classloading, classpath, dependencies
>
> Currently, Hadoop exposes downstream clients to a variety of third party =
libraries. As our code base grows and matures we increase the set of librar=
ies we rely on. At the same time, as our user base grows we increase the li=
kelihood that some downstream project will run into a conflict while attemp=
ting to use a different version of some library we depend on. This has alre=
ady happened with i.e. Guava several times for HBase, Accumulo, and Spark (=
and I'm sure others).
> While YARN-286 and MAPREDUCE-1700 provided an initial effort, they defaul=
t to off and they don't do anything to help dependency conflicts on the dri=
ver side or for folks talking to HDFS directly. This should serve as an umb=
rella for changes needed to do things thoroughly on the next major version.
> We should ensure that downstream clients
> 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce =
that doesn't pull in any third party dependencies
> 2) only see our public API classes (or as close to this as feasible) when=
 executing user provided code, whether client side in a launcher/driver or =
on the cluster in a container or within MR.
> This provides us with a double benefit: users get less grief when they wa=
nt to run substantially ahead or behind the versions we need and the projec=
t is freer to change our own dependency versions because they'll no longer =
be in our compatibility promises.
> Project specific task jiras to follow after I get some justifying use cas=
es written in the comments.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)