Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A58B31005F for ; Mon, 2 Mar 2015 18:54:19 +0000 (UTC) Received: (qmail 50981 invoked by uid 500); 2 Mar 2015 18:54:06 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 50916 invoked by uid 500); 2 Mar 2015 18:54:06 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 50902 invoked by uid 99); 2 Mar 2015 18:54:06 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Mar 2015 18:54:06 +0000 Date: Mon, 2 Mar 2015 18:54:06 +0000 (UTC) From: "Colin Patrick McCabe (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-11656) Classpath isolation for downstream clients MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-11656?page=3Dcom.atlassi= an.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D14= 343565#comment-14343565 ]=20 Colin Patrick McCabe commented on HADOOP-11656: ----------------------------------------------- Thank you for filing this, [~busbey]. +1000 for fixing this... it is a hug= e pain point in Hadoop deployments. bq. Steve wrote: There's another strategy, which is pure-REST-client. Why d= o we need an HDFS client using Hadoop IPC when we have webHDFS? Same for YA= RN? Even a YARN app shouldn't need to pull in yarn-*.jar, though there's en= ough IPC there & other things you probably would have to. A pure REST client is slower than a pure java client, and can't do things l= ike zero-copy reads, short circuit reads, and so forth. Another way of rea= lizing this is to see that httpfs and webfs have been around for a long tim= e, and haven't solved this problem for our users. bq. the other strategy "ultra-lean client" is appealing, though we're fairl= y contaminated with Guava, commons-logging, SLF4J, httpclient, commons-lang= , etc. The notion of "single client JAR" is going to be hard to pull off wi= thout embracing Shading, and the wrongness that comes from that. Guava is a really nice library. It's nice on the server, and it's just as = nice on the client. We had this discussion earlier when someone attempted = to remove Guava from the client... "that dog won't hunt." And even if it d= id, we have Jackson, Protobuf, AmazonAWS, zookeeper, jersey, glassfish, avr= o, jetty, and on and on. We *can't* solve this problem by minimizing dependencies. Because even if = we do a huge amount of code-worsening wheel-reinvention to get rid of our n= ice utility libraries, we still are stuck with dependencies like Protobuf a= nd Jetty. The Protobuf 2.4.1 -> 2.5.0 transition caused a huge amount of p= ain for users and developers. And we all know about the security implicati= ons of using old libraries. In a larger sense, good software architecture = should involve code reuse and libraries when appropriate. Treating depende= ncies as "contamination" will just result in more "not invented here" syndr= ome. It doesn't scale. bq. ps, we don't really make dependency promises. If you look at the Hadoop= compatibility document, you can see we explicitly say "no guarantees". Tha= t's not an accident. We're just being somewhat cautious about updating thin= gs. If, say, HBase, accumulo & Oozie all wanted a co-ordinated update, we c= ould try. "Not making dependency promises" is just kicking the problem out to our use= rs. It makes people unwilling to upgrade because they don't know if their = code will be broken by the removal or alteration of a jar they need. Case = in point: Jackson 1.8.8 -> 1.9 broke a lot of user code because it removed = {{defaultPrettyPrintingWriter}} and replaced it with a function called {{wr= iterWithDefaultPrettyPrinter}}. This is why some enterprise distros didn't= pick up the change. We have tried dependency harmonization in the past. It doesn't work, becau= se different projects have different release schedules and different needs.= Not to mention different communities. Also, projects like HBase want to = support multiple versions of Hadoop. This means that they either have to l= ive with mixed versions of things like Guava, Jetty, etc. or agree to never= update dependencies. bq. Do you propose writing your own classloader? If so, we're in trouble = =E2=80=94based on my experience with every single classloader I have encoun= tered. The consensus has gathered around OSGi not because it is any better = than other people's attempts, it is simply no worse, and with "a standard",= you the individual don't take the hit for: security problems, .class leaka= ge, object equality breakage, classloader leakage, etc etc. Simple example,= UGI relies on being a singleton for its identity management. Embrace class= loaders and you have >1 UGI singleton, so had better be confident that thei= r doAs identities worked as required. Hadoop is a big project and worth the effort to manage our own CLASSPATH. = If there are problems we can work through them. I am not opposed to OSGi b= ut I think that is a separate discussion. > Classpath isolation for downstream clients > ------------------------------------------ > > Key: HADOOP-11656 > URL: https://issues.apache.org/jira/browse/HADOOP-11656 > Project: Hadoop Common > Issue Type: New Feature > Reporter: Sean Busbey > Assignee: Sean Busbey > Labels: classloading, classpath, dependencies > > Currently, Hadoop exposes downstream clients to a variety of third party = libraries. As our code base grows and matures we increase the set of librar= ies we rely on. At the same time, as our user base grows we increase the li= kelihood that some downstream project will run into a conflict while attemp= ting to use a different version of some library we depend on. This has alre= ady happened with i.e. Guava several times for HBase, Accumulo, and Spark (= and I'm sure others). > While YARN-286 and MAPREDUCE-1700 provided an initial effort, they defaul= t to off and they don't do anything to help dependency conflicts on the dri= ver side or for folks talking to HDFS directly. This should serve as an umb= rella for changes needed to do things thoroughly on the next major version. > We should ensure that downstream clients > 1) can depend on a client artifact for each of HDFS, YARN, and MapReduce = that doesn't pull in any third party dependencies > 2) only see our public API classes (or as close to this as feasible) when= executing user provided code, whether client side in a launcher/driver or = on the cluster in a container or within MR. > This provides us with a double benefit: users get less grief when they wa= nt to run substantially ahead or behind the versions we need and the projec= t is freer to change our own dependency versions because they'll no longer = be in our compatibility promises. > Project specific task jiras to follow after I get some justifying use cas= es written in the comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)