Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-issues@hadoop.apache.org
Date: Thu, 5 Mar 2015 20:31:39 +0000 (UTC)
From: "Sean Busbey (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: <JIRA.12779871.1425585410000.14612.1425587499289@Atlassian.JIRA>
In-Reply-To: <JIRA.12779871.1425585410000@Atlassian.JIRA>
References: <JIRA.12779871.1425585410000@Atlassian.JIRA>
 <JIRA.12779871.1425585410012@arcas>
Subject: [jira] [Commented] (HADOOP-11680) Deduplicate jars in convenience
 binary distribution
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349431#comment-14349431 ] 

Sean Busbey commented on HADOOP-11680:
--------------------------------------

{quote}
IMO, it might be easier to start with just making sure the sub-projects (HDFS, YARN, MAPRED), don't bundle the jars that are already required by common, since those subprojects also have common as a dependency.
{quote}

I'm not familiar with how the hadoop assemblies work yet, but would doing this be as simple as having those components list common and its dependencies provided? If I use maven to say a dependency should already be present at runtime, common maven assemblies will skip including those artifacts in the bundle.

Presuming that also works for Hadoop assemblies, if libraries still show up in the "foo classpath" commands is it reasonable to expect everything will work?

> Deduplicate jars in convenience binary distribution
> ---------------------------------------------------
>
>                 Key: HADOOP-11680
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11680
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: build
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>
> Pulled from discussion on HADOOP-11656 Colin wrote:
> {quote}
> bq. Andrew wrote: One additional note related to this, we can spend a lot of time right now distributing 100s of MBs of jar dependencies when launching a YARN job. Maybe this is ameliorated by the new shared distributed cache, but I've heard this come up quite a bit as a complaint. If we could meaningfully slim down our client, it could lead to a nice win.
> I'm frustrated that nobody responded to my earlier suggestion that we de-duplicate jars. This would drastically reduce the size of our install, and without rearchitecting anything.
> In fact I was so frustrated that I decided to write a program to do it myself and measure the delta. Here it is:
> Before:
> {code}
> du -h /h
> 249M    /h
> {code}
> After:
> {code}
> du -h /h
> 140M    /h
> {code}
> Seems like deduplicating jars would be a much better project than splitting into a client jar, if we really cared about this.
> <snip>
> {quote}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)