hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Busbey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11680) Deduplicate jars in convenience binary distribution
Date Thu, 05 Mar 2015 20:31:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349431#comment-14349431

Sean Busbey commented on HADOOP-11680:

IMO, it might be easier to start with just making sure the sub-projects (HDFS, YARN, MAPRED),
don't bundle the jars that are already required by common, since those subprojects also have
common as a dependency.

I'm not familiar with how the hadoop assemblies work yet, but would doing this be as simple
as having those components list common and its dependencies provided? If I use maven to say
a dependency should already be present at runtime, common maven assemblies will skip including
those artifacts in the bundle.

Presuming that also works for Hadoop assemblies, if libraries still show up in the "foo classpath"
commands is it reasonable to expect everything will work?

> Deduplicate jars in convenience binary distribution
> ---------------------------------------------------
>                 Key: HADOOP-11680
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11680
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: build
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
> Pulled from discussion on HADOOP-11656 Colin wrote:
> {quote}
> bq. Andrew wrote: One additional note related to this, we can spend a lot of time right
now distributing 100s of MBs of jar dependencies when launching a YARN job. Maybe this is
ameliorated by the new shared distributed cache, but I've heard this come up quite a bit as
a complaint. If we could meaningfully slim down our client, it could lead to a nice win.
> I'm frustrated that nobody responded to my earlier suggestion that we de-duplicate jars.
This would drastically reduce the size of our install, and without rearchitecting anything.
> In fact I was so frustrated that I decided to write a program to do it myself and measure
the delta. Here it is:
> Before:
> {code}
> du -h /h
> 249M    /h
> {code}
> After:
> {code}
> du -h /h
> 140M    /h
> {code}
> Seems like deduplicating jars would be a much better project than splitting into a client
jar, if we really cared about this.
> <snip>
> {quote}

This message was sent by Atlassian JIRA

View raw message