hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-12363) Hadoop binary distributions contain many copies of the same jars
Date Fri, 28 Aug 2015 16:32:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720159#comment-14720159

Allen Wittenauer edited comment on HADOOP-12363 at 8/28/15 4:32 PM:

Pretty sure this is HADOOP-11680 and HADOOP-10115.  This was fixed in trunk since removing
jars from the distribution in key locations was considered an incompatible change.

was (Author: aw):
Pretty sure this is HADOOP-11680.

> Hadoop binary distributions contain many copies of the same jars
> ----------------------------------------------------------------
>                 Key: HADOOP-12363
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12363
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Benoit Sigoure
>            Priority: Minor
> [I noticed this 2 years ago|https://twitter.com/tsunanet/status/384917643162972161] but
this is bugging me again so I'm finally filing a bug ;o
> The Hadoop binary distribution is insanely redundant.  Over 80% of the size of the ~200MB
tarballs distributed both by Apache upstream and by Cloudera is made of duplicate files.
> Back when I was complaining about CDH 4.4.0, the Hadoop tarball contained [3477 duplicate
files, some of which had 98 copies in the tarball|http://tsunanet.net/~tsuna/cdh440-dup-files.txt]!
> Now I'm looking at the official {{hadoop-2.7.1.tar.gz}} and I'm seeing 7 copies of {{jackson-mapper-asl-1.9.13.jar}},
{{jersey-server-1.9.jar}}, {{protobuf-java-2.5.0.jar}}, etc, 6 copies of {{guava-11.0.2.jar}},
{{xz-1.0.jar}}, {{commons-logging-1.1.3.jar}}, etc, 5 copies of {{snappy-java-}},
etc etc etc.  All in all there are well over 200 files that appear at least twice in the tarball,
and that account for 118MB worth of files that could just be replaced with a symlink (assuming
you don't want to change the structure of the tarball at all).
> This is really not necessary :)
> Can we fix the distribution?  I'm sure Cloudera and others will fix their distributions
as well once this is fixed upstream (their distros exhibit a substantially more acute version
of this problem).

This message was sent by Atlassian JIRA

View raw message