Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A9254108D1 for ; Thu, 5 Mar 2015 20:31:39 +0000 (UTC) Received: (qmail 76009 invoked by uid 500); 5 Mar 2015 20:31:39 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 75957 invoked by uid 500); 5 Mar 2015 20:31:39 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 75943 invoked by uid 99); 5 Mar 2015 20:31:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Mar 2015 20:31:39 +0000 Date: Thu, 5 Mar 2015 20:31:39 +0000 (UTC) From: "Sean Busbey (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-11680) Deduplicate jars in convenience binary distribution MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-11680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349431#comment-14349431 ] Sean Busbey commented on HADOOP-11680: -------------------------------------- {quote} IMO, it might be easier to start with just making sure the sub-projects (HDFS, YARN, MAPRED), don't bundle the jars that are already required by common, since those subprojects also have common as a dependency. {quote} I'm not familiar with how the hadoop assemblies work yet, but would doing this be as simple as having those components list common and its dependencies provided? If I use maven to say a dependency should already be present at runtime, common maven assemblies will skip including those artifacts in the bundle. Presuming that also works for Hadoop assemblies, if libraries still show up in the "foo classpath" commands is it reasonable to expect everything will work? > Deduplicate jars in convenience binary distribution > --------------------------------------------------- > > Key: HADOOP-11680 > URL: https://issues.apache.org/jira/browse/HADOOP-11680 > Project: Hadoop Common > Issue Type: Improvement > Components: build > Reporter: Sean Busbey > Assignee: Sean Busbey > > Pulled from discussion on HADOOP-11656 Colin wrote: > {quote} > bq. Andrew wrote: One additional note related to this, we can spend a lot of time right now distributing 100s of MBs of jar dependencies when launching a YARN job. Maybe this is ameliorated by the new shared distributed cache, but I've heard this come up quite a bit as a complaint. If we could meaningfully slim down our client, it could lead to a nice win. > I'm frustrated that nobody responded to my earlier suggestion that we de-duplicate jars. This would drastically reduce the size of our install, and without rearchitecting anything. > In fact I was so frustrated that I decided to write a program to do it myself and measure the delta. Here it is: > Before: > {code} > du -h /h > 249M /h > {code} > After: > {code} > du -h /h > 140M /h > {code} > Seems like deduplicating jars would be a much better project than splitting into a client jar, if we really cared about this. > > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)