Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BFEA010A80 for ; Tue, 30 Dec 2014 02:22:13 +0000 (UTC) Received: (qmail 17079 invoked by uid 500); 30 Dec 2014 02:22:13 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 16988 invoked by uid 500); 30 Dec 2014 02:22:13 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 16977 invoked by uid 99); 30 Dec 2014 02:22:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Dec 2014 02:22:13 +0000 Date: Tue, 30 Dec 2014 02:22:13 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@mahout.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAHOUT-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260688#comment-14260688 ] ASF GitHub Bot commented on MAHOUT-1636: ---------------------------------------- Github user pferrel commented on a diff in the pull request: https://github.com/apache/mahout/pull/69#discussion_r22337057 --- Diff: spark/src/main/assembly/dependencies.xml --- @@ -38,9 +38,34 @@ / true + + org.apache.hadoop:hadoop-core + org.apache.spark:spark-core_${scala.major} + org.scala-lang:scala-library + jackson-core-asl + jackson-mapper-asl + xstream + lucene-core + lucene-analyzers-common --- End diff -- @dlyubimov open to that argument but maybe you can explain. This allows us to add dependencies by putting them in any dependent module pom but without dealing with the assembly xml. Doing it with includes mean we have to remember to add to the module pom as well as this assembly every time. Also the excludes will only be things in the environment, I would think change seldom, even with new versions. The reverse argument is also true. If we add new parts of spark or scala we'll have to add them to the excludes since they are already in the environment. Not sure what those would be but maybe you have some examples. > Class dependencies for the spark module are put in a job.jar, which is very inefficient > --------------------------------------------------------------------------------------- > > Key: MAHOUT-1636 > URL: https://issues.apache.org/jira/browse/MAHOUT-1636 > Project: Mahout > Issue Type: Bug > Components: spark > Affects Versions: 1.0-snapshot > Reporter: Pat Ferrel > Assignee: Ted Dunning > Fix For: 1.0-snapshot > > > using a maven plugin and an assembly job.xml a job.jar is created with all dependencies including transitive ones. This job.jar is in mahout/spark/target and is included in the classpath when a Spark job is run. This allows dependency classes to be found at runtime but the job.jar include a great deal of things not needed that are duplicates of classes found in the main mrlegacy job.jar. If the job.jar is removed, drivers will not find needed classes. A better way needs to be implemented for including class dependencies. > I'm not sure what that better way is so am leaving the assembly alone for now. Whoever picks up this Jira will have to remove it after deciding on a better method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)