mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pat Ferrel (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient
Date Thu, 25 Dec 2014 02:36:13 GMT


Pat Ferrel commented on MAHOUT-1636:

I think the job.jar is just an assembly of other jars as specified in job.xml, which could
be called anything. AFAIK there is nothing specific about the format and the jvm certainly
does recognize and load classes from the a job.jar in the front end part of the driver.

Can you point me to something that shows how the backend classes are loaded differently?

The shell depends on spark module, which depends on mrlegacy and since the only place transitive
dependencies are assembled is the mrlegacy job jar I suspect the DSL+shell will have holes
if you do away with the job jar. 

Seems like we have two cases, hadoop mapreduce, which is covered. And Spark, which does need
an all deps jar (minus HDFS, Spark, and Scala). This means at least two classpaths, which
we have. We are missing a Spark assembly that we can agree on. I'm terrible at POMs but will
see it I can figure a way to exclude HDFS, Spark, and Scala from the current job jar in the
spark module. This should get us most of the way to an agreeable solution.

> Class dependencies for the spark module are put in a job.jar, which is very inefficient
> ---------------------------------------------------------------------------------------
>                 Key: MAHOUT-1636
>                 URL:
>             Project: Mahout
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.0-snapshot
>            Reporter: Pat Ferrel
>            Assignee: Ted Dunning
>             Fix For: 1.0-snapshot
> using a maven plugin and an assembly job.xml a job.jar is created with all dependencies
including transitive ones. This job.jar is in mahout/spark/target and is included in the classpath
when a Spark job is run. This allows dependency classes to be found at runtime but the job.jar
include a great deal of things not needed that are duplicates of classes found in the main
mrlegacy job.jar.  If the job.jar is removed, drivers will not find needed classes. A better
way needs to be implemented for including class dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for now. Whoever
picks up this Jira will have to remove it after deciding on a better method.

This message was sent by Atlassian JIRA

View raw message