mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pat Ferrel (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1636) Class dependencies for the spark module are put in a job.jar, which is very inefficient
Date Mon, 29 Dec 2014 17:01:13 GMT


Pat Ferrel commented on MAHOUT-1636:

The "front-end" gets the classpath created in the mahout script. 

The "back-end" gets the classpath created in mahoutSparkContext which uses "mahout classpath
-spark" and allows for a list of special purpose jars (not used internal to Mahout)

The output of "mahout classpath" and "mahout classpath -spark" are identical in my case. For
the back-end mahoutSparkContext has the chance to modify the classpath or add jars but does

So there may be some refactoring to do here. For instance, why are the two cps identical?
Surely there are more things we can exclude when running hadoop drivers. Also should we be
using something like spark-submit to launch spark drivers?

As a first step I'll try creating a new "dependencies.jar" assembly which has all transitives
for spark drivers excluding anything that seems unneeded or is already guaranteed by the environment.
I believe that the only way to test this will be to run all drivers from the CLI since scalatest
during build uses a different methods for finding classes. See
for further discussion.

[~tdunning] I assume this doesn't duplicate anything you are doing on this ticket?

> Class dependencies for the spark module are put in a job.jar, which is very inefficient
> ---------------------------------------------------------------------------------------
>                 Key: MAHOUT-1636
>                 URL:
>             Project: Mahout
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.0-snapshot
>            Reporter: Pat Ferrel
>            Assignee: Ted Dunning
>             Fix For: 1.0-snapshot
> using a maven plugin and an assembly job.xml a job.jar is created with all dependencies
including transitive ones. This job.jar is in mahout/spark/target and is included in the classpath
when a Spark job is run. This allows dependency classes to be found at runtime but the job.jar
include a great deal of things not needed that are duplicates of classes found in the main
mrlegacy job.jar.  If the job.jar is removed, drivers will not find needed classes. A better
way needs to be implemented for including class dependencies.
> I'm not sure what that better way is so am leaving the assembly alone for now. Whoever
picks up this Jira will have to remove it after deciding on a better method.

This message was sent by Atlassian JIRA

View raw message