mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Palumbo <>
Subject Re: using spark-submit to launch CLI jobs
Date Sun, 29 Nov 2015 20:45:04 GMT
Pat, That seems like a good approach- my only ask would be that you keep mahoutSparkContext

From: Pat Ferrel <>
Sent: Sunday, November 29, 2015 1:33 PM
Subject: Re: using spark-submit to launch CLI jobs

BTW I agree with a later reply from Dmitriy that real use of Mahout generally will employ
spark-submit so the motivation is primarily related to launching app/driver level things in
Mahout. But these have broken several times now partly due to Mahout not following the spark-submit
conventions (ever changing though they may be).

One other motivation is that the Spark bindings mahoutSparkContext function calls the mahout
script to get a classpath and then creates a Spark Context. It might be good to make this
private to Mahout (used only in the test suites) so users don’t see this as the only or
preferred way to create a SparkMahoutContext, which seems better constructed from a Spark

    implicit val sc = <Spark Context creation code>
    implicit val mc = SparkDistributedContext( sc )

Since the drivers are sometimes used as examples of employing Mahout with Spark we could change
them to use the above method and for the same reasons employing spark-submit to launch them
is the right example to give.

If no one is particularly interested in this bit of refactoring or has no contrary opinions
to the above I’m inclined to do this as I have time.

On Nov 28, 2015, at 10:55 AM, Pat Ferrel <> wrote:

I use spark-submit also to launch apps that use Mahout so not sure what assumptions you are
talking about. The first thing is to use spark-submit in our own launch script.

The current code calls the CLI mahout script to get classpath info, this should be passed
in to the spark-submit so if we launch with spark-submit I think the call of the mahout script
would be unnecessary. This makes it more straightforward to use with Yarn cluster mode where
the client/driver is launched on some cluster machine where there would be no script to call.

If the SparkMahoutContext is a hard requirement that’s fine. As I said, I don’t understand
all of those ramifications.

On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov <> wrote:

I do submits all the time, don't see any problem. It is part of my standard
stress test harness.

Mahout context is conceptual and cannot be removed, nor it is required to
be removed in order to run submitted jobs. Submission and contexts are two
completely separate concepts. One can submit a job that for example doesn't
set up a spark job at all and runs for example a Mr job, or just
manipulates some HDFS directories, or sets up multiple jobs or combinations
of all of the above. All submission means is sending an Uber jar to an
application server and launching a main class there, instead of doing the
same locally. Not sure where these all assumptions are coming from.
On Nov 27, 2015 11:33 AM, "Pat Ferrel" <> wrote:

> Currently we create a SparkMahoutContext, and use “mahout -spark
> classpath” to create the SparkContext. the SparkConf is also directly
> accessed. If we move to using spark-submit for launching the Mahout Shell
> and other drivers we would need to refactor some of this and change the
> mahout script. It seems desirable to have and driver code create the Spark
> context and rely on spark-submit for any config overrides and params. This
> implies the possible removal (not sure about this) of SparkMahoutContext.
> In general it would be nice if this were done outside of Mahout, or limited
> to the drivers and shell. Mahout has become a library that is designed to
> be backend independent. This code was designed before this became a goal
> and is beyond my understanding to fully grasp how much work would be
> involved and what would replace it.
> The code refactoring needed is not well understood, by me at least. But
> intuition says that with a growing number of backends it might be good to
> clean up the Spark dependencies for context management. This has also been
> a bit of a problem in creating apps that use Mahout since typical
> spark-submit use cannot be relied on to make config changes, they must be
> made in environment variables only. These arguably non-standard
> manipulation of the context puts limitations and hidden assumptions into
> using Mahout as a library.
> Doing all of this implies a fairly large bit of work, I think. The benefit
> is that it will be more clear how to use Mahout as a library and in
> cleaning up some unneeded code. I’m not sure I have enough time to do all
> of this myself.
> This isn’t so much a proposal as a call for discussion.

View raw message