mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: using spark-submit to launch CLI jobs
Date Mon, 30 Nov 2015 04:58:58 GMT
On Sat, Nov 28, 2015 at 10:55 AM, Pat Ferrel <> wrote:

> I use spark-submit also to launch apps that use Mahout so not sure what
> assumptions you are talking about.

Ok so if it works what's the problem. I am lost.
I am talking about assumptions that anything dealing with context needs to
be changed or even removed.

> The first thing is to use spark-submit in our own launch script.
What script would that be?

> The current code calls the CLI mahout script to get classpath info, this
> should be passed in to the

Which code? mahout context creation? As i said, you can customize that
behavior. You can tell it not to look for standard jars + get your own jars
into classpath. Should be flexible enough to handle any startup situation.

> spark-submit so if we launch with spark-submit I think the call of the
> mahout script would be unnecessary. This makes it more straightforward to
> use with Yarn cluster mode where the client/driver is launched on some
> cluster machine where there would be no script to call.

Again, see comment above.  Yes, i did submits to yarn and standalone, you
name it. it is all good.

> If the SparkMahoutContext is a hard requirement that’s fine.

Every single operation uses context (which essentially wraps backend
context). it is not passed in, it is implied by a dataset parameter. No
physical operator can work without it.

For most part, context is required because the backend engines require a
session equivalent of it (SparkContext in Spark's case). This is more a
hard requirement on the backend part.

> As I said, I don’t understand all of those ramifications.
> On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov <> wrote:
> I do submits all the time, don't see any problem. It is part of my standard
> stress test harness.
> Mahout context is conceptual and cannot be removed, nor it is required to
> be removed in order to run submitted jobs. Submission and contexts are two
> completely separate concepts. One can submit a job that for example doesn't
> set up a spark job at all and runs for example a Mr job, or just
> manipulates some HDFS directories, or sets up multiple jobs or combinations
> of all of the above. All submission means is sending an Uber jar to an
> application server and launching a main class there, instead of doing the
> same locally. Not sure where these all assumptions are coming from.
> On Nov 27, 2015 11:33 AM, "Pat Ferrel" <> wrote:
> > Currently we create a SparkMahoutContext, and use “mahout -spark
> > classpath” to create the SparkContext. the SparkConf is also directly
> > accessed. If we move to using spark-submit for launching the Mahout Shell
> > and other drivers we would need to refactor some of this and change the
> > mahout script. It seems desirable to have and driver code create the
> Spark
> > context and rely on spark-submit for any config overrides and params.
> This
> > implies the possible removal (not sure about this) of SparkMahoutContext.
> > In general it would be nice if this were done outside of Mahout, or
> limited
> > to the drivers and shell. Mahout has become a library that is designed to
> > be backend independent. This code was designed before this became a goal
> > and is beyond my understanding to fully grasp how much work would be
> > involved and what would replace it.
> >
> > The code refactoring needed is not well understood, by me at least. But
> > intuition says that with a growing number of backends it might be good to
> > clean up the Spark dependencies for context management. This has also
> been
> > a bit of a problem in creating apps that use Mahout since typical
> > spark-submit use cannot be relied on to make config changes, they must be
> > made in environment variables only. These arguably non-standard
> > manipulation of the context puts limitations and hidden assumptions into
> > using Mahout as a library.
> >
> > Doing all of this implies a fairly large bit of work, I think. The
> benefit
> > is that it will be more clear how to use Mahout as a library and in
> > cleaning up some unneeded code. I’m not sure I have enough time to do all
> > of this myself.
> >
> > This isn’t so much a proposal as a call for discussion.
> >
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message