mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Spark options
Date Wed, 12 Nov 2014 01:49:04 GMT
The submit code is the only place that documents which are needed by clients AFAICT. It is
pretty complicated and heavily laden with checks for which cluster manager is being used.
I’d feel a lot better if we were using it. There is no way any of us are going to be able
to test on all those configurations. is mostly for launching the cluster not the client but there seem to be exceptions
like executor memory.

On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <> wrote:

these files if i read it correctly are for spawning yet another process. i
don't see how it may work for the shell.

I am also not convinced that spark-env is important for the client.

On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <> wrote:

> I was thinking -Dx=y too, seems like a good idea.
> But we should also support setting them the way Spark documents in
> and the two links Andrew found may solve that in a
> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf
> function, which handles all env supplied setup. For the drivers it can be
> done in the base class allowing and CLI overrides later. Then the SparkConf
> is finally passed in to mahoutSparkContext where as little as possible is
> changed in the conf.
> I’ll look at this for the drivers. Should be easy to add to the shell.
> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <> wrote:
> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
> parameters to the java startup call and all should be fine.
> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <>
> wrote:
>> I've run into this problem starting $ mahout shell-script.  i.e. needing
>> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.
> I've
>> been temporarily hard coding them for now while developing.
>> I'm just getting familiar with What you've done with the CLI drivers.
> For
>> #2 could we borrow option parsing code/methods from spark [1] [2] at each
>> (spark) release and somehow add this to
>> MahoutOptionParser.parseSparkOptions?
>> I'll hopefully be doing some CLI work soon and have a better
> understanding.
>> [1]
>> [2]
>>> From:
>>> Subject: Spark options
>>> Date: Wed, 5 Nov 2014 09:48:59 -0800
>>> To:
>>> Spark has a launch script as hadoop does. We use the Hadoop launcher
>> script but not the Spark one. When starting up your Spark cluster there
> is
>> a script that can set a bunch of environment variables. In
> our
>> own mahoutSparkContext function, which takes the place of the Spark
> submit
>> script and launcher we don’t account for most of the environment
> variables.
>>> Unless I missed something this means most of the documented options will
>> be ignored unless a user of Mahout parses and sets them in their own
>> SparkConf. The Mahout CLI drivers don’t do this for all possible options,
>> only supporting a few like job name and spark.executor.memory.
>>> The question is how to best handle these Spark options. There seem to be
>> two options:
>>> 1) use sparks launch mechanism for drivers but allow some to be
>> overridden in the CLI
>>> 2) add parsing the env for options and set up the SparkConf default in
>> mahoutSparkContext with those variables.
>>> The downside of #2 is that as variables change we’ll have to reflect
>> those in our code. I forget why #1 is not an option but Dmitriy has been
>> consistently against this—in any case it would mean a fair bit of
>> refactoring I believe.
>>> Any opinions or corrections?

View raw message