mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Mahout without a CLI?
Date Tue, 15 Apr 2014 18:57:14 GMT
Quite happy to have you live in the shell and do the arcane math that most end users don’t
want to be required to know. That’s why Apache pays you the big bucks ;-)

In my experience the customizing pipelines problem is one of import and export. That and having
to write Java to do it. Pig+UDFs is an example of a solution to the import/export problem
though you do have to learn Pig. Everyone uses Solr (or ElasticSearch). Both of these are
language agnostic and have extremely flexible integration methods and formats. Let’s target
their users.

Language agnostic data formats and blackbox boundaries will make Mahout far easier to use
for production engineers. The rest will dive into the Scala shell and maybe more will do this
over time. But let’s not denigrate a potentially huge number of users by saying they don’t
exist. It will be self-fulfilling as it has been in the past. If Mahout has made a misstep
it is in moving away from these users. We have a clean slate here, if we do this targeting
of a broad user base well--they will come. 

On Apr 15, 2014, at 11:31 AM, Dmitriy Lyubimov <> wrote:

Finally, the whole point of ML environment is to enable pipeline
customization. Mahout's major criticism is mostly that -- "we can't
integrate and customize pipelines using Mahout's methods becasue Mahout's
throws "us" into bash environment(only) to do that, and that's silly".

So the question is always about how we connect building blocks, how we do
customized (cross)validation rounds etc. etc. I think we consistently heard
that. So the main successful argument here is that programming environment
is primary, and everything else is secondary. Supporting notions are that
environment is an existing accepted environment with sufficient 3rd party
following rather than new one (i.e. scala in our case) and that there's no
mix of environments (such as in Pig/Pig UDF conundrum).

So sure, just to try things out, one wants just to call a method with a
predefined input and output locations. But as soon as the "kicking the
tires" stage ends, one wants to do tons of other things as pre and post to
the method (e.g. grabbing the latest time-stamped hdfs input rather than a
predefined hardcoded constant) etc. etc. or even combine a bunch of methods
(e.g. LSA pipeline).

Assuming we operate on constrained resource schedule, i'd just go after
prime priorities first. I would not oppose if somebody spent time building
CLIs and CLI-based tutorial of course -- I just don't think we
realistically have people willing to do that..

On Tue, Apr 15, 2014 at 11:14 AM, Dmitriy Lyubimov <>wrote:

> On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <>wrote:
>> Sorry you are sick. Thanks for the tip. Spark has a client launcher
>> method "spark-class …Client launch ..." but I’m not having much success
>> with that.
> This will not work because you need Mahout's classpath too. And Spark's.
> The complexity here is the damn jar dependencies. Anything Spark (or hadoop
> for that matter, too) CLI do is assume that application is so simple it can
> fit into single jar and will have 0 external dependencies. I can do my own
> rant about it for ages.
> So. the task here is to collect all Spark jars and its dependencies; merge
> that of the same of Mahout's, perhaps filtering in only what is really
> needed in spark-based pipelines, and then run it. It is what specialized
> mahoutContext() api does, and there's a crapload of scala code devoted just
> to this single issue of deducing and grabing dependencies and make sure
> Spark takes them.
> Hope this clarifies why Spark helpers' ways of starting standalone spark
> applications just are not helpful for us. (or anyone, to be frank. I
> participated in a healhful dozen of spark-based projects, and none of them
> could use these helpers like Client or for the same reason
> -- they had to do their own bootstrap routine).
> So... we will have to have our own helpers to do that .  I wonder if
> there's a similar syntax for mahout already, something like "mahout
> run-class <class-name>". Since i never used that, i don't know for sure,
> but hadoop subordinate projects all usually have that (e.g. there's an
> 'hbase <class-name>" to run any class in hbase code base with proper
> classpath dependencies taken care of).
>> As to the statement "There is not, nor do i think there will be a way to
>> run this stuff with CLI” seems unduly misleading. Really, does anyone
>> second this?
>> There will be Scala scripts to drive this stuff and yes even from the
>> CLI. Do you imagine that every Mahout USER will be a Scala + Mahout DSL
>> programmer? That may be fine for commiters but users will be PHP devs, Ruby
>> devs, Python or Java devs maybe even a few C# devs. I think you are
>> confusing Mahout DEVS with USERS. Few users are R devs moving into
>> production work, they are production engineers moving into ML who want a
>> blackbox. They will need a language agnostic way to drive Mahout. Making
>> statements like this only confuse potential users and drive them away to no
>> purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in the
>> typical user’s world view.
>> Sorry, end-of-rant.
>> On Apr 15, 2014, at 10:14 AM, Dmitriy Lyubimov (JIRA) <>
>> wrote:
>>   [
>> Dmitriy Lyubimov commented on MAHOUT-1464:
>> ------------------------------------------
>> [My] Silence idicates I've been pretty sick :)
>> I thought i explained in my email we are not planning CLI. We are
>> planning script shell instead. There is not, nor do i think there will be a
>> way to run this stuff with CLI, just like there's no way to invoke a
>> particular method in R without writing a short script.
>> That said, yes, you can try to run it as a java application, i.e.
>> [java|scala] -cp <cp>. <class name>
>> where -cp is what `mahout classpath` returns.
>>> Cooccurrence Analysis on Spark
>>> ------------------------------
>>>               Key: MAHOUT-1464
>>>               URL:
>>>           Project: Mahout
>>>        Issue Type: Improvement
>>>        Components: Collaborative Filtering
>>>       Environment: hadoop, spark
>>>          Reporter: Pat Ferrel
>>>          Assignee: Sebastian Schelter
>>>           Fix For: 1.0
>>>       Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>> has several applications including cross-action recommendations.
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)

View raw message