mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Musselman <>
Subject Re: Mahout 1.0 goals
Date Fri, 28 Feb 2014 03:01:07 GMT
Thanks for starting the conversation, Ted.  I'm relatively new to the
project though I've been using Mahout for a couple years in production, and
am happy to see things move forward in whatever way makes sense.

I think Mahout needs to ship a production-ready version if it's going to be
called 1.0, otherwise we ought to call the next release 0.10.

In that vein, I think Sean, Dmitriy, and Ted have some good points in that
Mahout is still a very rough draft.  I think we all have used some portion
of Mahout in production and are surprised when we find out how dodgy things
are in certain spots when we look around further after learning how our
favorite things work.

I'd like to see several of the things you mention, Ted, including
decoupling from Hadoop and map-reduce where possible, working on the speed
competition, exporting to PMML, and clarifying the programming approach.

And I'm not sure if this is what Dmitriy meant in his comments (3), but I'd
love to be able to do Mathematica-style work in an interactive shell and/or
symbolic system where I could do A*B' and it just worked.  That would crush
everything on the market, though it could be a lot of work to build a DSL
that supports it.

I also think Dmitriy's (5) for having up-front data assessment stuff is
really valuable.  I'm building things like that internally at work and I
can confirm that there is high demand for it.

Along with the up-front pipelining, I'd like back in  Mahout is a feature
that I think was in there and got removed:  shipping results in a web
service, without writing your own.

So I'd like a free machine-learning library I can count on to make sense
when I use the Java/Scala API or command-line programs, take raw data and
do the necessary "first whack" at it, prepare vectors for jobs, run jobs,
and then build a jar file I can put into Jetty or Tomcat, and bonus points
do that "real-time" solr-recommender-style recalculation and results

The end-to-end part is where I think Mahout could sprint to the front pack
and do well.


On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <> wrote:

> I would like to start a conversation about where we want Mahout to be for
> 1.0.  Let's suspend for the moment the question of how to achieve the
> goals.  Instead, let's converge on what we really would like to have happen
> and after that, let's talk about means that will get us there.
> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:
> - runs with or without Hadoop
> - runs with or without map-reduce
> - includes (at least), regularized generalized linear models, k-means,
> random forest, distributed random forest, distributed neural networks
> - reasonably competitive speed against other implementations including
> graphlab, mlib and R.
> - interactive model building
> - models can be exported as code or data
> - simple programming model
> - programmable via Java or R
> - runs clustered or not
> What does everybody think?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message