mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suneel Marthi <>
Subject Re: Mahout 1.0 goals
Date Sat, 08 Mar 2014 22:55:37 GMT

On Saturday, March 8, 2014 5:41 PM, Pat Ferrel <> wrote:
Ah, now back to freely babbling on the dev list.

Mahout wishlist:
1) scaling:  I don’t get the need for R integration or running without hadoop or spark.
You can run hadoop in local mode on your native file system even using a debugger--then run
the exact same code on a cluster. If you don’t care about scaling there are plenty of great
libs for R already, why worry about Mahout? One project I worked on started with the in-memory
recommender but within months had hopelessly outgrown it. If there isn’t at least a path
to scaling we would never have started with Mahout.  Non-scalable code is fine and solves
many applications but I hope it’s not the primary design point.
2) speed: read below, Hadoop now (speed means buying more computers) More Spark later (buy
less computers)
3) ease of data input/output. The conversion of external ids into Mahout sequential integers
is deceptively difficult and has to be re-created with every project. I’m trying to submit
an example, which includes an input/output pipeline that is mostly scalable. It takes delimited
logfiles with external ids, creates Mahout input, then takes the output of Mahout and converts
back to external Ids. It is not worthy of core inclusion but is at least a prototype or example
of how to do this. 

My $0.02 worth about the future of Mahout:
1) the future will be in moving lots of the current code to Spark and that may not be the
end of it. If yet another faster platform emerges Mahout will have to go there too. If Mahout
doesn’t move (pretty quickly) someone will fill the gap and Mahout will be left behind.
2) the future of Mahout is tied to big data, at least I hope so.

Ask yourself this: Is Mahout a sandbox for experimentation on cutting edge algorithms or is
Mahout a scalable, performant ML library that is targeted for production environments?

>> Agree with the later and given that the future is moving existing implementations
to Spark, all the more reason to make Mahout less of an experimental sandbox. 

I hope most people think it is the later.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message