mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Mahout feedback
Date Fri, 15 Apr 2011 14:25:55 GMT

On Apr 15, 2011, at 8:59 AM, Sean Owen wrote:

> I had a chance to get feedback last night from a few Old Street
> startups using Mahout. The overall comments were of course positive --
> it provides a solution that's at least 80% ready-to-go and saves a
> great deal of trial and error in getting towards something working.

Very cool.  Thanks for bringing this up.

> The problems I heard were similar to last time. The jobs are uneven
> and not standard, so each has its own peculiar learning curve. There
> are evidently still a number of invisible assumptions baked into the
> code about the file structure and environment too -- I heard again
> that repeated use of "new Configuration()" around the code breaks
> things. The experience of Mahout seemed to be one of weeks of trial
> and error, some of which has to do with understanding the machinery of
> Hadoop of course.

And machine learning, I suspect.  There seems to be a fair amount of T & E in ML no matter
what, given the need to find good parameters and to do feature selection.

> Finally there was a group using the LDA
> implementation but had abandoned it over scalability concerns --
> didn't get more detail on that.

Yeah, we've heard the LDA concerns before.  I actually think all of our clustering other than
K-Means needs a good hard look in terms of performance.  From the tests that Tim and Syzmon
and I did for MAHOUT-588, Dirchlet, Fuzzy K-Means, Mean Shift and Canopy don't look good.
 In fact, K-Means is the only one that scaled.  We've got a repeatable framework setup for
them, so they should be runnable by others.  

Now, having said that, I believe our approach with them is correct (in other words, they should
be able to scale), so it points to either the way we were running them or the implementation.
 I hope it's the former, but am pretty sure it's the latter.

Hopefully with the convergence work that Jeff is doing, that will make getting performance
improvements easier since there is less code to debug and the pathways get exercised more

> I do reiterate that there is, at heart, a significant and eager
> developer audience who is finding all this really useful, that are
> burning up a lot of energy just getting started. That's just the
> nature of this beast at version 0.x, but, I think it just once again
> underscores that the need is not for new algorithms, but cleaning up,
> fixing, documenting, streamlining what's already there.

Yeah, I agree.  I had really hoped that someone would put in for a benchmarking GSOC project,
but I didn't see it (if you did submit one, please point me at it, as I missed the title!)

I also think it points to the fact that we aren't going to go from 0.5 to 1.0 like we had
thought, but instead will have at least a 0.6.

I think in order to do this scrubbing, we should focus on some real world data that we can
run our primary algorithms on.  I would propose the 6.5M ASF mail archives that I have up
on S3 (see utils/bin/ on trunk).  From this, I think we could test/demo
our 3 C's (clustering, classifiers, collab filtering) along w/ Freq. Patternset mining.  Doing
this, will give us a consistent, easily repeated set of examples across real content and should
help us flesh out the performance issues as well as exercise many of the dark areas of Mahout.
 This would also let us put together some recipes around how to do things in Mahout, especially
feature selection.  The bonus is all the data is freely redistributable.

I think we would also benefit from something similar to Lucene's randomized testing framework.
 I'm not sure how to incorporate it just yet, but it massively expands our test capabilities.

I also will go back to my REST service layer.  I think if we had a service layer (ala Solr,
etc.) that you could start jobs, get status, add content, get results, etc. all in a scalable
way, it would really help people get started and running.  This is probably longer term, however.

View raw message