mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <>
Subject Making Mahout Leaner
Date Tue, 08 May 2012 15:11:46 GMT
Based on some discussion on the private group about where Mahout is
faltering in the real world, a stream of thought bubbled up - Make Mahout
leaner. i.e push the best stuff we have to the top and prune out algorithms
that are underperforming. The main issue here is that Iterative nature of
many of the algorithms make it inefficient to be implemented on top of
current Hadoop. The summary or the state of the disucssion so far

1) Focus on large scale data(not medium scale) and focus on algorithms that
run at *almost* O(n).
2) Focus on deployability and less on making it an analysis tool for data
3) Prune prune prune things that are not being maintained.

The following is one way of looking at Mahout and the state of its
algorithms. Let us know if you would like something to be in the keeper

1. Recommenders -- clearly a keeper
2. SGD
3. LDA
4. Some clustering (with upgrades)
5. Math + collections
6.  Hadoop Utilities + Integration  -- I know it's silly, but things like
sequence file dumper, the iterators, etc. are handy in a number of places.
7. SVD and related
8 RowSimilarity
9. Some of the upfront preprocessing tools (Lucene, Text , etc.)


- Bayes + Random Forest - Seems a shame on bayes, since it gives a
baseline, but I don't know that it actually works and then there's the
whole split personality nature of it (text-based and vector-based)
- Collocations - I'd say keep for now, even if just for selfish reasons
- Minhash - every time I look at it is seems broken and the original author
doesn't respond to requests for explanation.
- Freq. Item Set - Tom's done some work to clean up and I've tried it on
search logs and the results looked OK, but no formal evaluation.  I've seen
others say why not just do simpler co-occurrence stuff...

Drop for sure:

1. Watchmaker
2. Unused/poor examples
3. Probably a lot more that escapes me at the moment.
4. PageRank
Robin Anil

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message