mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: MapReduce Stats calculations
Date Fri, 06 May 2011 13:58:21 GMT
Hadoop has something like this:
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/package-summary.html

I find there's a very strong and unfortunate tension between
reusability and performance in some cases. Having a discrete stage to
compute something like this is good; if it can be computed inline in a
prior stage and output on the side, that's a big performance savings.

I also find myself tempted to construct a bunch of M/R primitives. For
now I am trying to restrict my thinking to refactoring pieces that can
come out easily, and that are used already in at least one place.

I suppose I mean: if you want to write primitive X and can't find one
good use for it yet in Mahout, I'd hold off, but otherwise would
surely add it and use it.


On Fri, May 6, 2011 at 2:49 PM, Grant Ingersoll <gsingers@apache.org> wrote:
> MAHOUT-688 has a M/R job to calculate std. deviation for document frequencies so that
it can prune noisy words.  I'm thinking of making it a bit more generic and adding a stats
package to org.apache.mahout.math.hadoop that contains this and other basic stats calculations
(mean, variance, sum of squares, etc.) that operate in M/R.
>
> Is that useful or am I re-inventing the wheel here or wasting time?  Seems like such
a beast should already exist, but a quick search didn't turn up much.
>
> -Grant

Mime
View raw message