mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: MapReduce Stats calculations
Date Fri, 06 May 2011 15:23:57 GMT
yeah... un-re-used re-usable primitives are of little help, but a Mahout big
data equivalent of the R summary function would handy to have.  The fact is,
we already have the re-usable bits anyway.  It is common to want column-wise
summaries of big matrices.  Useful summaries include:

a) moment based statistics like average and standard deviation

b) rank based statistics like min, max, 1, 5, 25, 50, 75, 95, and 99th
percentiles.

c) counts of positive, negative and all entries

d) for word or text-like data, the total number of unique items with
frequency greater than 0, 1, 5 and the top 5-10 most common items.


On Fri, May 6, 2011 at 6:58 AM, Sean Owen <srowen@gmail.com> wrote:

> Hadoop has something like this:
>
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/package-summary.html
>
> I find there's a very strong and unfortunate tension between
> reusability and performance in some cases. Having a discrete stage to
> compute something like this is good; if it can be computed inline in a
> prior stage and output on the side, that's a big performance savings.
>
> I also find myself tempted to construct a bunch of M/R primitives. For
> now I am trying to restrict my thinking to refactoring pieces that can
> come out easily, and that are used already in at least one place.
>
> I suppose I mean: if you want to write primitive X and can't find one
> good use for it yet in Mahout, I'd hold off, but otherwise would
> surely add it and use it.
>
>
> On Fri, May 6, 2011 at 2:49 PM, Grant Ingersoll <gsingers@apache.org>
> wrote:
> > MAHOUT-688 has a M/R job to calculate std. deviation for document
> frequencies so that it can prune noisy words.  I'm thinking of making it a
> bit more generic and adding a stats package to org.apache.mahout.math.hadoop
> that contains this and other basic stats calculations (mean, variance, sum
> of squares, etc.) that operate in M/R.
> >
> > Is that useful or am I re-inventing the wheel here or wasting time?
>  Seems like such a beast should already exist, but a quick search didn't
> turn up much.
> >
> > -Grant
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message