commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: [math] MATH-224 - need a better idea
Date Sun, 19 Apr 2009 21:51:44 GMT
That is a fine answer for some things, but the parallel cases fail.

My feeling is that there are a few cases where there are nice aggregatable
summary statistics like moments and there are many cases where this just
doesn't work well (such as rank statistics).  For the latter, case I usually
make do with a surrogate such as a random sub-sample or a recency weighted
random sub-sample combined with a few aggregatable stats such as total
samples, max, min, sum and second moment.  That gives me most of what I want
and if the sub-sample is reasonably large, I can sometimes estimate a few
parameters such as total uniques.  The sub-sampled data streams can be
combined trivially so I now have a aggregatable approximation of
non-aggregatable statistics.  For descriptive quantiles this is generally
just fine.

On Sun, Apr 19, 2009 at 2:44 PM, John Bollinger <thinman42@yahoo.com> wrote:

> The key would be to generate the aggregate statistics at the same time as
> the per-partition ones, instead of aggregating them after the fact.




-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message