commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <phil.ste...@gmail.com>
Subject Re: [math] MATH-224 - need a better idea
Date Tue, 21 Apr 2009 10:20:08 GMT
John Bollinger wrote:
> The same approach could certainly be applied for DescriptiveStatistics, but the variable
window complicates things: if a finite window is selected for the aggregate statistics then
they will be sensitive to the order in which values are added to the contributing per-partition
statistics.  That problem exists no matter when the aggregation is performed, however, and
I guess the order we would get is reasonably likely to be the desired one.  Also, the removeMostRecentValue()
and replaceMostRecentValue() methods are a bit tricky if they need to cascade to the aggregate
statistics because the most recent value for one contributor may not be the most recent value
for the aggregate.  Anyway, I'll prepare an AggregateDescriptiveStatistics along the same
line as my AggregateSummaryStatistics, and then at least we'll have something concrete to
discuss.  Shall I post it as an additional patch for MATH-224?
>   
> DescriptiveStatistics does provide an opportunity for aggregating after the fact that
SummaryStatistics doesn't, because each contributing statistic remembers (some of) the values
provided to it.  On the other hand, users already can manually aggregate DescriptiveStatistics
objects.  What they cannot easily do after the fact is duplicate the overall order in which
values were added to the set of DescriptiveStatistics, and that is exactly what AggregateDescriptiveStatistics
will provide.  I think I'm rambling now, so I'll stop and write some code.
>   
Always a good idea ^

I was thinking initially of post-hoc aggregation, using the backing 
data, but it is worth investigating the approach above.  Thanks!

Phil
>
> Regards,
>
> John
>
>
>
>
> ________________________________
> From: Phil Steitz <phil.steitz@gmail.com>
> To: Commons Developers List <dev@commons.apache.org>
> Sent: Monday, April 20, 2009 7:01:20 AM
> Subject: Re: [math] MATH-224 - need a better idea
>
> Ted Dunning wrote:
>   
>> That is a fine answer for some things, but the parallel cases fail.
>>
>> My feeling is that there are a few cases where there are nice aggregatable
>> summary statistics like moments and there are many cases where this just
>> doesn't work well (such as rank statistics). 
>>     
> Yes, this is why not all statistics are "storeless."  We have another "summary" class
that maintains its data in storage and supports "rolling" behavior in DescriptiveStatistics.
 The discussion here is focussed on the "storeless" case, which is limited to those stats
that are computable in this way.  The cases of interest are stats that can be computed in
one pass through the data but which can't be "aggregated" post hoc.  John's approach provides
a simple solution to this problem.
>
> For completeness, we should probably similarly implement aggregation in the sense defined
in MATH-224 for DescriptiveStatistics as well. 
> Phil
>   
>>  For the latter, case I usually
>> make do with a surrogate such as a random sub-sample or a recency weighted
>> random sub-sample combined with a few aggregatable stats such as total
>> samples, max, min, sum and second moment.  That gives me most of what I want
>> and if the sub-sample is reasonably large, I can sometimes estimate a few
>> parameters such as total uniques.  The sub-sampled data streams can be
>> combined trivially so I now have a aggregatable approximation of
>> non-aggregatable statistics.  For descriptive quantiles this is generally
>> just fine.
>>
>> On Sun, Apr 19, 2009 at 2:44 PM, John Bollinger <thinman42@yahoo.com> wrote:
>>
>>  
>>     
>>> The key would be to generate the aggregate statistics at the same time as
>>> the per-partition ones, instead of aggregating them after the fact.
>>>    
>>>       
>>
>>
>>  
>>     
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>
>       
>   



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message