commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <phil.ste...@gmail.com>
Subject Re: [math] MATH-224 - need a better idea
Date Mon, 20 Apr 2009 10:50:24 GMT
John Bollinger wrote:
> I'm looking at commons-math for the first time, but I don't think the feature can be
implemented as requested in a manner that is suitably generic.  On the other hand, I think
the same objective could be achieved a different way without changing the base API at all.
 The key would be to generate the aggregate statistics at the same time as the per-partition
ones, instead of aggregating them after the fact.  That does require knowing beforehand that
you're going to want the aggregate stats, but I think that's a fair tradeoff.  This could
be done without making client programs update two sets of statistics with each datum, by wrapping
the each StorelessUnivariateStatistic with an implementation that forwards the data to two
StorelessUnivariateStatistics -- the wrapped one and one for the aggregate.  Almost all the
work of setting that up can be automated.
>
> I'll see whether I can whip up a proof of concept for you to check out.
>   
I like this approach.  As you point out, it avoids entirely the issues 
raised above and is actually quite flexible in terms of when streams 
start and end, etc.  The only downsides are a) cost of all the 
"forwarded" increment calls (not likely to be a real practical issue in 
most cases) and b) ease of use.  I mention b) only because I had to 
think for 5 seconds before anticipating how the test case was going to 
be coded.  I would appreciate feedback from others on this - especially 
those requesting the feature.

Thanks!

Phil
> John
>
>
>
>
> ________________________________
> From: Phil Steitz <phil.steitz@gmail.com>
> To: Commons Developers List <dev@commons.apache.org>
> Sent: Sunday, April 19, 2009 11:34:24 AM
> Subject: [math] MATH-224 - need a better idea
>
> We should be able to find a clean way to do what this enhancement request is asking for.
 I am feeling stupid because even when I consider breaking compatibility / refactoring to
use generics, I can't find a simple way to do it.  Here is a description of the current API
and some failed ideas that I have considered so far.   As usual, I would like to minimize
pain for current users in addressing this, but at this point I am starting to think that wholesale
refactoring is necessary and I would appreciate ideas on the best way to do this.
>
> SummaryStatistics provides "storeless" computation of summary statistics - min, max,
mean, variance, etc.  Here "storeless" means that the class does not hold the stream of data
in memory.  It was designed to support pluggable implementations of the statistics that it
computes.  It does this in a way that looks smelly in the new world of type-safe Java (well,
maybe it always smelled ;)  The injectable implementation classes in SummaryStatistics are
typed as "StorelessUnivariateStatistic" which is an interface that includes things like getResult()
and increment(double).  There is nothing preventing, for example, a variance implementation
from being "plugged in" to implement the mean.
>
> The request in MATH-224 is to support aggregation in the following sense:  SummaryStatistics
instance 1 gets a stream of values and instance 2 gets another stream of values and we want
to create a new instance or replace instance 1 with an instance that behaves as though it
got all the data from both streams.  The simplest way to do this would be to add an "aggregate"
method to the StorelessUnivariateStatistic interface and then just implement aggregation in
SummaryStatistics by delegation to the implementation instances.  This is essentially what
the patch attached to MATH-224 does.  The problem with this approach is that supporting aggregation
is a fairly strong requirement in general, stronger than just requiring that the statistic
be computable without storing the data.  Stronger still is the requirement that an implementation
of a statistic be "aggregatable" with a possibly different implementation (since then it would
have access only to the value
>  of the other statistic).
>
> So the challenge is can we find a clean way to achieve the four objectives:
>
> 0) Maintain pluggability of statistics implementations
> 1) Support aggregation
> 2) Improve type safety
> 3) Minimize trauma for current users
>
> Dropping 0) makes things much simpler, but I would like to avoid that unless there is
really no way to accomplish 1) and 2) without taking that step.  Strictly speaking, 1) may
be impossible as I know of no way to support this for the higher moments.  I would be OK with
aggregation forcing these to NaN (documented, of course).
>
> My first thought was to define a parameterized Aggregatable interface that requires the
same types.  Then two SummaryStatistics instances are aggregatable iff their implementation
statistics match types.  I am OK with these restrictions, but am having trouble actually making
it work.
>
> Suggestions / patches welcome!
>
> Phil
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>
>       
>   



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message