commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <>
Subject [math] MATH-224 - need a better idea
Date Sun, 19 Apr 2009 15:34:24 GMT
We should be able to find a clean way to do what this enhancement 
request is asking for.  I am feeling stupid because even when I consider 
breaking compatibility / refactoring to use generics, I can't find a 
simple way to do it.  Here is a description of the current API and some 
failed ideas that I have considered so far.   As usual, I would like to 
minimize pain for current users in addressing this, but at this point I 
am starting to think that wholesale refactoring is necessary and I would 
appreciate ideas on the best way to do this.

SummaryStatistics provides "storeless" computation of summary statistics 
- min, max, mean, variance, etc.  Here "storeless" means that the class 
does not hold the stream of data in memory.  It was designed to support 
pluggable implementations of the statistics that it computes.  It does 
this in a way that looks smelly in the new world of type-safe Java 
(well, maybe it always smelled ;)  The injectable implementation classes 
in SummaryStatistics are typed as "StorelessUnivariateStatistic" which 
is an interface that includes things like getResult() and 
increment(double).  There is nothing preventing, for example, a variance 
implementation from being "plugged in" to implement the mean.

The request in MATH-224 is to support aggregation in the following 
sense:  SummaryStatistics instance 1 gets a stream of values and 
instance 2 gets another stream of values and we want to create a new 
instance or replace instance 1 with an instance that behaves as though 
it got all the data from both streams.  The simplest way to do this 
would be to add an "aggregate" method to the 
StorelessUnivariateStatistic interface and then just implement 
aggregation in SummaryStatistics by delegation to the implementation 
instances.  This is essentially what the patch attached to MATH-224 
does.  The problem with this approach is that supporting aggregation is 
a fairly strong requirement in general, stronger than just requiring 
that the statistic be computable without storing the data.  Stronger 
still is the requirement that an implementation of a statistic be 
"aggregatable" with a possibly different implementation (since then it 
would have access only to the value of the other statistic).

So the challenge is can we find a clean way to achieve the four objectives:

0) Maintain pluggability of statistics implementations
1) Support aggregation
2) Improve type safety
3) Minimize trauma for current users

Dropping 0) makes things much simpler, but I would like to avoid that 
unless there is really no way to accomplish 1) and 2) without taking 
that step.  Strictly speaking, 1) may be impossible as I know of no way 
to support this for the higher moments.  I would be OK with aggregation 
forcing these to NaN (documented, of course).

My first thought was to define a parameterized Aggregatable interface 
that requires the same types.  Then two SummaryStatistics instances are 
aggregatable iff their implementation statistics match types.  I am OK 
with these restrictions, but am having trouble actually making it work.

Suggestions / patches welcome!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message