commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <>
Subject Re: [math] UnivariateImpl statistical computation strategies
Date Tue, 17 Jun 2003 07:33:01 GMT

--- "Mark R. Diggory" <> wrote:
> I've got a design decision to make that I'd like to get others opinion 
> on. Currently, the strategy in UnivariateImpl is to calculate the 
> rudimentary building blocks of the statistics and then calculate the 
> statistics in the "getters" (getVariance, getSkewness, getKurtosis 
> etc.). Some cases its done in the getter, some cases its done in the 
> addValue method itself. Often its based on the implementors opinion of 
> where to put it, not on any hard logic.
> This presents a debate with the following arguments:
> (1) Bean etiquette suggests "getters" are for bean properties, its 
> usually recommended that  this means that they do nothing more than 
> return the value for a property. 

This is certainly not specified anywhere in the Javabeans spec.  In fact, the
spec explicitly states (sect 7.1) "So properties need not just be simple data
fields, they can actually be computed values. Updates
may have various programmatic side effects."  If the "etiguette" above were in
fact standard, entity EJBs, for example, would be impossible.  The power of the
javabeans specification is that it is an interface specification, not an
implementation specification.  Beans can and should manage their internal state
and the mapping between their internals and their publicly exposed properties
in the most convenient and efficient way possible.  

This is beneficial in our Univariate 
> case when calling a getter many times without adding a new value (lets 
> say you use "getKurtosis" allot in a calculation before adding another 
> value), then its more logical to have the kurtosis only calculated once 
> and put the code for calculating it in the addValue method.
Huh?  Kurtosis is only defined for the versions that store all values.  If and
when we implement the corrected two-pass formulas, these may benefit from some
running sum computations; but for now, all computations should be performed on
demand, using the vector of stored values.  There is no reason to keep updating
as the values are added for the stored case.

> (2) However, If calling addValue many times (more likely the case) with 
> only the interest of getting the "getMean" back, its wasted 
> computational time to calculate all the other Stats (like kurtosis) in 
> addValue when you just want the results of "getMean" back after each 
> "addValue".

Yes.  The stored versions should use array-based computations, computing
statistics on demand in the getters.
> I suspect this debate leads to a compromise similar to what I've done in 
> skew and kurt where all the rudimentary building blocks for all the 
> stats are built in addValue, and the detailed calculation specific to 
> that stat is done in the getter.
I see no reason to do anything in addValue (other than add the value) for the
stored case.  Computations should be vector based -- unless the modified
two-pass stuff can have reduced computational overhead by keeping lagged

> thoughts?
> Mark
> p.s. In a more complex approach the user might be able to tune the 
> calculations given thier specific need. But this would require the 
> creation of a delegation framework and boolean switching to control the 
> behavior of the Implementation, allot of added complexity that would 
> need to be maintained, it could create more work than its worth.


p.s.  let's try keep subjects indicative of the content. (note change above)
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message