commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <>
Subject Re: [MATH] Summary proposed changes
Date Mon, 30 Aug 2004 00:57:50 GMT
Kim van der Linde wrote:
> Well, I had a discussion with several collegues (type science users, we 
> went snorkeling) on several of these issues. The score of the day was 
> the idea that the simple linear LS regression was considered a 
> multvariate statistics.

So it should stay where it is.
> Phil Steitz wrote:
>> Nothing has been "put aside."  We make decisions by consensus.  You 
>> have provided input and we are considering it.  To make sure I have it 
>> all right, you have proposed four changes:

>> 2) Change the name of "BivariateRegression" to "UnivariateRegression" 
>> (or something else)
> Put it in univariate, name it LSRegression. (or better, 
> SimpleRegression, and bild in the option for RMA and MA regressions).

The placement in .univariate contradicts what you say both above and 
below. Even with just one independent variable, regression is a 
multivatiate technique.
>> 3) Change Variance to be configurable to generate the population 
>> statistic.
> Yup, or even beter, configurable bias reduction (n = N-a default a = 1, 
> but settable by constuctor and specific methods to mantain the option of 
> getting both statistics from the same dataset without doing things 
> twice). The current situation actually introduces fundamental errors.

Huh?  The formula provides unbiased estimates -- "fundamental error" would 
be to use the biased estimator for sample statistics. As I stated in an 
earlier post, the statistics in the univariate package are all designed to 
produce unbiased estimates for (unknown) population parameters based on 
sample data. The "population variance" that you want to add is either a 
biased (therefore inappropriate) estimator for the population variance 
based on a sample, or an exact expression of the population variance of 
the discrete distribution whose mass points are the data (i.e., assuming 
that the data values *are* the population and not a sample from it -- 
which is why it is called "population variance").  In either case it is a 
different statistic and to keep our design consistent, we should not use 
the same univariate to compute different statistics.

>  From the JavaDoc for Variance and SD class:
> - double evaluate(double[] values, double mean, int begin, int length)
>     Returns the variance of the entries in the specified portion of the 
> input array, using the precomputed mean value.
> And in Variance only:
> - double evaluate(double[] values, double mean)
>     Returns the variance of the entries in the input array, using the 
> precomputed mean value.
> If you compute the variance based on a already existing mean obtained 
> different from the sample you estblish the variance on, the population 
> variance should be used as there is no loss of "degree's of freedom" by 
>  first establishing the mean of the sample. IF the mean is based in the 
> same sample, than it is correct.

These methods, like Variance itself, assume that the mean and variance are 
being computed based on sample data.  This is why it says "precomputed" 
rather than "known population parameter". The methods are provided to save 
computation when the sample mean has already been computed.
>> 4) Combine the univariate and multivariate packages, since it is 
>> confusing to separate statistics that focus on one variable and 
>> sometimes the word "univariate" is used in the context of multivariate 
>> techniques (e.g. "Univariate Anova").
> No, keep them separate, but just locate things where they belong and not 
> reinvent that simple LS regressions should be within the multivariate 
> package.

Contradicts above -- assuming you mean that regression belongs in 
.multivariate, which it does.
> I have question for you. Where would you locate a Covariance class....?

I am not sure that we would define a covariance class; but if we did, it 
would certainly belong in .multivariate, since covariance is a property of 
the joint distribution of two variables rather than just one.  The basic 
idea is very simple: univariate is for statistics that characterize the 
distribution of just one random variable, multivariate is for analyses 
that involve joint distributions of multiple random variables.

> Kim

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message