commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mauro Talevi <mauro.tal...@aquilonia.org>
Subject Re: [math] Improving numerics in OLSMultipleLinearRegression
Date Mon, 09 Jun 2008 07:58:21 GMT
Hi Phil,

thanks for reviewing the multiple linear regression implementations and 
setting up the R/NIST data tests.   I finally got around to installing R 
and can now run them too.

Phil Steitz wrote:
> While clear and elegant from a matrix algebra standpoint, the "nailve" 
> implementation in OLSMultipleLinearRegression has bad numerical 
> qualities.  It is well known that solving the normal equations directly 
> does not give good numerics.  I just added some tests to actually verify 
> parameter values, using the classic "Longly" dataset, for which NIST 
> provides certified statistics.  This is a "hard" design matrix.  R was 
> able to get to within 1E-8 of the certified parameter values.  
> OLSMultipleLinearRegression can only get 1E-1.

The OLS implementation has been added as a simple by-product of the GLS 
case - which is the main one I have needed for hypothesis testing - as 
it came "for free" with unitary covariance.
True - the emphasis was on clarity and formulaic simplicity.  And also 
following the old Donald Knuth maxim "optimization is the root of all 
evil".  But it seems like there is a need for refinement of the 
implementation - the devil raised his head :-)

> We have talked in the past about providing an implementation based on QR 
> decomposition.   Anyone up for  using the QR decomposition that we now 
> have to do this?  I really think we need to do it (or something else to 
> improve numerics) before releasing this class.  I will get to it 
> eventually, but am a little pegged at the moment.  I will review and 
> apply patches if someone is willing to do the implementation.  I can 
> also explain here or offline how the R tests and NIST datasets work, as 
> these are useful in validating code.

I'd be happy to improve the impl.  I'm getting my head around R and 
NIST, but perhaps a chat offline would not hurt!

> Another thing that we should think about before releasing any of this 
> stuff is the completeness of the API.  Many standard regression 
> statistics are missing.  If we are going to stick with the Interface / 
> Implementation setup, we need to get the right stuff into the 
> interface.  It is also awkward to have to insert "1"'s in the design 
> matrix to get an intercept term computed.  This is convenient for 
> implementation, but awkward for users.  A more natural setup (IMHO) 
> would be to expose a "noIntercept" or "hasIntercept" property for the 
> model.

No problem with adding other statistics - let's just decide on what is 
the stardard regression API.

And finally, how do you see the no/hasIntercept model working?

Cheers




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message