While clear and elegant from a matrix algebra standpoint, the "nailve"
implementation in OLSMultipleLinearRegression has bad numerical
qualities. It is well known that solving the normal equations directly
does not give good numerics. I just added some tests to actually verify
parameter values, using the classic "Longly" dataset, for which NIST
provides certified statistics. This is a "hard" design matrix. R was
able to get to within 1E8 of the certified parameter values.
OLSMultipleLinearRegression can only get 1E1.
We have talked in the past about providing an implementation based on QR
decomposition. Anyone up for using the QR decomposition that we now
have to do this? I really think we need to do it (or something else to
improve numerics) before releasing this class. I will get to it
eventually, but am a little pegged at the moment. I will review and
apply patches if someone is willing to do the implementation. I can
also explain here or offline how the R tests and NIST datasets work, as
these are useful in validating code.
Another thing that we should think about before releasing any of this
stuff is the completeness of the API. Many standard regression
statistics are missing. If we are going to stick with the Interface /
Implementation setup, we need to get the right stuff into the
interface. It is also awkward to have to insert "1"'s in the design
matrix to get an intercept term computed. This is convenient for
implementation, but awkward for users. A more natural setup (IMHO)
would be to expose a "noIntercept" or "hasIntercept" property for the model.

To unsubscribe, email: devunsubscribe@commons.apache.org
For additional commands, email: devhelp@commons.apache.org
