From dev-return-110180-apmail-commons-dev-archive=commons.apache.org@commons.apache.org Tue Jun 10 01:35:35 2008
Return-Path:
Delivered-To: apmail-commons-dev-archive@www.apache.org
Received: (qmail 36509 invoked from network); 10 Jun 2008 01:35:34 -0000
Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2)
by minotaur.apache.org with SMTP; 10 Jun 2008 01:35:34 -0000
Received: (qmail 58786 invoked by uid 500); 10 Jun 2008 01:35:36 -0000
Delivered-To: apmail-commons-dev-archive@commons.apache.org
Received: (qmail 58682 invoked by uid 500); 10 Jun 2008 01:35:36 -0000
Mailing-List: contact dev-help@commons.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: "Commons Developers List"
Delivered-To: mailing list dev@commons.apache.org
Received: (qmail 58671 invoked by uid 99); 10 Jun 2008 01:35:36 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Jun 2008 18:35:36 -0700
X-ASF-Spam-Status: No, hits=1.2 required=10.0
tests=SPF_NEUTRAL
X-Spam-Check-By: apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Received: from [64.233.184.229] (HELO wr-out-0506.google.com) (64.233.184.229)
by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jun 2008 01:34:46 +0000
Received: by wr-out-0506.google.com with SMTP id 36so1290027wra.24
for ; Mon, 09 Jun 2008 18:35:02 -0700 (PDT)
Received: by 10.90.98.12 with SMTP id v12mr4461340agb.29.1213061702694;
Mon, 09 Jun 2008 18:35:02 -0700 (PDT)
Received: from ?192.168.1.184? ( [70.91.8.169])
by mx.google.com with ESMTPS id 34sm11003797yxl.0.2008.06.09.18.34.56
(version=SSLv3 cipher=RC4-MD5);
Mon, 09 Jun 2008 18:35:01 -0700 (PDT)
Message-ID: <484DDA37.60004@gmail.com>
Date: Mon, 09 Jun 2008 21:34:47 -0400
User-Agent: Thunderbird 2.0.0.14 (X11/20080421)
MIME-Version: 1.0
To: Commons Developers List
Subject: Re: [math] Improving numerics in OLSMultipleLinearRegression
References: <484C929C.40004@gmail.com>
In-Reply-To:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
From: Phil Steitz
X-Virus-Checked: Checked by ClamAV on apache.org
Mauro Talevi wrote:
> Hi Phil,
>
> thanks for reviewing the multiple linear regression implementations
> and setting up the R/NIST data tests. I finally got around to
> installing R and can now run them too.
>
> Phil Steitz wrote:
>> While clear and elegant from a matrix algebra standpoint, the
>> "nailve" implementation in OLSMultipleLinearRegression has bad
>> numerical qualities. It is well known that solving the normal
>> equations directly does not give good numerics. I just added some
>> tests to actually verify parameter values, using the classic "Longly"
>> dataset, for which NIST provides certified statistics. This is a
>> "hard" design matrix. R was able to get to within 1E-8 of the
>> certified parameter values. OLSMultipleLinearRegression can only get
>> 1E-1.
>
> The OLS implementation has been added as a simple by-product of the
> GLS case - which is the main one I have needed for hypothesis testing
> - as it came "for free" with unitary covariance.
> True - the emphasis was on clarity and formulaic simplicity. And also
> following the old Donald Knuth maxim "optimization is the root of all
> evil". But it seems like there is a need for refinement of the
> implementation - the devil raised his head :-)
>
Yes, and I would distinguish performance optimization from numerical
accuracy. From my perspective, we can release a ".0" with room for
performance improvement, but at least decent numerics are required.
>> We have talked in the past about providing an implementation based on
>> QR decomposition. Anyone up for using the QR decomposition that we
>> now have to do this? I really think we need to do it (or something
>> else to improve numerics) before releasing this class. I will get to
>> it eventually, but am a little pegged at the moment. I will review
>> and apply patches if someone is willing to do the implementation. I
>> can also explain here or offline how the R tests and NIST datasets
>> work, as these are useful in validating code.
>
> I'd be happy to improve the impl. I'm getting my head around R and
> NIST, but perhaps a chat offline would not hurt!
I may be hard to catch synchronously, as my day-and-night job is a
little demanding, but I would be happy to answer questions (with maybe a
little latency ;)
>
>> Another thing that we should think about before releasing any of this
>> stuff is the completeness of the API. Many standard regression
>> statistics are missing. If we are going to stick with the Interface
>> / Implementation setup, we need to get the right stuff into the
>> interface. It is also awkward to have to insert "1"'s in the design
>> matrix to get an intercept term computed. This is convenient for
>> implementation, but awkward for users. A more natural setup (IMHO)
>> would be to expose a "noIntercept" or "hasIntercept" property for the
>> model.
>
> No problem with adding other statistics - let's just decide on what is
> the stardard regression API.
Here are some initial ideas on what should be included in the multiple
regression API. Other suggestions welcome!
1. Coefficients should be accompanied by standard errors, t-statistics,
two-sided t probablilities (can get these using t distribution from
distributions package) and ideally confidence intervals.
2. F, R-square, adjusted R-square, F prob (again can use distributions
package to estimate)
3. ANOVA table (Regression sum of squares, residual sum of squares)
4. Residuals
R, SAS, SPSS and Excel all represent (or in the case of R, can
construct) these basic statistics in some way in their output. We
should model them in classes representing properties of the computed
model.
>
> And finally, how do you see the no/hasIntercept model working?
As a configurable property - noIntercept means the model is estimated
without an intercept. The point I was making was more how the data is
supplied via the API. It is awkward to have to fill in a column of 1's
to get the linear algebra to work to estimate a model with intercept
(which should be the default).
I would recommend that we have setData or "newData" provide a n x m
matrix, where n is the number of observations and m-1 is the number of
independent variables. Then either a) have the constructor take another
argument specifying which column holds the dependent variable b) assume
it is the first column c) support column labels and some form of model
specification such as what R provides (a lot of work) d) split off the y
vector, so setting data requires separate x and y vectors. Probably a)
is easiest for users, who will most often be starting with a rectangular
array of data with the dependent variable in one of the columns.
Phil
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org