commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "greg sterijevski (JIRA)" <>
Subject [jira] [Commented] (MATH-607) Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
Date Wed, 06 Jul 2011 19:28:16 GMT


greg sterijevski commented on MATH-607:

Sorry for duplicating part of my response, but gmail has truncated it (maybe google is telling
me something about my ideas... ;0 )

My complete response is:

I agree on eliminating getRedundant() and isRedundant(int idx). If the underlying solver is
QR or Gaussian this info would exist. If the underlying method is SVD, then we would register
the rank reduction, but we would not be able to attribute it to a particular column in the
design matrix.

I am probably in agreement with with making RegressionResults concrete, but there were a couple
of considerations which forced me to interface.

Say that I begin with the following augmented matrix:
 | X'X     X'Y|
 | X'Y    Y'Y|
  where X is the design matrix ( nobs x nreg ), Y is the dependent variable (nobs x 1 )

On a copy of the cross products matrix (the thing above), I get the following via gaussian

 | inv(X'X)     -beta|
 | -beta           e'e|

inv(X'X) is the inverse of the X'X matrix. -beta is the OLS vector of slopes. e'e is the sum
of squared errors.

Getting most of the info (that RegressionResults surfaces) is simply a matter of indexing.
All I need to do in this case is write a wrapper around a symmetric matrix which implements
the interface.

I suppose that there could be constructor which took the matrix above and did the indexing,
but that seems too dirty. Furthermore, there are probably other optimized formats for OLS
which have similar aspects. I wanted to keep the door open to other schemes, without making
(potentially large) copies of variance matrices, standard errors and so forth a necessity.

On the name of the getter for number of observations, I am okay with whatever you feel is
a better name.

    Regarding the model interface, I would again suggest that we just define this as a class,
UpdatingOLSRegression.  I suppose that if we end up implementing a weighted or other non-OLS
version, we might want to factor out a common interface like what exists for MultipleLinearRegression,
but in retrospect, I am not sure that interface was worth much.  Note that all that we could
factor out is essentially what is in MultivariateRegression, which is analogous to your RegressionResults.

So you are saying the UpdatingOLSRegression be an abstract class? There are not that many
methods in the interface. That would be okay if were sure that subclasses always overrode
either the regress(...) methods or the addObservations(...) methods. I worry that you might
get have a base class full of nothing but abstract functions.

> Current Multiple Regression Object does calculations with all data incore. There are
non incore techniques which would be useful with large datasets.
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: MATH-607
>                 URL:
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 3.0
>         Environment: Java
>            Reporter: greg sterijevski
>              Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
>             Fix For: 3.0
>         Attachments: updating_reg_ifaces
>   Original Estimate: 840h
>  Remaining Estimate: 840h
> The current multiple regression class does a QR decomposition on the complete data set.
This necessitates the loading incore of the complete dataset. For large datasets, or large
datasets and a requirement to do datamining or stepwise regression this is not practical.
There are techniques which form the normal equations on the fly, as well as ones which form
the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression"
interface which defines basic functionality all such techniques must fulfill. 
> Related to this 'updating' regression, the results of running a regression on some subset
of the data should be encapsulated in an immutable object. This is to ensure that subsequent
additions of observations do not corrupt or render inconsistent parameter estimates. I am
calling this interface "RegressionResults".  
> Once the community has reached a consensus on the interface, work on the concrete implementation
of these techniques will take place.
> Thanks,
> -Greg

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message