commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "greg sterijevski (JIRA)" <>
Subject [jira] [Commented] (MATH-607) Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
Date Thu, 21 Jul 2011 22:33:59 GMT


greg sterijevski commented on MATH-607:


1. Fix exceptions. I am not 100% sure what I needed to do in order to
correctly exclude this from the bug report. I did not want to commit
something half baked. I would appreciate your help here.

3. Yes, the rank getter is missing. I can put that in.

4. There are a couple of reasons I thought we should keep all that info in
an array.
     a.) Neater, all of the information on the fit is in one member
variable, as opposed to 5, 10 or 15 member variables. We really should have
a GlobalInfoEnum maybe? Then we could eliminate all the getters with:
     public double getGlobalFitInfo( GlobalInfoEnum gie );

     b.) Serialization is a bit easier should a hand coded serialization
routine need to be written.

     c.) Model Selection. If we use the regression results object in model
selection algorithms, then the criteria used for evaluate goodness of fit
could be accessible by an index (or enum) into that array. For example, I
might write a little app that runs a million regressions and chooses the top
1% by Rsquared. (I know that this example is complete ad hoc[ery]. ) You
might then decide that mean squared error is really the criterion you want
to use. Instead of recoding the objective function to call
getMeanSquaredError() instead of getRSquared(), you simple provide the index
or the enum.

    d.) Growth. While we have a few parameters of global fit now, I am sure
that number will grow. We might need add likelihood function value, an F
Test of global applicability,.... In a simple beans interface setup we would
add many methods... I can't help but feel that this is messy and tedious for
the user.


> Current Multiple Regression Object does calculations with all data incore. There are
non incore techniques which would be useful with large datasets.
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: MATH-607
>                 URL:
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 3.0
>         Environment: Java
>            Reporter: greg sterijevski
>              Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
>             Fix For: 3.0
>         Attachments: RegressResults2, millerreg, millerreg_take2, millerregtest, regres_change1,
updating_reg_cut2, updating_reg_ifaces
>   Original Estimate: 840h
>  Remaining Estimate: 840h
> The current multiple regression class does a QR decomposition on the complete data set.
This necessitates the loading incore of the complete dataset. For large datasets, or large
datasets and a requirement to do datamining or stepwise regression this is not practical.
There are techniques which form the normal equations on the fly, as well as ones which form
the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression"
interface which defines basic functionality all such techniques must fulfill. 
> Related to this 'updating' regression, the results of running a regression on some subset
of the data should be encapsulated in an immutable object. This is to ensure that subsequent
additions of observations do not corrupt or render inconsistent parameter estimates. I am
calling this interface "RegressionResults".  
> Once the community has reached a consensus on the interface, work on the concrete implementation
of these techniques will take place.
> Thanks,
> -Greg

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message