commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phil Steitz (JIRA)" <>
Subject [jira] [Commented] (MATH-607) Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
Date Thu, 21 Jul 2011 23:54:00 GMT


Phil Steitz commented on MATH-607:

Don't worry about the exceptions.  The only thing remaining there is to fit into the hierarchy.
 I will fix that.  Feel free to weigh in on the ML thread, though.

I am still not sold on the globalstats array.  It is just not the Java way to use arrays with
static constants into them to represent properties.  I agree strongly that we are going to
want to add more fields to RegressionResults. In the public API, we are going to want them
to be properties, though.  Why would a user ever want to use getGlobalStats[THING_I_WANT]
instead of getThingIWant?  It is actually more code to maintain the enum plus the array rather
than just fields.  So I guess I disagree with a) and d) above.  I get b), but don't see it
as a big deal or enough to change API.  I also get c), but it frankly looks a little scary.
 I have been burned so many times over the years by indicies into blocks of storage, variable
content hashmaps of attributes, etc. that I try to avoid these things in my code.  And think
about the change in c) in any case - from globalStats[OLD_INDEX] to globalStats[NEW_INDEX]
somehow through the API, when the same change is really s/oldProperty/newProperty likely at
the same entry point.

> Current Multiple Regression Object does calculations with all data incore. There are
non incore techniques which would be useful with large datasets.
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: MATH-607
>                 URL:
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 3.0
>         Environment: Java
>            Reporter: greg sterijevski
>              Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
>             Fix For: 3.0
>         Attachments: RegressResults2, millerreg, millerreg_take2, millerregtest, regres_change1,
updating_reg_cut2, updating_reg_ifaces
>   Original Estimate: 840h
>  Remaining Estimate: 840h
> The current multiple regression class does a QR decomposition on the complete data set.
This necessitates the loading incore of the complete dataset. For large datasets, or large
datasets and a requirement to do datamining or stepwise regression this is not practical.
There are techniques which form the normal equations on the fly, as well as ones which form
the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression"
interface which defines basic functionality all such techniques must fulfill. 
> Related to this 'updating' regression, the results of running a regression on some subset
of the data should be encapsulated in an immutable object. This is to ensure that subsequent
additions of observations do not corrupt or render inconsistent parameter estimates. I am
calling this interface "RegressionResults".  
> Once the community has reached a consensus on the interface, work on the concrete implementation
of these techniques will take place.
> Thanks,
> -Greg

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message