commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "greg sterijevski (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MATH-607) Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
Date Wed, 06 Jul 2011 20:13:18 GMT

    [ https://issues.apache.org/jira/browse/MATH-607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060800#comment-13060800
] 

greg sterijevski commented on MATH-607:
---------------------------------------

On the results object:

There are vars *( vars + 1 ) /2 elements in the cov matrix, vars int
parameters, vars int standard errors and a some other assorted stuff. Not
terribly large at first. However, consider doing panel regression via dummy
variables, the covariance matrix can get fast very quickly. That being said,
I don't think making RegressionResults a concrete class is a gamestopper.
Should I send a follow up patch with results made concrete?

On the regression object:

Are you concerned that we will be removing methods from any interface we
specify today? Or do you think the contract is too restrictive? The reason I
am pushing for interface is that I have two candidates for concrete
implementation of updating regression. The first implementation is based on
Gentleman's lemma and is detailed in the following article:

Algorithm AS 274: Least Squares Routines to Supplement those of Gentleman
Author: Alan J Miller
Source Journal of the Royal Statistical Society Vol 41 No 2 (1992)

The second approach is one detailed by this article by Goodnight:
A Tutorial on the SWEEP Operator
James H. Goodnight
The American Statistician, Vol. 33, No. 3. (Aug., 1979), pp. 149-158.

The first approach never forms the cross products matrix, the second does.
They are significantly different approaches to dealing with large data sets.


How would I do this in the concrete class you propose?

Thanks,

-Greg





> Current Multiple Regression Object does calculations with all data incore. There are
non incore techniques which would be useful with large datasets.
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MATH-607
>                 URL: https://issues.apache.org/jira/browse/MATH-607
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 3.0
>         Environment: Java
>            Reporter: greg sterijevski
>              Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
>             Fix For: 3.0
>
>         Attachments: updating_reg_ifaces
>
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set.
This necessitates the loading incore of the complete dataset. For large datasets, or large
datasets and a requirement to do datamining or stepwise regression this is not practical.
There are techniques which form the normal equations on the fly, as well as ones which form
the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression"
interface which defines basic functionality all such techniques must fulfill. 
> Related to this 'updating' regression, the results of running a regression on some subset
of the data should be encapsulated in an immutable object. This is to ensure that subsequent
additions of observations do not corrupt or render inconsistent parameter estimates. I am
calling this interface "RegressionResults".  
> Once the community has reached a consensus on the interface, work on the concrete implementation
of these techniques will take place.
> Thanks,
> -Greg

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message