commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phil Steitz (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (MATH-607) Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
Date Wed, 06 Jul 2011 18:32:16 GMT

    [ https://issues.apache.org/jira/browse/MATH-607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060744#comment-13060744
] 

Phil Steitz edited comment on MATH-607 at 7/6/11 6:31 PM:
----------------------------------------------------------

First, thanks for pushing this along and sorry to be slow to respond.

I like both of the abstractions, but I am not sure that defining interfaces is the best way
to go in either case.  The reporting interface (RegressionResults) could be a concrete class
and it is probably best to define a base class that omits some of the reported stats (e.g.
isRedundant, getRedundant).  Making this a class gives us more flexibility.  It also makes
it a little easier / more convenient for users who want to store off intermediate results.
 One thing that I would add to either the base or an extended version is adjusted R-square.
 I think it is also a good idea at this point to ask what else might be missing.  Your suggestions
on redundancy are a good example.  For now, I would suggest making RegressionResults a serializable
class as we finalize its contents.  One small quibble on naming:  s/getNobs/getNumberOfObservations
or if that is too onerous getN (similar to other stats).

Regarding the model interface, I would again suggest that we just define this as a class,
UpdatingOLSRegression.  I suppose that if we end up implementing a weighted or other non-OLS
version, we might want to factor out a common interface like what exists for MultipleLinearRegression,
but in retrospect, I am not sure that interface was worth much.  Note that all that we could
factor out is essentially what is in MultivariateRegression, which is analogous to your RegressionResults.

So, modulo the one name change, I propose to just change these to classes and get going on
the implementation.  Any other suggestions on what we should add / modify in the RegressionResults?
 


      was (Author: psteitz):
    First, thanks for pushing this along and sorry to be slow to respond.

I like both of the abstractions, but I am not sure that defining interfaces is the best way
to go in either case.  The reporting interface (RegressionResults) could be a concrete class
and it is probably best to define a base class that omits some of the reported stats (e.g.
isRedundant, getRedundant).  Making this a class gives us more flexibility.  It also makes
it a little easier / more convenient for users who want to store off intermediate results.
 One thing that I would add to either the base or an extended version is adjusted R-square.
 I think it is also a good idea at the point to ask what else might be missing.  Your suggestions
on redundancy are a good example.  For now, I would suggest making RegressionResults a serializable
class as we finalize its contents.  One small quibble on naming:  s/getNobs/getNumberOfObservations
or if that is too onerous getN (similar to other stats).

Regarding the model interface, I would again suggest that we just define this as a class,
UpdatingOLSRegression.  I suppose that if we end up implementing a weighted or other non-OLS
version, we might want to factor out a common interface like what exists for MultipleLinearRegression,
but in retrospect, I am not sure that interface was worth much.  Note that all that we could
factor out is essentially what is in MultivariateRegression, which is analogous to your RegressionResults.

So, modulo the one name change, I propose to just change these to classes and get going on
the implementation.  Any other suggestions on what we should add / modify in the RegressionResults?
 

  
> Current Multiple Regression Object does calculations with all data incore. There are
non incore techniques which would be useful with large datasets.
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MATH-607
>                 URL: https://issues.apache.org/jira/browse/MATH-607
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 3.0
>         Environment: Java
>            Reporter: greg sterijevski
>              Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
>             Fix For: 3.0
>
>         Attachments: updating_reg_ifaces
>
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set.
This necessitates the loading incore of the complete dataset. For large datasets, or large
datasets and a requirement to do datamining or stepwise regression this is not practical.
There are techniques which form the normal equations on the fly, as well as ones which form
the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression"
interface which defines basic functionality all such techniques must fulfill. 
> Related to this 'updating' regression, the results of running a regression on some subset
of the data should be encapsulated in an immutable object. This is to ensure that subsequent
additions of observations do not corrupt or render inconsistent parameter estimates. I am
calling this interface "RegressionResults".  
> Once the community has reached a consensus on the interface, work on the concrete implementation
of these techniques will take place.
> Thanks,
> -Greg

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message