commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <phil.ste...@gmail.com>
Subject Re: [math] Updating Regression questions..
Date Sat, 10 Sep 2011 21:33:36 GMT
On 9/10/11 10:15 AM, Greg Sterijevski wrote:
> Hi All,
>
> Another mostly exceptional question from me! In the
> interface UpdatingMultipleLinearRegression, we have regress() and
> regress(int[] variablesToInclude).
>
> 1. What is the appropriate exception to throw when there is a variable index
> in the array which does not exist in the data? For example I have regressors
> 0 to 4. The user requests variables 0, 2 and 7. Is it still a
> ModelSpecificationException? If so what is appropriate text? "Request for
> index {0} cannot be fulfilled because the data has only {1} independent
> variables"

I would say yes, it is a model specification problem and I would just
throw new ModelSpecificationException(
        LocalizedFormats.INDEX_LARGER_THAN_MAX, i, this.nvars);
>
> 2. Say that the request list ( the integer array variablesToInclude ) has a
> request that looks like int[]{ 1, 4, 2  }. In the Miller regression, I
> attempt to return things in the canonical order. Would it make better sense
> to have an element in the RegressionResults object which records the
> canonical position of the regressor and returns the result in arbitrary
> order?

I would say avoid this complexity by requiring that the sequence of
indices be monotone increasing.  Document this precondition and if
it is violated, throw a ModelSpecificationException with
NOT_INCREASING_SEQUENCE message.
>
> 3. In the call to regress() what is the proper manner in which to handle a
> case where no result can be returned? Say that the user has supplied nothing
> but NaNs in the data. There is nothing that can be done. What is the proper
> exception? Is it fair to return just a null?

As Gilles pointed out, we do not have a hard and fast rule on this. 
What is important is that we clearly document the contract.  I would
say that for the regression classes, throwing IAE for insufficient
data would be best.
>
> 4. Should any of these regression techniques (whether they implement
> UpdatingMultipleRegression or not) check the (input) data for things like
> NaN or Inf? If so, what is the exception to throw? Is there any other
> parallel with other classes in Math?

We have four reasonable choices here:

0) throw IAE whenever NaNs or INFs appear anywhere
1) omit observations including NaNs or INFs
2) get into the imputation business (develop some kind of pluggable
data imputation strategy framework and use this to "replace" NaNs
which we interpret as missing data)
3) don't worry, be happy, let GIGO rule (just let the computation
proceed and end up with NaNs, INFs or exceptions)

Unfortunately, we mostly do 3) now in the stats package.  Patrick
has gotten us started thinking about 1) and 2) via his work on
storeless covariance (MATH-449).  I think we will eventually end up
supporting 2), but we are not there yet, so for now, I think the
choice is really between 0) and 1).  Probably the safest at this
time is 0).  When we have developed a good approach for representing
/ handling missing data, we can add configuration parameters that
allow NaNs to be used to signal missing data instead of triggering
IAE.   

Phil
>
> Thank you,
>
> -Greg
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message