commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Sterijevski <gsterijev...@gmail.com>
Subject Re: [math] SimpleRegression
Date Sat, 13 Aug 2011 05:59:43 GMT
Also, I was thinking that maybe it might be useful to bring SimpleRegression
into line with the other regression techniques and give the user the ability
to constrain the constant to zero?

-Greg

On Sat, Aug 13, 2011 at 12:13 AM, Greg Sterijevski
<gsterijevski@gmail.com>wrote:

> One more thing... (ala Detective Colombo).
>
> In add and remove observation there is a snippet which looks like:
>
>             sumXX += dx * dx * (double) n / (n + 1d);
>             sumYY += dy * dy * (double) n / (n + 1d);
>             sumXY += dx * dy * (double) n / (n + 1d);
>             xbar += dx / (n + 1.0);
>             ybar += dy / (n + 1.0);
>
> Would you be against pulling out the divisions by ( n+1.0)?  Maybe
> something like:
>            final double fact1 = 1.0/((double) n + 1.0);
>            final double fact2 = ((double) n) * fact1;
>             sumXX += dx * dx * fact2;
>             sumYY += dy * dy * fact2;
>             sumXY += dx * dy * fact2;
>             xbar += dx * fact1;
>             ybar += dy * fact1;
>
>
> I realize that most likely the compiler or runtime does this optimization,
> but just in case...
>
>
> -Greg
>
>
> On Fri, Aug 12, 2011 at 11:40 PM, Greg Sterijevski <gsterijevski@gmail.com
> > wrote:
>
>> +1 from me, on all counts! -Greg
>>
>>
>> On Fri, Aug 12, 2011 at 11:30 PM, Phil Steitz <phil.steitz@gmail.com>wrote:
>>
>>> On 8/12/11 7:16 PM, Greg Sterijevski wrote:
>>> > Hello All,
>>> >
>>> > Before I chum the water with more JIRA tickets, I thought I would see
>>> > whether people thought this was important.
>>> >
>>> > In the SimpleRegression you have two methods:
>>> >
>>> > public void addData(double x, double y) {
>>> >   ...some code that is not germane to discussion......
>>> >
>>> >         if (n > 2) {
>>> >             distribution = new TDistributionImpl(n - 2);
>>> >         }
>>> >     }
>>> >
>>> >     public void removeData(double x, double y) {
>>> >   ...some code that is not germane......
>>> >
>>> >             if (n > 2) {
>>> >                 distribution = new TDistributionImpl(n - 2);
>>> >             }
>>> >         }
>>> >     }
>>> >
>>> > >From the perspective of a user, you are likely to call add/remove
>>> repeatedly
>>> > before you ever need to check for statistical significance. Wouldn't it
>>> be
>>> > better to instantiate the TDistribution when it is needed?
>>> >
>>> > So you would have to make the following two methods a bit more
>>> complicated:
>>> >
>>> >  public double getSlopeConfidenceInterval(double alpha)
>>> >         throws MathException {
>>> >         if (alpha >= 1 || alpha <= 0) {
>>> >             throw new
>>> > OutOfRangeException(LocalizedFormats.SIGNIFICANCE_LEVEL,
>>> >                                           alpha, 0, 1);
>>> >         }
>>> >         if( distribution == null || distribution.getDegreesOfFreedom()
>>> !=
>>> > n-2){
>>> >           distribution = new TDistributionImpl(n - 2);
>>> > }
>>> >         return getSlopeStdErr() *
>>> >             distribution.inverseCumulativeProbability(1d - alpha / 2d);
>>> >     }
>>> >
>>> > Similarly getSignificance() would have to be modified with the check of
>>> the
>>> > degrees of freedom of the distribution.
>>> >
>>> > There is nothing wrong with the current code, but making the change
>>> means
>>> > faster updates.
>>>
>>> Slightly, yes.  There is not much code at all in the distribution
>>> constructor, but you are correct.  Moreover, I can see now that the
>>> "immutability-everywhere" changes in trunk have made the code in the
>>> class a little funny.  In versions <=2.2, the TDistribution used by
>>> the class was pluggable - i.e., there was a constructor that took at
>>> TDistribution as an argument, so if for some reason you wanted to
>>> use an impl different from TDistributionImpl, you could do that.
>>> There was also a setter for the distribution. In addition,
>>> TDistributionImpl itself was mutable, exposing a setDegreesOfFreedom
>>> method.  So the distribution member was set at construction time and
>>> the data update methods called the setter for DF on the
>>> distribution.  We decided to make the distributions immutable in 3.0
>>> (search the archives for discussion), so the current mods were done
>>> to basically accomplish that.  But the code should be cleaned up.
>>> The constructor taking an int is meaningless and should be
>>> deprecated or removed (unfortunately, we added that in 2.2 and
>>> advertised it as a deprecation replacement for the version that took
>>> a distribution as parameter.  We should have realized then that it
>>> was meaningless.  My bad for missing that. I would favor dropping it
>>> in 3.0, since even if anyone is using it, it isn't doing anything
>>> meaningful for them.)  Given that constructing a TDistributionImpl
>>> is trivial, we might as well eliminate the member field altogether
>>> and just create one when needed.  If you agree and no one else
>>> objects, I will make these changes.  Thanks for reviewing the code.
>>>
>>> Phil
>>> >
>>> > Thoughts?
>>> >
>>> > -Greg
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message