commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gilles (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MATH-1441) SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call
Date Sat, 27 Jan 2018 13:32:00 GMT

    [ https://issues.apache.org/jira/browse/MATH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342131#comment-16342131
] 

Gilles commented on MATH-1441:
------------------------------

{quote}it's the `inverseCumulativeProbability` calculation immediately after that's expensive.
{quote}
In your case, the use of a fixed "alpha" (0.05 by default) makes all those call always compute
the same value; so it seems that both "n" and "alpha" (and their expensive dependents) must
be cached.
{quote}I might be interested in contributing something like that if you're interested
{quote}
Sure.
{quote}what's the minimum JVM version you're targeting in Commons Math 4.0?
{quote}
24.3 ;)
 As it is we should rather focus on the new "Commons Statistics" component which I referred
to above. As a new project, it is targeted at Java 8.

It would be great if you could post on the ML with a proposal on how to move that code to
the new repository. If you intend to work on that, I can create a new module: {{commons-statistics-regression}}
(?).
{quote}sometimes a pleasant surprise, to work with people who care about their open-source
software.
{quote}
Some would say that I care too much. See ML archive for details...

Actually, we are looking for developers who'd like to care too (in the longer term).
 And "Commons Math" is too broad for that. As an example, I'm not a direct user of anything
in {{o.a.c.math4.stat}}, so I should not be the one to dig into unknown code figuring out
why things are the way they are (with the consequence that a review could uncover design problems
– cf. other JIRA reports about it).

In the new module, we should take the time to define a stable API.

> SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call
> -------------------------------------------------------------------------------------
>
>                 Key: MATH-1441
>                 URL: https://issues.apache.org/jira/browse/MATH-1441
>             Project: Commons Math
>          Issue Type: Improvement
>    Affects Versions: 3.6.1
>         Environment: Java 8, Linux x64.
>            Reporter: Max Aller
>            Priority: Minor
>              Labels: performance
>         Attachments: visualvm screenshot.png
>
>
> SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other of 100k
or 1M times), is surprisingly slow - 3M calls, on my 3rd gen i7 machine, takes *75 seconds*.
This is primarily because recalculating the inverse cumulative probability, and reconstructing
the TDistribution object itself, is somewhat expensive, entailing a lot of `log` and `sqrt`
and iteration and all that stuff.
> The confidence interval is based on two values - *n* and *alpha*. I'd posit that alpha
will _usually_ be one of a very small set of values, and n, well, at least in my case, I'm
always calling this method with the same number of data points - a moving window of time-series
data. But I recognize n might be all over the place for some users.
> In any event, I strongly believe some level of caching would greatly benefit the users
of Commons Math. I've forked my own version of getSlopeConfidenceInterval that uses caching.
Here's a test case demonstrating that:
> {code:java}
> class SlowRegressionsTest {
>     @Test
>     void slow() {
>         SimpleRegression simpleRegression = new SimpleRegression();
>         simpleRegression.addData(0.0, 0.0);
>         simpleRegression.addData(1.0, 1.0);
>         simpleRegression.addData(2.0, 2.0);
>         long start = System.currentTimeMillis();
>         for (int i = 0; i < 3_000_000; i++) {
>             simpleRegression.getSlopeConfidenceInterval();
>         }
>         long duration = System.currentTimeMillis() - start;
>         System.out.printf("`slow` took %dms%n", duration);
>     }
>     @Test
>     void fast() {
>         SimpleRegression simpleRegression = new SimpleRegression();
>         simpleRegression.addData(0.0, 0.0);
>         simpleRegression.addData(1.0, 1.0);
>         simpleRegression.addData(2.0, 2.0);
>         long start = System.currentTimeMillis();
>         for (int i = 0; i < 3_000_000; i++) {
>             SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression);
>         }
>         long duration = System.currentTimeMillis() - start;
>         System.out.printf("`fast` took %dms%n", duration);
>     }
> }{code}
> which prints out
> {noformat}
> `fast` took 159ms
> `slow` took 75304ms{noformat}
> Nearly 500x faster - 53ns/call. My particular solution is written in Kotlin for Java
8, so not of direct relevance to you, but here it is:
> {code:java}
> package math
> import org.apache.commons.math3.distribution.TDistribution
> import org.apache.commons.math3.exception.OutOfRangeException
> import org.apache.commons.math3.exception.util.LocalizedFormats
> import org.apache.commons.math3.stat.regression.SimpleRegression
> import java.util.concurrent.ConcurrentHashMap
> @JvmOverloads
> fun SimpleRegression.fastGetSlopeConfidenceInterval(alpha: Double = 0.05): Double {
>     if (n < 3) {
>         return Double.NaN
>     }
>     if (alpha >= 1 || alpha <= 0) {
>         throw OutOfRangeException(
>             LocalizedFormats.SIGNIFICANCE_LEVEL,
>             alpha, 0, 1
>         )
>     }
>     // No advertised NotStrictlyPositiveException here - will return NaN above
>     // PATCH: use cached inverse cumulative probability
>     return slopeStdErr * getInverseCumulativeProbability(n, alpha)
> }
> private val cache = ConcurrentHashMap<Key, Double>()
> private data class Key(val n: Long, val alpha: Double)
> private fun getInverseCumulativeProbability(n: Long, alpha: Double): Double =
>     cache.getOrPut(Key(n, alpha)) {
>         TDistribution((n - 2).toDouble()).inverseCumulativeProbability(1.0 - alpha /
2.0)
>     }
> {code}
> Limitations: 1. Kotlin, 2. ConcurrentHashMap is unbounded here.
> I don't know how/if Commons Math does caching elsewhere, but it'd sure be handy here,
I believe. What are your thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message