commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phil Steitz (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MATH-1179) kolmogorovSmirnovTest poor performance in monteCarloP method
Date Mon, 29 Jun 2015 23:59:05 GMT

    [ https://issues.apache.org/jira/browse/MATH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606534#comment-14606534
] 

Phil Steitz edited comment on MATH-1179 at 6/29/15 11:58 PM:
-------------------------------------------------------------

I did a little more research.  The 2-sample case is different because  the test statistic
has a discrete distribution.  This makes exact computation possible, which we do for small
m,n.  We use a naively computed Smirnov approximation for large n*m and Monte Carlo for "moderate"
n*m by default.  There are two things we can do to improve this:

 1.  I can't find a freely available reference, but I am chasing down dead trees for a much
better exact computation algorithm that I have seen referenced [1].  What the exactP code
does now is simple and correct but ridiculously slow.  I conjecture that the code that I have
not been able to figure out in R may be using the algorithm in [1] for small n,m.  By implementing
a better exact algorithm, we can use it up to n * m <= 10000, which is the cut R uses.
 That basically eliminates the need for the Monte Carlo stuff.
 2.  The sum computed in approximateP has known bad numerical properties and our code does
not really do anything to correct for this.  The magic numbers in some of the code you reference
above have to do with continuity corrections.  We need to research this a little more and
apply the corrections to the computation of the sum.

[1] Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the two sample
Kolmogorov–Smirnov  criterion D_mn(m<=n) Selected  Tables  in  Mathematical  Statistics
80–129  American Mathematical Society


was (Author: psteitz):
I did a little more research.  The 2-sample case is different because  the test statistic
has a discrete distribution.  This makes exact computation possible, which we do for small
m,n.  We use a naively computed Smirnov approximation for large n*m and Monte Carlo for "moderate"
n*m by default.  There are two things we can do to improve this:

1.  I can't find a freely available reference, but I am chasing down dead trees for a much
better exact computation algorithm that I have seen referenced [1].  What the exactP code
does now is simple and correct but ridiculously slow.  I conjecture that the code that I have
not been able to figure out in R may be using that algorithm for small n,m.  By implementing
a better exact algorithm, we can use it up to n * m <= 10000, which is the cut R uses.
 That basically eliminates the need for the Monte Carlo stuff.
2.  The sum computed in approximateP has known bad numerical properties and our code does
not really do anything to correct for this.  The magic numbers in some of the code you reference
above have to do with continuity corrections.  We need to research this a little more and
apply the corrections to the computation of the sum.

[1] Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the two sample
Kolmogorov–Smirnov  criterion D_mn(m<=n) Selected  Tables  in  Mathematical  Statistics
80–129  American Mathematical Society

> kolmogorovSmirnovTest poor performance in monteCarloP method
> ------------------------------------------------------------
>
>                 Key: MATH-1179
>                 URL: https://issues.apache.org/jira/browse/MATH-1179
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Gilad
>             Fix For: 4.0
>
>         Attachments: KSTest-JavaAndR.txt, KSTestSnippet.txt
>
>
> I'm using the kolmogovSmirnovTest method to calculate pvalues.
> However, when i try running the test on two double[] of sizes 5 and 45 the results take
over 10 seconds to calculate.
> This seems very long, whereas in R it takes a few miliseconds for the same calculation.
> I'd be very happy to hear any comment you may have on the subject.
>    Gilad



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message