[ https://issues.apache.org/jira/browse/MATH1179?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=14606534#comment14606534
]
Phil Steitz edited comment on MATH1179 at 6/30/15 12:00 AM:

I did a little more research. The 2sample case is different because the test statistic
has a discrete distribution. This makes exact computation possible, which we do for small
m,n. We use a naively computed Smirnov approximation for large n*m and Monte Carlo for "moderate"
n*m by default. There are two things we can do to improve this:
# I can't find a freely available reference, but I am chasing down dead trees for a much
better exact computation algorithm that I have seen referenced [1]. What the exactP code
does now is simple and correct but ridiculously slow. I conjecture that the code that I have
not been able to figure out in R may be using the algorithm in [1] for small n,m. By implementing
a better exact algorithm, we can use it up to n * m <= 10000, which is the cut R uses.
That basically eliminates the need for the Monte Carlo stuff.
# The sum computed in approximateP has known bad numerical properties and our code does
not really do anything to correct for this. The magic numbers in some of the code you reference
above have to do with continuity corrections. We need to research this a little more and
apply the corrections to the computation of the sum.
[1] Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the two sample
Kolmogorov–Smirnov criterion D_mn(m<=n) Selected Tables in Mathematical Statistics
80–129 American Mathematical Society
was (Author: psteitz):
I did a little more research. The 2sample case is different because the test statistic
has a discrete distribution. This makes exact computation possible, which we do for small
m,n. We use a naively computed Smirnov approximation for large n*m and Monte Carlo for "moderate"
n*m by default. There are two things we can do to improve this:
1. I can't find a freely available reference, but I am chasing down dead trees for a much
better exact computation algorithm that I have seen referenced [1]. What the exactP code
does now is simple and correct but ridiculously slow. I conjecture that the code that I have
not been able to figure out in R may be using the algorithm in [1] for small n,m. By implementing
a better exact algorithm, we can use it up to n * m <= 10000, which is the cut R uses.
That basically eliminates the need for the Monte Carlo stuff.
2. The sum computed in approximateP has known bad numerical properties and our code does
not really do anything to correct for this. The magic numbers in some of the code you reference
above have to do with continuity corrections. We need to research this a little more and
apply the corrections to the computation of the sum.
[1] Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the two sample
Kolmogorov–Smirnov criterion D_mn(m<=n) Selected Tables in Mathematical Statistics
80–129 American Mathematical Society
> kolmogorovSmirnovTest poor performance in monteCarloP method
> 
>
> Key: MATH1179
> URL: https://issues.apache.org/jira/browse/MATH1179
> Project: Commons Math
> Issue Type: Bug
> Reporter: Gilad
> Fix For: 4.0
>
> Attachments: KSTestJavaAndR.txt, KSTestSnippet.txt
>
>
> I'm using the kolmogovSmirnovTest method to calculate pvalues.
> However, when i try running the test on two double[] of sizes 5 and 45 the results take
over 10 seconds to calculate.
> This seems very long, whereas in R it takes a few miliseconds for the same calculation.
> I'd be very happy to hear any comment you may have on the subject.
> Gilad

This message was sent by Atlassian JIRA
(v6.3.4#6332)
