I am working on MATH437 (turning KS distribution into a proper KS
test impl) and have to decide how to implement 2sample tests.
Asymptotically, the 2sample D_n,m test statistic (see [1]) has a
KS distribution, so for large samples just using the cdf we already
have is appropriate. For small samples (actually for any size
sample), the test statistic distribution is discrete and can be
computed exactly. A brute force way to do that is to enumerate all
of the nm partitions of {0, ..., n+m1} and compute all the
possible D_n,m values. R seems to use a more clever way to do
this. Does anyone have a reference for an efficient way to compute
the exact distribution, or background on where R got their
implementation?
Absent a "clever" approach, I see three alternatives and would
appreciate some feedback:
0) just use the asymptotic distribution even for small samples
1) fully enumerate all nm partitions and compute the D_n,m as above
1) use a monte carlo approach  instead of full enumeration of the
D_n,m, randomly generate a large number of splits and compute the
pvalue for observed D_n,m by computing the number of random nm
splits generate a D value less than what is observed.
Thanks in advance for any feedback / pointers.
Phil
[1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

To unsubscribe, email: devunsubscribe@commons.apache.org
For additional commands, email: devhelp@commons.apache.org
