This depends on data size. If it fits in memory, a single pass through the
sorted array to find the biggest differences would suffice.
If the data doesn't fit, you probably need a StorelessQuantile estimator
like QuantileBin1D from the colt libraries. Then pick a resolution and do
the single pass search.
Cheers,
Ajo
On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <phil.steitz@gmail.com> wrote:
> I am working on MATH437 (turning KS distribution into a proper KS
> test impl) and have to decide how to implement 2sample tests.
> Asymptotically, the 2sample D_n,m test statistic (see [1]) has a
> KS distribution, so for large samples just using the cdf we already
> have is appropriate. For small samples (actually for any size
> sample), the test statistic distribution is discrete and can be
> computed exactly. A brute force way to do that is to enumerate all
> of the nm partitions of {0, ..., n+m1} and compute all the
> possible D_n,m values. R seems to use a more clever way to do
> this. Does anyone have a reference for an efficient way to compute
> the exact distribution, or background on where R got their
> implementation?
>
> Absent a "clever" approach, I see three alternatives and would
> appreciate some feedback:
>
> 0) just use the asymptotic distribution even for small samples
> 1) fully enumerate all nm partitions and compute the D_n,m as above
> 1) use a monte carlo approach  instead of full enumeration of the
> D_n,m, randomly generate a large number of splits and compute the
> pvalue for observed D_n,m by computing the number of random nm
> splits generate a D value less than what is observed.
>
> Thanks in advance for any feedback / pointers.
>
> Phil
>
> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
>
> 
> To unsubscribe, email: devunsubscribe@commons.apache.org
> For additional commands, email: devhelp@commons.apache.org
>
>
