# commons-dev mailing list archives

##### Site index · List index
Message view
Top
From Phil Steitz <phil.ste...@gmail.com>
Subject Re: [math] Kolmogorov-Smirnov 2-sample test
Date Sat, 10 Aug 2013 16:16:16 GMT
```On 8/10/13 8:59 AM, Ajo Fod wrote:
> This depends on data size. If it fits in memory, a single pass through the
> sorted array to find the biggest differences would suffice.
>
> If the data doesn't fit, you probably need a StorelessQuantile estimator
> like QuantileBin1D from the colt libraries. Then pick a resolution and do
> the single pass search.

Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
problem is in computing the exact p-values for the test.  For that,
I need to compute the exact distribution of D_n,m.  Brute-forcing
requires that you examine every element of n + m choose n.  R seems
to use a clever approach, but there is no documentation in the R
sources on how the algorithm works.  Moreover, my first attempts at
Monte Carlo simulation don't give the same results.  Most likely, I
have not set the simulation up correctly.  Any better ideas or
references on how to compute the exact distribution would be
appreciated.

Phil
>
> Cheers,
> -Ajo
>
>
> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <phil.steitz@gmail.com> wrote:
>
>> I am working on MATH-437 (turning K-S distribution into a proper K-S
>> test impl) and have to decide how to implement 2-sample tests.
>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
>> K-S distribution, so for large samples just using the cdf we already
>> have is appropriate.  For small samples (actually for any size
>> sample), the test statistic distribution is discrete and can be
>> computed exactly.  A brute force way to do that is to enumerate all
>> of the n-m partitions of {0, ..., n+m-1} and compute all the
>> possible D_n,m values.  R seems to use a more clever way to do
>> this.  Does anyone have a reference for an efficient way to compute
>> the exact distribution, or background on where R got their
>> implementation?
>>
>> Absent a "clever" approach, I see three alternatives and would
>> appreciate some feedback:
>>
>> 0) just use the asymptotic distribution even for small samples
>> 1) fully enumerate all n-m partitions and compute the D_n,m as above
>> 1) use a monte carlo approach - instead of full enumeration of the
>> D_n,m, randomly generate a large number of splits and compute the
>> p-value for observed D_n,m by computing the number of random n-m
>> splits generate a D value less than what is observed.
>>
>> Thanks in advance for any feedback / pointers.
>>
>> Phil
>>
>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org