commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <>
Subject Re: [math] Kolmogorov-Smirnov 2-sample test
Date Sat, 10 Aug 2013 17:06:39 GMT
On 8/10/13 9:35 AM, Ajo Fod wrote:
> In:
> Take a look at 2 sample KS stats and the relationship to the 1 sample ...
> page 88. You already have the 1 sample table.

The problem is that I don't have the *exact* 1 sample table, or more
precisely the means to generate it.  What our current machinery
allows us to compute is an asymptotically valid approximation to the
distribution of D_n,m.  Theorem 2 on page 87 of the reference above
justifies the large sample approach; but it is an asymptotic
result.   Using it is fine for large n, m, but not so good for small
samples.  For that we need the exact, discrete distribution of
D_n,m.  Like other math stat references I have consulted, the one
above states that the discrete distribution has been tabulated for
small n,m and those tables are available in most math stat texts. 
What I need is the algorithm used to generate those tables.

> Cheers,
> Ajo
> On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <> wrote:
>> On 8/10/13 8:59 AM, Ajo Fod wrote:
>>> This depends on data size. If it fits in memory, a single pass through
>> the
>>> sorted array to find the biggest differences would suffice.
>>> If the data doesn't fit, you probably need a StorelessQuantile estimator
>>> like QuantileBin1D from the colt libraries. Then pick a resolution and do
>>> the single pass search.
>> Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
>> problem is in computing the exact p-values for the test.  For that,
>> I need to compute the exact distribution of D_n,m.  Brute-forcing
>> requires that you examine every element of n + m choose n.  R seems
>> to use a clever approach, but there is no documentation in the R
>> sources on how the algorithm works.  Moreover, my first attempts at
>> Monte Carlo simulation don't give the same results.  Most likely, I
>> have not set the simulation up correctly.  Any better ideas or
>> references on how to compute the exact distribution would be
>> appreciated.
>> Phil
>>> Cheers,
>>> -Ajo
>>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <>
>> wrote:
>>>> I am working on MATH-437 (turning K-S distribution into a proper K-S
>>>> test impl) and have to decide how to implement 2-sample tests.
>>>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
>>>> K-S distribution, so for large samples just using the cdf we already
>>>> have is appropriate.  For small samples (actually for any size
>>>> sample), the test statistic distribution is discrete and can be
>>>> computed exactly.  A brute force way to do that is to enumerate all
>>>> of the n-m partitions of {0, ..., n+m-1} and compute all the
>>>> possible D_n,m values.  R seems to use a more clever way to do
>>>> this.  Does anyone have a reference for an efficient way to compute
>>>> the exact distribution, or background on where R got their
>>>> implementation?
>>>> Absent a "clever" approach, I see three alternatives and would
>>>> appreciate some feedback:
>>>> 0) just use the asymptotic distribution even for small samples
>>>> 1) fully enumerate all n-m partitions and compute the D_n,m as above
>>>> 1) use a monte carlo approach - instead of full enumeration of the
>>>> D_n,m, randomly generate a large number of splits and compute the
>>>> p-value for observed D_n,m by computing the number of random n-m
>>>> splits generate a D value less than what is observed.
>>>> Thanks in advance for any feedback / pointers.
>>>> Phil
>>>> [1]
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message