commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <>
Subject Re: [math] Kolmogorov-Smirnov 2-sample test
Date Sat, 10 Aug 2013 19:46:38 GMT
On 8/10/13 11:00 AM, Phil Steitz wrote:
> On 8/10/13 10:41 AM, Ajo Fod wrote:
>> This paper provides some approximations and situations where you could use
>> exact computations:
>> ... they even refer to a java program at the end of page 11.
> He he.  This is the reference that our current K-S impl is based on
> :)  So our impl correctly handles small samples for the 1-sample test.
> This paper does provide some references for exact computation of
> D_n.  Unfortunately, what I need is D_n,m for the two-sample test
> and while asymptotically D_n,m is K-S distributed, I don't think I
> can adapt exact algorithms for D_n to compute exact D_n,m.  I could
> be wrong here - would love to be shown how to do it.

Sorry, everywhere above I meant "the probability distribution of
D_n/D_n,m" in place of "D_n/D_n,m"

> Phil
>> Cheers,
>> Ajo.
>> On Sat, Aug 10, 2013 at 10:06 AM, Phil Steitz <> wrote:
>>> On 8/10/13 9:35 AM, Ajo Fod wrote:
>>>> In:
>>>> Take a look at 2 sample KS stats and the relationship to the 1 sample ...
>>>> page 88. You already have the 1 sample table.
>>> The problem is that I don't have the *exact* 1 sample table, or more
>>> precisely the means to generate it.  What our current machinery
>>> allows us to compute is an asymptotically valid approximation to the
>>> distribution of D_n,m.  Theorem 2 on page 87 of the reference above
>>> justifies the large sample approach; but it is an asymptotic
>>> result.   Using it is fine for large n, m, but not so good for small
>>> samples.  For that we need the exact, discrete distribution of
>>> D_n,m.  Like other math stat references I have consulted, the one
>>> above states that the discrete distribution has been tabulated for
>>> small n,m and those tables are available in most math stat texts.
>>> What I need is the algorithm used to generate those tables.
>>> Phil
>>>> Cheers,
>>>> Ajo
>>>> On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <>
>>> wrote:
>>>>> On 8/10/13 8:59 AM, Ajo Fod wrote:
>>>>>> This depends on data size. If it fits in memory, a single pass through
>>>>> the
>>>>>> sorted array to find the biggest differences would suffice.
>>>>>> If the data doesn't fit, you probably need a StorelessQuantile
>>> estimator
>>>>>> like QuantileBin1D from the colt libraries. Then pick a resolution
>>> do
>>>>>> the single pass search.
>>>>> Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
>>>>> problem is in computing the exact p-values for the test.  For that,
>>>>> I need to compute the exact distribution of D_n,m.  Brute-forcing
>>>>> requires that you examine every element of n + m choose n.  R seems
>>>>> to use a clever approach, but there is no documentation in the R
>>>>> sources on how the algorithm works.  Moreover, my first attempts at
>>>>> Monte Carlo simulation don't give the same results.  Most likely, I
>>>>> have not set the simulation up correctly.  Any better ideas or
>>>>> references on how to compute the exact distribution would be
>>>>> appreciated.
>>>>> Phil
>>>>>> Cheers,
>>>>>> -Ajo
>>>>>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <>
>>>>> wrote:
>>>>>>> I am working on MATH-437 (turning K-S distribution into a proper
>>>>>>> test impl) and have to decide how to implement 2-sample tests.
>>>>>>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has
>>>>>>> K-S distribution, so for large samples just using the cdf we
>>>>>>> have is appropriate.  For small samples (actually for any size
>>>>>>> sample), the test statistic distribution is discrete and can
>>>>>>> computed exactly.  A brute force way to do that is to enumerate
>>>>>>> of the n-m partitions of {0, ..., n+m-1} and compute all the
>>>>>>> possible D_n,m values.  R seems to use a more clever way to do
>>>>>>> this.  Does anyone have a reference for an efficient way to compute
>>>>>>> the exact distribution, or background on where R got their
>>>>>>> implementation?
>>>>>>> Absent a "clever" approach, I see three alternatives and would
>>>>>>> appreciate some feedback:
>>>>>>> 0) just use the asymptotic distribution even for small samples
>>>>>>> 1) fully enumerate all n-m partitions and compute the D_n,m as
>>>>>>> 1) use a monte carlo approach - instead of full enumeration of
>>>>>>> D_n,m, randomly generate a large number of splits and compute
>>>>>>> p-value for observed D_n,m by computing the number of random
>>>>>>> splits generate a D value less than what is observed.
>>>>>>> Thanks in advance for any feedback / pointers.
>>>>>>> Phil
>>>>>>> [1]
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:
>>>>>>> For additional commands, e-mail:
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>>> For additional commands, e-mail:
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message