On 8/10/13 11:00 AM, Phil Steitz wrote:
> On 8/10/13 10:41 AM, Ajo Fod wrote:
>> This paper provides some approximations and situations where you could use
>> exact computations:
>> http://www.iro.umontreal.ca/~lecuyer/myftp/papers/ksdist.pdf
>> ... they even refer to a java program at the end of page 11.
> He he. This is the reference that our current KS impl is based on
> :) So our impl correctly handles small samples for the 1sample test.
>
> This paper does provide some references for exact computation of
> D_n. Unfortunately, what I need is D_n,m for the twosample test
> and while asymptotically D_n,m is KS distributed, I don't think I
> can adapt exact algorithms for D_n to compute exact D_n,m. I could
> be wrong here  would love to be shown how to do it.
Sorry, everywhere above I meant "the probability distribution of
D_n/D_n,m" in place of "D_n/D_n,m"
Phil
>
> Phil
>> Cheers,
>> Ajo.
>>
>>
>> On Sat, Aug 10, 2013 at 10:06 AM, Phil Steitz <phil.steitz@gmail.com> wrote:
>>
>>> On 8/10/13 9:35 AM, Ajo Fod wrote:
>>>> In:
>>>>
>>> http://ocw.mit.edu/courses/mathematics/18443statisticsforapplicationsfall2006/lecturenotes/lecture14.pdf
>>>> Take a look at 2 sample KS stats and the relationship to the 1 sample ...
>>>> page 88. You already have the 1 sample table.
>>> The problem is that I don't have the *exact* 1 sample table, or more
>>> precisely the means to generate it. What our current machinery
>>> allows us to compute is an asymptotically valid approximation to the
>>> distribution of D_n,m. Theorem 2 on page 87 of the reference above
>>> justifies the large sample approach; but it is an asymptotic
>>> result. Using it is fine for large n, m, but not so good for small
>>> samples. For that we need the exact, discrete distribution of
>>> D_n,m. Like other math stat references I have consulted, the one
>>> above states that the discrete distribution has been tabulated for
>>> small n,m and those tables are available in most math stat texts.
>>> What I need is the algorithm used to generate those tables.
>>>
>>> Phil
>>>> Cheers,
>>>> Ajo
>>>>
>>>>
>>>> On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <phil.steitz@gmail.com>
>>> wrote:
>>>>> On 8/10/13 8:59 AM, Ajo Fod wrote:
>>>>>> This depends on data size. If it fits in memory, a single pass through
>>>>> the
>>>>>> sorted array to find the biggest differences would suffice.
>>>>>>
>>>>>> If the data doesn't fit, you probably need a StorelessQuantile
>>> estimator
>>>>>> like QuantileBin1D from the colt libraries. Then pick a resolution
and
>>> do
>>>>>> the single pass search.
>>>>> Thanks, Ajo. I have no problem computing the D_n,m statistics. My
>>>>> problem is in computing the exact pvalues for the test. For that,
>>>>> I need to compute the exact distribution of D_n,m. Bruteforcing
>>>>> requires that you examine every element of n + m choose n. R seems
>>>>> to use a clever approach, but there is no documentation in the R
>>>>> sources on how the algorithm works. Moreover, my first attempts at
>>>>> Monte Carlo simulation don't give the same results. Most likely, I
>>>>> have not set the simulation up correctly. Any better ideas or
>>>>> references on how to compute the exact distribution would be
>>>>> appreciated.
>>>>>
>>>>> Phil
>>>>>> Cheers,
>>>>>> Ajo
>>>>>>
>>>>>>
>>>>>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <phil.steitz@gmail.com>
>>>>> wrote:
>>>>>>> I am working on MATH437 (turning KS distribution into a proper
KS
>>>>>>> test impl) and have to decide how to implement 2sample tests.
>>>>>>> Asymptotically, the 2sample D_n,m test statistic (see [1]) has
a
>>>>>>> KS distribution, so for large samples just using the cdf we
already
>>>>>>> have is appropriate. For small samples (actually for any size
>>>>>>> sample), the test statistic distribution is discrete and can
be
>>>>>>> computed exactly. A brute force way to do that is to enumerate
all
>>>>>>> of the nm partitions of {0, ..., n+m1} and compute all the
>>>>>>> possible D_n,m values. R seems to use a more clever way to do
>>>>>>> this. Does anyone have a reference for an efficient way to compute
>>>>>>> the exact distribution, or background on where R got their
>>>>>>> implementation?
>>>>>>>
>>>>>>> Absent a "clever" approach, I see three alternatives and would
>>>>>>> appreciate some feedback:
>>>>>>>
>>>>>>> 0) just use the asymptotic distribution even for small samples
>>>>>>> 1) fully enumerate all nm partitions and compute the D_n,m as
above
>>>>>>> 1) use a monte carlo approach  instead of full enumeration of
the
>>>>>>> D_n,m, randomly generate a large number of splits and compute
the
>>>>>>> pvalue for observed D_n,m by computing the number of random
nm
>>>>>>> splits generate a D value less than what is observed.
>>>>>>>
>>>>>>> Thanks in advance for any feedback / pointers.
>>>>>>>
>>>>>>> Phil
>>>>>>>
>>>>>>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
>>>>>>>
>>>>>>> 
>>>>>>> To unsubscribe, email: devunsubscribe@commons.apache.org
>>>>>>> For additional commands, email: devhelp@commons.apache.org
>>>>>>>
>>>>>>>
>>>>> 
>>>>> To unsubscribe, email: devunsubscribe@commons.apache.org
>>>>> For additional commands, email: devhelp@commons.apache.org
>>>>>
>>>>>
>>> 
>>> To unsubscribe, email: devunsubscribe@commons.apache.org
>>> For additional commands, email: devhelp@commons.apache.org
>>>
>>>

To unsubscribe, email: devunsubscribe@commons.apache.org
For additional commands, email: devhelp@commons.apache.org
