commons-dev mailing list archives

Site index · List index
Message view
Top
From Ajo Fod <ajo....@gmail.com>
Subject Re: [math] Kolmogorov-Smirnov 2-sample test
Date Sat, 10 Aug 2013 17:41:46 GMT
```This paper provides some approximations and situations where you could use
exact computations:
http://www.iro.umontreal.ca/~lecuyer/myftp/papers/ksdist.pdf
... they even refer to a java program at the end of page 11.

Cheers,
Ajo.

On Sat, Aug 10, 2013 at 10:06 AM, Phil Steitz <phil.steitz@gmail.com> wrote:

> On 8/10/13 9:35 AM, Ajo Fod wrote:
> > In:
> >
> http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf
> >
> > Take a look at 2 sample KS stats and the relationship to the 1 sample ...
> > page 88. You already have the 1 sample table.
>
> The problem is that I don't have the *exact* 1 sample table, or more
> precisely the means to generate it.  What our current machinery
> allows us to compute is an asymptotically valid approximation to the
> distribution of D_n,m.  Theorem 2 on page 87 of the reference above
> justifies the large sample approach; but it is an asymptotic
> result.   Using it is fine for large n, m, but not so good for small
> samples.  For that we need the exact, discrete distribution of
> D_n,m.  Like other math stat references I have consulted, the one
> above states that the discrete distribution has been tabulated for
> small n,m and those tables are available in most math stat texts.
> What I need is the algorithm used to generate those tables.
>
> Phil
> >
> > Cheers,
> > Ajo
> >
> >
> > On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz <phil.steitz@gmail.com>
> wrote:
> >
> >> On 8/10/13 8:59 AM, Ajo Fod wrote:
> >>> This depends on data size. If it fits in memory, a single pass through
> >> the
> >>> sorted array to find the biggest differences would suffice.
> >>>
> >>> If the data doesn't fit, you probably need a StorelessQuantile
> estimator
> >>> like QuantileBin1D from the colt libraries. Then pick a resolution and
> do
> >>> the single pass search.
> >> Thanks, Ajo.  I have no problem computing the D_n,m statistics.  My
> >> problem is in computing the exact p-values for the test.  For that,
> >> I need to compute the exact distribution of D_n,m.  Brute-forcing
> >> requires that you examine every element of n + m choose n.  R seems
> >> to use a clever approach, but there is no documentation in the R
> >> sources on how the algorithm works.  Moreover, my first attempts at
> >> Monte Carlo simulation don't give the same results.  Most likely, I
> >> have not set the simulation up correctly.  Any better ideas or
> >> references on how to compute the exact distribution would be
> >> appreciated.
> >>
> >> Phil
> >>> Cheers,
> >>> -Ajo
> >>>
> >>>
> >>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz <phil.steitz@gmail.com>
> >> wrote:
> >>>> I am working on MATH-437 (turning K-S distribution into a proper K-S
> >>>> test impl) and have to decide how to implement 2-sample tests.
> >>>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a
> >>>> K-S distribution, so for large samples just using the cdf we already
> >>>> have is appropriate.  For small samples (actually for any size
> >>>> sample), the test statistic distribution is discrete and can be
> >>>> computed exactly.  A brute force way to do that is to enumerate all
> >>>> of the n-m partitions of {0, ..., n+m-1} and compute all the
> >>>> possible D_n,m values.  R seems to use a more clever way to do
> >>>> this.  Does anyone have a reference for an efficient way to compute
> >>>> the exact distribution, or background on where R got their
> >>>> implementation?
> >>>>
> >>>> Absent a "clever" approach, I see three alternatives and would
> >>>> appreciate some feedback:
> >>>>
> >>>> 0) just use the asymptotic distribution even for small samples
> >>>> 1) fully enumerate all n-m partitions and compute the D_n,m as above
> >>>> 1) use a monte carlo approach - instead of full enumeration of the
> >>>> D_n,m, randomly generate a large number of splits and compute the
> >>>> p-value for observed D_n,m by computing the number of random n-m
> >>>> splits generate a D value less than what is observed.
> >>>>
> >>>> Thanks in advance for any feedback / pointers.
> >>>>
> >>>> Phil
> >>>>
> >>>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >>>> For additional commands, e-mail: dev-help@commons.apache.org
> >>>>
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >> For additional commands, e-mail: dev-help@commons.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

```
Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message