Return-Path: X-Original-To: apmail-commons-dev-archive@www.apache.org Delivered-To: apmail-commons-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 05C40104CD for ; Sat, 10 Aug 2013 19:47:08 +0000 (UTC) Received: (qmail 24794 invoked by uid 500); 10 Aug 2013 19:47:07 -0000 Delivered-To: apmail-commons-dev-archive@commons.apache.org Received: (qmail 24671 invoked by uid 500); 10 Aug 2013 19:47:06 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 24663 invoked by uid 99); 10 Aug 2013 19:47:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Aug 2013 19:47:06 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of phil.steitz@gmail.com designates 209.85.220.46 as permitted sender) Received: from [209.85.220.46] (HELO mail-pa0-f46.google.com) (209.85.220.46) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Aug 2013 19:47:01 +0000 Received: by mail-pa0-f46.google.com with SMTP id fa1so4202507pad.5 for ; Sat, 10 Aug 2013 12:46:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=M1uYqp/EQNZu465PuCFfbvBXzykzBikD6Cn9Adesik8=; b=jmlwjByqTAX5hJebyZS7cJx0N/YNOfUOb//l/ZzB6qgkquibYzHcNPl9g+HVgWTb8v Ac0kXDofZwVIyIOFFn8V+Dc3Va/t849iStgN3GLDOwDhzOPkyNxWuQx2sHnWsz+FJmir jSBc8ItnOv5Bkitf/2MfadpRxXudY+lZJpmQ52g+cnzfwo137AdnXyCzDkA4qQXjtlfg rW7qytadzCxsIvCVRmiRIjLkNvhpgQ0HLMXaH6rsqxU4hI/alArLbgF1YE2LuAgOgVDt dl0HPSi/Eey+Y/yPtu8p1Zff7Exc7Jsv47YDjylgC9ylrWY2g2kqEkoda9MZwqC6WQ5E 93kw== X-Received: by 10.66.187.34 with SMTP id fp2mr6622678pac.12.1376164001332; Sat, 10 Aug 2013 12:46:41 -0700 (PDT) Received: from [192.168.2.107] (ip72-208-109-243.ph.ph.cox.net. [72.208.109.243]) by mx.google.com with ESMTPSA id xl3sm27456519pbb.17.2013.08.10.12.46.39 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 10 Aug 2013 12:46:40 -0700 (PDT) Message-ID: <5206989E.3060004@gmail.com> Date: Sat, 10 Aug 2013 12:46:38 -0700 From: Phil Steitz User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: Commons Developers List Subject: Re: [math] Kolmogorov-Smirnov 2-sample test References: <51EAC276.3030808@gmail.com> <52066750.5060306@gmail.com> <5206731F.9090307@gmail.com> <52067FA3.3010609@gmail.com> In-Reply-To: <52067FA3.3010609@gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 8/10/13 11:00 AM, Phil Steitz wrote: > On 8/10/13 10:41 AM, Ajo Fod wrote: >> This paper provides some approximations and situations where you could use >> exact computations: >> http://www.iro.umontreal.ca/~lecuyer/myftp/papers/ksdist.pdf >> ... they even refer to a java program at the end of page 11. > He he. This is the reference that our current K-S impl is based on > :) So our impl correctly handles small samples for the 1-sample test. > > This paper does provide some references for exact computation of > D_n. Unfortunately, what I need is D_n,m for the two-sample test > and while asymptotically D_n,m is K-S distributed, I don't think I > can adapt exact algorithms for D_n to compute exact D_n,m. I could > be wrong here - would love to be shown how to do it. Sorry, everywhere above I meant "the probability distribution of D_n/D_n,m" in place of "D_n/D_n,m" Phil > > Phil >> Cheers, >> Ajo. >> >> >> On Sat, Aug 10, 2013 at 10:06 AM, Phil Steitz wrote: >> >>> On 8/10/13 9:35 AM, Ajo Fod wrote: >>>> In: >>>> >>> http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf >>>> Take a look at 2 sample KS stats and the relationship to the 1 sample ... >>>> page 88. You already have the 1 sample table. >>> The problem is that I don't have the *exact* 1 sample table, or more >>> precisely the means to generate it. What our current machinery >>> allows us to compute is an asymptotically valid approximation to the >>> distribution of D_n,m. Theorem 2 on page 87 of the reference above >>> justifies the large sample approach; but it is an asymptotic >>> result. Using it is fine for large n, m, but not so good for small >>> samples. For that we need the exact, discrete distribution of >>> D_n,m. Like other math stat references I have consulted, the one >>> above states that the discrete distribution has been tabulated for >>> small n,m and those tables are available in most math stat texts. >>> What I need is the algorithm used to generate those tables. >>> >>> Phil >>>> Cheers, >>>> Ajo >>>> >>>> >>>> On Sat, Aug 10, 2013 at 9:16 AM, Phil Steitz >>> wrote: >>>>> On 8/10/13 8:59 AM, Ajo Fod wrote: >>>>>> This depends on data size. If it fits in memory, a single pass through >>>>> the >>>>>> sorted array to find the biggest differences would suffice. >>>>>> >>>>>> If the data doesn't fit, you probably need a StorelessQuantile >>> estimator >>>>>> like QuantileBin1D from the colt libraries. Then pick a resolution and >>> do >>>>>> the single pass search. >>>>> Thanks, Ajo. I have no problem computing the D_n,m statistics. My >>>>> problem is in computing the exact p-values for the test. For that, >>>>> I need to compute the exact distribution of D_n,m. Brute-forcing >>>>> requires that you examine every element of n + m choose n. R seems >>>>> to use a clever approach, but there is no documentation in the R >>>>> sources on how the algorithm works. Moreover, my first attempts at >>>>> Monte Carlo simulation don't give the same results. Most likely, I >>>>> have not set the simulation up correctly. Any better ideas or >>>>> references on how to compute the exact distribution would be >>>>> appreciated. >>>>> >>>>> Phil >>>>>> Cheers, >>>>>> -Ajo >>>>>> >>>>>> >>>>>> On Sat, Jul 20, 2013 at 10:01 AM, Phil Steitz >>>>> wrote: >>>>>>> I am working on MATH-437 (turning K-S distribution into a proper K-S >>>>>>> test impl) and have to decide how to implement 2-sample tests. >>>>>>> Asymptotically, the 2-sample D_n,m test statistic (see [1]) has a >>>>>>> K-S distribution, so for large samples just using the cdf we already >>>>>>> have is appropriate. For small samples (actually for any size >>>>>>> sample), the test statistic distribution is discrete and can be >>>>>>> computed exactly. A brute force way to do that is to enumerate all >>>>>>> of the n-m partitions of {0, ..., n+m-1} and compute all the >>>>>>> possible D_n,m values. R seems to use a more clever way to do >>>>>>> this. Does anyone have a reference for an efficient way to compute >>>>>>> the exact distribution, or background on where R got their >>>>>>> implementation? >>>>>>> >>>>>>> Absent a "clever" approach, I see three alternatives and would >>>>>>> appreciate some feedback: >>>>>>> >>>>>>> 0) just use the asymptotic distribution even for small samples >>>>>>> 1) fully enumerate all n-m partitions and compute the D_n,m as above >>>>>>> 1) use a monte carlo approach - instead of full enumeration of the >>>>>>> D_n,m, randomly generate a large number of splits and compute the >>>>>>> p-value for observed D_n,m by computing the number of random n-m >>>>>>> splits generate a D value less than what is observed. >>>>>>> >>>>>>> Thanks in advance for any feedback / pointers. >>>>>>> >>>>>>> Phil >>>>>>> >>>>>>> [1] http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org >>>>>>> For additional commands, e-mail: dev-help@commons.apache.org >>>>>>> >>>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org >>>>> For additional commands, e-mail: dev-help@commons.apache.org >>>>> >>>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org >>> For additional commands, e-mail: dev-help@commons.apache.org >>> >>> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org For additional commands, e-mail: dev-help@commons.apache.org