commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phil Steitz (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MATH-1246) Kolmogorov-Smirnov 2-sample test does not correctly handle ties
Date Tue, 07 Jul 2015 03:59:04 GMT

    [ https://issues.apache.org/jira/browse/MATH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616134#comment-14616134
] 

Phil Steitz edited comment on MATH-1246 at 7/7/15 3:59 AM:
-----------------------------------------------------------

I think the current implementation can be fixed as follows.  If we move to a faster implementation,
the strategy below may not work.

What exactP does now is to exhaustively compute all possible D-statistics for all m-set /
n-set partitions of m+n and simply tally the number that exceed (strict) or are as large as
(not strict) the observed D.  If there are ties in the data, it is not correct to look at
partitions of m+n, since not all partitions of an m+n set with duplicates are distinct and
the set of possible D values is different in the presence of ties.  I think we can correctly
handle ties in the data if we compute and tally D statistics based on a combined multi-set
sample with duplicates in the positions corresponding to what is observed in the data.  For
example, suppose that the two samples are x = [0, 3, 6, 9, 9, 10] and y = [1, 3, 4, 8, 11].
 then the multi-set universe is  U = [0, 1, 3, 3, 4, 6, 8, 9, 9, 10, 11].  As before, we generate
partitions of 11 into a 6-set and a 5-set, but instead of computing the D-statistics on the
subsets of 11, we use indexes into U instead.  So if a generated split is mSet = [0, 2, 3,
7, 8, 9], nSet = [1, 4, 5, 6, 10], we compute D for [0, 3, 3, 9, 9, 10] and [1, 4, 6, 8, 11].
 The rationale here is that the p-value is the probability that if U is split randomly into
a 5-set and a 6-set, the D-value exceeds the observed d.


was (Author: psteitz):
I think the current implementation can be fixed as follows.  If we move to a faster implementation,
the strategy below may not work.

What exactP does now is to exhaustively compute all possible D-statistics for all m-set /
n-set partitions of m+n and simply tally the number that exceed (strict) or are as large as
(not strict) the observed D.  If there are ties in the data, it is not correct to look at
partitions of m+n, since not all partitions of an m+n set with duplicates are distinct and
the set of possible D values is different in the presence of ties.  I think we can correctly
handle ties in the data if we compute and tally D statistics based on a combined multi-set
sample with duplicates in the positions corresponding to what is observed in the data.  For
example, suppose that the two samples are x = [0, 3, 6, 9, 9, 10] and y = [1, 3, 4, 8, 11].
 then the multi-set universe is  U = {0, 1, 3, 3, 4, 6, 8, 9, 9, 10, 11}.  As before, we generate
partitions of 11 into a 6-set and a 5-set, but instead of computing the D-statistics on the
subsets of 11, we use indexes into U instead.  So if a generated split is mSet = {0, 2, 3,
7, 8, 9}, nSet = {1, 4, 5, 6, 10}, we compute D for [0, 3, 3, 9, 9, 10] and [1, 4, 6, 8, 11].
 The rationale here is that the p-value is the probability that if U is split randomly into
a 5-set and a 6-set, the D-value exceeds the observed d.

> Kolmogorov-Smirnov 2-sample test does not correctly handle ties
> ---------------------------------------------------------------
>
>                 Key: MATH-1246
>                 URL: https://issues.apache.org/jira/browse/MATH-1246
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the distribution
of a D-statistic for m-n sets with no ties.  No warning or special handling is delivered in
the presence of ties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message