[ https://issues.apache.org/jira/browse/MATH1246?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=14616134#comment14616134
]
Phil Steitz edited comment on MATH1246 at 7/7/15 3:59 AM:

I think the current implementation can be fixed as follows. If we move to a faster implementation,
the strategy below may not work.
What exactP does now is to exhaustively compute all possible Dstatistics for all mset /
nset partitions of m+n and simply tally the number that exceed (strict) or are as large as
(not strict) the observed D. If there are ties in the data, it is not correct to look at
partitions of m+n, since not all partitions of an m+n set with duplicates are distinct and
the set of possible D values is different in the presence of ties. I think we can correctly
handle ties in the data if we compute and tally D statistics based on a combined multiset
sample with duplicates in the positions corresponding to what is observed in the data. For
example, suppose that the two samples are x = [0, 3, 6, 9, 9, 10] and y = [1, 3, 4, 8, 11].
then the multiset universe is U = [0, 1, 3, 3, 4, 6, 8, 9, 9, 10, 11]. As before, we generate
partitions of 11 into a 6set and a 5set, but instead of computing the Dstatistics on the
subsets of 11, we use indexes into U instead. So if a generated split is mSet = [0, 2, 3,
7, 8, 9], nSet = [1, 4, 5, 6, 10], we compute D for [0, 3, 3, 9, 9, 10] and [1, 4, 6, 8, 11].
The rationale here is that the pvalue is the probability that if U is split randomly into
a 5set and a 6set, the Dvalue exceeds the observed d.
was (Author: psteitz):
I think the current implementation can be fixed as follows. If we move to a faster implementation,
the strategy below may not work.
What exactP does now is to exhaustively compute all possible Dstatistics for all mset /
nset partitions of m+n and simply tally the number that exceed (strict) or are as large as
(not strict) the observed D. If there are ties in the data, it is not correct to look at
partitions of m+n, since not all partitions of an m+n set with duplicates are distinct and
the set of possible D values is different in the presence of ties. I think we can correctly
handle ties in the data if we compute and tally D statistics based on a combined multiset
sample with duplicates in the positions corresponding to what is observed in the data. For
example, suppose that the two samples are x = [0, 3, 6, 9, 9, 10] and y = [1, 3, 4, 8, 11].
then the multiset universe is U = {0, 1, 3, 3, 4, 6, 8, 9, 9, 10, 11}. As before, we generate
partitions of 11 into a 6set and a 5set, but instead of computing the Dstatistics on the
subsets of 11, we use indexes into U instead. So if a generated split is mSet = {0, 2, 3,
7, 8, 9}, nSet = {1, 4, 5, 6, 10}, we compute D for [0, 3, 3, 9, 9, 10] and [1, 4, 6, 8, 11].
The rationale here is that the pvalue is the probability that if U is split randomly into
a 5set and a 6set, the Dvalue exceeds the observed d.
> KolmogorovSmirnov 2sample test does not correctly handle ties
> 
>
> Key: MATH1246
> URL: https://issues.apache.org/jira/browse/MATH1246
> Project: Commons Math
> Issue Type: Bug
> Reporter: Phil Steitz
>
> For small samples, KolmogorovSmirnovTest(double[], double[]) computes the distribution
of a Dstatistic for mn sets with no ties. No warning or special handling is delivered in
the presence of ties.

This message was sent by Atlassian JIRA
(v6.3.4#6332)
