commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phil Steitz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MATH-1131) Kolmogorov-Smirnov Tests takes 'forever' on 10,000 item dataset
Date Sat, 28 Jun 2014 22:56:24 GMT

    [ https://issues.apache.org/jira/browse/MATH-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046995#comment-14046995
] 

Phil Steitz commented on MATH-1131:
-----------------------------------

A few of comments not directly related to the performance issue, but likely relevant to the
OP and anyone using KolmogorovSmirnovTest to evaluate the null hypothesis that a sample comes
from a normal (Gaussian) distribution:

1. The KS test using parameters estimated from the data is in general not the best test to
use to test normality.  We do not currently implement the Lillifors or other tests.  Patches
welcome :)  (Discuss first on the mailing list, then open separate tickets for these if interested.)
2.  *No* classical frequentist test really works for large samples.  KS, Liilifors, Shapiro-Wilks
et al are uniformly too powerful to be meaningful for samples even as small as 5000 observations.
 See, e.g. [1].
3.  An interesting alternative for large samples is [2].   Here again, patches welcome.  A
similar approach implementable using Commons Math version 3.x would be to bin the data in
standard deviation units and then apply a G-test with expected counts computed using quantiles
of the normal distribution.

[1] http://www.statisticalmisses.nl/index.php/frequently-asked-questions/77-what-is-wrong-with-tests-of-normality
[2] https://ideals.illinois.edu/bitstream/handle/2142/29878/largesamplenorma93171bera.pdf

> Kolmogorov-Smirnov Tests takes 'forever' on 10,000 item dataset
> ---------------------------------------------------------------
>
>                 Key: MATH-1131
>                 URL: https://issues.apache.org/jira/browse/MATH-1131
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 3.3
>         Environment: Java 8
>            Reporter: Schalk W. Cronjé
>         Attachments: 1.txt, MATH-1131.patch, ReproduceKsIssue.groovy, ReproduceKsIssue.java
>
>
> I have code simplified to the following:
>     KolmogorovSmirnovTest kst = new KolmogorovSmirnovTest();
>     NormalDistribution nd = new NormalDistribution(mean,stddev);
>     kst.kolmogorovSmirnovTest(nd,dataset)
> I find that for my dataset of 10,000 items, the call to kolmogorovSmirnovTest takes 'forever'.
It has not returned after nearly 15minutes and in one my my tests has gone over 150MB in 
memory usage. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message