commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anders Conbere (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MATH-1140) Incorrect result from MannWhitneyUTest#mannWhitneyUTest with large datasets
Date Mon, 11 Aug 2014 18:28:13 GMT

    [ https://issues.apache.org/jira/browse/MATH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093104#comment-14093104
] 

Anders Conbere commented on MATH-1140:
--------------------------------------

I found my actual source of the issue I'm experiencing which has to do with an integer overflow
when calculating U1 in mannWhitneyU and multiplying array lengths together. Since array lengths
are ints this imposes a pretty tiny maximum size to the length of your array inputs Math.sqrt(Integer.MAX_VALUE).
I would recommend casting those into longs or doubles to improve usability or asserting the
maximum length of the arrays early on.

> Incorrect result from MannWhitneyUTest#mannWhitneyUTest with large datasets
> ---------------------------------------------------------------------------
>
>                 Key: MATH-1140
>                 URL: https://issues.apache.org/jira/browse/MATH-1140
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 3.3
>            Reporter: Anders Conbere
>            Priority: Minor
>
> On large datasets MannWhitneyUTest#mannWhitneyUTest returns the double value 0.0 instead
of the correct p-value. I suspect this is an overflow but haven't been able to trace it down
yet.
> I'm afraid I'm not very good at java, but I'm including a link to a public repository
where you can reproduce the issue, unfortunately my implementation is written in clojure.
> https://github.com/aconbere/apache-commons-mann-whitney-bug
> The summary is that by calling MannWhitneyUTest#mannWhitneyUTest with two randomly generated
arrays (50k elements with a max value of 300) I can reliably reproduce the result 0.0. By
reducing that to something more modest  like 2k I get correct p-value calculations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message