spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Created] (SPARK-6311) ChiSqTest should check for too few counts
Date Fri, 13 Mar 2015 01:19:38 GMT
Joseph K. Bradley created SPARK-6311:

             Summary: ChiSqTest should check for too few counts
                 Key: SPARK-6311
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.2.0
            Reporter: Joseph K. Bradley

ChiSqTest assumes that elements of the contingency matrix are large enough (have enough counts)
s.t. the central limit theorem kicks in.  It would be reasonable to do one or more of the
* Add a note in the docs about making sure there are a reasonable number of instances being
used (or counts in the contingency table entries, to be more precise and account for skewed
category distributions).
* Add a check in the code which could:
** Log a warning message
** Alter the p-value to make sure it indicates the test result is insignificant

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message