spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-6311) ChiSqTest should check for too few counts
Date Fri, 13 Mar 2015 01:19:38 GMT
Joseph K. Bradley created SPARK-6311:
----------------------------------------

             Summary: ChiSqTest should check for too few counts
                 Key: SPARK-6311
                 URL: https://issues.apache.org/jira/browse/SPARK-6311
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.2.0
            Reporter: Joseph K. Bradley


ChiSqTest assumes that elements of the contingency matrix are large enough (have enough counts)
s.t. the central limit theorem kicks in.  It would be reasonable to do one or more of the
following:
* Add a note in the docs about making sure there are a reasonable number of instances being
used (or counts in the contingency table entries, to be more precise and account for skewed
category distributions).
* Add a check in the code which could:
** Log a warning message
** Alter the p-value to make sure it indicates the test result is insignificant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message