commons-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r1408174 - /commons/proper/math/trunk/src/site/xdoc/userguide/stat.xml
Date Mon, 12 Nov 2012 04:33:11 GMT
Author: psteitz
Date: Mon Nov 12 04:33:11 2012
New Revision: 1408174

Added G-test. JIRA: MATH-878.


Modified: commons/proper/math/trunk/src/site/xdoc/userguide/stat.xml
--- commons/proper/math/trunk/src/site/xdoc/userguide/stat.xml (original)
+++ commons/proper/math/trunk/src/site/xdoc/userguide/stat.xml Mon Nov 12 04:33:11 2012
@@ -810,6 +810,7 @@ new PearsonsCorrelation().correlation(ra
           Student's t</a>,
           <a href="">
+          <a href="">G Test</a>,
           <a href="">
           One-Way ANOVA</a>,
           <a href="">
@@ -818,12 +819,14 @@ new PearsonsCorrelation().correlation(ra
           Wilcoxon signed rank</a> test statistics as well as
           <a href="">
           p-values</a> associated with <code>t-</code>,
-          <code>Chi-Square</code>, <code>One-Way ANOVA</code>, <code>Mann-Whitney
+          <code>Chi-Square</code>, <code>G</code>, <code>One-Way
ANOVA</code>, <code>Mann-Whitney U</code>
           and <code>Wilcoxon signed rank</code> tests. The respective test classes
           <a href="../apidocs/org/apache/commons/math3/stat/inference/TTest.html">
           <a href="../apidocs/org/apache/commons/math3/stat/inference/ChiSquareTest.html">
+          <a href="../apidocs/org/apache/commons/math3/stat/inference/GTest.html">
+          GTest</a>,
           <a href="../apidocs/org/apache/commons/math3/stat/inference/OneWayAnova.html">
           <a href="../apidocs/org/apache/commons/math3/stat/inference/MannWhitneyUTest.html">
@@ -864,14 +867,19 @@ new PearsonsCorrelation().correlation(ra
           <li>p-values returned by t-, chi-square and Anova tests are exact, based
            on numerical approximations to the t-, chi-square and F distributions in the
            <code>distributions</code> package. </li>
-           <li>p-values returned by t-tests are for two-sided tests and the boolean-valued
+          <li>The G test implementation provides two p-values:
+           <code>gTest(expected, observed)</code>, which is the tail probability
+           <code>g(expected, observed)</code> in the ChiSquare distribution with
+           of freedom one less than the common length of input arrays and 
+           <code>gTestIntrinsic(expected, observed)</code> which is the same
+           probability computed using a ChiSquare distribution with one less degeree
+           of freedom. </li>
+          <li>p-values returned by t-tests are for two-sided tests and the boolean-valued
            methods supporting fixed significance level tests assume that the hypotheses
            are two-sided.  One sided tests can be performed by dividing returned p-values
            (resp. critical values) by 2.</li>
-           <li>Degrees of freedom for chi-square tests are integral values, based on
-           number of observed or expected counts (number of observed counts - 1)
-           for the goodness-of-fit tests and (number of columns -1) * (number of rows - 1)
-           for independence tests.</li>
+           <li>Degrees of freedom for g- and chi-square tests are integral values,
based on the
+           number of observed or expected counts (number of observed counts - 1).</li>
@@ -1059,11 +1067,70 @@ TestUtils.chiSquareTest(counts, alpha);
           hypothesis can be rejected with confidence <code>1 - alpha</code>.
+          <dt><strong>g tests</strong></dt>
+          <br></br>
+          <dd>g tests are an alternative to chi-square tests that are recommended
+          when observed counts are small and / or incidence probabillities for 
+          some cells are small. See Ted Dunning's paper,
+          <a href="">
+          Accurate Methods for the Statistics of Surprise and Coincidence</a> for
+          background and an empirical analysis showing now chi-square
+          statistics can be misldeading in the presence of low incidence probabilities.
+          This paper also derives the formulas used in computing g statistics and the
+          root log likelihood ratio provided by the <code>GTest</code> class.</dd>
+          <dd>
+          <dd>To compute a g-test statistic measuring the agreement between a
+          <code>long[]</code> array of observed counts and a <code>double[]</code>
+          array of expected counts, use:
+          <source>
+double[] expected = new double[]{0.54d, 0.40d, 0.05d, 0.01d};
+long[] observed = new long[]{70, 79, 3, 4};
+System.out.println(TestUtils.g(expected, observed));
+          </source>
+          the value displayed will be
+          <code>2 * sum(observed[i]) * log(observed[i]/expected[i])</code>
+          </dd>
+          <dd> To get the p-value associated with the null hypothesis that
+          <code>observed</code> conforms to <code>expected</code>
+          <source>
+TestUtils.gTest(expected, observed);
+          </source>
+          </dd>
+          <dd> To test the null hypothesis that <code>observed</code> conforms
+          <code>expected</code> with <code>alpha</code> siginficance
+          (equiv. <code>100 * (1-alpha)%</code> confidence) where <code>
+          0 &lt; alpha &lt; 1 </code> use:
+          <source>
+TestUtils.gTest(expected, observed, alpha);
+          </source>
+          The boolean value returned will be <code>true</code> iff the null hypothesis
+          can be rejected with confidence <code>1 - alpha</code>.
+          </dd>
+          <dd>To evaluate the hypothesis that two sets of counts come from the
+          same underlying distribution, use long[] arrays for the counts and
+          <code>gDataSetsComparison</code> for the test statistic
+          <source>
+long[] obs1 = new long[]{268, 199, 42};
+long[] obs2 = new long[]{807, 759, 184};
+System.out.println(TestUtils.gDataSetsComparison(obs1, obs2)); // g statistic
+System.out.println(TestUtils.gTestDataSetsComparison(obs1, obs2)); // p-value
+          </source>
+          </dd>
+          <dd>For 2 x 2 designs, the <code>rootLogLikelihoodRaio</code>
+          computes the
+          <a href="">
+          signed root log likelihood ratio.</a>  For example, suppose that for two
+          A and B, the observed count of AB (both occurring) is 5, not A and B (B without
+          is 1995, A not B is 0; and neither A nor B is 10000.  Then
+          <source>
+new GTest().rootLogLikelihoodRatio(5, 1995, 0, 100000);
+          </source>
+          returns the root log likelihood associated with the null hypothesis that A 
+          and B are independent.
+          </dd>
+          <br></br>
           <dt><strong>One-Way Anova tests</strong></dt>
-          <dd>To conduct a One-Way Analysis of Variance (ANOVA) to evaluate the
-          null hypothesis that the means of a collection of univariate datasets
-          are the same, start by loading the datasets into a collection, e.g.
 double[] classA =
    {93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 };

View raw message