From chiwanpark <...@git.apache.org>
Date Sun, 16 Aug 2015 09:43:40 GMT
Github user chiwanpark commented on a diff in the pull request:

+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over data, determining
+ mean, variance, gini impurity, entropy etc. of data.
+
+## Methods
+
+ The Statistics utility provides two major functions: createHistogram and dataStats.
+
+### Creating a histogram
+
+ There are two types of histograms:
+   1. <strong>Continuous Histograms</strong>: These histograms are formed
on a data set X:
+   DataSet[Double]
+   when the values in X are from a continuous range. These histograms support
+   quantile and sum  operations. Here quantile(q) refers to a value $x_q$ such
that $|x: x + \leq x_q| = q * |X|$. Further, sum(s) refers to the number of elements $x \leq s$,
which can
+    be construed as a cumulative probability value at $s$[Of course, <i>scaled</i>
probability].
+   <br>
+   2. A continuous histogram can be formed by calling X.createHistogram(b) where b
is the
+    number of bins.
+    <strong>Categorical Histograms</strong>: These histograms are formed
on a data set X:DataSet[Double]
+    when the values in X are from a discrete distribution. These histograms
+    support count(c) operation which returns the number of elements associated with
cateogry c.
+    <br>
+        A categorical histogram can be formed by calling X.createHistogram(0).
+
+### Data Statistics
+
+ The dataStats function operates on a data set X: DataSet[Vector] and returns column-wise
+ statistics for X. Every field of X is allowed to be defined as either <i>discrete</i>
or
