spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From josepablocam <>
Subject [GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...
Date Thu, 09 Jul 2015 01:13:09 GMT
Github user josepablocam commented on a diff in the pull request:
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala ---
    @@ -158,4 +158,47 @@ object Statistics {
       def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
    +  /**
    +   * Conduct the two-sided Kolmogorov Smirnov test for data sampled from a
    +   * continuous distribution. By comparing the largest difference between the empirical
    +   * distribution of the sample data and the theoretical distribution we can provide
a test for the
    +   * the null hypothesis that the sample data comes from that theoretical distribution.
    +   * For more information on KS Test:
    +   * @see [[]]
    +   *
    +   * Implementation note: We seek to implement the KS test with a minimal number of distributed
    +   * passes. We sort the RDD, and then perform the following operations on a per-partition
    +   * calculate an empirical cumulative distribution value for each observation, and a
    +   * cumulative distribution value. We know the latter to be correct, while the former
will be off
    +   * by a constant (how large the constant is depends on how many values precede it in
    +   * partitions).However, given that this constant simply shifts the ECDF upwards, but
    +   * change its shape, and furthermore, that constant is the same within a given partition,
we can
    +   * pick 2 values in each partition that can potentially resolve to the largest global
    +   * Namely, we pick the minimum distance and the maximum distance. Additionally, we
keep track of
    +   * how many elements are in each partition. Once these three values have been returned
for every
    +   * partition, we can collect and operate locally. Locally, we can now adjust each distance
by the
    +   * appropriate constant (the cumulative sum of # of elements in the prior partitions
divided by
    +   * the data set size). Finally, we take the maximum absolute value, and this is the
    +   * @param data an `RDD[Double]` containing the sample of data to test
    +   * @param cdf a `Double => Double` function to calculate the theoretical CDF at
a given value
    +   * @return KSTestResult object containing test statistic, p-value, and null hypothesis.
    +   */
    +  def ksTest(data: RDD[Double], cdf: Double => Double): KSTestResult = {
    +    KSTest.testOneSample(data, cdf)
    +  }
    +  /**
    +   * Convenience function to conduct a one-sample, two sided Kolmogorov Smirnov test
for probability
    +   * distribution equality. Currently supports the normal distribution, taking as parameters
    +   * the mean and standard deviation.
    +   * (distName = "norm")
    +   * @param data an `RDD[Double]` containing the sample of data to test
    +   * @param distName a `String` name for a theoretical distribution
    +   * @param params `Double*` specifying the parameters to be used for the theoretical
    +   * @return KSTestResult object containing test statistic, p-value, and null hypothesis.
    +   */
    +  def ksTest(data: RDD[Double], distName: String, params: Double*): KSTestResult = {
    --- End diff --
    I was thinking about overloading the name (similar to how the R ks.test does 1 sample
when passed in a vector of data, and 2 sample when passed in 2).  Scipy's implementation breaks
out the 2 sample test as ks_2samp. I think I prefer the R approach.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message