# spark-reviews mailing list archives

##### Site index · List index
Message view
Top
From mengxr <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-2839][MLlib] Stats Toolkit documentatio...
Date Tue, 26 Aug 2014 04:52:58 GMT
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2123#discussion_r16695707

--- Diff: docs/mllib-stats.md ---
@@ -99,69 +99,336 @@ v = u.map(lambda x: 1.0 + 2.0 * x)

</div>

-## Stratified Sampling
+## Correlation Calculation
+
+Calculating the correlation between two series of data is a common operation in Statistics.
In MLlib
+we provide the flexibility to calculate correlation between many series. The supported
correlation
+methods are currently Pearson's and Spearman's correlation.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[Statistics](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to +calculate correlations between series. Depending on the type of input, two RDD[Double]s or +an RDD[Vector], the output will be a Double or the correlation Matrix respectively. + +{% highlight scala %} +import org.apache.spark.SparkContext +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.stat.Statistics + +val sc: SparkContext = ... + +val seriesX: RDD[Double] = ... // a series +val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX + +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a +// method is not specified, Pearson's method will be used by default. +val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson") + +val data: RDD[Vector] = ... // note that each Vector is a row and not a column + +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. +// If a method is not specified, Pearson's method will be used by default. +val correlMatrix: Matrix = Statistics.corr(data, "pearson") + +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> +[Statistics](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to +calculate correlations between series. Depending on the type of input, two JavaDoubleRDDs or +a JavaRDD<Vector>, the output will be a Double or the correlation Matrix respectively. + +{% highlight java %} +import org.apache.spark.api.java.JavaDoubleRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.linalg.*; +import org.apache.spark.mllib.stat.Statistics; + +JavaSparkContext jsc = ... + +JavaDoubleRDD seriesX = ... // a series +JavaDoubleRDD seriesY = ... // must have the same number of partitions and cardinality as seriesX + +// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a +// method is not specified, Pearson's method will be used by default. +Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson"); + +JavaRDD<Vector> data = ... // note that each Vector is a row and not a column + +// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. +// If a method is not specified, Pearson's method will be used by default. +Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson"); + +{% endhighlight %} +</div> + +<div data-lang="python" markdown="1"> +[Statistics](api/python/index.html#org.apache.spark.mllib.stat.Statistics$) provides
methods to
+calculate correlations between series. Depending on the type of input, two RDD[Double]s
or
+an RDD[Vector], the output will be a Double or the correlation Matrix respectively.
+
+Support for RowMatrix operations in python currently don't exist, but will be added
in future
+releases.
+
+{% highlight python %}
+from pyspark.mllib.stat import Statistics
+
+sc = ... # SparkContext
+
+seriesX = ... # a series
+seriesY = ... # must have the same number of partitions and cardinality as seriesX
+
+# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.
If a
+# method is not specified, Pearson's method will be used by default.
+print Statistics.corr(seriesX, seriesY, method="pearson")
+
+data = ... # an RDD of Vectors
+# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's
method.
+# If a method is not specified, Pearson's method will be used by default.
+print Statistics.corr(data, method="pearson")
+
+{% endhighlight %}
+</div>
+
+</div>
+
+## Stratified Sampling
+
+Unlike the other statistics functions, which reside in MLLib, stratified sampling methods,

+sampleByKey and sampleByKeyExact, can be accessed in PairRDDFunctions in core,
as stratified
+sampling is tightly coupled with the PairRDD data type, and the function signature conforms
to the
+other *ByKey* methods in PairRDDFunctions. A separate method for exact sample size support
exists
+as it requires significant more resources than the per-stratum simple random sampling
used in
+sampleByKey.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[sampleByKeyExact()](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows
users to
+sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is
the desired
+fraction for key $k$, and $n_k$ is the number of key-value pairs for key $k$.
+Sampling without replacement requires one additional pass over the RDD to guarantee sample

+size, whereas sampling with replacement requires two additional passes.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.rdd.PairRDDFunctions
+
+val sc: SparkContext = ...
+
+val data = ... // an RDD[(K,V)] of any key value pairs
+val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
+
+// Get an exact sample from each stratum
+val sample = data.sampleByKeyExact(withReplacement=false, fractions)
--- End diff --

add spaces around =

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message