spark-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From m...@apache.org
Subject spark git commit: [SPARK-7707] User guide and example code for KernelDensity
Date Tue, 18 Aug 2015 00:57:44 GMT
Repository: spark
Updated Branches:
  refs/heads/master 0b6b01761 -> f9d1a92aa


[SPARK-7707] User guide and example code for KernelDensity

Author: Sandy Ryza <sandy@cloudera.com>

Closes #8230 from sryza/sandy-spark-7707.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f9d1a92a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f9d1a92a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f9d1a92a

Branch: refs/heads/master
Commit: f9d1a92aa1bac4494022d78559b871149579e6e8
Parents: 0b6b017
Author: Sandy Ryza <sandy@cloudera.com>
Authored: Mon Aug 17 17:57:51 2015 -0700
Committer: Xiangrui Meng <meng@databricks.com>
Committed: Mon Aug 17 17:57:51 2015 -0700

----------------------------------------------------------------------
 docs/mllib-statistics.md | 77 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/f9d1a92a/docs/mllib-statistics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index be04d0b..80a9d06 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -528,5 +528,82 @@ u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
 v = u.map(lambda x: 1.0 + 2.0 * x)
 {% endhighlight %}
 </div>
+</div>
+
+## Kernel density estimation
+
+[Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a
technique
+useful for visualizing empirical probability distributions without requiring assumptions
about the
+particular distribution that the observed samples are drawn from. It computes an estimate
of the
+probability density function of a random variables, evaluated at a given set of points. It
achieves
+this estimate by expressing the PDF of the empirical distribution at a particular point as
the the
+mean of PDFs of normal distributions centered around each of the samples.
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides
methods
+to compute kernel density estimates from an RDD of samples. The following example demonstrates
how
+to do so.
+
+{% highlight scala %}
+import org.apache.spark.mllib.stat.KernelDensity
+import org.apache.spark.rdd.RDD
+
+val data: RDD[Double] = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard deviation for the
Gaussian
+// kernels
+val kd = new KernelDensity()
+  .setSample(data)
+  .setBandwidth(3.0)
+
+// Find density estimates for the given values
+val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides
methods
+to compute kernel density estimates from an RDD of samples. The following example demonstrates
how
+to do so.
+
+{% highlight java %}
+import org.apache.spark.mllib.stat.KernelDensity;
+import org.apache.spark.rdd.RDD;
+
+RDD<Double> data = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard deviation for the
Gaussian
+// kernels
+KernelDensity kd = new KernelDensity()
+  .setSample(data)
+  .setBandwidth(3.0);
+
+// Find density estimates for the given values
+double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides
methods
+to compute kernel density estimates from an RDD of samples. The following example demonstrates
how
+to do so.
+
+{% highlight python %}
+from pyspark.mllib.stat import KernelDensity
+
+data = ... # an RDD of sample data
+
+# Construct the density estimator with the sample data and a standard deviation for the Gaussian
+# kernels
+kd = KernelDensity()
+kd.setSample(data)
+kd.setBandwidth(3.0)
+
+# Find density estimates for the given values
+densities = kd.estimate([-1.0, 2.0, 5.0])
+{% endhighlight %}
+</div>
 
 </div>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org


Mime
View raw message