spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From viirya <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-7556][ML][Doc] Add user guide for spark...
Date Fri, 15 May 2015 01:48:25 GMT
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6116#discussion_r30380052
  
    --- Diff: docs/ml-features.md ---
    @@ -183,6 +183,90 @@ for words_label in wordsDataFrame.select("words", "label").take(3):
     </div>
     </div>
     
    +## Binarizer
    +
    +Binarization is the process of thresholding numerical features to binary features. As
some probabilistic estimators make assumption that the input data is distributed according
to [Bernoulli distribution](http://en.wikipedia.org/wiki/Bernoulli_distribution), a binarizer
is useful for pre-processing the input data with continuous numerical features.
    +
    +A simple [Binarizer](api/scala/index.html#org.apache.spark.ml.feature.Binarizer) class
provides this functionality. Besides the common parameters of `inputCol` and `outputCol`,
`Binarizer` has the parameter `threshold` used for binarizing continuous numerical features.
The features greater than the threshold, will be binarized to 1.0. The features equal to or
less than the threshold, will be binarized to 0.0. The example below shows how to binarize
numerical features.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.ml.feature.Binarizer
    +import org.apache.spark.sql.DataFrame
    +
    +val data = Array(
    +  (0, 0.1),
    +  (1, 0.8),
    +  (2, 0.2)
    +)
    +val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", "feature")
    +
    +val binarizer: Binarizer = new Binarizer()
    +  .setInputCol("feature")
    +  .setOutputCol("binarized_feature")
    +  .setThreshold(0.5)
    +
    +val binarizedDataFrame = binarizer.transform(dataFrame)
    +val binarizedFeatures = binarizedDataFrame.select("binarized_feature")
    +binarizedFeatures.collect().foreach(println)
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import com.google.common.collect.Lists;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.ml.feature.Binarizer;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.RowFactory;
    +import org.apache.spark.sql.types.DataTypes;
    +import org.apache.spark.sql.types.Metadata;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +
    +JavaRDD<Row> jrdd = jsc.parallelize(Lists.newArrayList(
    +  RowFactory.create(0, 0.1),
    +  RowFactory.create(1, 0.8),
    +  RowFactory.create(2, 0.2)
    +));
    +StructType schema = new StructType(new StructField[]{
    +  new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
    +  new StructField("feature", DataTypes.DoubleType, false, Metadata.empty())
    +});
    +DataFrame continuousDataFrame = jsql.createDataFrame(jrdd, schema);
    +Binarizer binarizer = new Binarizer()
    +  .setInputCol("feature")
    +  .setOutputCol("binarized_feature")
    +  .setThreshold(0.5);
    +DataFrame binarizedDataFrame = binarizer.transform(continuousDataFrame);
    +DataFrame binarizedFeatures = binarizedDataFrame.select("binarized_feature");
    +for (Row r : binarizedFeatures.collect()) {
    +  Double binarized_value = r.getDouble(0);
    +  System.out.println(binarized_value);
    +}
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +from pyspark.ml.feature import Binarizer
    +
    +continuousDataFrame = sqlContext.createDataFrame([
    +  (0, 0.1),
    +  (1, 0.8),
    +  (2, 0.2)
    +], ["label", "feature"])
    +binarizer = Binarizer(threshold=0.5, inputCol="feature", outputCol="binarized_feature")
    +binarizedDataFrame = binarizer.transform(continuousDataFrame)
    +binarizedFeatures = binarizedDataFrame.select("binarized_feature")
    +for binarized_feature in binarizedFeatures.collect():
    +  print binarized_feature
    --- End diff --
    
    Good point. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message