Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CB47D200AE4 for ; Fri, 24 Jun 2016 13:26:42 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C9FE3160A62; Fri, 24 Jun 2016 11:26:42 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EAB6B160A38 for ; Fri, 24 Jun 2016 13:26:41 +0200 (CEST) Received: (qmail 28009 invoked by uid 500); 24 Jun 2016 11:26:41 -0000 Mailing-List: contact commits-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list commits@spark.apache.org Received: (qmail 28000 invoked by uid 99); 24 Jun 2016 11:26:41 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jun 2016 11:26:41 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 0A9D5E049D; Fri, 24 Jun 2016 11:26:41 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: mlnick@apache.org To: commits@spark.apache.org Message-Id: X-Mailer: ASF-Git Admin Mailer Subject: spark git commit: [SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer Date: Fri, 24 Jun 2016 11:26:41 +0000 (UTC) archived-at: Fri, 24 Jun 2016 11:26:43 -0000 Repository: spark Updated Branches: refs/heads/branch-2.0 201d5e8db -> 76741b570 [SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer ## What changes were proposed in this pull request? Made changes to HashingTF,QuantileVectorizer and CountVectorizer Author: GayathriMurali Closes #13745 from GayathriMurali/SPARK-15997. (cherry picked from commit be88383e15a86d094963de5f7e8792510bc990de) Signed-off-by: Nick Pentreath Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76741b57 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76741b57 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76741b57 Branch: refs/heads/branch-2.0 Commit: 76741b570e20eb7957ada28ad3c5babc0abb738f Parents: 201d5e8 Author: GayathriMurali Authored: Fri Jun 24 13:25:40 2016 +0200 Committer: Nick Pentreath Committed: Fri Jun 24 13:26:28 2016 +0200 ---------------------------------------------------------------------- docs/ml-features.md | 29 ++++++++++++-------- .../ml/JavaQuantileDiscretizerExample.java | 7 ++++- .../python/ml/quantile_discretizer_example.py | 11 ++++++-- .../ml/QuantileDiscretizerExample.scala | 9 ++++-- 4 files changed, 38 insertions(+), 18 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/76741b57/docs/ml-features.md ---------------------------------------------------------------------- diff --git a/docs/ml-features.md b/docs/ml-features.md index 3cb2644..88fd291 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -46,14 +46,18 @@ In MLlib, we separate TF and IDF to make them flexible. `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words. `HashingTF` utilizes the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing). -A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies +A raw feature is mapped into an index (term) by applying a hash function. The hash function +used here is [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash). Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the feature dimension, otherwise the features will -not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. +not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. +An optional binary toggle parameter controls term frequency counts. When set to true all nonzero +frequency counts are set to 1. This is especially useful for discrete probabilistic models that +model binary, rather than integer, counts. `CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer ](ml-features.html#countvectorizer) for more details. @@ -145,9 +149,11 @@ for more details on the API. passed to other algorithms like LDA. During the fitting process, `CountVectorizer` will select the top `vocabSize` words ordered by - term frequency across the corpus. An optional parameter "minDF" also affects the fitting process + term frequency across the corpus. An optional parameter `minDF` also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be - included in the vocabulary. + included in the vocabulary. Another optional binary toggle parameter controls the output vector. + If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic + models that model binary, rather than integer, counts. **Examples** @@ -1096,14 +1102,13 @@ for more details on the API. ## QuantileDiscretizer `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned -categorical features. -The bin ranges are chosen by taking a sample of the data and dividing it into roughly equal parts. -The lower and upper bin bounds will be `-Infinity` and `+Infinity`, covering all real values. -This attempts to find `numBuckets` partitions based on a sample of the given input data, but it may -find fewer depending on the data sample values. - -Note that the result may be different every time you run it, since the sample strategy behind it is -non-deterministic. +categorical features. The number of bins is set by the `numBuckets` parameter. +The bin ranges are chosen using an approximate algorithm (see the documentation for +[approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) for a +detailed description). The precision of the approximation can be controlled with the +`relativeError` parameter. When set to zero, exact quantiles are calculated +(**Note:** Computing exact quantiles is an expensive operation). The lower and upper bin bounds +will be `-Infinity` and `+Infinity` covering all real values. **Examples** http://git-wip-us.apache.org/repos/asf/spark/blob/76741b57/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java ---------------------------------------------------------------------- diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java index 16f58a8..dd20cac 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java @@ -54,7 +54,12 @@ public class JavaQuantileDiscretizerExample { }); Dataset df = spark.createDataFrame(data, schema); - + // $example off$ + // Output of QuantileDiscretizer for such small datasets can depend on the number of + // partitions. Here we force a single partition to ensure consistent results. + // Note this is not necessary for normal use cases + df = df.repartition(1); + // $example on$ QuantileDiscretizer discretizer = new QuantileDiscretizer() .setInputCol("hour") .setOutputCol("result") http://git-wip-us.apache.org/repos/asf/spark/blob/76741b57/examples/src/main/python/ml/quantile_discretizer_example.py ---------------------------------------------------------------------- diff --git a/examples/src/main/python/ml/quantile_discretizer_example.py b/examples/src/main/python/ml/quantile_discretizer_example.py index 6ae7bb1..5444cac 100644 --- a/examples/src/main/python/ml/quantile_discretizer_example.py +++ b/examples/src/main/python/ml/quantile_discretizer_example.py @@ -28,11 +28,16 @@ if __name__ == "__main__": # $example on$ data = [(0, 18.0,), (1, 19.0,), (2, 8.0,), (3, 5.0,), (4, 2.2,)] - dataFrame = spark.createDataFrame(data, ["id", "hour"]) - + df = spark.createDataFrame(data, ["id", "hour"]) + # $example off$ + # Output of QuantileDiscretizer for such small datasets can depend on the number of + # partitions. Here we force a single partition to ensure consistent results. + # Note this is not necessary for normal use cases + df = df.repartition(1) + # $example on$ discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result") - result = discretizer.fit(dataFrame).transform(dataFrame) + result = discretizer.fit(df).transform(df) result.show() # $example off$ http://git-wip-us.apache.org/repos/asf/spark/blob/76741b57/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala ---------------------------------------------------------------------- diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala index 1a16515..2f7e217 100644 --- a/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala @@ -32,8 +32,13 @@ object QuantileDiscretizerExample { // $example on$ val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2)) - val df = spark.createDataFrame(data).toDF("id", "hour") - + var df = spark.createDataFrame(data).toDF("id", "hour") + // $example off$ + // Output of QuantileDiscretizer for such small datasets can depend on the number of + // partitions. Here we force a single partition to ensure consistent results. + // Note this is not necessary for normal use cases + .repartition(1) + // $example on$ val discretizer = new QuantileDiscretizer() .setInputCol("hour") .setOutputCol("result") --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org For additional commands, e-mail: commits-help@spark.apache.org