From reviews-return-624773-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Wed Mar 14 23:09:30 2018
Return-Path: <reviews-return-624773-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 5B496180654
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 14 Mar 2018 23:09:30 +0100 (CET)
Received: (qmail 71341 invoked by uid 500); 14 Mar 2018 22:09:29 -0000
Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:reviews-help@spark.apache.org>
List-Unsubscribe: <mailto:reviews-unsubscribe@spark.apache.org>
List-Post: <mailto:reviews@spark.apache.org>
List-Id: <reviews.spark.apache.org>
Delivered-To: mailing list reviews@spark.apache.org
Received: (qmail 71313 invoked by uid 99); 14 Mar 2018 22:09:28 -0000
Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Mar 2018 22:09:28 +0000
Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33)
	id A7EF5E96E4; Wed, 14 Mar 2018 22:09:28 +0000 (UTC)
From: BryanCutler <git@git.apache.org>
To: reviews@spark.apache.org
Reply-To: reviews@spark.apache.org
References: <git-pr-20777-spark@git.apache.org>
In-Reply-To: <git-pr-20777-spark@git.apache.org>
Subject: [GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Content-Type: text/plain
Message-Id: <20180314220928.A7EF5E96E4@git1-us-west.apache.org>
Date: Wed, 14 Mar 2018 22:09:28 +0000 (UTC)

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20777#discussion_r174624206
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit
       def getMinDF: Double = $(minDF)
     
       /**
    -   * Specifies the maximum number of different documents a term must appear in to be included
    -   * in the vocabulary.
    -   * If this is an integer greater than or equal to 1, this specifies the number of documents
    -   * the term must appear in; if this is a double in [0,1), then this specifies the fraction of
    -   * documents.
    +   * maxDF is used for removing terms that appear too frequently. It specifies the maximum number
    +   * of different documents a term could appear in to be included in the vocabulary.
    +   * If this is an integer greater than or equal to 1, this specifies the maximum number of
    +   * documents the term could appear in; if this is a double in [0,1), then this specifies the
    +   * maximum fraction of documents the term could appear in. A term appears more frequently
    +   * than maxDF will be removed.
    --- End diff --
    
    This sounds much better, but probably should use ignore instead of remove and might be good to just change the order of the sentence like this:
    
    ```
    Specifies the maximum number of different documents a term could appear in to be included
    in the vocabulary. A term that appears more than the threshold will be ignored. If this is an
    integer greater than or equal to 1, this specifies the maximum number of documents the term
    could appear in; if this is a double in [0,1), then this specifies the maximum fraction of
    documents the term could appear in.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org