From reviews-return-624773-archive-asf-public=cust-asf.ponee.io@spark.apache.org Wed Mar 14 23:09:30 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5B496180654 for ; Wed, 14 Mar 2018 23:09:30 +0100 (CET) Received: (qmail 71341 invoked by uid 500); 14 Mar 2018 22:09:29 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 71313 invoked by uid 99); 14 Mar 2018 22:09:28 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Mar 2018 22:09:28 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id A7EF5E96E4; Wed, 14 Mar 2018 22:09:28 +0000 (UTC) From: BryanCutler To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org References: In-Reply-To: Subject: [GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ... Content-Type: text/plain Message-Id: <20180314220928.A7EF5E96E4@git1-us-west.apache.org> Date: Wed, 14 Mar 2018 22:09:28 +0000 (UTC) Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20777#discussion_r174624206 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit def getMinDF: Double = $(minDF) /** - * Specifies the maximum number of different documents a term must appear in to be included - * in the vocabulary. - * If this is an integer greater than or equal to 1, this specifies the number of documents - * the term must appear in; if this is a double in [0,1), then this specifies the fraction of - * documents. + * maxDF is used for removing terms that appear too frequently. It specifies the maximum number + * of different documents a term could appear in to be included in the vocabulary. + * If this is an integer greater than or equal to 1, this specifies the maximum number of + * documents the term could appear in; if this is a double in [0,1), then this specifies the + * maximum fraction of documents the term could appear in. A term appears more frequently + * than maxDF will be removed. --- End diff -- This sounds much better, but probably should use ignore instead of remove and might be good to just change the order of the sentence like this: ``` Specifies the maximum number of different documents a term could appear in to be included in the vocabulary. A term that appears more than the threshold will be ignored. If this is an integer greater than or equal to 1, this specifies the maximum number of documents the term could appear in; if this is a double in [0,1), then this specifies the maximum fraction of documents the term could appear in. ``` --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org