spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From BryanCutler <...@git.apache.org>
Subject [GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...
Date Wed, 14 Mar 2018 22:09:28 GMT
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20777#discussion_r174624206
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol
wit
       def getMinDF: Double = $(minDF)
     
       /**
    -   * Specifies the maximum number of different documents a term must appear in to be
included
    -   * in the vocabulary.
    -   * If this is an integer greater than or equal to 1, this specifies the number of documents
    -   * the term must appear in; if this is a double in [0,1), then this specifies the fraction
of
    -   * documents.
    +   * maxDF is used for removing terms that appear too frequently. It specifies the maximum
number
    +   * of different documents a term could appear in to be included in the vocabulary.
    +   * If this is an integer greater than or equal to 1, this specifies the maximum number
of
    +   * documents the term could appear in; if this is a double in [0,1), then this specifies
the
    +   * maximum fraction of documents the term could appear in. A term appears more frequently
    +   * than maxDF will be removed.
    --- End diff --
    
    This sounds much better, but probably should use ignore instead of remove and might be
good to just change the order of the sentence like this:
    
    ```
    Specifies the maximum number of different documents a term could appear in to be included
    in the vocabulary. A term that appears more than the threshold will be ignored. If this
is an
    integer greater than or equal to 1, this specifies the maximum number of documents the
term
    could appear in; if this is a double in [0,1), then this specifies the maximum fraction
of
    documents the term could appear in.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message