spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hhbyyh <...@git.apache.org>
Subject [GitHub] spark pull request: [Spark-9028] [ML] Add CountVectorizer as an es...
Date Fri, 14 Aug 2015 13:23:38 GMT
Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/7388#issuecomment-131104897
  
    Thanks @jkbradley for the update. Copy the comment of Joseph from https://github.com/hhbyyh/spark/pull/3
to here:
    
    Updates:
    
    Renamed "minCount" to "minTokenCount"
    Added "minTermFreq" back, including unit test
    Moved all Params to include in both Estimator and Model so that they can be viewed in
either.
    New thoughts:
    
    Previously, we had decided to call "minTermFreq" a "frequency" even though it is a count.
That was to follow other things such as sklearn and "TF-IDF."
    Now, I would like to rename it to "minTermCount." Yuhao got me thinking along these lines,
and it actually matches other libraries to some extent:
    sklearn calls it "min_tf," but it accepts both integers (counts) and doubles (fractions
or frequencies). Since we cannot take both the way Params are set up, the best we could do
is have a minTermCount Param and a minTermFreq Param. Frequency might actually be a better
way to specify this threshold.
    TF-IDF arguably uses fractional frequencies (not counts)...depending on whether "TF-IDF"
refers to the value or the normalizer.
    Proposal: I propose we do 1 of the following:
    
    (A) Rename "minTermFreq" to "minTermCount"
    (B) Change "minTermFreq" to take a Double fraction
    (C) Remove "minTermFreq" for now pending further discussion
    CC: @hhbyyh @mengxr What do you think?
    
    @hhbyyh Feel free to merge, take parts, or modify as you see fit. It may be easiest if
you go ahead and merge/modify so that @mengxr can view your full PR.
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message