spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-25441) calculate term frequency in CountVectorizer()
Date Sun, 03 Mar 2019 19:51:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-25441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-25441.
-------------------------------
    Resolution: Won't Fix

What you have there is already term frequency. If you want to normalize it to some kind of
term fraction, you can just make that transformation yourself.

> calculate term frequency in CountVectorizer()
> ---------------------------------------------
>
>                 Key: SPARK-25441
>                 URL: https://issues.apache.org/jira/browse/SPARK-25441
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 2.3.1
>            Reporter: Xinyong Tian
>            Priority: Major
>
> currently CountVectorizer() can not output TF (term frequency). I hope there will be
such option.
> TF defined as https://en.m.wikipedia.org/wiki/Tf–idf
>  
> example,
> >>> df = spark.createDataFrame( ...  [(0, ["a", "b", "c"]), (1, ["a", "b", "b",
"c", "a"])], ...  ["label", "raw"])
> >>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> >>> model = cv.fit(df)
> >>> model.transform(df).limit(1).show(truncate=False)
> label        raw           vectors 
> 0            [a, b, c]       (3,[0,1,2],[1.0,1.0,1.0])
>  
> instead I want 
> 0            [a, b, c]       (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector
devided by by its sum, here 3, so                                     
                                           sum of new vector will 1,for
every row(document)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message