spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-5888) Add OneHotEncoder as a Transformer
Date Tue, 05 May 2015 19:35:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiangrui Meng resolved SPARK-5888.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 1.4.0

Issue resolved by pull request 5500
[https://github.com/apache/spark/pull/5500]

> Add OneHotEncoder as a Transformer
> ----------------------------------
>
>                 Key: SPARK-5888
>                 URL: https://issues.apache.org/jira/browse/SPARK-5888
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Sandy Ryza
>             Fix For: 1.4.0
>
>
> `OneHotEncoder` takes a categorical column and output a vector column, which stores the
category info in binaries.
> {code}
> val ohe = new OneHotEncoder()
>   .setInputCol("countryIndex")
>   .setOutputCol("countries")
> {code}
> It should read the category info from the metadata and assign feature names properly
in the output column. We need to discuss the default naming scheme and whether we should let
it process multiple categorical columns at the same time.
> One category (the most frequent one) should be removed from the output to make the output
columns linear independent. Or this could be an option tuned on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message