spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-20949) Is there another reason for the onehotencoder is different from scikit learn than specified in scaladoc?
Date Thu, 01 Jun 2017 09:02:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-20949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-20949.
-------------------------------
    Resolution: Invalid

Questions belong on the mailing list. Intuitively, you can see that the last encoding column
is knowable from the others, so is redundant. The reason the column vectors (not rows) end
up linearly dependent is because of the intercept term. See a good explanation at http://www.algosome.com/articles/dummy-variable-trap-regression.html

> Is there another reason for the onehotencoder is different from scikit learn than specified
in scaladoc?
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20949
>                 URL: https://issues.apache.org/jira/browse/SPARK-20949
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 1.6.2
>            Reporter: Sungjun Kim
>            Priority: Minor
>
> Spark OneHotEncoder is different from that of scikit learn. 
> It makes an entry into zero vector.
> In scaladoc, there is a reason for this. It says that "it makes the vector entries sum
up to one, and hence linearly dependent." But I don't think this is correct. Consider vectors
[1.0, 0.0], [0.0, 1.0]. They sums 1 but are linearly independent obviously. Am I missing something?
or Is there any other reason?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message