spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From actuaryzhang <...@git.apache.org>
Subject [GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...
Date Mon, 22 May 2017 15:11:47 GMT
Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @yanboliang I understand your points. The issue is `OneHotEncoder` only supports `dropLast`.

    The ideal solution to match R exactly (both the category dropped and ordering of feature
columns) will be use `alphabetAsc` in StringIndexer and `dropFirst` in OneHotEncoder. 
    
    Without changing `OneHotEncoder`, the best I can do in this PR is to match only the category
that is dropped in R. This will make sure the model interpretation and magnitude of coefficients
are consistent with R,  but the ordering among the feature columns are still different, which
is a minor issue. That's also why I sorted the coefficients first in the example above to
compare GLM results. 
    
    Please let me know if this is clear and your thought on `OneHotEncoder`. If adding a `dropFirst`
is preferred, I can also update `OneHotEncoder`. But that may cause some disruption. Thanks.

    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message