spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From actuaryzhang <>
Subject [GitHub] spark issue #17879: [SPARK-20619][ML]StringIndexer supports multiple ways of...
Date Sat, 06 May 2017 07:57:27 GMT
Github user actuaryzhang commented on the issue:
    @holdenk The main motivation for this PR is that the behavior of StringIndexer will affect
OneHotEncoder, RFormula and models estimated based on these transformers. There have been
a few desired improvement in RFormula that could not be done without the change in StringIndexer.
    One use case for alphabetical ordering is to make comparison of Spark model results to
that in R, which drops the first alphabetical value in one-hot encoding. Right now, even though
we do lots of comparisons between Spark and R, we lack comparisons involving String features
because the encoding is different. There is already a [JIRA|].

    Another motivation for this PR is to support ascending order by label frequency. This
is also related to one-hot encoding. In practical applications of regression type models,
it is almost always better to set the most frequent label as the reference level (i.e., drop
the most frequent label in OneHotEncoding) for better interpretability. Right now, the behavior
is the opposite and has made it very difficult to interpret results. 
    I think  the flexibility of different ordering will benefit a lot the downstream feature
transformers and model estimators. Does this make sense? 

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message