spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21005) VectorIndexerModel does not prepare output column field correctly
Date Mon, 09 Apr 2018 19:12:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431079#comment-16431079
] 

Joseph K. Bradley commented on SPARK-21005:
-------------------------------------------

I don't actually see why this is a problem: If a feature is categorical, we should not silently
convert it to continuous.  To use a high-arity categorical feature in a decision tree, one
should convert it to a different representation first, such as hashing to a set of bins with
HashingTF.

That said, I do think we should clarify this behavior in the VectorIndexer docstring.  I know
it's been a long time since you sent your PR, but would you want to update it to simply update
the docs?  If you're busy now, I'd be happy to take it over though.  Thanks!

> VectorIndexerModel does not prepare output column field correctly
> -----------------------------------------------------------------
>
>                 Key: SPARK-21005
>                 URL: https://issues.apache.org/jira/browse/SPARK-21005
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Chen Lin
>            Priority: Major
>
> From my understanding through reading the documentation,  VectorIndexer decides which
features should be categorical based on the number of distinct values, where features with
at most maxCategories are declared categorical. Meanwhile, those features which exceed maxCategories
are declared continuous. 
> Currently, VectorIndexerModel works all right with a dataset which has empty schema.
However, when VectorIndexerModel is transforming on a dataset with `ML_ATTR` metadata, it
may not output the expected result. For example, a feature with nominal attribute which has
distinct values exceeding maxCategorie will not be treated as a continuous feature as we expected
but still a categorical feature. Thus, it may cause all the tree-based algorithms (like Decision
Tree, Random Forest, GBDT, etc.) throw errors as "DecisionTree requires maxBins (= $maxPossibleBins)
to be at least as large as the number of values in each categorical feature, but categorical
feature $maxCategory has $maxCategoriesPerFeature values. Considering remove this and other
categorical features with a large number of values, or add more training examples.".
> Correct me if my understanding is wrong.
> I will submit a PR soon to resolve this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message