spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Barry Becker (JIRA)" <>
Subject [jira] [Commented] (SPARK-6162) Handle missing values in GBM
Date Tue, 27 Mar 2018 13:21:00 GMT


Barry Becker commented on SPARK-6162:

If we all agree that is is something that would be very nice to have, why is it closed as
won't fix instead of just being deferred to a future release?

This seems like a big limitation of spark Tree models in Spark.

> Handle missing values in GBM
> ----------------------------
>                 Key: SPARK-6162
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.1
>            Reporter: Devesh Parekh
>            Priority: Major
> We build a lot of predictive models over data combined from multiple sources, where some
entries may not have all sources of data and so some values are missing in each feature vector.
Another place this might come up is if you have features from slightly heterogeneous items
(or items composed of heterogeneous subcomponents) that share many features in common but
may have extra features for different types, and you don't want to manually train models for
every different type.
> R's GBM library, which is what we are currently using, deals with this type of data nicely
by making "missing" nodes in the decision tree (a surrogate split) for features that can have
missing values. We'd like to do the same with MLLib, but LabeledPoint would need to support
missing values, and GradientBoostedTrees would need to be modified to deal with them.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message