spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yan Facai (颜发才) (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-3383) DecisionTree aggregate size could be smaller
Date Mon, 06 Nov 2017 13:29:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240284#comment-16240284
] 

Yan Facai (颜发才) edited comment on SPARK-3383 at 11/6/17 1:28 PM:
-----------------------------------------------------------------

[~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I believe that
unordered features can benefit a lot from your work.


was (Author: facai):
[~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I believe that
unordered features can benefit a lot from the PR.

> DecisionTree aggregate size could be smaller
> --------------------------------------------
>
>                 Key: SPARK-3383
>                 URL: https://issues.apache.org/jira/browse/SPARK-3383
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.1.0
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  The savings
would be significant for datasets with many low-arity categorical features (binary features,
or unordered categorical features).  Savings would be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, bin).  We
could store 1 fewer bin per (node, feature): For a given (node, feature), if we store these
vectors for all but the last bin, and also store the total statistics for each node, then
we could compute the statistics for the last bin.  For binary and unordered categorical features,
this would cut in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message