spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-3160) Simplify DecisionTree data structure for training
Date Wed, 10 Sep 2014 04:02:29 GMT

     [ https://issues.apache.org/jira/browse/SPARK-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley updated SPARK-3160:
-------------------------------------
    Description: 
Improvement: code clarity

Currently, we maintain a tree structure, a flat array of nodes, and a parentImpurities array.

Proposed fix: Maintain everything within a growing tree structure.

This would let us eliminate the flat array of nodes, thus saving storage when we do not grow
a full tree.  It would also potentially make it easier to pass subtrees to compute nodes for
local training.

Note:
* This JIRA used to have this item as well: We could have a “LearningNode extends Node”
setup where the LearningNode holds metadata for learning (such as impurities).  The test-time
model could be extracted from this training-time model, so that extra information (such as
impurities) does not have to be kept after training.
* However, this is really a separate issue, so I removed it.

  was:
Improvement: code clarity

Currently, we maintain a tree structure, a flat array of nodes, and a parentImpurities array.

Proposed fix: Maintain everything within a growing tree structure.  For this, we could have
a “LearningNode extends Node” setup where the LearningNode holds metadata for learning
(such as impurities).  The test-time model could be extracted from this training-time model,
so that extra information (such as impurities) does not have to be kept after training.

This would let us eliminate the flat array of nodes, thus saving storage when we do not grow
a full tree.  It would also potentially make it easier to pass subtrees to compute nodes for
local training.



> Simplify DecisionTree data structure for training
> -------------------------------------------------
>
>                 Key: SPARK-3160
>                 URL: https://issues.apache.org/jira/browse/SPARK-3160
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>            Priority: Minor
>
> Improvement: code clarity
> Currently, we maintain a tree structure, a flat array of nodes, and a parentImpurities
array.
> Proposed fix: Maintain everything within a growing tree structure.
> This would let us eliminate the flat array of nodes, thus saving storage when we do not
grow a full tree.  It would also potentially make it easier to pass subtrees to compute nodes
for local training.
> Note:
> * This JIRA used to have this item as well: We could have a “LearningNode extends Node”
setup where the LearningNode holds metadata for learning (such as impurities).  The test-time
model could be extracted from this training-time model, so that extra information (such as
impurities) does not have to be kept after training.
> * However, this is really a separate issue, so I removed it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message