spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qiping Li (JIRA)" <>
Subject [jira] [Commented] (SPARK-3155) Support DecisionTree pruning
Date Tue, 26 Aug 2014 02:42:57 GMT


Qiping Li commented on SPARK-3155:

Hi Joseph, glad to see you have considered to support pruning in MLLib's decision tree, 
is there someone working on this issue, or you can assign this issue to me. 
I'm ready to help on this module.

> Support DecisionTree pruning
> ----------------------------
>                 Key: SPARK-3155
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision trees. 
A smart implementation can prune the tree during training in order to avoid training parts
of the tree which would be pruned eventually anyways.  DecisionTree does not currently support
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but with zero
or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) based on the
training set.
> (3) Take a held-out validation set, and use the tree to make predictions for each validation
example.  This allows one to compute the validation error made at each node in the tree (based
on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on the validation
set made by the leafs’ predictions with the error made by the parent’s predictions.  Remove
the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the validation
set made by each node as it is trained.  Whenever two children increase the validation error,
they are pruned, and no more training is required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is important
when using a tree directly for prediction.  It is less important when combining trees via
ensemble methods.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message