spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Solimando <alessandro.solima...@gmail.com>
Subject Re: redundant decision tree model
Date Tue, 13 Feb 2018 13:39:57 GMT
Thanks for your feedback Sean, I agree with you.

I have logged a JIRA case (https://issues.apache.org/jira/browse/SPARK-23409),
I will take a look at the code more in detail and see if I come up with a
PR to handle this.

On 13 February 2018 at 12:00, Sean Owen <srowen@gmail.com> wrote:

> I think the simple pruning you have in mind was just never implemented.
>
> That sort of pruning wouldn't help much if the nodes maintained a
> distribution over classes, as those are rarely identical, but, they just
> maintain a single class prediction. After training, I see no value in
> keeping those nodes. Whatever impurity gain the split managed on the
> training data is 'lost' when the prediction is collapsed to a single class
> anyway.
>
> Whether it's easy to implement in the code I don't know, but it's
> straightforward conceptually.
>
> On Tue, Feb 13, 2018 at 4:21 AM Alessandro Solimando <
> alessandro.solimando@gmail.com> wrote:
>
>> Hello Nick,
>> thanks for the pointer, that's interesting.
>>
>> However, there seems to be a major difference with what I was discussing.
>>
>> The JIRA issue relates to overfitting and consideration on information
>> gain, while what I propose is a much simpler "syntactic" pruning.
>>
>> Consider a fragment of the example above, the leftmost subtree in
>> particular:
>>
>> If (feature 1 <= 0.5)
>>>    If (feature 2 <= 0.5)
>>>     If (feature 0 <= 0.5)
>>>      Predict: 0.0
>>>     Else (feature 0 > 0.5)
>>>      Predict: 0.0
>>>    Else (feature 2 > 0.5)
>>>     If (feature 0 <= 0.5)
>>>      Predict: 0.0
>>>     Else (feature 0 > 0.5)
>>>      Predict: 0.0
>>
>>
>> Which corresponds to the following "objects":
>>
>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@fdf00000)
>>> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@ffe00000)
>>> ---InternalNode(prediction = 0.0, impurity = 0.4444444444444445, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@3fe00000)
>>> ----LeafNode(prediction = 0.0, impurity = -1.0)
>>> ----LeafNode(prediction = 0.0, impurity = 0.0)
>>> ---InternalNode(prediction = 0.0, impurity = 0.2777777777777777, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@3fe00000)
>>> ----LeafNode(prediction = 0.0, impurity = 0.0)
>>> ----LeafNode(prediction = 0.0, impurity = -1.0)
>>
>>
>> For sure a more comprehensive policy for node splitting based on impurity
>> might prevent this situation (by splitting node "ffe00000" you have an
>> impurity gain on one child, and a loss on the other), but independently
>> from this, once the tree is built, I would cut the redundant subtree and
>> obtain the following:
>>
>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>>> org.apache.spark.ml.tree.ContinuousSplit@fdf00000)
>>> --LeafNode(prediction = 0.0, impurity = ...)
>>
>>
>> I cannot say that this is relevant for all the tree ensemble methods, but
>> it for sure is for RF, even more than for DT, as the lever effect will be
>> even higher (and the code generating them is the same, DT calls RF with
>> numTree = 1 for what I can see).
>>
>> Being an optimization aiming at saving model memory footprint and
>> invocation time, it is independent from any consideration on the
>> statistical amortization of overfit, as your reply seems to imply.
>>
>> Am I missing something?
>>
>> Best regards,
>> Alessandro
>>
>>
>>
>> On 13 February 2018 at 10:57, Nick Pentreath <nick.pentreath@gmail.com>
>> wrote:
>>
>>> There is a long outstanding JIRA issue about it:
>>> https://issues.apache.org/jira/browse/SPARK-3155.
>>>
>>> It is probably still a useful feature to have for trees but the priority
>>> is not that high since it may not be that useful for the tree ensemble
>>> models.
>>>
>>>
>>> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
>>> alessandro.solimando@gmail.com> wrote:
>>>
>>>> Hello community,
>>>> I have recently manually inspected some decision trees computed with
>>>> Spark (2.2.1, but the behavior is the same with the latest code on the
>>>> repo).
>>>>
>>>> I have observed that the trees are always complete, even if an entire
>>>> subtree leads to the same prediction in its different leaves.
>>>>
>>>> In such case, the root of the subtree, instead of being an
>>>> InternalNode, could simply be a LeafNode with the (shared) prediction.
>>>>
>>>> I know that decision trees computed by scikit-learn share the same
>>>> feature, I understand that this is needed by construction, because you
>>>> realize this redundancy only at the end.
>>>>
>>>> So my question is, why is this "post-pruning" missing?
>>>>
>>>> Three hypothesis:
>>>>
>>>> 1) It is not suitable (for a reason I fail to see)
>>>> 2) Such addition to the code is considered as not worth (in terms of
>>>> code complexity, maybe)
>>>> 3) It has been overlooked, but could be a favorable addition
>>>>
>>>> For clarity, I have managed to isolate a small case to reproduce this,
>>>> in what follows.
>>>>
>>>> This is the dataset:
>>>>
>>>>> +-----+-------------+
>>>>> |label|features     |
>>>>> +-----+-------------+
>>>>> |1.0  |[1.0,0.0,1.0]|
>>>>> |1.0  |[0.0,1.0,0.0]|
>>>>> |1.0  |[1.0,1.0,0.0]|
>>>>> |0.0  |[0.0,0.0,0.0]|
>>>>> |1.0  |[1.0,1.0,0.0]|
>>>>> |0.0  |[0.0,1.0,1.0]|
>>>>> |1.0  |[0.0,0.0,0.0]|
>>>>> |0.0  |[0.0,1.0,1.0]|
>>>>> |1.0  |[0.0,1.0,1.0]|
>>>>> |0.0  |[1.0,0.0,0.0]|
>>>>> |0.0  |[1.0,0.0,1.0]|
>>>>> |1.0  |[0.0,1.0,1.0]|
>>>>> |0.0  |[0.0,0.0,1.0]|
>>>>> |0.0  |[1.0,0.0,1.0]|
>>>>> |0.0  |[0.0,0.0,1.0]|
>>>>> |0.0  |[1.0,1.0,1.0]|
>>>>> |0.0  |[1.0,1.0,0.0]|
>>>>> |1.0  |[1.0,1.0,1.0]|
>>>>> |0.0  |[1.0,0.0,1.0]|
>>>>> +-----+-------------+
>>>>
>>>>
>>>> Which generates the following model:
>>>>
>>>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with
>>>>> 15 nodes
>>>>>   If (feature 1 <= 0.5)
>>>>>    If (feature 2 <= 0.5)
>>>>>     If (feature 0 <= 0.5)
>>>>>      Predict: 0.0
>>>>>     Else (feature 0 > 0.5)
>>>>>      Predict: 0.0
>>>>>    Else (feature 2 > 0.5)
>>>>>     If (feature 0 <= 0.5)
>>>>>      Predict: 0.0
>>>>>     Else (feature 0 > 0.5)
>>>>>      Predict: 0.0
>>>>>   Else (feature 1 > 0.5)
>>>>>    If (feature 2 <= 0.5)
>>>>>     If (feature 0 <= 0.5)
>>>>>      Predict: 1.0
>>>>>     Else (feature 0 > 0.5)
>>>>>      Predict: 1.0
>>>>>    Else (feature 2 > 0.5)
>>>>>     If (feature 0 <= 0.5)
>>>>>      Predict: 0.0
>>>>>     Else (feature 0 > 0.5)
>>>>>      Predict: 0.0
>>>>
>>>>
>>>> As you can see, the following model would be equivalent, but smaller
>>>> and
>>>>
>>>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with
>>>>> 15 nodes
>>>>>   If (feature 1 <= 0.5)
>>>>>    Predict: 0.0
>>>>>   Else (feature 1 > 0.5)
>>>>>    If (feature 2 <= 0.5)
>>>>>     Predict: 1.0
>>>>>    Else (feature 2 > 0.5)
>>>>>     Predict: 0.0
>>>>
>>>>
>>>> This happens pretty often in real cases, and despite the small gain in
>>>> the single model invocation for the "optimized" version, it can become non
>>>> negligible when the number of calls is massive, as one can expect in a Big
>>>> Data context.
>>>>
>>>> I would appreciate your opinion on this matter (if relevant for a PR or
>>>> not, pros/cons etc).
>>>>
>>>> Best regards,
>>>> Alessandro
>>>>
>>>
>>

Mime
View raw message