mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From deneche abdelhakim <adene...@gmail.com>
Subject Re: default tree builder
Date Thu, 07 Oct 2010 04:13:01 GMT
> Does this refer to the random forest stuff?

Yes, and the DefaultTreeBuilder is used by all implementations
(sequential, InMem and Partial).

> Andrey has a question about the Random Forest code.  (hurray for him for
> being the first outsider to beat on it)

Hurray =D

I will make the fix and create a patch as soon as possible.

I wonder if Andrey good give us the data that lead to this Bug. Just
to make sure it's not another hidden bug that's causing the infinite
recursion.

On Wed, Oct 6, 2010 at 10:22 PM, Ted Dunning <tdunning@apache.org> wrote:
> Andrey has a question about the Random Forest code.  (hurray for him for
> being the first outsider to beat on it)
>
> This looks like a simple and fairly obvious fix, but I can't afford enough
> attention to verify.
>
> Deneche?
>
> ---------- Forwarded message ----------
> From: Andrey Gusev <andrey.gusev@gmail.com>
> Date: Wed, Oct 6, 2010 at 1:41 PM
> Subject: default tree builder
> To: Ted Dunning <tdunning@apache.org>
>
>
> Hi Ted,
>
> Unrelated to KCCA discussion. I have some pretty positive results with
> bagging of DefaultTreeBuilder-id3 like implementation. However, there is
> seems to be an issue with building possible running into infinite recursion,
> i.e. stack overflow. Basically when you have few features, it is possible
> that all results left to split can not be split according to any attribute
> so in my cases line 107 of DefaultTreeBuilder keeps being called with
> identical set. More precisely loSubset is empty and hiSubset contain few
> instances which can not be split. I modified this into LimitDepthTreeBuilder
> with something like:
>
>
>    public Node build(Random rng, Data data) {
>        return inner_build(rng, data, 0);
>    }
>
>    public Node inner_build(Random rng, Data data, int depth) {
>
>        if (selected == null) {
>            selected = new boolean[data.getDataset().nbAttributes()];
>        }
>
>        if (data.isEmpty()) {
>            return new Leaf(-1);
>        }
>        if (isIdentical(data)) {
>            return new Leaf(data.majorityLabel(rng));
>        }
>        if (data.identicalLabel()) {
>            return new Leaf(data.get(0).label);
>        }
>        // SFDC change to prevent stack overflow
>        if (depth > MAX_DEPTH) {
>            return new Leaf(data.majorityLabel(rng));
>        }
> ...
>
> }
>
> I know mahout-0.4 is pretty close to freeze (or past it). But this is simple
> enough change and with high enough MAX_DEPTH (say 100) should not have much
> functional impact unless there is actually infinite recursion. Or it could
> default to not use it but have a setter for that. Let me know what you
> think.
>
> Thanks,
>
> Andrey
>

Mime
View raw message