mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@apache.org>
Subject Fwd: default tree builder
Date Wed, 06 Oct 2010 21:22:44 GMT
Andrey has a question about the Random Forest code.  (hurray for him for
being the first outsider to beat on it)

This looks like a simple and fairly obvious fix, but I can't afford enough
attention to verify.

Deneche?

---------- Forwarded message ----------
From: Andrey Gusev <andrey.gusev@gmail.com>
Date: Wed, Oct 6, 2010 at 1:41 PM
Subject: default tree builder
To: Ted Dunning <tdunning@apache.org>


Hi Ted,

Unrelated to KCCA discussion. I have some pretty positive results with
bagging of DefaultTreeBuilder-id3 like implementation. However, there is
seems to be an issue with building possible running into infinite recursion,
i.e. stack overflow. Basically when you have few features, it is possible
that all results left to split can not be split according to any attribute
so in my cases line 107 of DefaultTreeBuilder keeps being called with
identical set. More precisely loSubset is empty and hiSubset contain few
instances which can not be split. I modified this into LimitDepthTreeBuilder
with something like:


    public Node build(Random rng, Data data) {
        return inner_build(rng, data, 0);
    }

    public Node inner_build(Random rng, Data data, int depth) {

        if (selected == null) {
            selected = new boolean[data.getDataset().nbAttributes()];
        }

        if (data.isEmpty()) {
            return new Leaf(-1);
        }
        if (isIdentical(data)) {
            return new Leaf(data.majorityLabel(rng));
        }
        if (data.identicalLabel()) {
            return new Leaf(data.get(0).label);
        }
        // SFDC change to prevent stack overflow
        if (depth > MAX_DEPTH) {
            return new Leaf(data.majorityLabel(rng));
        }
...

}

I know mahout-0.4 is pretty close to freeze (or past it). But this is simple
enough change and with high enough MAX_DEPTH (say 100) should not have much
functional impact unless there is actually infinite recursion. Or it could
default to not use it but have a setter for that. Let me know what you
think.

Thanks,

Andrey

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message