mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anca Leuca <ancaleuca2...@gmail.com>
Subject Re: The function of the parameter complemented in DecisionTreeBuilder
Date Fri, 02 Nov 2012 17:07:05 GMT
Hi,

However, when complemented = true, the split is still based on the same
> possible values of C from the data that is passed to the method.


Yes. The split is indeed based on a subset of the data.


> As said by
> the code  from line 278 to line 280, if a value of C is contained in the
> entire dataset, but not the data that is passed to the method, the continue
> statement is executed. So those values of C that are not contained in the
> data passed to the method do not affect the method.
>

Not sure what you mean by 'affect the method'. I think the datapoints that
refer to values of C not contained in the data passed are not meant to
change the calculations.
Also, *c**ontinue* is being called twice: in the loop 277-285 and the loop
303-317, under the same conditions. So technically I don't think there's a
bug there, although admittedly it's not a very clean/obvious solution :).


> In a word, whether complemented is true or false, the result after
> executing the code from line 267 to line 285 is the same.
>

Again, I am not sure what you mean by 'result'. If you mean the variable *
subsets*, yes, that one will have the same value, regardless of
complemented. The interesting stuff, however, happens in lines 302-332,
where the 'complementing' leaves are being built.

That being said, I think the best approach would be to just give the tree
builder a test and see what it spits out, for a simple dataset that you can
eyeball. Or have a look at the unit tests (if any), they should also give a
clue on what was meant.

Anca


> On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <ancaleuca2005@gmail.com>
> wrote:
>
> > Hi Yang,
> >
> > I think I understand it better now, as well. So this is what I think it
> > does:
> >
> > First of all, I think it only affects the categorical node splits. It
> will
> > work as following in this scenario:
> > Let us consider a dataset D we want to build a decision tree from.
> > Let's say the tree has been partially built, and we've reached a
> > categorical attribute C that we want to split on.
> >
> > As I understand it, when parametrized = false, on that node we might only
> > branch on a subset of possible values of C.
> >
> > When parametrized = true, however, we will 'force' branching on all
> > possible values of C from the entire dataset, and replace the missing
> data
> > with leaves having a label computed from the parent data (line 307):
> >
> > if (data.getDataset
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > >().isNumerical
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29
> > >(data.getDataset
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > >().getLabelId
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29
> > >()))
> > {
> >
> > label = sum / data.size
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29
> > >();
> >
> > } else {
> >
> > label = data.majorityLabel
> > <
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29
> > >(rng);
> >
> > }
> >
> >
> > I hope this is correct and helps with understanding it better.
> >
> >
> > Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840>,
> > it's the Jira task that introduced the DecisionTreeBuilder, take a
> > look at the comments, maybe it'll help you as well.
> >
> >
> >
> > Anca
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message