mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vijay Santhanam <vijay.santha...@gmail.com>
Subject Re: 20news
Date Mon, 04 Jul 2011 15:08:10 GMT
Hi Sean,

Thanks for responding.

I would expect the sequential classifer tokenizer to be identical to what's
used in the parallel classifier tokenizer.

If that's not possible, then NGrams should perhaps be configurable with
where it finds it's first token (i.e. the label).

I'm very new to hadoop and this world, so I'm not sure what I'm looking at
when it the classifier goes into mapreduce execution.

-V


On Tue, Jul 5, 2011 at 12:46 AM, Sean Owen <srowen@gmail.com> wrote:

> This could be my doing. I noticed that various bits of code split
> input files in different ways: StringTokenizer, Pattern, Splitter. And
> using different delimiters: space, space/tab, or the weird collection
> of delimiters from StringTokenizer. (BTW StringTokenizer is all but
> deprecated for this reason.) So I tried to move towards Splitter, or
> Pattern where that made more sense.
>
> So I have recently tried to standardize how things like NGrams
> tokenizes to make it all work more the same. I tried to guess and
> preserve the intent of the tokenization, but it did change in several
> places as a result, and this could be the issue here.
>
> So: what class is tokenizing, what do you expect it tokenize on? We
> can easily add "tab" to what NGrams tokenizes on for instance.
>
> Sean
>
> On Mon, Jul 4, 2011 at 1:23 PM, Robin Anil <robin.anil@gmail.com> wrote:
> > Are you using some non-standard Java character encoding?
> >
> >
> > On Mon, Jul 4, 2011 at 5:23 PM, Vijay Santhanam
> > <vijay.santhanam@gmail.com>wrote:
> >
> >> Hi,
> >>
> >> Okay, I replaced all the tab characters with space characters for each
> file
> >> in the bayes-test-input folder and now the classifier completes without
> >> error.
> >>
> >> Tomorrow I'll investigate why the trainer correctly parses the
> >> tab-separated
> >> label correctly, but the classifier does not. Actually, I know why the
> >> classifier doesn't extract the correct label--- because
> >> org.apache.mahout.common.nlp.NGrams tokenizes via spaces only.
> >>
> >> The other mystery is why it works for everyone else except poor me :(
> >>
> >> If anyone has any ideas I'd love to hear it.
> >>
> >> Cheers,
> >> Vijay
> >>
> >>
> >>
> >> On Mon, Jul 4, 2011 at 9:16 PM, Vijay Santhanam
> >> <vijay.santhanam@gmail.com>wrote:
> >>
> >> > Hi,
> >> >
> >> > I got debugger running w/ eclipse so I could watch what was happening
> >> under
> >> > the hood.
> >> >
> >> > Here's the exception again
> >> > Exception in thread "main" java.lang.IllegalArgumentException: Label
> not
> >> > found: alt.atheism from
> >> >  at
> >> >
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
> >> > at
> >> >
> >>
> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:93)
> >> >  at
> >> >
> >>
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:113)
> >> > at
> >> >
> >>
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:117)
> >> >  at
> >> >
> >>
> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:85)
> >> > at
> >> >
> >>
> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:67)
> >> >  at
> >> >
> >>
> org.apache.mahout.classifier.bayes.TestClassifier.classifySequential(TestClassifier.java:244)
> >> > at
> >> >
> >>
> org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:177)
> >> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> > at
> >> >
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >> >  at
> >> >
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> > at java.lang.reflect.Method.invoke(Method.java:597)
> >> >  at
> >> >
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >> > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >> >  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
> >> >
> >> > Notice the "Label not found: alt.atheism\tfrom"
> >> >
> >> > That's an invalid label in the confusion matrix. I think it SHOULD be
> >> just
> >> > alt.atheism. I'm not sure how the \tfrom is getting in there, but it
> is.
> >> > Perhaps it has something to do with the way my test data was
> formatted.
> >> >
> >> > I'll keep digging....
> >> >
> >> > Thanks,
> >> > Vijay
> >> >
> >> >
>



-- 
 Vijay Santhanam
 Software Engineer
 http://au.linkedin.com/in/vijaysanthanam
 0407525087

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message