mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: 20news
Date Mon, 04 Jul 2011 14:46:36 GMT
This could be my doing. I noticed that various bits of code split
input files in different ways: StringTokenizer, Pattern, Splitter. And
using different delimiters: space, space/tab, or the weird collection
of delimiters from StringTokenizer. (BTW StringTokenizer is all but
deprecated for this reason.) So I tried to move towards Splitter, or
Pattern where that made more sense.

So I have recently tried to standardize how things like NGrams
tokenizes to make it all work more the same. I tried to guess and
preserve the intent of the tokenization, but it did change in several
places as a result, and this could be the issue here.

So: what class is tokenizing, what do you expect it tokenize on? We
can easily add "tab" to what NGrams tokenizes on for instance.

Sean

On Mon, Jul 4, 2011 at 1:23 PM, Robin Anil <robin.anil@gmail.com> wrote:
> Are you using some non-standard Java character encoding?
>
>
> On Mon, Jul 4, 2011 at 5:23 PM, Vijay Santhanam
> <vijay.santhanam@gmail.com>wrote:
>
>> Hi,
>>
>> Okay, I replaced all the tab characters with space characters for each file
>> in the bayes-test-input folder and now the classifier completes without
>> error.
>>
>> Tomorrow I'll investigate why the trainer correctly parses the
>> tab-separated
>> label correctly, but the classifier does not. Actually, I know why the
>> classifier doesn't extract the correct label--- because
>> org.apache.mahout.common.nlp.NGrams tokenizes via spaces only.
>>
>> The other mystery is why it works for everyone else except poor me :(
>>
>> If anyone has any ideas I'd love to hear it.
>>
>> Cheers,
>> Vijay
>>
>>
>>
>> On Mon, Jul 4, 2011 at 9:16 PM, Vijay Santhanam
>> <vijay.santhanam@gmail.com>wrote:
>>
>> > Hi,
>> >
>> > I got debugger running w/ eclipse so I could watch what was happening
>> under
>> > the hood.
>> >
>> > Here's the exception again
>> > Exception in thread "main" java.lang.IllegalArgumentException: Label not
>> > found: alt.atheism from
>> >  at
>> > com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>> > at
>> >
>> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:93)
>> >  at
>> >
>> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:113)
>> > at
>> >
>> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:117)
>> >  at
>> >
>> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:85)
>> > at
>> >
>> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:67)
>> >  at
>> >
>> org.apache.mahout.classifier.bayes.TestClassifier.classifySequential(TestClassifier.java:244)
>> > at
>> >
>> org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:177)
>> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> >  at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> > at java.lang.reflect.Method.invoke(Method.java:597)
>> >  at
>> >
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>> > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> >  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>> >
>> > Notice the "Label not found: alt.atheism\tfrom"
>> >
>> > That's an invalid label in the confusion matrix. I think it SHOULD be
>> just
>> > alt.atheism. I'm not sure how the \tfrom is getting in there, but it is.
>> > Perhaps it has something to do with the way my test data was formatted.
>> >
>> > I'll keep digging....
>> >
>> > Thanks,
>> > Vijay
>> >
>> >

Mime
View raw message