mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Problems with the Bayesian classifiers.
Date Sun, 20 Jul 2008 14:08:00 GMT
I haven't done a lot of testing w/ M-9 yet, so it is more than likely  
there are bugs ;-)

-Grant

On Jul 20, 2008, at 6:21 AM, Miles Osborne wrote:

> i think it would also be useful to cross-check your results against  
> a text
> classification system which is known to work.  look at rainbow:
>
> http://www.cs.cmu.edu/~mccallum/bow/rainbow/
>
> if you get the correct results here then either you have somehow  
> messed-up
> with Mahout or else there really is a bug
>
> Miles
>
> 2008/7/20 Robin Anil <robin.anil@gmail.com>:
>
>> Can you upload your split somewhere.
>>
>> On Sun, Jul 20, 2008 at 6:46 AM, Philippe Lamarche <
>> philippe.lamarche@gmail.com> wrote:
>>
>>> Now, with the attachment.
>>> Sorry.
>>>
>>> On Sat, Jul 19, 2008 at 9:13 PM, Philippe Lamarche
>>> <philippe.lamarche@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I have been working for a little while with Mahout and the Bayesian
>>>> classifier for a school project.
>>>>
>>>> I am using the Enron email corpus and the UC Berkeley classified
>>>> emails (http://www.cs.cmu.edu/~enron/<http://www.cs.cmu.edu/%7Eenron/

>>>> ><
>> http://www.cs.cmu.edu/%7Eenron/>).
>>> I did a few tests and I can't
>>>> seem to make it work. I wonder if I am doing something wrong.
>>>>
>>>> For example, I am getting correct prediction under 10%, with  
>>>> Bayes and
>>>> around 1% with CBayes. The problem seems to lie in the fact that  
>>>> all
>>>> instances of a class will be predicted to another class, or that  
>>>> they
>>>> will all be predicted to the class containing the more feature.
>>>>
>>>> I also tested with the 20News corpus and I get similar result where
>>>> all instances of a class will be predicted to another class.  
>>>> (e.g. all
>>>> 421 "rec.motorcycles" get predicted as "talk.politics.mideast").
>>>> Attached is two confusions matrix displaying results for bayes and
>>>> cbayes. Both used the same division in the training and testing  
>>>> set.
>>>>
>>>> Am I doing something wrong?
>>>>
>>>> Thanks,
>>>>
>>>> Philippe Lamarche.
>>>>
>>>
>>
>>
>> Thanks
>> Robin
>>
>
>
>
> -- 
> The University of Edinburgh is a charitable body, registered in  
> Scotland,
> with registration number SC005336.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








Mime
View raw message