mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philippe Lamarche" <philippe.lamar...@gmail.com>
Subject Re: Problems with the Bayesian classifiers.
Date Sun, 20 Jul 2008 15:23:19 GMT
 Hi,

I uploaded my split here:

http://www.2shared.com/file/3624998/e9330a64/news-train-testtar.html

(the download link is after all the ads, at the bottom of the page)

The file contains the "news_test_1" and "news_train_1" folders, with
the original file/folder structure. The "news_ha_train_1" folder
contains the collapse version of "news_train_1".

The training files are not perfectly distributed in each class (some
class will contain less training file than other). This was done to
reflect the UC Berkeley Enron corpus.

Thanks,
Philippe.


On Sun, Jul 20, 2008 at 10:08 AM, Grant Ingersoll <gsingers@apache.org> wrote:
> I haven't done a lot of testing w/ M-9 yet, so it is more than likely there
> are bugs ;-)
>
> -Grant
>
> On Jul 20, 2008, at 6:21 AM, Miles Osborne wrote:
>
>> i think it would also be useful to cross-check your results against a text
>> classification system which is known to work.  look at rainbow:
>>
>> http://www.cs.cmu.edu/~mccallum/bow/rainbow/
>>
>> if you get the correct results here then either you have somehow messed-up
>> with Mahout or else there really is a bug
>>
>> Miles
>>
>> 2008/7/20 Robin Anil <robin.anil@gmail.com>:
>>
>>> Can you upload your split somewhere.
>>>
>>> On Sun, Jul 20, 2008 at 6:46 AM, Philippe Lamarche <
>>> philippe.lamarche@gmail.com> wrote:
>>>
>>>> Now, with the attachment.
>>>> Sorry.
>>>>
>>>> On Sat, Jul 19, 2008 at 9:13 PM, Philippe Lamarche
>>>> <philippe.lamarche@gmail.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have been working for a little while with Mahout and the Bayesian
>>>>> classifier for a school project.
>>>>>
>>>>> I am using the Enron email corpus and the UC Berkeley classified
>>>>> emails (http://www.cs.cmu.edu/~enron/<http://www.cs.cmu.edu/%7Eenron/><
>>>
>>> http://www.cs.cmu.edu/%7Eenron/>).
>>>>
>>>> I did a few tests and I can't
>>>>>
>>>>> seem to make it work. I wonder if I am doing something wrong.
>>>>>
>>>>> For example, I am getting correct prediction under 10%, with Bayes and
>>>>> around 1% with CBayes. The problem seems to lie in the fact that all
>>>>> instances of a class will be predicted to another class, or that they
>>>>> will all be predicted to the class containing the more feature.
>>>>>
>>>>> I also tested with the 20News corpus and I get similar result where
>>>>> all instances of a class will be predicted to another class. (e.g. all
>>>>> 421 "rec.motorcycles" get predicted as "talk.politics.mideast").
>>>>> Attached is two confusions matrix displaying results for bayes and
>>>>> cbayes. Both used the same division in the training and testing set.
>>>>>
>>>>> Am I doing something wrong?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Philippe Lamarche.
>>>>>
>>>>
>>>
>>>
>>> Thanks
>>> Robin
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in Scotland,
>> with registration number SC005336.
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>

Mime
View raw message