mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: [jira] [Updated] (MAHOUT-939) ASF Email SGD Examples don't produce good results
Date Sun, 08 Jan 2012 12:47:44 GMT
I've always had decent performance ~70%+ on just commons v.s. cocoon.  The bigger problem is
when you add in more labels.  I just did a run that had 24 labels and the results still stink.
 I tend to agree that the primary issue is in the preprocessing of the content, or there is
a bug.  

I think perhaps in order to not block 0.6 it would make sense to mark the classification examples
as experimental in this example.  They certainly show the flow one has to go through, but
they don't produce good results currently.  Longer term,  it would be good to get these fixed.



On Jan 4, 2012, at 10:06 PM, Lance Norskog wrote:

> I tested apace commons v.s. cocoon. They use two different build
> systems, with different message formats.  I believe the repetitive
> messages have the effect of spamming terms, both in the subject lines
> and body. In fact, the subject lines are probably bigger offenders
> than the bodies. But, we shall see.
> 
> On Wed, Jan 4, 2012 at 6:29 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>> On Wed, Jan 4, 2012 at 4:59 PM, Lance Norskog <goksron@gmail.com> wrote:
>> 
>>> The last step before posting was to test it on SGD :) My results on
>>> ASF mails (two labels) is around 80%, but both failure boxes get about
>>> 20% of the messages. This seems more realistic.
>>> 
>> 
>> Was this 80% on time-separated test data?  Or the training data?
>> 
>> 
>>> There is another leakage/spam problem in the dev mails: build reports.
>>> 
>> 
>> Why are these a problem?  Too easy?
>> 
>> They are emails sent to the group and should be reasonable to classify
>> unless they inflate the accuracy.
>> 
>> 
>>> The MailProcessor has positive regex rules to find header entries &
>>> subject lines. It does not do negative regex rules to reject a
>>> message- this is the right way to nuke (the first) build message.
>>> 
>> 
>> Yes.  They should be easy to nuke.  But I am not sure why.
>> 
>> 
>>> Is it worthwhile to clamp the training data so that there are similar
>>> numbers of documents for each label? Or does Naive Bayes work well
>>> with a bell curve?
>>> 
>> 
>> Shouldn't matter much for any of our classifiers.
>> 
>> The only strong reason to do this is to speed up training but this data set
>> is pretty small.
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message