mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: [jira] [Updated] (MAHOUT-939) ASF Email SGD Examples don't produce good results
Date Wed, 04 Jan 2012 22:24:02 GMT
I have a separate solution: strip the quoted text. Quoted text in the
emails spams the term vectors; just plain TF-IDF is not enough to
combat this. Lucene has a lot of tools besides TFi-IDF.

I have a patch, gotta start the JIRA. Also added more measurements to
the confusion matrix. I want to get a good measurement of the
performance on each producer and consumer, not just a global ratio.
'testnb' gives 80% but one of the false boxes has a 1. This is bogus.
(I'm using your complete corpus of commons v.s. cocoon, classifying
dev v.s. user.)

On Wed, Jan 4, 2012 at 6:57 AM, Grant Ingersoll (Updated) (JIRA)
<> wrote:
>     [
> Grant Ingersoll updated MAHOUT-939:
> -----------------------------------
>    Attachment: MAHOUT-939.patch
> Here's a start on this.  Added some more construction options to the AdaptiveLogisticRegression
class.  Still testing what values to use in TrainASFEmail, but thought I would put this up
for now.
>> ASF Email SGD Examples don't produce good results
>> -------------------------------------------------
>>                 Key: MAHOUT-939
>>                 URL:
>>             Project: Mahout
>>          Issue Type: Bug
>>    Affects Versions: 0.6
>>            Reporter: Grant Ingersoll
>>            Assignee: Grant Ingersoll
>>              Labels: MAHOUT_INTRO_CONTRIBUTE
>>             Fix For: 0.7
>>         Attachments: MAHOUT-939.patch
>> The SGD examples for the ASF email don't work all that well currently in terms of
quality.  Also, need to determine how much memory is required for vectors of cardinality
size 100K.
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
> For more information on JIRA, see:

Lance Norskog

View raw message