mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miles Osborne <mi...@inf.ed.ac.uk>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 20:13:40 GMT
it is probably good to benchmark against standard datasets.  for text
classification this tends to be the Reuters set:

http://www.daviddlewis.com/resources/testcollections/

this way you know if you are doing a good job

Miles

2009/7/22 Grant Ingersoll <gsingers@apache.org>

> The model size is much smaller with unigrams.  :-)
>
> I'm not quite sure what constitutes good just yet, but, I can report the
> following using the commands I reported earlier w/ the exception that I am
> using unigrams:
>
> I have two categories:  History and Science
>
> 0. Splitter:
> org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
> --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir
> /PATH/wikipedia/chunks -c 64
>
> Then prep:
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
> --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test
> --categories PATH/mahout-clean/examples/src/test/resources/subjects.txt
> (also do this for the training set)
>
> 1. Train set:
> ls ../chunks
> chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml
>  chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml
>  chunk-0033.xml  chunk-0037.xml
> chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml
>  chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml
>  chunk-0034.xml  chunk-0038.xml
> chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml
>  chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml
>  chunk-0035.xml  chunk-0039.xml
> chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml
>  chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml
>  chunk-0036.xml
>
> 2. Test Set:
>  ls
> chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml
>  chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
> chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml
>  chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml
>
> 3. Run the Trainer on the train set:
> --input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/model
> --gramSize 1 --classifierType bayes
>
> 4. Run the TestClassifier.
>
> --model PATH/wikipedia/subjects/model --testDir
> PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
>
> Output is:
>
> <snip>
> 9/07/22 15:55:09 INFO bayes.TestClassifier:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :       4143       74.0615%
> Incorrectly Classified Instances        :       1451       25.9385%
> Total Classified Instances              :       5594
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a       b       <--Classified as
> 3910    186      |  4096        a     = history
> 1265    233      |  1498        b     = science
> Default Category: unknown: 2
> </snip>
>
> At least it's better than 50%, which is presumably a good thing ;-)  I have
> no clue what the state of the art is these days, but it doesn't seem
> _horrendous_ either.
>
> I'd love to see someone validate what I have done.  Let me know if you need
> more details.  I'd also like to know how I can improve it.
>
> On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
>
>  Indeed.  I hadn't snapped to the fact you were using trigrams.
>>
>> 30 million features is quite plausible for that.  To effectively use long
>> n-grams as features in classification of documents you really need to have
>> the following:
>>
>> a) good statistical methods for resolving what is useful and what is not.
>> Everybody here knows that my preference for a first hack is sparsification
>> with log-likelihood ratios.
>>
>> b) some kind of smoothing using smaller n-grams
>>
>> c) some kind of smoothing over variants of n-grams.
>>
>> AFAIK, mahout doesn't have many or any of these in place.  You are likely
>> to
>> do better with unigrams as a result.
>>
>> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gsingers@apache.org
>> >wrote:
>>
>>  I suspect the explosion in the number of features, Ted, is due to the use
>>> of n-grams producing a lot of unique terms.  I can try w/ gramSize = 1,
>>> that
>>> will likely reduce the feature set quite a bit.
>>>
>>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message