it is probably good to benchmark against standard datasets. for text
classification this tends to be the Reuters set:
http://www.daviddlewis.com/resources/testcollections/
this way you know if you are doing a good job
Miles
2009/7/22 Grant Ingersoll <gsingers@apache.org>
> The model size is much smaller with unigrams. :-)
>
> I'm not quite sure what constitutes good just yet, but, I can report the
> following using the commands I reported earlier w/ the exception that I am
> using unigrams:
>
> I have two categories: History and Science
>
> 0. Splitter:
> org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
> --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir
> /PATH/wikipedia/chunks -c 64
>
> Then prep:
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
> --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test
> --categories PATH/mahout-clean/examples/src/test/resources/subjects.txt
> (also do this for the training set)
>
> 1. Train set:
> ls ../chunks
> chunk-0001.xml chunk-0005.xml chunk-0009.xml chunk-0013.xml
> chunk-0017.xml chunk-0021.xml chunk-0025.xml chunk-0029.xml
> chunk-0033.xml chunk-0037.xml
> chunk-0002.xml chunk-0006.xml chunk-0010.xml chunk-0014.xml
> chunk-0018.xml chunk-0022.xml chunk-0026.xml chunk-0030.xml
> chunk-0034.xml chunk-0038.xml
> chunk-0003.xml chunk-0007.xml chunk-0011.xml chunk-0015.xml
> chunk-0019.xml chunk-0023.xml chunk-0027.xml chunk-0031.xml
> chunk-0035.xml chunk-0039.xml
> chunk-0004.xml chunk-0008.xml chunk-0012.xml chunk-0016.xml
> chunk-0020.xml chunk-0024.xml chunk-0028.xml chunk-0032.xml
> chunk-0036.xml
>
> 2. Test Set:
> ls
> chunk-0101.xml chunk-0103.xml chunk-0105.xml chunk-0108.xml
> chunk-0130.xml chunk-0132.xml chunk-0134.xml chunk-0137.xml
> chunk-0102.xml chunk-0104.xml chunk-0107.xml chunk-0109.xml
> chunk-0131.xml chunk-0133.xml chunk-0135.xml chunk-0139.xml
>
> 3. Run the Trainer on the train set:
> --input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/model
> --gramSize 1 --classifierType bayes
>
> 4. Run the TestClassifier.
>
> --model PATH/wikipedia/subjects/model --testDir
> PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
>
> Output is:
>
> <snip>
> 9/07/22 15:55:09 INFO bayes.TestClassifier:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances : 4143 74.0615%
> Incorrectly Classified Instances : 1451 25.9385%
> Total Classified Instances : 5594
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a b <--Classified as
> 3910 186 | 4096 a = history
> 1265 233 | 1498 b = science
> Default Category: unknown: 2
> </snip>
>
> At least it's better than 50%, which is presumably a good thing ;-) I have
> no clue what the state of the art is these days, but it doesn't seem
> _horrendous_ either.
>
> I'd love to see someone validate what I have done. Let me know if you need
> more details. I'd also like to know how I can improve it.
>
> On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
>
> Indeed. I hadn't snapped to the fact you were using trigrams.
>>
>> 30 million features is quite plausible for that. To effectively use long
>> n-grams as features in classification of documents you really need to have
>> the following:
>>
>> a) good statistical methods for resolving what is useful and what is not.
>> Everybody here knows that my preference for a first hack is sparsification
>> with log-likelihood ratios.
>>
>> b) some kind of smoothing using smaller n-grams
>>
>> c) some kind of smoothing over variants of n-grams.
>>
>> AFAIK, mahout doesn't have many or any of these in place. You are likely
>> to
>> do better with unigrams as a result.
>>
>> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gsingers@apache.org
>> >wrote:
>>
>> I suspect the explosion in the number of features, Ted, is due to the use
>>> of n-grams producing a lot of unique terms. I can try w/ gramSize = 1,
>>> that
>>> will likely reduce the feature set quite a bit.
>>>
>>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>
>
>
--
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
|