On Jul 22, 2009, at 4:13 PM, Miles Osborne wrote:
> it is probably good to benchmark against standard datasets. for text
> classification this tends to be the Reuters set:
>
> http://www.daviddlewis.com/resources/testcollections/
>
> this way you know if you are doing a good job
Yeah, good point. Only problem is, for my demo, I am doing it all on
Wikipedia, because I want coherent examples and don't want to have to
introduce another dataset. I know there are a few areas for error in
the process, since we are just picking a single category for a
document even though they have multiple, furthermore, we are picking
the first category that matches, even thought multiple input
categories might be present, or even, both categories in one (i.e.
History of Science)
Still, good to try out w/ the Reuters collection as well. Sigh, I'll
put it on the list to do.
>
> Miles
>
> 2009/7/22 Grant Ingersoll <gsingers@apache.org>
>
>> The model size is much smaller with unigrams. :-)
>>
>> I'm not quite sure what constitutes good just yet, but, I can
>> report the
>> following using the commands I reported earlier w/ the exception
>> that I am
>> using unigrams:
>>
>> I have two categories: History and Science
>>
>> 0. Splitter:
>> org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
>> --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --
>> outputDir
>> /PATH/wikipedia/chunks -c 64
>>
>> Then prep:
>> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
>> --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/
>> subjects/test
>> --categories PATH/mahout-clean/examples/src/test/resources/
>> subjects.txt
>> (also do this for the training set)
>>
>> 1. Train set:
>> ls ../chunks
>> chunk-0001.xml chunk-0005.xml chunk-0009.xml chunk-0013.xml
>> chunk-0017.xml chunk-0021.xml chunk-0025.xml chunk-0029.xml
>> chunk-0033.xml chunk-0037.xml
>> chunk-0002.xml chunk-0006.xml chunk-0010.xml chunk-0014.xml
>> chunk-0018.xml chunk-0022.xml chunk-0026.xml chunk-0030.xml
>> chunk-0034.xml chunk-0038.xml
>> chunk-0003.xml chunk-0007.xml chunk-0011.xml chunk-0015.xml
>> chunk-0019.xml chunk-0023.xml chunk-0027.xml chunk-0031.xml
>> chunk-0035.xml chunk-0039.xml
>> chunk-0004.xml chunk-0008.xml chunk-0012.xml chunk-0016.xml
>> chunk-0020.xml chunk-0024.xml chunk-0028.xml chunk-0032.xml
>> chunk-0036.xml
>>
>> 2. Test Set:
>> ls
>> chunk-0101.xml chunk-0103.xml chunk-0105.xml chunk-0108.xml
>> chunk-0130.xml chunk-0132.xml chunk-0134.xml chunk-0137.xml
>> chunk-0102.xml chunk-0104.xml chunk-0107.xml chunk-0109.xml
>> chunk-0131.xml chunk-0133.xml chunk-0135.xml chunk-0139.xml
>>
>> 3. Run the Trainer on the train set:
>> --input PATH/wikipedia/subjects/out --output PATH/wikipedia/
>> subjects/model
>> --gramSize 1 --classifierType bayes
>>
>> 4. Run the TestClassifier.
>>
>> --model PATH/wikipedia/subjects/model --testDir
>> PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
>>
>> Output is:
>>
>> <snip>
>> 9/07/22 15:55:09 INFO bayes.TestClassifier:
>> =======================================================
>> Summary
>> -------------------------------------------------------
>> Correctly Classified Instances : 4143 74.0615%
>> Incorrectly Classified Instances : 1451 25.9385%
>> Total Classified Instances : 5594
>>
>> =======================================================
>> Confusion Matrix
>> -------------------------------------------------------
>> a b <--Classified as
>> 3910 186 | 4096 a = history
>> 1265 233 | 1498 b = science
>> Default Category: unknown: 2
>> </snip>
>>
>> At least it's better than 50%, which is presumably a good
>> thing ;-) I have
>> no clue what the state of the art is these days, but it doesn't seem
>> _horrendous_ either.
>>
>> I'd love to see someone validate what I have done. Let me know if
>> you need
>> more details. I'd also like to know how I can improve it.
>>
>> On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
>>
>> Indeed. I hadn't snapped to the fact you were using trigrams.
>>>
>>> 30 million features is quite plausible for that. To effectively
>>> use long
>>> n-grams as features in classification of documents you really need
>>> to have
>>> the following:
>>>
>>> a) good statistical methods for resolving what is useful and what
>>> is not.
>>> Everybody here knows that my preference for a first hack is
>>> sparsification
>>> with log-likelihood ratios.
>>>
>>> b) some kind of smoothing using smaller n-grams
>>>
>>> c) some kind of smoothing over variants of n-grams.
>>>
>>> AFAIK, mahout doesn't have many or any of these in place. You are
>>> likely
>>> to
>>> do better with unigrams as a result.
>>>
>>> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gsingers@apache.org
>>>> wrote:
>>>
>>> I suspect the explosion in the number of features, Ted, is due to
>>> the use
>>>> of n-grams producing a lot of unique terms. I can try w/
>>>> gramSize = 1,
>>>> that
>>>> will likely reduce the feature set quite a bit.
>>>>
>>>>
>>>
>>>
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>>
>>
>>
>>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland,
> with registration number SC005336.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
|