Also reasonable to hand-judge 20 docs from each of the four cells of the
confusion matrix. That will give you a rough idea of what the error
processes are.
On Wed, Jul 22, 2009 at 1:13 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
> it is probably good to benchmark against standard datasets. for text
> classification this tends to be the Reuters set:
>
> http://www.daviddlewis.com/resources/testcollections/
>
> this way you know if you are doing a good job
>
> Miles
>
> 2009/7/22 Grant Ingersoll <gsingers@apache.org>
>
> > The model size is much smaller with unigrams. :-)
> >
> > I'm not quite sure what constitutes good just yet, but, I can report the
> > following using the commands I reported earlier w/ the exception that I
> am
> > using unigrams:
> >
> > I have two categories: History and Science
> >
> > 0. Splitter:
> > org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
> > --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir
> > /PATH/wikipedia/chunks -c 64
> >
> > Then prep:
> > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
> > --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test
> > --categories PATH/mahout-clean/examples/src/test/resources/subjects.txt
> > (also do this for the training set)
> >
> > 1. Train set:
> > ls ../chunks
> > chunk-0001.xml chunk-0005.xml chunk-0009.xml chunk-0013.xml
> > chunk-0017.xml chunk-0021.xml chunk-0025.xml chunk-0029.xml
> > chunk-0033.xml chunk-0037.xml
> > chunk-0002.xml chunk-0006.xml chunk-0010.xml chunk-0014.xml
> > chunk-0018.xml chunk-0022.xml chunk-0026.xml chunk-0030.xml
> > chunk-0034.xml chunk-0038.xml
> > chunk-0003.xml chunk-0007.xml chunk-0011.xml chunk-0015.xml
> > chunk-0019.xml chunk-0023.xml chunk-0027.xml chunk-0031.xml
> > chunk-0035.xml chunk-0039.xml
> > chunk-0004.xml chunk-0008.xml chunk-0012.xml chunk-0016.xml
> > chunk-0020.xml chunk-0024.xml chunk-0028.xml chunk-0032.xml
> > chunk-0036.xml
> >
> > 2. Test Set:
> > ls
> > chunk-0101.xml chunk-0103.xml chunk-0105.xml chunk-0108.xml
> > chunk-0130.xml chunk-0132.xml chunk-0134.xml chunk-0137.xml
> > chunk-0102.xml chunk-0104.xml chunk-0107.xml chunk-0109.xml
> > chunk-0131.xml chunk-0133.xml chunk-0135.xml chunk-0139.xml
> >
> > 3. Run the Trainer on the train set:
> > --input PATH/wikipedia/subjects/out --output
> PATH/wikipedia/subjects/model
> > --gramSize 1 --classifierType bayes
> >
> > 4. Run the TestClassifier.
> >
> > --model PATH/wikipedia/subjects/model --testDir
> > PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
> >
> > Output is:
> >
> > <snip>
> > 9/07/22 15:55:09 INFO bayes.TestClassifier:
> > =======================================================
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances : 4143 74.0615%
> > Incorrectly Classified Instances : 1451 25.9385%
> > Total Classified Instances : 5594
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a b <--Classified as
> > 3910 186 | 4096 a = history
> > 1265 233 | 1498 b = science
> > Default Category: unknown: 2
> > </snip>
> >
> > At least it's better than 50%, which is presumably a good thing ;-) I
> have
> > no clue what the state of the art is these days, but it doesn't seem
> > _horrendous_ either.
> >
> > I'd love to see someone validate what I have done. Let me know if you
> need
> > more details. I'd also like to know how I can improve it.
> >
> > On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
> >
> > Indeed. I hadn't snapped to the fact you were using trigrams.
> >>
> >> 30 million features is quite plausible for that. To effectively use
> long
> >> n-grams as features in classification of documents you really need to
> have
> >> the following:
> >>
> >> a) good statistical methods for resolving what is useful and what is
> not.
> >> Everybody here knows that my preference for a first hack is
> sparsification
> >> with log-likelihood ratios.
> >>
> >> b) some kind of smoothing using smaller n-grams
> >>
> >> c) some kind of smoothing over variants of n-grams.
> >>
> >> AFAIK, mahout doesn't have many or any of these in place. You are
> likely
> >> to
> >> do better with unigrams as a result.
> >>
> >> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gsingers@apache.org
> >> >wrote:
> >>
> >> I suspect the explosion in the number of features, Ted, is due to the
> use
> >>> of n-grams producing a lot of unique terms. I can try w/ gramSize = 1,
> >>> that
> >>> will likely reduce the feature set quite a bit.
> >>>
> >>>
> >>
> >>
> >> --
> >> Ted Dunning, CTO
> >> DeepDyve
> >>
> >
> >
> >
>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>
--
Ted Dunning, CTO
DeepDyve
|