I'm doing a pretty naive (pun intended) approach for this based on the viewpoint of someone coming in new to Mahout and ML, for that matter, (I also will admit I haven't done a lot of practical classification myself, even if I've read many of the papers, so it isn't a stretch for me) and just want to get started doing some basic classification that works reasonably well to demonstrate the idea. The code is all publicly available in Mahout. The Wikipedia data set I'm using is at http://people.apache.org/~gsingers/wikipedia/ (ignore the small files, the big bz2 file is the one I've used) I'm happy to share the commands I used: 1. WikipediaDataSetCreatorDriver: --input PATH/wikipedia/chunks/ -- output PATH/wikipedia/subjects/out --categories PATH TO MAHOUT CODE/ examples/src/test/resources/subjects.txt 2. TrainClassifier: --input PATH/wikipedia/subjects/out --output PATH/ wikipedia/subjects/model --gramSize 3 --classifierType bayes 3. Test Classifier: --model PATH/wikipedia/subjects/model --testDir PATH/wikipedia/subjects/test --gramSize 3 --classifierType bayes The training data was produced by the Wikipedia Splitter (first 60 chunks) and the test data was some other chunks not in the first 60 (I haven't successfully completed a Test run yet, or at least not one that resulted in even decent results) I suspect the explosion in the number of features, Ted, is due to the use of n-grams producing a lot of unique terms. I can try w/ gramSize = 1, that will likely reduce the feature set quite a bit. I am using the WikipediaTokenizer from Lucene which does a better job of removing cruft from Wikipedia than StandardAnalyzer. This is all based on me piecing together from the Wiki and the code and is not on any great insight on my end. -Grant On Jul 22, 2009, at 2:24 PM, Ted Dunning wrote: > It is common to have more features than there are plausible words. > > If these features are common enough to provide some support for the > statistical inferences, then they are fine to use as long as they > aren't > target leaks. If they are rare (page URL for instance), then they > have > little utility and should be pruned. > > Pruning will generally improve accuracy as well as speed and memory > use. > > On Wed, Jul 22, 2009 at 11:19 AM, Robin Anil > wrote: > >> Yes, I agree. Maybe we can add a prune step or a minSupport parameter >> to prune. But then again a lot depends on the tokenizer used. >> Numerals >> plus string literal combinations like say 100-sanfrancisco-ugs found >> in Wikipedia data a lot. They add up to the feature count more than >> English words >> >> -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search