Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 95314 invoked from network); 22 Jul 2009 18:38:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 22 Jul 2009 18:38:40 -0000 Received: (qmail 10187 invoked by uid 500); 22 Jul 2009 18:39:45 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 10138 invoked by uid 500); 22 Jul 2009 18:39:45 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 10128 invoked by uid 99); 22 Jul 2009 18:39:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jul 2009 18:39:45 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [208.97.132.119] (HELO spunkymail-a3.g.dreamhost.com) (208.97.132.119) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jul 2009 18:39:34 +0000 Received: from [10.0.0.77] (adsl-065-013-152-164.sip.rdu.bellsouth.net [65.13.152.164]) by spunkymail-a3.g.dreamhost.com (Postfix) with ESMTP id D2A4415D4EA for ; Wed, 22 Jul 2009 11:39:11 -0700 (PDT) Message-Id: <853B5BEC-A76D-483D-A3FE-5331FA66C2A1@apache.org> From: Grant Ingersoll To: mahout-user@lucene.apache.org In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: Getting Started with Classification Date: Wed, 22 Jul 2009 14:39:10 -0400 References: <2D2B0739-2E1E-46FF-88F2-9D0D65A4E826@apache.org> <91449380-DE44-4546-8556-C2F83AF0251E@apache.org> <27FDDA22-D428-48AA-A7C3-263DF7F465F9@apache.org> <7d7600c50907220738t785934fch8489a2932d609dc4@mail.gmail.com> <7d7600c50907220850q5c5fb640y2d3fcee318094eba@mail.gmail.com> <7d7600c50907221119i265348a1y3fbb57f33de90136@mail.gmail.com> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org I'm doing a pretty naive (pun intended) approach for this based on the viewpoint of someone coming in new to Mahout and ML, for that matter, (I also will admit I haven't done a lot of practical classification myself, even if I've read many of the papers, so it isn't a stretch for me) and just want to get started doing some basic classification that works reasonably well to demonstrate the idea. The code is all publicly available in Mahout. The Wikipedia data set I'm using is at http://people.apache.org/~gsingers/wikipedia/ (ignore the small files, the big bz2 file is the one I've used) I'm happy to share the commands I used: 1. WikipediaDataSetCreatorDriver: --input PATH/wikipedia/chunks/ -- output PATH/wikipedia/subjects/out --categories PATH TO MAHOUT CODE/ examples/src/test/resources/subjects.txt 2. TrainClassifier: --input PATH/wikipedia/subjects/out --output PATH/ wikipedia/subjects/model --gramSize 3 --classifierType bayes 3. Test Classifier: --model PATH/wikipedia/subjects/model --testDir PATH/wikipedia/subjects/test --gramSize 3 --classifierType bayes The training data was produced by the Wikipedia Splitter (first 60 chunks) and the test data was some other chunks not in the first 60 (I haven't successfully completed a Test run yet, or at least not one that resulted in even decent results) I suspect the explosion in the number of features, Ted, is due to the use of n-grams producing a lot of unique terms. I can try w/ gramSize = 1, that will likely reduce the feature set quite a bit. I am using the WikipediaTokenizer from Lucene which does a better job of removing cruft from Wikipedia than StandardAnalyzer. This is all based on me piecing together from the Wiki and the code and is not on any great insight on my end. -Grant On Jul 22, 2009, at 2:24 PM, Ted Dunning wrote: > It is common to have more features than there are plausible words. > > If these features are common enough to provide some support for the > statistical inferences, then they are fine to use as long as they > aren't > target leaks. If they are rare (page URL for instance), then they > have > little utility and should be pruned. > > Pruning will generally improve accuracy as well as speed and memory > use. > > On Wed, Jul 22, 2009 at 11:19 AM, Robin Anil > wrote: > >> Yes, I agree. Maybe we can add a prune step or a minSupport parameter >> to prune. But then again a lot depends on the tokenizer used. >> Numerals >> plus string literal combinations like say 100-sanfrancisco-ugs found >> in Wikipedia data a lot. They add up to the feature count more than >> English words >> >> -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search