Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-Id: <853B5BEC-A76D-483D-A3FE-5331FA66C2A1@apache.org>
From: Grant Ingersoll <gsingers@apache.org>
To: mahout-user@lucene.apache.org
In-Reply-To: <c7d45fc70907221124r7dbf3139u50fe649aca5a266a@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v935.3)
Subject: Re: Getting Started with Classification
Date: Wed, 22 Jul 2009 14:39:10 -0400
References: <2D2B0739-2E1E-46FF-88F2-9D0D65A4E826@apache.org>
  <c7d45fc70907212126r60dc509dped0de7d59b795af8@mail.gmail.com>
  <91449380-DE44-4546-8556-C2F83AF0251E@apache.org>
 <27FDDA22-D428-48AA-A7C3-263DF7F465F9@apache.org>
  <7d7600c50907220738t785934fch8489a2932d609dc4@mail.gmail.com>
  <F328F3A6-F869-42F7-B305-DA33A266DD13@apache.org>
 <7d7600c50907220850q5c5fb640y2d3fcee318094eba@mail.gmail.com>
  <AD880FFF-5642-4E56-9326-77297E704281@apache.org>
 <c7d45fc70907221111n1380f89fi63069a844867e5b5@mail.gmail.com>
  <7d7600c50907221119i265348a1y3fbb57f33de90136@mail.gmail.com>
 <c7d45fc70907221124r7dbf3139u50fe649aca5a266a@mail.gmail.com>

I'm doing a pretty naive (pun intended) approach for this based on the  
viewpoint of someone coming in new to Mahout and ML, for that matter,  
(I also will admit I haven't done a lot of practical classification  
myself, even if I've read many of the papers, so it isn't a stretch  
for me) and just want to get started doing some basic classification  
that works reasonably well to demonstrate the idea.

The code is all publicly available in Mahout.  The Wikipedia data set  
I'm using is at http://people.apache.org/~gsingers/wikipedia/ (ignore  
the small files, the big bz2 file is the one I've used)

  I'm happy to share the commands I used:

1. WikipediaDataSetCreatorDriver:  --input PATH/wikipedia/chunks/ -- 
output PATH/wikipedia/subjects/out --categories PATH TO MAHOUT CODE/ 
examples/src/test/resources/subjects.txt

2. TrainClassifier: --input PATH/wikipedia/subjects/out --output PATH/ 
wikipedia/subjects/model --gramSize 3 --classifierType bayes

3. Test Classifier: --model PATH/wikipedia/subjects/model --testDir  
PATH/wikipedia/subjects/test --gramSize 3 --classifierType bayes

The training data was produced by the Wikipedia Splitter (first 60  
chunks) and the test data was some other chunks not in the first 60 (I  
haven't successfully completed a Test run yet, or at least not one  
that resulted in even decent results)

I suspect the explosion in the number of features, Ted, is due to the  
use of n-grams producing a lot of unique terms.  I can try w/ gramSize  
= 1, that will likely reduce the feature set quite a bit.

I am using the WikipediaTokenizer from Lucene which does a better job  
of removing cruft from Wikipedia than StandardAnalyzer.

This is all based on me piecing together from the Wiki and the code  
and is not on any great insight on my end.

-Grant


On Jul 22, 2009, at 2:24 PM, Ted Dunning wrote:

> It is common to have more features than there are plausible words.
>
> If these features are common enough to provide some support for the
> statistical inferences, then they are fine to use as long as they  
> aren't
> target leaks.  If they are rare (page URL for instance), then they  
> have
> little utility and should be pruned.
>
> Pruning will generally improve accuracy as well as speed and memory  
> use.
>
> On Wed, Jul 22, 2009 at 11:19 AM, Robin Anil <robin.anil@gmail.com>  
> wrote:
>
>> Yes, I agree. Maybe we can add a prune step or a minSupport parameter
>> to prune. But then again a lot depends on the tokenizer used.  
>> Numerals
>> plus string literal combinations like say 100-sanfrancisco-ugs found
>> in Wikipedia data a lot.  They add up to the feature count more than
>> English words
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search