mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Terminology Extraction
Date Mon, 14 Nov 2011 15:33:10 GMT
Look also at the ModelDissector class.

The idea is that all of the hashed vector encoders allow you to pass in a
so-called trace dictionary.  This records which terms are in which
locations.  Then you can explain model weightings using the ModelDissector.
 Significantly, you can (with a bit of extra work) pass the ModelDissector
the internal state of the classifier *after* multiplying by your input.
 That will tell you which features contributed to the particular
classification the current document has.

This will be a lot slower than normal classification, mostly due to the
overhead of tracing the hashed feature encoding, but it can be made to work.

On Sun, Nov 13, 2011 at 11:57 PM, Suneel Marthi <suneel_marthi@yahoo.com>wrote:

> Try looking into Stochastic Gradient Descent (SGD), you could use
> AdaptiveLogisticRegression to simultaneously create multiple training
> models and try running your tests with the best model as spewed out by
> AdapativeLogisticRegression.
>
>
>
> ________________________________
> From: Yuval Feinstein <yuvalf@citypath.com>
> To: user@mahout.apache.org
> Sent: Monday, November 14, 2011 2:11 AM
> Subject: Terminology Extraction
>
> Hi all.
> I am trying to use Mahout for terminology extraction:
> I have ~140 classes, each of which contains ~100 text documents.
> The class categories are distinct but may overlap a bit.
> I want to extract terms related to the label, for example if I have a
> "dogs" category,
> the terms "canine", "German Sheppard", "bone" may be related to the
> category.
> What I have come up with in the meantime was:
> 1. Learn a classifier using Mahout.
> 2. Look at term weights for the classifier - terms with high weights are
> suspect as representing the category.
> I currently only use Naive Bayes, with ng=1.
> My questions are:
> a. Is this a good setting for the problem at hand? Or does Mahout have a
> better algorithm for this?
> b. Which Mahout classifier is best for this? I chose Naive Bayes first
> because its parameters have a simple interpretation.
> Which other (stronger) classifiers also have this property?
> TIA,
> Yuval
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message