mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magesh Sarma <>
Subject Re: Document Classification - Recommended Algorithms?
Date Wed, 26 Dec 2012 22:54:34 GMT
Thanks for the helpful pointers.

> Do you have thousands of labeled documents for each category?
Yes, I have several years worth of human-classified documents.  I can
get my hands on as many labeled documents as needed.

> Are the categories groupable into very similar clusters?
I don't understand what you mean by this.  Each "document" in my case
will have one or more pages - typically 1 to 3 pages.  When testing,
any page may be fed in for classification, and the label needs to be
correctly applied.  So, for training purposes, I split a multi-page
document into single-page ones, and give each page the same category.

All documents belong to the same business domain and are very similar
in terms used.  However, I'm not sure if that answers your question.

> Do categories come and go?
Very rarely.  When this happens, it will be a highly controlled event.

> What is high accuracy to you?
With J48, I was able to get upwards of 99.5% accurate predictions on a
5000 document test set.  It was as good as, if not better than, human
classification, assuming the human makes errors too.

> My first recommendation for text classification always is L_1 regularized
> logistic regression.  Since your training data is small, I would recommend
> that you start with glmnet on R with word level features.  If you have
> additional meta-data such as source of the text or time of day or whatnot,
> label that specially and see if including it helps.

There is no meta data - just OCR'd sheets of text.

> Whether you want a multinomial model or lots of binomial models is an open
> question.  Try each design if you can (glmnet will only do the binomial
> option).
> As an interesting tree-based alternative, I think that your data is small
> enough to use the standard random forest implementation.
> If you have usable category nesting, you might try training a top-level
> model, then taking the top few super-categories and trying a category
> specific model at that level.
> R should suffice as long as your data are less than hundreds of thousands.
>  Some algorithms in R work with larger data, most will not.

OK - I will give that a try.

> On Wed, Dec 26, 2012 at 8:01 AM, Magesh Sarma <>wrote:
> > Hi:
> >
> > Coming from the Weka world, I have Newb question.
> >
> > My problem is straight-forward: I have to label a given document.  Each
> > document will have only one label.  I have hundreds of labels.  I have a
> > big training set (thousands of labeled documents).  Accuracy is important.
> > So is the ability to incrementally train, or alternatively rebuild the
> > model from scratch fast.
> >
> > I have used the J48 (based on C4.5) algorithm in Weka with a good degree of
> > success.  Accuracy is high, but training speed is very slow.  Plus, it does
> > not support incremental training.
> >
> > Any recommendation on what algorithm(s) would be a good fit if I switch to
> > Mahout?
> >
> > Cheers,
> > Magesh
> >

View raw message