mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 19:15:20 GMT
Indeed.  I hadn't snapped to the fact you were using trigrams.

30 million features is quite plausible for that.  To effectively use long
n-grams as features in classification of documents you really need to have
the following:

a) good statistical methods for resolving what is useful and what is not.
Everybody here knows that my preference for a first hack is sparsification
with log-likelihood ratios.

b) some kind of smoothing using smaller n-grams

c) some kind of smoothing over variants of n-grams.

AFAIK, mahout doesn't have many or any of these in place.  You are likely to
do better with unigrams as a result.

On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <gsingers@apache.org>wrote:

> I suspect the explosion in the number of features, Ted, is due to the use
> of n-grams producing a lot of unique terms.  I can try w/ gramSize = 1, that
> will likely reduce the feature set quite a bit.
>



-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message