mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 18:24:43 GMT
It is common to have more features than there are plausible words.

If these features are common enough to provide some support for the
statistical inferences, then they are fine to use as long as they aren't
target leaks.  If they are rare (page URL for instance), then they have
little utility and should be pruned.

Pruning will generally improve accuracy as well as speed and memory use.

On Wed, Jul 22, 2009 at 11:19 AM, Robin Anil <> wrote:

> Yes, I agree. Maybe we can add a prune step or a minSupport parameter
> to prune. But then again a lot depends on the tokenizer used. Numerals
> plus string literal combinations like say 100-sanfrancisco-ugs found
> in Wikipedia data a lot.  They add up to the feature count more than
> English words

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message