mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Bayes/CBayes classification on a non-existing feature
Date Mon, 03 Oct 2011 14:31:57 GMT
for Bayes, a zero weight makes a lot of sense.

The only other reasonable option is to train the tokenizer on a separate
corpus and then return an "unknown-word" token during training.  That will
let the training figure out a good weight for the unknown word.

For document classification, I doubt this is a great idea.

On Mon, Oct 3, 2011 at 7:03 AM, Isabel Drost <isabel@apache.org> wrote:

> On 29.09.2011 André-Philippe Paquet wrote:
> > After checking in the CBayesAlgorithm class, I made my own subclass and
> > overrided the "featureWeight" function to return 0 if the weight of the
> > feature in the curent label is 0 instead of returning the theta
> normalized
> > weight. It fixed the problem in my case.
> >
> > Should I fill an issue?
>
> Yes, absolutely. Your fix sounds like a nice starting point.
>
> Robin, in a second iteration, should we allow users to plug in their own
> strategies for weighting so far unseen features, or can we come up with one
> that
> works for all most common cases?
>
> Isabel
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message