mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Getting Started with Classification
Date Wed, 22 Jul 2009 23:32:34 GMT
Some learning algorithms deal with this better than others.  The problem is
particularly bad in information retrieval (negative examples include almost
the entire corpus, positives are a tiny fraction) and fraud (less than 1% of
the training data is typically fraud).

Down-sampling the over-represented case is the simplest answer where you
have lots of data.  It doesn't help much to have more than 3x more data for
one case as another anyway (at least in binary decisions).

Another aspect of this is the cost of different errors.  For instance, in
fraud, verifying a transaction with a customer has low cost (but not
non-zero) while not detecting a fraud in progress can be very, very bad.
False negatives are thus more of a problem than false positives and the
models are tuned accordingly.

On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <> wrote:

> this is the class imbalance problem  (ie you have many more instances for
> one class than another one).
> in this case, you could ensure that the training set was balanced (50:50);
> more interestingly, you can have a prior which corrects for this.  or, you
> could over-sample or even under-sample the training set, etc etc.

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message