mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Re: Getting Started with Classification
Date Thu, 23 Jul 2009 01:50:14 GMT
Did you try CBayes. Its supposed to negate the class imbalance effect
to some extend



On Thu, Jul 23, 2009 at 5:02 AM, Ted Dunning<ted.dunning@gmail.com> wrote:
> Some learning algorithms deal with this better than others.  The problem is
> particularly bad in information retrieval (negative examples include almost
> the entire corpus, positives are a tiny fraction) and fraud (less than 1% of
> the training data is typically fraud).
>
> Down-sampling the over-represented case is the simplest answer where you
> have lots of data.  It doesn't help much to have more than 3x more data for
> one case as another anyway (at least in binary decisions).
>
> Another aspect of this is the cost of different errors.  For instance, in
> fraud, verifying a transaction with a customer has low cost (but not
> non-zero) while not detecting a fraud in progress can be very, very bad.
> False negatives are thus more of a problem than false positives and the
> models are tuned accordingly.
>
> On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <miles@inf.ed.ac.uk> wrote:
>
>> this is the class imbalance problem  (ie you have many more instances for
>> one class than another one).
>>
>> in this case, you could ensure that the training set was balanced (50:50);
>> more interestingly, you can have a prior which corrects for this.  or, you
>> could over-sample or even under-sample the training set, etc etc.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Mime
View raw message