Did you try CBayes. Its supposed to negate the class imbalance effect to some extend On Thu, Jul 23, 2009 at 5:02 AM, Ted Dunning wrote: > Some learning algorithms deal with this better than others.  The problem is > particularly bad in information retrieval (negative examples include almost > the entire corpus, positives are a tiny fraction) and fraud (less than 1% of > the training data is typically fraud). > > Down-sampling the over-represented case is the simplest answer where you > have lots of data.  It doesn't help much to have more than 3x more data for > one case as another anyway (at least in binary decisions). > > Another aspect of this is the cost of different errors.  For instance, in > fraud, verifying a transaction with a customer has low cost (but not > non-zero) while not detecting a fraud in progress can be very, very bad. > False negatives are thus more of a problem than false positives and the > models are tuned accordingly. > > On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne wrote: > >> this is the class imbalance problem  (ie you have many more instances for >> one class than another one). >> >> in this case, you could ensure that the training set was balanced (50:50); >> more interestingly, you can have a prior which corrects for this.  or, you >> could over-sample or even under-sample the training set, etc etc. >> > > > > -- > Ted Dunning, CTO > DeepDyve >