mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Smith <>
Subject Diagnosing naive bayes results
Date Fri, 27 Jan 2012 20:06:05 GMT

Does naive bayes always classify a document into a category?
Or will it refuse to classify something it cannot?

For example:

I'm working through the naive bayes tutorial in Taming Text - with my own data.
I built a lucene index, ran extract training data, split 90/10, etc.

After looking at the seq dumper on the trained model - I noticed I made a mistake when building
the index:
The good/bad documents had a unique id field (in the terms) that didn't get filtered out because
of a typo/error in my little java program to build the index.

I went ahead and ran the test just to see what would happen, and the confusion matrix I got
all was zeros.
No document was classified correctly or incorrectly.

No document was classified at all.

I suspect this was because it overfit to the unique id field in the training data - which
the test vectors would not have.

While this sounds rational, it only explains the results if naive bayes can refuse to classify
a document in any category whatsover. 

So I'm just wondering if this is true, or I should be looking for more mistakes.

I'm re-running it right now, but building the index takes a while, so I thought I'd ping the
list in the meantime..


Take care,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message