lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Cunningham <>
Subject Mahout In Action - Bayes/CBayes Classification returns NaN
Date Wed, 02 Nov 2011 17:33:31 GMT
My objective is to be able to classify news documents to these classes:
Sports, Entertainment, Politics, Business, etc. Here are the steps I took:

- Used prepare20newsgroups command (page 277 - Mahout In Action) to prepare
the training data set (one long document ~5MB per class).
- Moved training dataset to HDFS and ran trainclassifier command (page 278)
and created the model
- Moved the model from HDFS to local FS and ran (at
on a sample document
- The result is NaN for all classes. It apparently can't assign any classes
to this document. Finally it is labeling with default category: unknown.

I know the program works with 20news dataset. I also know I am training
correctly and my dataset is pretty realistic. What might be the reason that
it can not classify? I tried a few other documents. The result is the same.
NaN. Just to note, when I run prepare20newsgroups command on the training
documents, it puts a single target variable and a single line of document,
which is very long such that (Sports - tab - a long single document) Would
this be the reason? Because I know the 20news dataset has a number of
repeated target variables with a number of documents in it.

Please help. Thanks, 

View this message in context:
Sent from the Lucene - General mailing list archive at

View raw message