lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Fwd: Mahout In Action - Bayes/CBayes Classification returns NaN
Date Wed, 02 Nov 2011 18:24:16 GMT
Forwarded to mahout list instead of lucene.  Let's move the discussion
there.

---------- Forwarded message ----------
From: Sam Cunningham <sam_cunnin@yahoo.com>
Date: Wed, Nov 2, 2011 at 10:33 AM
Subject: Mahout In Action - Bayes/CBayes Classification returns NaN
To: general@lucene.apache.org


My objective is to be able to classify news documents to these classes:
Sports, Entertainment, Politics, Business, etc. Here are the steps I took:

- Used prepare20newsgroups command (page 277 - Mahout In Action) to prepare
the training data set (one long document ~5MB per class).
- Moved training dataset to HDFS and ran trainclassifier command (page 278)
and created the model
- Moved the model from HDFS to local FS and ran Classify.java (at
http://search-lucene.com/c/Mahout:/core/src/main/java/org/apache/mahout/classifier/Classify.java%7C%7Clucene
)
on a sample document
- The result is NaN for all classes. It apparently can't assign any classes
to this document. Finally it is labeling with default category: unknown.

I know the program works with 20news dataset. I also know I am training
correctly and my dataset is pretty realistic. What might be the reason that
it can not classify? I tried a few other documents. The result is the same.
NaN. Just to note, when I run prepare20newsgroups command on the training
documents, it puts a single target variable and a single line of document,
which is very long such that (Sports - tab - a long single document) Would
this be the reason? Because I know the 20news dataset has a number of
repeated target variables with a number of documents in it.

Please help. Thanks,

--
View this message in context:
http://lucene.472066.n3.nabble.com/Mahout-In-Action-Bayes-CBayes-Classification-returns-NaN-tp3474535p3474535.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message