mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Feeny <>
Subject How to use Naive Bayes Classifier to classify new data?
Date Tue, 16 Apr 2013 00:09:09 GMT
I am using Mahout version .7

I have used the complementary naive bayes classifier to classify basic spam/ham messages like

Copy easy_ham and spam directories into 20news-all:
 cp -R easy_ham/ spam/ 20news-all/

Copy 20news-all to HDFS:
hadoop fs -put 20news-all

Prepare data by sequencing into vectors:
 mahout seqdirectory -i 20news-all -o 20news-seq
 mahout seq2sparse -i 20news-seq -o 20news-vectors  -lnorm -nv  -wt tfidf

Split data into train and test sets with 20% of the data being used for test and 80% for train:
mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput
20news-test-vectors --randomSelectionPct 20 --overwrite --sequenceFiles -xm sequential

Build the model:
mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c

You can test the model against the training set:
mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing-train -c

Now test against the test set:
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing-test -c

This all works fine, I get good results with my Confusion Matrix output.

Now what if I have a message called message.txt.  How would I pass this to my data model to
see if it classifies it as spam or ham?  

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message