h1. Intro
Mahout currently has two implementations of Bayesian classifiers. One is the traditional
Naive Bayes approach, and the other is called Complementary Naive Bayes.
h1. Implementations
[NaiveBayes] ([MAHOUT9http://issues.apache.org/jira/browse/MAHOUT9])
[Complementary Naive Bayes] ([MAHOUT60http://issues.apache.org/jira/browse/MAHOUT60])
The Naive Bayes implementations in Mahout follow the paper [http://people.csail.mit.edu/jrennie/papers/icml03nb.pdf]
Before we get to the actual algorithm lets discuss the terminology
Given
j = 0 to N features
k = 0 to L labels
in an input set of classified documents.
Normalized Frequency for a term(feature) in a document is calculated by dividing the term
frequency by the root mean square of terms frequencies in that document
Weight Normalized Tf for a given feature in a given label = sum of Normalized Frequency of
the feature across all the documents in the label.
Weight Normalized TfIdf for a given feature in a label is the Tfidf calculated using standard
idf multiplied by the Weight Normalized Tf
Once Weight Normalized Tfidf(WNTfidf) is calculated, the final weight matrix for Bayes
and Cbayes are calculated as follows
We calculate the sum of WNTfidf for all the features in a label called as Sigma_k or sumLabelWeight
For Bayes
Weight = Log [ ( WNTfIdf + alpha_i ) / ( Sigma_k + N ) ]
For CBayes
We calculate the Sum of WNTfIdf across all labels for a given feature. We call this sumFeatureWeight
of Sigma_j
Also we sum the entire WNTfIdf weights for all feature,label pair in the train set. Call
this Sigma_jSigma_k
Final Weight is calculated as
Weight = Log [ ( Sigma_j  WNTfIdf + alpha_i ) / ( Sigma_jSigma_k  Sigma_k + N ) ]
h1. Examples
In Mahout's example code, there are two samples that can be used:
# [WikipediaBayesExample]  Classify Wikipedia data.
# [TwentyNewsGroups]  Classify the classic Twenty Newsgroups data.
