mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Lucene Mahout > Bayesian
Date Sun, 07 Feb 2010 15:45:00 GMT
Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: Bayesian (http://cwiki.apache.org/confluence/display/MAHOUT/Bayesian)


Edited by Robin Anil:
---------------------------------------------------------------------
h1. Intro

Mahout currently has two implementations of Bayesian classifiers.  One is the traditional
Naive Bayes approach, and the other is called Complementary Naive Bayes.

h1. Implementations

[NaiveBayes] ([MAHOUT-9|http://issues.apache.org/jira/browse/MAHOUT-9])

[Complementary Naive Bayes] ([MAHOUT-60|http://issues.apache.org/jira/browse/MAHOUT-60])

The Naive Bayes implementations in Mahout follow the paper [http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf]
Before we get to the actual algorithm lets discuss the terminology
{noformat}
Given 
j = 0 to N features 
k = 0 to L labels
in an input set of classified documents.
{noformat}
{noformat}
Normalized Frequency for a term(feature) in a document is calculated by dividing the term
frequency by the root mean square of terms frequencies in that document
Weight Normalized Tf for a given feature in a given label = sum of Normalized Frequency of
the feature across all the documents in the label. 
Weight Normalized Tf-Idf for a given feature in a label is the Tf-idf calculated using standard
idf multiplied by the Weight Normalized Tf
{noformat}
Once Weight Normalized Tf-idf(W-N-Tf-idf) is calculated, the final weight matrix for Bayes
and Cbayes are calculated as follows

We calculate the sum of W-N-Tf-idf for all the features in a label called as Sigma_k or sumLabelWeight

For Bayes
{noformat}
Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N  ) ]
{noformat}
For CBayes

We calculate the Sum of W-N-Tf-Idf across all labels for a given feature. We call this sumFeatureWeight
of Sigma_j
Also we sum the entire W-N-Tf-Idf weights for all feature,label pair in the train set. Call
this Sigma_jSigma_k

Final Weight is calculated as
{noformat}
Weight = Log [ ( Sigma_j - W-N-Tf-Idf + alpha_i ) / ( Sigma_jSigma_k - Sigma_k + N  ) ]
{noformat}

h1. Examples

In Mahout's example code, there are two samples that can be used:

# [WikipediaBayesExample] - Classify Wikipedia data.

# [TwentyNewsGroups] - Classify the classic Twenty Newsgroups data.


Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message