mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From s..@apache.org
Subject svn commit: r1590217 - /mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
Date Sat, 26 Apr 2014 07:14:32 GMT
Author: ssc
Date: Sat Apr 26 07:14:32 2014
New Revision: 1590217

URL: http://svn.apache.org/r1590217
Log:
MAHOUT-1502 Update Naive Bayes Webpage to Current Implementation

Modified:
    mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext?rev=1590217&r1=1590216&r2=1590217&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext Sat Apr 26 07:14:32
2014
@@ -1,80 +1,85 @@
 # Naive Bayes
 
-<a name="Bayesian-Intro"></a>
-#### Intro</b>
+
+### Intro
 
 Mahout currently has two Naive Bayes implementations.  The first is standard Multinomial
Naive Bayes. The second is an implementation of Transformed Weight-normalized Complement Naive
Bayes as introduced by Rennie et al. [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf).
We refer to the former as Bayes and the latter as CBayes.
 
 Where Bayes has long been a standard in text classification, CBayes is an extension of Bayes
that performs particularly well on datasets with skewed classes and has been shown to be competitive
with algorithms of higher complexity such as Support Vector Machines. 
 
-<a name="Bayesian-Implementations"></a>
-#### Implementations
+
+### Implementations
 Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and classification
can be done via a MapReduce Job or sequentially.  Mahout provides CLI drivers for preprocessing,
training and testing. A Spark implementation is currently in the works ([MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)).
 
-#### Preprocessing and Algorithm
+### Preprocessing and Algorithm
 
 As described in [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive
Bayes is broken down into the following steps:  
 
-* Let $\vec{d}=(\vec{d_1},...,\vec{d_n})$ be a set of documents; $d_{ij}$ is the count of
word $i$ in document $j$.
-* Let $\vec{y}=(\vec{y_1},...,\vec{y_n})$ be their labels.
-* <b>Preprocessing</b>: TF-IDF transformation and L2 length normalization of
$\vec{d}$
-1. $d_{ij} = \sqrt{d_{ij}}$
-2. $d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)$
-
-3. $d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}$ 
-* <b>Training: Bayes</b>$(\vec{d},\vec{y})$ calculate term weights $w_{ci}$ as:
-1. $\hat\theta_{ci}=\frac{d_{ic}+1}{\sum_k{d_{kc}+1}} $
-2. $w_{ci}=\log{\hat\theta_{ci}}$
-* <b>Training: CBayes</b>$(\vec{d},\vec{y})$ calculate term weights $w_{ci}$
as:
-1. $\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+1}{\sum_{j:y_j\neq c}\sum_k d_{kj}+1}$
-2. $w_{ci}=-\log{\hat\theta_{ci}}$
-3. $w_{ci} = \frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}$
-* <b>Label Assignment/Testing</b>:
-1. Let $\vec{t}= (t_1,...,t_n)$ be a test document
-2. Label the document according to $l(t)=\arg\max_c \sum_i t_i w_{ci}$
+- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; `\(d_{ij}\)` is the
count of word `\(i\)` in document `\(j\)`.
+- Let `\(\vec{y}=(\vec{y_1},...,\vec{y_n})\)` be their labels.
+- **Preprocessing**: TF-IDF transformation and L2 length normalization of `\(\vec{d}\)`
+    1. `\(d_{ij} = \sqrt{d_{ij}}\)` 
+    2. `\(d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` 
+    3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` 
+- **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci}=\frac{d_{ic}+1}{\sum_k{d_{kc}+1}}\)`
+    2. `\(w_{ci}=\log{\hat\theta_{ci}}\)`
+- **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+1}{\sum_{j:y_j\neq c}\sum_k d_{kj}+1}\)`
+    2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)`
+    3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)`
+- **Label Assignment/Testing:**
+    1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document
+    2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}\)`
 
-As we can see, the main difference between Bayes and CBayes is the weight calculation step.
 Where Bayes weighs terms more heavily based on the likelihood that they belong to class $c$,
CBayes seeks to maximize term weights on the likelihood that they do not belong to any other
class.  
+As we can see, the main difference between Bayes and CBayes is the weight calculation step.
 Where Bayes weighs terms more heavily based on the likelihood that they belong to class `\(c\)`,
CBayes seeks to maximize term weights on the likelihood that they do not belong to any other
class.  
 
-# Running from the command line
+### Running from the command line
 
 Mahout provides CLI drivers for all above steps.  Here we will give a simple overview of
Mahout CLI commands used to preprocess the data, train the model and assign labels to the
training set. An [example script](http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/classify-20newsgroups.sh)
is given for the full process from data acquisition through classification of the classic
[20 Newsgroups corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html).
 
 
-* <b>Preprocessing</b>:
-For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the `mahout` [`seq2sparse`](https://mahout.apache.org/users/basics/creating-vectors-from-text.html)
command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization
(-n 2 option) as follows:
-`
-      $ mahout seq2sparse 
-	      -i ${PATH_TO_SEQUENCE_FILES} 
-	      -o ${PATH_TO_TFIDF_VECTORS} 
-	      -nv -n 2 -wt tfidf`
+- **Preprocessing:**
+For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the [mahout seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html)
command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization
(-n 2 option) as follows:
 
-* <b>Training:</b>
+        mahout seq2sparse 
+          -i ${PATH_TO_SEQUENCE_FILES} 
+          -o ${PATH_TO_TFIDF_VECTORS} 
+          -nv 
+          -n 2
+          -wt tfidf
+
+- **Training:**
 The model is then trained using `mahout trainnb` .  The default is to train  a Bayes model.
The -c option is given to train a CBayes model:
-`
-      $ mahout trainnb 
-      -i ${PATH_TO_TFIDF_VECTORS} 
-      -el 
-      -o ${PATH_TO_MODEL}/model 
-      -li ${PATH_TO_MODEL}/labelindex 
-      -ow -c
-`
-* <b>Label Assignment/Testing:</b>
+
+        mahout trainnb
+          -i ${PATH_TO_TFIDF_VECTORS} 
+          -el 
+          -o ${PATH_TO_MODEL}/model 
+          -li ${PATH_TO_MODEL}/labelindex 
+          -ow 
+          -c
+
+- **Label Assignment/Testing:**
 Classification and testing on a holdout set can then be performed via `mahout testnb`. Again,
the -c option indicates that the model is CBayes.  The -seq option tells `mahout testnb` to
run sequentially:
-`
-    $ mahout testnb 
-    -i ${PATH_TO_TFIDF_TEST_VECTORS}
-    -m ${PATH_TO_MODEL}/model 
-    -l ${PATH_TO_MODEL}/labelindex 
-    -ow -o ${PATH_TO_OUTPUT} -c -seq
-`
-<a name="Bayesian-Examples"></a>
-# Examples
 
-Mahout provides 2 examples for Naive Nayes classification:
+        mahout testnb 
+          -i ${PATH_TO_TFIDF_TEST_VECTORS}
+          -m ${PATH_TO_MODEL}/model 
+          -l ${PATH_TO_MODEL}/labelindex 
+          -ow 
+          -o ${PATH_TO_OUTPUT} 
+          -c 
+          -seq
+
+### Examples
+
+Mahout provides 2 examples for Naive Bayes classification:
 
 1. [Classify 20 Newsgroups](twenty-newsgroups.html)
 2. [Wikipedia Naive Bayes Example](wikipedia-bayes-example.html)
  
-# References
+### References
 
 [1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). [Tackling the
Poor Assumptions of Naive Bayes Text Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf).
Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003).
 
+



Mime
View raw message