mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rawkintr...@apache.org
Subject [03/13] mahout git commit: WEBSITE Porting Old Website
Date Sun, 30 Apr 2017 03:24:07 GMT
http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_convenience/map-reduce/classification/twenty-newsgroups.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/twenty-newsgroups.md
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/twenty-newsgroups.md
deleted file mode 100644
index 472aaf6..0000000
--- a/website/old_site_migration/needs_work_convenience/map-reduce/classification/twenty-newsgroups.md
+++ /dev/null
@@ -1,179 +0,0 @@
----
-layout: default
-title: Twenty Newsgroups
-theme:
-    name: retro-mahout
----
-
-
-<a name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a>
-## Twenty Newsgroups Classification Example
-
-<a name="TwentyNewsgroups-Introduction"></a>
-## Introduction
-
-The 20 newsgroups dataset is a collection of approximately 20,000
-newsgroup documents, partitioned (nearly) evenly across 20 different
-newsgroups. The 20 newsgroups collection has become a popular data set for
-experiments in text applications of machine learning techniques, such as
-text classification and text clustering. We will use the [Mahout CBayes](http://mahout.apache.org/users/mapreduce/classification/bayesian.html)
-classifier to create a model that would classify a new document into one of
-the 20 newsgroups.
-
-<a name="TwentyNewsgroups-Prerequisites"></a>
-### Prerequisites
-
-* Mahout has been downloaded ([instructions here](https://mahout.apache.org/general/downloads.html))
-* Maven is available
-* Your environment has the following variables:
-     * **HADOOP_HOME** Environment variables refers to where Hadoop lives 
-     * **MAHOUT_HOME** Environment variables refers to where Mahout lives
-
-<a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a>
-### Instructions for running the example
-
-1. If running Hadoop in cluster mode, start the hadoop daemons by executing the following
commands:
-
-            $ cd $HADOOP_HOME/bin
-            $ ./start-all.sh
-   
-    Otherwise:
-
-            $ export MAHOUT_LOCAL=true
-
-2. In the trunk directory of Mahout, compile and install Mahout:
-
-            $ cd $MAHOUT_HOME
-            $ mvn -DskipTests clean install
-
-3. Run the [20 newsgroups example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
by executing:
-
-            $ ./examples/bin/classify-20newsgroups.sh
-
-4. You will be prompted to select a classification method algorithm: 
-    
-            1. Complement Naive Bayes
-            2. Naive Bayes
-            3. Stochastic Gradient Descent
-
-Select 1 and the the script will perform the following:
-
-1. Create a working directory for the dataset and all input/output.
-2. Download and extract the *20news-bydate.tar.gz* from the [20 newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz)
to the working directory.
-3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. 
-4. Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile
containing term frequencies for each document.
-5. Split the preprocessed dataset into training and testing sets. 
-6. Train the classifier.
-7. Test the classifier.
-
-
-Output should look something like:
-
-
-    =======================================================
-    Confusion Matrix
-    -------------------------------------------------------
-     a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t <--Classified as
-    381 0  0  0  0  9  1  0  0  0  1  0  0  2  0  1  0  0  3  0 |398 a=rec.motorcycles
-     1 284 0  0  0  0  1  0  6  3  11 0  66 3  0  6  0  4  9  0 |395 b=comp.windows.x
-     2  0 339 2  0  3  5  1  0  0  0  0  1  1  12 1  7  0  2  0 |376 c=talk.politics.mideast
-     4  0  1 327 0  2  2  0  0  2  1  1  0  5  1  4  12 0  2  0 |364 d=talk.politics.guns
-     7  0  4  32 27 7  7  2  0  12 0  0  6  0 100 9  7  31 0  0 |251 e=talk.religion.misc
-     10 0  0  0  0 359 2  2  0  0  3  0  1  6  0  1  0  0  11 0 |396 f=rec.autos
-     0  0  0  0  0  1 383 9  1  0  0  0  0  0  0  0  0  3  0  0 |397 g=rec.sport.baseball
-     1  0  0  0  0  0  9 382 0  0  0  0  1  1  1  0  2  0  2  0 |399 h=rec.sport.hockey
-     2  0  0  0  0  4  3  0 330 4  4  0  5  12 0  0  2  0  12 7 |385 i=comp.sys.mac.hardware
-     0  3  0  0  0  0  1  0  0 368 0  0  10 4  1  3  2  0  2  0 |394 j=sci.space
-     0  0  0  0  0  3  1  0  27 2 291 0  11 25 0  0  1  0  13 18|392 k=comp.sys.ibm.pc.hardware
-     8  0  1 109 0  6  11 4  1  18 0  98 1  3  11 10 27 1  1  0 |310 l=talk.politics.misc
-     0  11 0  0  0  3  6  0  10 6  11 0 299 13 0  2  13 0  7  8 |389 m=comp.graphics
-     6  0  1  0  0  4  2  0  5  2  12 0  8 321 0  4  14 0  8  6 |393 n=sci.electronics
-     2  0  0  0  0  0  4  1  0  3  1  0  3  1 372 6  0  2  1  2 |398 o=soc.religion.christian
-     4  0  0  1  0  2  3  3  0  4  2  0  7  12 6 342 1  0  9  0 |396 p=sci.med
-     0  1  0  1  0  1  4  0  3  0  1  0  8  4  0  2 369 0  1  1 |396 q=sci.crypt
-     10 0  4  10 1  5  6  2  2  6  2  0  2  1 86 15 14 152 0  1 |319 r=alt.atheism
-     4  0  0  0  0  9  1  1  8  1  12 0  3  0  2  0  0  0 341 2 |390 s=misc.forsale
-     8  5  0  0  0  1  6  0  8  5  50 0  40 2  1  0  9  0  3 256|394 t=comp.os.ms-windows.misc
-    =======================================================
-    Statistics
-    -------------------------------------------------------
-    Kappa                                       0.8808
-    Accuracy                                   90.8596%
-    Reliability                                86.3632%
-    Reliability (standard deviation)            0.2131
-
-
-
-
-
-<a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a>
-## End to end commands to build a CBayes model for 20 newsgroups
-The [20 newsgroups example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
issues the following commands as outlined above. We can build a CBayes classifier from the
command line by following the process in the script: 
-
-*Be sure that **MAHOUT_HOME**/bin and **HADOOP_HOME**/bin are in your **$PATH***
-
-1. Create a working directory for the dataset and all input/output.
-           
-            $ export WORK_DIR=/tmp/mahout-work-${USER}
-            $ mkdir -p ${WORK_DIR}
-
-2. Download and extract the *20news-bydate.tar.gz* from the [20newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz)
to the working directory.
-
-            $ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

-                -o ${WORK_DIR}/20news-bydate.tar.gz
-            $ mkdir -p ${WORK_DIR}/20news-bydate
-            $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz &&
cd .. && cd ..
-            $ mkdir ${WORK_DIR}/20news-all
-            $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all
-     * If you're running on a Hadoop cluster:
- 
-            $ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
-
-3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. 
-          
-            $ mahout seqdirectory 
-                -i ${WORK_DIR}/20news-all 
-                -o ${WORK_DIR}/20news-seq 
-                -ow
-            
-4. Convert and preprocesses the dataset into  a < Text, VectorWritable > SequenceFile
containing term frequencies for each document. 
-            
-            $ mahout seq2sparse 
-                -i ${WORK_DIR}/20news-seq 
-                -o ${WORK_DIR}/20news-vectors
-                -lnorm 
-                -nv 
-                -wt tfidf
-If we wanted to use different parsing methods or transformations on the term frequency vectors
we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization.
 See the [Creating vectors from text](http://mahout.apache.org/users/basics/creating-vectors-from-text.html)
page for a list of all seq2sparse options.   
-
-5. Split the preprocessed dataset into training and testing sets.
-
-            $ mahout split 
-                -i ${WORK_DIR}/20news-vectors/tfidf-vectors 
-                --trainingOutput ${WORK_DIR}/20news-train-vectors 
-                --testOutput ${WORK_DIR}/20news-test-vectors  
-                --randomSelectionPct 40 
-                --overwrite --sequenceFiles -xm sequential
- 
-6. Train the classifier.
-
-            $ mahout trainnb 
-                -i ${WORK_DIR}/20news-train-vectors
-                -el  
-                -o ${WORK_DIR}/model 
-                -li ${WORK_DIR}/labelindex 
-                -ow 
-                -c
-
-7. Test the classifier.
-
-            $ mahout testnb 
-                -i ${WORK_DIR}/20news-test-vectors
-                -m ${WORK_DIR}/model 
-                -l ${WORK_DIR}/labelindex 
-                -ow 
-                -o ${WORK_DIR}/20news-testing 
-                -c
-
- 
-       
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_priority/spark-naive-bayes.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/spark-naive-bayes.md b/website/old_site_migration/needs_work_priority/spark-naive-bayes.md
deleted file mode 100644
index 8823812..0000000
--- a/website/old_site_migration/needs_work_priority/spark-naive-bayes.md
+++ /dev/null
@@ -1,132 +0,0 @@
----
-layout: default
-title: Spark Naive Bayes
-theme:
-    name: retro-mahout
----
-
-# Spark Naive Bayes
-
-
-## Intro
-
-Mahout currently has two flavors of Naive Bayes.  The first is standard Multinomial Naive
Bayes. The second is an implementation of Transformed Weight-normalized Complement Naive Bayes
as introduced by Rennie et al. [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf).
We refer to the former as Bayes and the latter as CBayes.
-
-Where Bayes has long been a standard in text classification, CBayes is an extension of Bayes
that performs particularly well on datasets with skewed classes and has been shown to be competitive
with algorithms of higher complexity such as Support Vector Machines. 
-
-
-## Implementations
-The mahout `math-scala` library has an implemetation of both Bayes and CBayes which is further
optimized in the `spark` module. Currently the Spark optimized version provides CLI drivers
for training and testing. Mahout Spark-Naive-Bayes models can also be trained, tested and
saved to the filesystem from the Mahout Spark Shell. 
-
-## Preprocessing and Algorithm
-
-As described in [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive
Bayes is broken down into the following steps (assignments are over all possible index values):
 
-
-- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; `\(d_{ij}\)` is the
count of word `\(i\)` in document `\(j\)`.
-- Let `\(\vec{y}=(y_1,...,y_n)\)` be their labels.
-- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; let `\(\alpha=\sum_i{\alpha_i}\)`.

-- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length normalization of
`\(\vec{d}\)`
-    1. `\(d_{ij} = \sqrt{d_{ij}}\)` 
-    2. `\(d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` 
-    3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` 
-- **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
-    1. `\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)`
-    2. `\(w_{ci}=\log{\hat\theta_{ci}}\)`
-- **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
-    1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)`
-    2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)`
-    3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)`
-- **Label Assignment/Testing:**
-    1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document; let `\(t_i\)` be the count of
the word `\(t\)`.
-    2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}\)`
-
-As we can see, the main difference between Bayes and CBayes is the weight calculation step.
 Where Bayes weighs terms more heavily based on the likelihood that they belong to class `\(c\)`,
CBayes seeks to maximize term weights on the likelihood that they do not belong to any other
class.  
-
-## Running from the command line
-
-Mahout provides CLI drivers for all above steps.  Here we will give a simple overview of
Mahout CLI commands used to preprocess the data, train the model and assign labels to the
training set. An [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
is given for the full process from data acquisition through classification of the classic
[20 Newsgroups corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html).
 
-
-- **Preprocessing:**
-For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the [mahout seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html)
command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization
(-n 2 option) as follows:
-
-        $ mahout seq2sparse 
-          -i ${PATH_TO_SEQUENCE_FILES} 
-          -o ${PATH_TO_TFIDF_VECTORS} 
-          -nv 
-          -n 2
-          -wt tfidf
-
-- **Training:**
-The model is then trained using `mahout spark-trainnb`.  The default is to train a Bayes
model. The -c option is given to train a CBayes model:
-
-        $ mahout spark-trainnb
-          -i ${PATH_TO_TFIDF_VECTORS} 
-          -o ${PATH_TO_MODEL}
-          -ow 
-          -c
-
-- **Label Assignment/Testing:**
-Classification and testing on a holdout set can then be performed via `mahout spark-testnb`.
Again, the -c option indicates that the model is CBayes:
-
-        $ mahout spark-testnb 
-          -i ${PATH_TO_TFIDF_TEST_VECTORS}
-          -m ${PATH_TO_MODEL} 
-          -c 
-
-## Command line options
-
-- **Preprocessing:** *note: still reliant on MapReduce seq2sparse* 
-  
-  Only relevant parameters used for Bayes/CBayes as detailed above are shown. Several other
transformations can be performed by `mahout seq2sparse` and used as input to Bayes/CBayes.
 For a full list of `mahout seq2Sparse` options see the [Creating vectors from text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html)
page.
-
-        $ mahout seq2sparse                         
-          --output (-o) output             The directory pathname for output.        
-          --input (-i) input               Path to job input directory.              
-          --weight (-wt) weight            The kind of weight to use. Currently TF   
-                                               or TFIDF. Default: TFIDF                 

-          --norm (-n) norm                 The norm to use, expressed as either a    
-                                               float or "INF" if you want to use the    

-                                               Infinite norm.  Must be greater or equal 

-                                               to 0.  The default is not to normalize   

-          --overwrite (-ow)                If set, overwrite the output directory    
-          --sequentialAccessVector (-seq)  (Optional) Whether output vectors should  
-                                               be SequentialAccessVectors. If set true  

-                                               else false                               

-          --namedVector (-nv)              (Optional) Whether output vectors should  
-                                               be NamedVectors. If set true else false  

-
-- **Training:**
-
-        $ mahout spark-trainnb
-          --input (-i) input               Path to job input directory.                 
-          --output (-o) output             The directory pathname for output.           
-          --trainComplementary (-c)        Train complementary? Default is false.
-          --master (-ma)                   Spark Master URL (optional). Default: "local".
-                                               Note that you can specify the number of 
-                                               cores to get a performance improvement, 
-                                               for example "local[4]"
-          --help (-h)                      Print out help                               
-
-- **Testing:**
-
-        $ mahout spark-testnb   
-          --input (-i) input               Path to job input directory.                 

-          --model (-m) model               The path to the model built during training. 
 
-          --testComplementary (-c)         Test complementary? Default is false.        
                 
-          --master (-ma)                   Spark Master URL (optional). Default: "local".

-                                               Note that you can specify the number of 
-                                               cores to get a performance improvement, 
-                                               for example "local[4]"                   
    
-          --help (-h)                      Print out help                               

-
-## Examples
-1. [20 Newsgroups classification](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
-2. [Document classification with Naive Bayes in the Mahout shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
-        
- 
-## References
-
-[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). [Tackling the
Poor Assumptions of Naive Bayes Text Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf).
Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003).
-
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_priority/sparkbindings/MahoutScalaAndSparkBindings.pptx
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/sparkbindings/MahoutScalaAndSparkBindings.pptx
b/website/old_site_migration/needs_work_priority/sparkbindings/MahoutScalaAndSparkBindings.pptx
deleted file mode 100644
index ec1de04..0000000
Binary files a/website/old_site_migration/needs_work_priority/sparkbindings/MahoutScalaAndSparkBindings.pptx
and /dev/null differ


Mime
View raw message