mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rawkintr...@apache.org
Subject [25/51] [partial] mahout git commit: New Website courtesy of startbootstrap.com
Date Sat, 02 Dec 2017 06:09:09 GMT
http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/linear-algebra/d-spca.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/linear-algebra/d-spca.md b/website-old/docs/algorithms/linear-algebra/d-spca.md
new file mode 100644
index 0000000..d2bd3da
--- /dev/null
+++ b/website-old/docs/algorithms/linear-algebra/d-spca.md
@@ -0,0 +1,175 @@
+---
+layout: algorithm
+
+title: Distributed Stochastic PCA
+theme:
+    name: retro-mahout
+---
+
+
+## Intro
+
+Mahout has a distributed implementation of Stochastic PCA[1]. This algorithm computes the exact equivalent of Mahout's dssvd(`\(\mathbf{A-1\mu^\top}\)`) by modifying the `dssvd` algorithm so as to avoid forming `\(\mathbf{A-1\mu^\top}\)`, which would densify a sparse input. Thus, it is suitable for work with both dense and sparse inputs.
+
+## Algorithm
+
+Given an *m* `\(\times\)` *n* matrix `\(\mathbf{A}\)`, a target rank *k*, and an oversampling parameter *p*, this procedure computes a *k*-rank PCA by finding the unknowns in `\(\mathbf{A−1\mu^\top \approx U\Sigma V^\top}\)`:
+
+1. Create seed for random *n* `\(\times\)` *(k+p)* matrix `\(\Omega\)`.
+2. `\(\mathbf{s_\Omega \leftarrow \Omega^\top \mu}\)`.
+3. `\(\mathbf{Y_0 \leftarrow A\Omega − 1 {s_\Omega}^\top, Y \in \mathbb{R}^{m\times(k+p)}}\)`.
+4. Column-orthonormalize `\(\mathbf{Y_0} \rightarrow \mathbf{Q}\)` by computing thin decomposition `\(\mathbf{Y_0} = \mathbf{QR}\)`. Also, `\(\mathbf{Q}\in\mathbb{R}^{m\times(k+p)}, \mathbf{R}\in\mathbb{R}^{(k+p)\times(k+p)}\)`.
+5. `\(\mathbf{s_Q \leftarrow Q^\top 1}\)`.
+6. `\(\mathbf{B_0 \leftarrow Q^\top A: B \in \mathbb{R}^{(k+p)\times n}}\)`.
+7. `\(\mathbf{s_B \leftarrow {B_0}^\top \mu}\)`.
+8. For *i* in 1..*q* repeat (power iterations):
+    - For *j* in 1..*n* apply `\(\mathbf{(B_{i−1})_{∗j} \leftarrow (B_{i−1})_{∗j}−\mu_j s_Q}\)`.
+    - `\(\mathbf{Y_i \leftarrow A{B_{i−1}}^\top−1(s_B−\mu^\top \mu s_Q)^\top}\)`.
+    - Column-orthonormalize `\(\mathbf{Y_i} \rightarrow \mathbf{Q}\)` by computing thin decomposition `\(\mathbf{Y_i = QR}\)`.
+    - `\(\mathbf{s_Q \leftarrow Q^\top 1}\)`.
+    - `\(\mathbf{B_i \leftarrow Q^\top A}\)`.
+    - `\(\mathbf{s_B \leftarrow {B_i}^\top \mu}\)`.
+9. Let `\(\mathbf{C \triangleq s_Q {s_B}^\top}\)`. `\(\mathbf{M \leftarrow B_q {B_q}^\top − C − C^\top + \mu^\top \mu s_Q {s_Q}^\top}\)`.
+10. Compute an eigensolution of the small symmetric `\(\mathbf{M = \hat{U} \Lambda \hat{U}^\top: M \in \mathbb{R}^{(k+p)\times(k+p)}}\)`.
+11. The singular values `\(\Sigma = \Lambda^{\circ 0.5}\)`, or, in other words, `\(\mathbf{\sigma_i= \sqrt{\lambda_i}}\)`.
+12. If needed, compute `\(\mathbf{U = Q\hat{U}}\)`.
+13. If needed, compute `\(\mathbf{V = B^\top \hat{U} \Sigma^{−1}}\)`.
+14. If needed, items converted to the PCA space can be computed as `\(\mathbf{U\Sigma}\)`.
+
+## Implementation
+
+Mahout `dspca(...)` is implemented in the mahout `math-scala` algebraic optimizer which translates Mahout's R-like linear algebra operators into a physical plan for both Spark and H2O distributed engines.
+
+    def dspca[K](drmA: DrmLike[K], k: Int, p: Int = 15, q: Int = 0): 
+    (DrmLike[K], DrmLike[Int], Vector) = {
+
+        // Some mapBlock() calls need it
+        implicit val ktag =  drmA.keyClassTag
+
+        val drmAcp = drmA.checkpoint()
+        implicit val ctx = drmAcp.context
+
+        val m = drmAcp.nrow
+    	val n = drmAcp.ncol
+        assert(k <= (m min n), "k cannot be greater than smaller of m, n.")
+        val pfxed = safeToNonNegInt((m min n) - k min p)
+
+        // Actual decomposition rank
+        val r = k + pfxed
+
+        // Dataset mean
+        val mu = drmAcp.colMeans
+
+        val mtm = mu dot mu
+
+        // We represent Omega by its seed.
+        val omegaSeed = RandomUtils.getRandom().nextInt()
+        val omega = Matrices.symmetricUniformView(n, r, omegaSeed)
+
+        // This done in front in a single-threaded fashion for now. Even though it doesn't require any
+        // memory beyond that is required to keep xi around, it still might be parallelized to backs
+        // for significantly big n and r. TODO
+        val s_o = omega.t %*% mu
+
+        val bcastS_o = drmBroadcast(s_o)
+        val bcastMu = drmBroadcast(mu)
+
+        var drmY = drmAcp.mapBlock(ncol = r) {
+            case (keys, blockA) ⇒
+                val s_o:Vector = bcastS_o
+                val blockY = blockA %*% Matrices.symmetricUniformView(n, r, omegaSeed)
+                for (row ← 0 until blockY.nrow) blockY(row, ::) -= s_o
+                keys → blockY
+        }
+                // Checkpoint Y
+                .checkpoint()
+
+        var drmQ = dqrThin(drmY, checkRankDeficiency = false)._1.checkpoint()
+
+        var s_q = drmQ.colSums()
+        var bcastVarS_q = drmBroadcast(s_q)
+
+        // This actually should be optimized as identically partitioned map-side A'B since A and Q should
+        // still be identically partitioned.
+        var drmBt = (drmAcp.t %*% drmQ).checkpoint()
+
+        var s_b = (drmBt.t %*% mu).collect(::, 0)
+        var bcastVarS_b = drmBroadcast(s_b)
+
+        for (i ← 0 until q) {
+
+            // These closures don't seem to live well with outside-scope vars. This doesn't record closure
+            // attributes correctly. So we create additional set of vals for broadcast vars to properly
+            // create readonly closure attributes in this very scope.
+            val bcastS_q = bcastVarS_q
+            val bcastMuInner = bcastMu
+
+            // Fix Bt as B' -= xi cross s_q
+            drmBt = drmBt.mapBlock() {
+                case (keys, block) ⇒
+                    val s_q: Vector = bcastS_q
+                    val mu: Vector = bcastMuInner
+                    keys.zipWithIndex.foreach {
+                        case (key, idx) ⇒ block(idx, ::) -= s_q * mu(key)
+                    }
+                    keys → block
+            }
+
+            drmY.uncache()
+            drmQ.uncache()
+
+            val bCastSt_b = drmBroadcast(s_b -=: mtm * s_q)
+
+            drmY = (drmAcp %*% drmBt)
+                // Fix Y by subtracting st_b from each row of the AB'
+                .mapBlock() {
+                case (keys, block) ⇒
+                    val st_b: Vector = bCastSt_b
+                    block := { (_, c, v) ⇒ v - st_b(c) }
+                    keys → block
+            }
+            // Checkpoint Y
+            .checkpoint()
+
+            drmQ = dqrThin(drmY, checkRankDeficiency = false)._1.checkpoint()
+
+            s_q = drmQ.colSums()
+            bcastVarS_q = drmBroadcast(s_q)
+
+            // This on the other hand should be inner-join-and-map A'B optimization since A and Q_i are not
+            // identically partitioned anymore.
+            drmBt = (drmAcp.t %*% drmQ).checkpoint()
+
+            s_b = (drmBt.t %*% mu).collect(::, 0)
+            bcastVarS_b = drmBroadcast(s_b)
+        }
+
+        val c = s_q cross s_b
+        val inCoreBBt = (drmBt.t %*% drmBt).checkpoint(CacheHint.NONE).collect -=:
+            c -=: c.t +=: mtm *=: (s_q cross s_q)
+        val (inCoreUHat, d) = eigen(inCoreBBt)
+        val s = d.sqrt
+
+        // Since neither drmU nor drmV are actually computed until actually used, we don't need the flags
+        // instructing compute (or not compute) either of the U,V outputs anymore. Neat, isn't it?
+        val drmU = drmQ %*% inCoreUHat
+        val drmV = drmBt %*% (inCoreUHat %*% diagv(1 / s))
+
+        (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k))
+    }
+
+## Usage
+
+The scala `dspca(...)` method can easily be called in any Spark, Flink, or H2O application built with the `math-scala` library and the corresponding `Spark`, `Flink`, or `H2O` engine module as follows:
+
+    import org.apache.mahout.math._
+    import decompositions._
+    import drm._
+    
+    val (drmU, drmV, s) = dspca(drmA, k=200, q=1)
+
+Note the parameter is optional and its default value is zero.
+ 
+## References
+
+[1]: Lyubimov and Palumbo, ["Apache Mahout: Beyond MapReduce; Distributed Algorithm Design"](https://www.amazon.com/Apache-Mahout-MapReduce-Dmitriy-Lyubimov/dp/1523775785)

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/linear-algebra/d-ssvd.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/linear-algebra/d-ssvd.md b/website-old/docs/algorithms/linear-algebra/d-ssvd.md
new file mode 100644
index 0000000..7a31e4d
--- /dev/null
+++ b/website-old/docs/algorithms/linear-algebra/d-ssvd.md
@@ -0,0 +1,140 @@
+---
+layout: algorithm
+title: Distributed Stochastic Singular Value Decomposition
+theme:
+    name: retro-mahout
+---
+
+## Intro
+
+Mahout has a distributed implementation of Stochastic Singular Value Decomposition [1] using the parallelization strategy comprehensively defined in Nathan Halko's dissertation ["Randomized methods for computing low-rank approximations of matrices"](http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf) [2].
+
+## Modified SSVD Algorithm
+
+Given an `\(m\times n\)`
+matrix `\(\mathbf{A}\)`, a target rank `\(k\in\mathbb{N}_{1}\)`
+, an oversampling parameter `\(p\in\mathbb{N}_{1}\)`, 
+and the number of additional power iterations `\(q\in\mathbb{N}_{0}\)`, 
+this procedure computes an `\(m\times\left(k+p\right)\)`
+SVD `\(\mathbf{A\approx U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)`:
+
+  1. Create seed for random `\(n\times\left(k+p\right)\)`
+  matrix `\(\boldsymbol{\Omega}\)`. The seed defines matrix `\(\mathbf{\Omega}\)`
+  using Gaussian unit vectors per one of suggestions in [Halko, Martinsson, Tropp].
+
+  2. `\(\mathbf{Y=A\boldsymbol{\Omega}},\,\mathbf{Y}\in\mathbb{R}^{m\times\left(k+p\right)}\)`
+ 
+  3. Column-orthonormalize `\(\mathbf{Y}\rightarrow\mathbf{Q}\)`
+  by computing thin decomposition `\(\mathbf{Y}=\mathbf{Q}\mathbf{R}\)`.
+  Also, `\(\mathbf{Q}\in\mathbb{R}^{m\times\left(k+p\right)},\,\mathbf{R}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)`; denoted as `\(\mathbf{Q}=\mbox{qr}\left(\mathbf{Y}\right).\mathbf{Q}\)`
+
+  4. `\(\mathbf{B}_{0}=\mathbf{Q}^{\top}\mathbf{A}:\,\,\mathbf{B}\in\mathbb{R}^{\left(k+p\right)\times n}\)`.
+ 
+  5. If `\(q>0\)`
+  repeat: for `\(i=1..q\)`: 
+  `\(\mathbf{B}_{i}^{\top}=\mathbf{A}^{\top}\mbox{qr}\left(\mathbf{A}\mathbf{B}_{i-1}^{\top}\right).\mathbf{Q}\)`
+  (power iterations step).
+
+  6. Compute Eigensolution of a small Hermitian `\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}=\mathbf{\hat{U}}\boldsymbol{\Lambda}\mathbf{\hat{U}}^{\top}\)`,
+  `\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)`.
+ 
+  7. Singular values `\(\mathbf{\boldsymbol{\Sigma}}=\boldsymbol{\Lambda}^{0.5}\)`,
+  or, in other words, `\(s_{i}=\sqrt{\sigma_{i}}\)`.
+ 
+  8. If needed, compute `\(\mathbf{U}=\mathbf{Q}\hat{\mathbf{U}}\)`.
+
+  9. If needed, compute `\(\mathbf{V}=\mathbf{B}_{q}^{\top}\hat{\mathbf{U}}\boldsymbol{\Sigma}^{-1}\)`.
+Another way is `\(\mathbf{V}=\mathbf{A}^{\top}\mathbf{U}\boldsymbol{\Sigma}^{-1}\)`.
+
+
+
+
+## Implementation
+
+Mahout `dssvd(...)` is implemented in the mahout `math-scala` algebraic optimizer which translates Mahout's R-like linear algebra operators into a physical plan for both Spark and H2O distributed engines.
+
+    def dssvd[K: ClassTag](drmA: DrmLike[K], k: Int, p: Int = 15, q: Int = 0):
+        (DrmLike[K], DrmLike[Int], Vector) = {
+
+        val drmAcp = drmA.checkpoint()
+
+        val m = drmAcp.nrow
+        val n = drmAcp.ncol
+        assert(k <= (m min n), "k cannot be greater than smaller of m, n.")
+        val pfxed = safeToNonNegInt((m min n) - k min p)
+
+        // Actual decomposition rank
+        val r = k + pfxed
+
+        // We represent Omega by its seed.
+        val omegaSeed = RandomUtils.getRandom().nextInt()
+
+        // Compute Y = A*Omega.  
+        var drmY = drmAcp.mapBlock(ncol = r) {
+            case (keys, blockA) =>
+                val blockY = blockA %*% Matrices.symmetricUniformView(n, r, omegaSeed)
+            keys -> blockY
+        }
+
+        var drmQ = dqrThin(drmY.checkpoint())._1
+
+        // Checkpoint Q if last iteration
+        if (q == 0) drmQ = drmQ.checkpoint()
+
+        var drmBt = drmAcp.t %*% drmQ
+        
+        // Checkpoint B' if last iteration
+        if (q == 0) drmBt = drmBt.checkpoint()
+
+        for (i <- 0  until q) {
+            drmY = drmAcp %*% drmBt
+            drmQ = dqrThin(drmY.checkpoint())._1            
+            
+            // Checkpoint Q if last iteration
+            if (i == q - 1) drmQ = drmQ.checkpoint()
+            
+            drmBt = drmAcp.t %*% drmQ
+            
+            // Checkpoint B' if last iteration
+            if (i == q - 1) drmBt = drmBt.checkpoint()
+        }
+
+        val (inCoreUHat, d) = eigen(drmBt.t %*% drmBt)
+        val s = d.sqrt
+
+        // Since neither drmU nor drmV are actually computed until actually used
+        // we don't need the flags instructing compute (or not compute) either of the U,V outputs 
+        val drmU = drmQ %*% inCoreUHat
+        val drmV = drmBt %*% (inCoreUHat %*%: diagv(1 /: s))
+
+        (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k))
+    }
+
+Note: As a side effect of checkpointing, U and V values are returned as logical operators (i.e. they are neither checkpointed nor computed).  Therefore there is no physical work actually done to compute `\(\mathbf{U}\)` or `\(\mathbf{V}\)` until they are used in a subsequent expression.
+
+
+## Usage
+
+The scala `dssvd(...)` method can easily be called in any Spark or H2O application built with the `math-scala` library and the corresponding `Spark` or `H2O` engine module as follows:
+
+    import org.apache.mahout.math._
+    import decompositions._
+    import drm._
+    
+    
+    val(drmU, drmV, s) = dssvd(drma, k = 40, q = 1)
+
+ 
+## References
+
+[1]: [Mahout Scala and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf)
+
+[2]: [Randomized methods for computing low-rank
+approximations of matrices](http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf)
+
+[2]: [Halko, Martinsson, Tropp](http://arxiv.org/abs/0909.4061)
+
+[3]: [Mahout Spark and Scala Bindings](http://mahout.apache.org/users/sparkbindings/home.html)
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/linear-algebra/index.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/linear-algebra/index.md b/website-old/docs/algorithms/linear-algebra/index.md
new file mode 100644
index 0000000..e42978a
--- /dev/null
+++ b/website-old/docs/algorithms/linear-algebra/index.md
@@ -0,0 +1,16 @@
+---
+layout: algorithm
+
+title: Distributed Linear Algebra
+theme:
+    name: retro-mahout
+---
+
+Mahout has a number of distributed linear algebra "algorithms" that, in concert with the mathematically expressive R-Like Scala DSL, make it possible for users to quickly "roll their own" distributed algorithms.
+ 
+[Distributed QR Decomposition](d-qr.html)
+
+[Distributed Stochastic Principal Component Analysis](d-spca.html)
+
+[Distributed Stochastic Singular Value Decomposition](d-ssvd.html)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/bayesian.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/bayesian.md b/website-old/docs/algorithms/map-reduce/classification/bayesian.md
new file mode 100644
index 0000000..5fd5f92
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/bayesian.md
@@ -0,0 +1,147 @@
+---
+layout: algorithm
+title: (Deprecated) 
+theme:
+    name: retro-mahout
+---
+
+# Naive Bayes
+
+
+## Intro
+
+Mahout currently has two Naive Bayes Map-Reduce implementations.  The first is standard Multinomial Naive Bayes. The second is an implementation of Transformed Weight-normalized Complement Naive Bayes as introduced by Rennie et al. [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). We refer to the former as Bayes and the latter as CBayes.
+
+Where Bayes has long been a standard in text classification, CBayes is an extension of Bayes that performs particularly well on datasets with skewed classes and has been shown to be competitive with algorithms of higher complexity such as Support Vector Machines. 
+
+
+## Implementations
+Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and classification can be done via a MapReduce Job or sequentially.  Mahout provides CLI drivers for preprocessing, training and testing. A Spark implementation is currently in the works ([MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)).
+
+## Preprocessing and Algorithm
+
+As described in [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive Bayes is broken down into the following steps (assignments are over all possible index values):  
+
+- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; `\(d_{ij}\)` is the count of word `\(i\)` in document `\(j\)`.
+- Let `\(\vec{y}=(y_1,...,y_n)\)` be their labels.
+- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; let `\(\alpha=\sum_i{\alpha_i}\)`. 
+- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length normalization of `\(\vec{d}\)`
+    1. `\(d_{ij} = \sqrt{d_{ij}}\)` 
+    2. `\(d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` 
+    3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` 
+- **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)`
+    2. `\(w_{ci}=\log{\hat\theta_{ci}}\)`
+- **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)`
+    2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)`
+    3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)`
+- **Label Assignment/Testing:**
+    1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document; let `\(t_i\)` be the count of the word `\(t\)`.
+    2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}\)`
+
+As we can see, the main difference between Bayes and CBayes is the weight calculation step.  Where Bayes weighs terms more heavily based on the likelihood that they belong to class `\(c\)`, CBayes seeks to maximize term weights on the likelihood that they do not belong to any other class.  
+
+## Running from the command line
+
+Mahout provides CLI drivers for all above steps.  Here we will give a simple overview of Mahout CLI commands used to preprocess the data, train the model and assign labels to the training set. An [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) is given for the full process from data acquisition through classification of the classic [20 Newsgroups corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html).  
+
+- **Preprocessing:**
+For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the [mahout seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization (-n 2 option) as follows:
+
+        mahout seq2sparse 
+          -i ${PATH_TO_SEQUENCE_FILES} 
+          -o ${PATH_TO_TFIDF_VECTORS} 
+          -nv 
+          -n 2
+          -wt tfidf
+
+- **Training:**
+The model is then trained using `mahout trainnb` .  The default is to train a Bayes model. The -c option is given to train a CBayes model:
+
+        mahout trainnb
+          -i ${PATH_TO_TFIDF_VECTORS} 
+          -o ${PATH_TO_MODEL}/model 
+          -li ${PATH_TO_MODEL}/labelindex 
+          -ow 
+          -c
+
+- **Label Assignment/Testing:**
+Classification and testing on a holdout set can then be performed via `mahout testnb`. Again, the -c option indicates that the model is CBayes.  The -seq option tells `mahout testnb` to run sequentially:
+
+        mahout testnb 
+          -i ${PATH_TO_TFIDF_TEST_VECTORS}
+          -m ${PATH_TO_MODEL}/model 
+          -l ${PATH_TO_MODEL}/labelindex 
+          -ow 
+          -o ${PATH_TO_OUTPUT} 
+          -c 
+          -seq
+
+## Command line options
+
+- **Preprocessing:**
+  
+  Only relevant parameters used for Bayes/CBayes as detailed above are shown. Several other transformations can be performed by `mahout seq2sparse` and used as input to Bayes/CBayes.  For a full list of `mahout seq2Sparse` options see the [Creating vectors from text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) page.
+
+        mahout seq2sparse                         
+          --output (-o) output             The directory pathname for output.        
+          --input (-i) input               Path to job input directory.              
+          --weight (-wt) weight            The kind of weight to use. Currently TF   
+                                               or TFIDF. Default: TFIDF                  
+          --norm (-n) norm                 The norm to use, expressed as either a    
+                                               float or "INF" if you want to use the     
+                                               Infinite norm.  Must be greater or equal  
+                                               to 0.  The default is not to normalize    
+          --overwrite (-ow)                If set, overwrite the output directory    
+          --sequentialAccessVector (-seq)  (Optional) Whether output vectors should  
+                                               be SequentialAccessVectors. If set true   
+                                               else false                                
+          --namedVector (-nv)              (Optional) Whether output vectors should  
+                                               be NamedVectors. If set true else false   
+
+- **Training:**
+
+        mahout trainnb
+          --input (-i) input               Path to job input directory.                 
+          --output (-o) output             The directory pathname for output.                    
+          --alphaI (-a) alphaI             Smoothing parameter. Default is 1.0
+          --trainComplementary (-c)        Train complementary? Default is false.                        
+          --labelIndex (-li) labelIndex    The path to store the label index in         
+          --overwrite (-ow)                If present, overwrite the output directory   
+                                               before running job                           
+          --help (-h)                      Print out help                               
+          --tempDir tempDir                Intermediate output directory                
+          --startPhase startPhase          First phase to run                           
+          --endPhase endPhase              Last phase to run
+
+- **Testing:**
+
+        mahout testnb   
+          --input (-i) input               Path to job input directory.                  
+          --output (-o) output             The directory pathname for output.            
+          --overwrite (-ow)                If present, overwrite the output directory    
+                                               before running job                                                
+
+      
+          --model (-m) model               The path to the model built during training   
+          --testComplementary (-c)         Test complementary? Default is false.                          
+          --runSequential (-seq)           Run sequential?                               
+          --labelIndex (-l) labelIndex     The path to the location of the label index   
+          --help (-h)                      Print out help                                
+          --tempDir tempDir                Intermediate output directory                 
+          --startPhase startPhase          First phase to run                            
+          --endPhase endPhase              Last phase to run  
+
+
+## Examples
+
+Mahout provides an example for Naive Bayes classification:
+
+1. [Classify 20 Newsgroups](twenty-newsgroups.html)
+ 
+## References
+
+[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). [Tackling the Poor Assumptions of Naive Bayes Text Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003).
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/class-discovery.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/class-discovery.md b/website-old/docs/algorithms/map-reduce/classification/class-discovery.md
new file mode 100644
index 0000000..2afc519
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/class-discovery.md
@@ -0,0 +1,155 @@
+---
+layout: algorithm
+title: (Deprecated)  Class Discovery
+theme:
+    name: retro-mahout
+---
+<a name="ClassDiscovery-ClassDiscovery"></a>
+# Class Discovery
+
+See http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-417.pdf
+
+CDGA uses a Genetic Algorithm to discover a classification rule for a given
+dataset. 
+A dataset can be seen as a table:
+
+<table>
+<tr><th> </th><th>attribute 1</th><th>attribute 2</th><th>...</th><th>attribute N</th></tr>
+<tr><td>row 1</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+<tr><td>row 2</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+<tr><td>...</td><td>...</td><td>...</td><td>...</td><td>...</td></tr>
+<tr><td>row M</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+</table>
+
+An attribute can be numerical, for example a "temperature" attribute, or
+categorical, for example a "color" attribute. For classification purposes,
+one of the categorical attributes is designated as a *label*, which means
+that its value defines the *class* of the rows.
+A classification rule can be represented as follows:
+<table>
+<tr><th> </th><th>attribute 1</th><th>attribute 2</th><th>...</th><th>attribute N</th></tr>
+<tr><td>weight</td><td>w1</td><td>w2</td><td>...</td><td>wN</td></tr>
+<tr><td>operator</td><td>op1</td><td>op2</td><td>...</td><td>opN</td></tr>
+<tr><td>value</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+</table>
+
+For a given *target* class and a weight *threshold*, the classification
+rule can be read :
+
+
+    for each row of the dataset
+      if (rule.w1 < threshold || (rule.w1 >= threshold && row.value1 rule.op1
+rule.value1)) &&
+         (rule.w2 < threshold || (rule.w2 >= threshold && row.value2 rule.op2
+rule.value2)) &&
+         ...
+         (rule.wN < threshold || (rule.wN >= threshold && row.valueN rule.opN
+rule.valueN)) then
+        row is part of the target class
+
+
+*Important:* The label attribute is not evaluated by the rule.
+
+The threshold parameter allows some conditions of the rule to be skipped if
+their weight is too small. The operators available depend on the attribute
+types:
+* for a numerical attributes, the available operators are '<' and '>='
+* for categorical attributes, the available operators are '!=' and '=='
+
+The "threshold" and "target" are user defined parameters, and because the
+label is always a categorical attribute, the target is the (zero based)
+index of the class label value in all the possible values of the label. For
+example, if the label attribute can have the following values (blue, brown,
+green), then a target of 1 means the "blue" class.
+
+For example, we have the following dataset (the label attribute is "Eyes
+Color"):
+<table>
+<tr><th> </th><th>Age</th><th>Eyes Color</th><th>Hair Color</th></tr>
+<tr><td>row 1</td><td>16</td><td>brown</td><td>dark</td></tr>
+<tr><td>row 2</td><td>25</td><td>green</td><td>light</td></tr>
+<tr><td>row 3</td><td>12</td><td>blue</td><td>light</td></tr>
+and a classification rule:
+<tr><td>weight</td><td>0</td><td>1</td></tr>
+<tr><td>operator</td><td><</td><td>!=</td></tr>
+<tr><td>value</td><td>20</td><td>light</td></tr>
+and the following parameters: threshold = 1 and target = 0 (brown).
+</table>
+
+This rule can be read as follows:
+
+    for each row of the dataset
+      if (0 < 1 || (0 >= 1 && row.value1 < 20)) &&
+         (1 < 1 || (1 >= 1 && row.value2 != light)) then
+        row is part of the "brown Eye Color" class
+
+
+Please note how the rule skipped the label attribute (Eye Color), and how
+the first condition is ignored because its weight is < threshold.
+
+<a name="ClassDiscovery-Runningtheexample:"></a>
+# Running the example:
+NOTE: Substitute in the appropriate version for the Mahout JOB jar
+
+1. cd <MAHOUT_HOME>/examples
+1. ant job
+1. {code}<HADOOP_HOME>/bin/hadoop dfs -put
+<MAHOUT_HOME>/examples/src/test/resources/wdbc wdbc{code}
+1. {code}<HADOOP_HOME>/bin/hadoop dfs -put
+<MAHOUT_HOME>/examples/src/test/resources/wdbc.infos wdbc.infos{code}
+1. {code}<HADOOP_HOME>/bin/hadoop jar
+<MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job
+org.apache.mahout.ga.watchmaker.cd.CDGA
+<MAHOUT_HOME>/examples/src/test/resources/wdbc 1 0.9 1 0.033 0.1 0 100 10
+
+    CDGA needs 9 parameters:
+    * param 1 : path of the directory that contains the dataset and its infos
+file
+    * param 2 : target class
+    * param 3 : threshold
+    * param 4 : number of crossover points for the multi-point crossover
+    * param 5 : mutation rate
+    * param 6 : mutation range
+    * param 7 : mutation precision
+    * param 8 : population size
+    * param 9 : number of generations before the program stops
+    
+    For more information about 4th parameter, please see [Multi-point Crossover|http://www.geatbx.com/docu/algindex-03.html#P616_36571]
+.
+    For a detailed explanation about the 5th, 6th and 7th parameters, please
+see [Real Valued Mutation|http://www.geatbx.com/docu/algindex-04.html#P659_42386]
+.
+    
+    *TODO*: Fill in where to find the output and what it means.
+    
+    h1. The info file:
+    To run properly, CDGA needs some informations about the dataset. Each
+dataset should be accompanied by an .infos file that contains the needed
+informations. for each attribute a corresponding line in the info file
+describes it, it can be one of the following:
+    * IGNORED
+      if the attribute is ignored
+    * LABEL, val1, val2,...
+      if the attribute is the label (class), and its possible values
+    * CATEGORICAL, val1, val2,...
+      if the attribute is categorial (nominal), and its possible values
+    * NUMERICAL, min, max
+      if the attribute is numerical, and its min and max values
+    
+    This file can be generated automaticaly using a special tool available with
+CDGA.
+    
+
+
+*  the tool searches for an existing infos file (*must be filled by the
+user*), in the same directory of the dataset with the same name and with
+the ".infos" extension, that contain the type of the attributes:
+  ** 'N' numerical attribute
+  ** 'C' categorical attribute
+  ** 'L' label (this also a categorical attribute)
+  ** 'I' to ignore the attribute
+  each attribute is in a separate 
+* A Hadoop job is used to parse the dataset and collect the informations.
+This means that *the dataset can be distributed over HDFS*.
+* the results are written back in the same .info file, with the correct
+format needed by CDGA.

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/classifyingyourdata.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/classifyingyourdata.md b/website-old/docs/algorithms/map-reduce/classification/classifyingyourdata.md
new file mode 100644
index 0000000..53bc514
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/classifyingyourdata.md
@@ -0,0 +1,27 @@
+---
+layout: algorithm
+title: (Deprecated)  ClassifyingYourData
+theme:
+    name: retro-mahout
+---
+
+# Classifying data from the command line
+
+
+After you've done the [Quickstart](../basics/quickstart.html) and are familiar with the basics of Mahout, it is time to build a
+classifier from your own data. The following pieces *may* be useful for in getting started:
+
+<a name="ClassifyingYourData-Input"></a>
+# Input
+
+For starters, you will need your data in an appropriate Vector format: See [Creating Vectors](../basics/creating-vectors.html) as well as [Creating Vectors from Text](../basics/creating-vectors-from-text.html).
+
+<a name="ClassifyingYourData-RunningtheProcess"></a>
+# Running the Process
+
+* Logistic regression [background](logistic-regression.html)
+* [Naive Bayes background](naivebayes.html) and [commandline](bayesian-commandline.html) options.
+* [Complementary naive bayes background](complementary-naive-bayes.html), [design](https://issues.apache.org/jira/browse/mahout-60.html), and [c-bayes-commandline](c-bayes-commandline.html)
+* [Random Forests Classification](https://cwiki.apache.org/confluence/display/MAHOUT/Random+Forests) comes with a [Breiman example](breiman-example.html). There is some really great documentation
+over at [Mark Needham's blog](http://www.markhneedham.com/blog/2012/10/27/kaggle-digit-recognizer-mahout-random-forest-attempt/). Also checkout the description on [Xiaomeng Shawn Wan
+s](http://shawnwan.wordpress.com/2012/06/01/mahout-0-7-random-forest-examples/) blog.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/collocations.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/collocations.md b/website-old/docs/algorithms/map-reduce/classification/collocations.md
new file mode 100644
index 0000000..d4406f2
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/collocations.md
@@ -0,0 +1,385 @@
+---
+layout: algorithm
+title: (Deprecated)  Collocations
+theme:
+    name: retro-mahout
+---
+
+
+
+<a name="Collocations-CollocationsinMahout"></a>
+# Collocations in Mahout
+
+A collocation is defined as a sequence of words or terms which co-occur
+more often than would be expected by chance. Statistically relevant
+combinations of terms identify additional lexical units which can be
+treated as features in a vector-based representation of a text. A detailed
+discussion of collocations can be found on [Wikipedia](http://en.wikipedia.org/wiki/Collocation).
+
+See there for a more detailed discussion of collocations in the [Reuters example](http://comments.gmane.org/gmane.comp.apache.mahout.user/5685).
+
+<a name="Collocations-Log-LikelihoodbasedCollocationIdentification"></a>
+## Theory behind implementation: Log-Likelihood based Collocation Identification
+
+Mahout provides an implementation of a collocation identification algorithm
+which scores collocations using log-likelihood ratio. The log-likelihood
+score indicates the relative usefulness of a collocation with regards other
+term combinations in the text. Collocations with the highest scores in a
+particular corpus will generally be more useful as features.
+
+Calculating the LLR is very straightforward and is described concisely in
+[Ted Dunning's blog post](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
+. Ted describes the series of counts reqired to calculate the LLR for two
+events A and B in order to determine if they co-occur more often than pure
+chance. These counts include the number of times the events co-occur (k11),
+the number of times the events occur without each other (k12 and k21), and
+the number of times anything occurs. These counts are summarized in the
+following table:
+
+<table>
+<tr><td> </td><td> Event A </td><td> Everything but Event A </td></tr>
+<tr><td> Event B </td><td> A and B together (k11) </td><td>  B but not A (k12) </td></tr>
+<tr><td> Everything but Event B </td><td> A but not B (k21) </td><td> Neither B nor A (k22) </td></tr>
+</table>
+
+For the purposes of collocation identification, it is useful to begin by
+thinking in word pairs, bigrams. In this case the leading or head term from
+the pair corresponds to A from the table above, B corresponds to the
+trailing or tail term, while neither B nor A is the total number of word
+pairs in the corpus less those containing B, A or both B and A.
+
+Given the word pair of 'oscillation overthruster', the Log-Likelihood ratio
+is computed by looking at the number of occurences of that word pair in the
+corpus, the number of word pairs that begin with 'oscillation' but end with
+something other than 'overthruster', the number of word pairs that end with
+'overthruster' begin with something other than 'oscillation' and the number
+of word pairs in the corpus that contain neither 'oscillation' and
+overthruster.
+
+This can be extended from bigrams to trigrams, 4-grams and beyond. In these
+cases, the current algorithm uses the first token of the ngram as the head
+of the ngram and the remaining n-1 tokens from the ngram, the n-1gram as it
+were, as the tail. Given the trigram 'hong kong cavaliers', 'hong' is
+treated as the head while 'kong cavaliers' is treated as the tail. Future
+versions of this algorithm will allow for variations in which tokens of the
+ngram are treated as the head and tail.
+
+Beyond ngrams, it is often useful to inspect cases where individual words
+occur around other interesting features of the text such as sentence
+boundaries.
+
+<a name="Collocations-GeneratingNGrams"></a>
+## Generating NGrams
+
+The tools that the collocation identification algorithm are embeeded within
+either consume tokenized text as input or provide the ability to specify an
+implementation of the Lucene Analyzer class perform tokenization in order
+to form ngrams. The tokens are passed through a Lucene ShingleFilter to
+produce NGrams of the desired length. 
+
+Given the text "Alice was beginning to get very tired" as an example,
+Lucene's StandardAnalyzer produces the tokens 'alice', 'beginning', 'get',
+'very' and 'tired', while the ShingleFilter with a max NGram size set to 3
+produces the shingles 'alice beginning', 'alice beginning get', 'beginning
+get', 'beginning get very', 'get very', 'get very tired' and 'very tired'.
+Note that both bigrams and trigrams are produced here. A future enhancement
+to the existing algorithm would involve limiting the output to a particular
+gram size as opposed to solely specifiying a max ngram size.
+
+<a name="Collocations-RunningtheCollocationIdentificationAlgorithm."></a>
+## Running the Collocation Identification Algorithm.
+
+There are a couple ways to run the llr-based collocation algorithm in
+mahout
+
+<a name="Collocations-Whencreatingvectorsfromasequencefile"></a>
+### When creating vectors from a sequence file
+
+The llr collocation identifier is integrated into the process that is used
+to create vectors from sequence files of text keys and values. Collocations
+are generated when the --maxNGramSize (-ng) option is not specified and
+defaults to 2 or is set to a number of 2 or greater. The --minLLR option
+can be used to control the cutoff that prevents collocations below the
+specified LLR score from being emitted, and the --minSupport argument can
+be used to filter out collocations that appear below a certain number of
+times. 
+
+
+    bin/mahout seq2sparse
+    
+    Usage:									    
+         [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize <chunkSize>
+          --output <output> --input <input> --minDF <minDF>
+          --maxDFPercent<maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR>
+          --numReducers  <numReducers> --maxNGramSize <ngramSize> --overwrite --help		    
+          --sequentialAccessVector]
+    Options 								    
+
+      --minSupport (-s) minSupport	  (Optional) Minimum Support. Default Value: 2				    
+
+      --analyzerName (-a) analyzerName    The class name of the analyzer
+
+      --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000MB
+
+      --output (-o) output		 The output directory
+
+      --input (-i) input		   Input dir containing the documents in sequence file format
+
+      --minDF (-md) minDF		  The minimum document frequency. Default is 1
+
+      --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF. Can be used to remove 
+                                          really high frequency terms. Expressed as an
+                                          integer between 0 and 100. Default is 99.     
+
+      --weight (-wt) weight 	      The kind of weight to use. Currently TF   
+    				      or TFIDF				    
+
+      --norm (-n) norm		      The norm to use, expressed as either a    
+    				      float or "INF" if you want to use the 
+    				      Infinite norm.  Must be greater orequal  
+    				      to 0.  The default is not to normalize    
+
+      --minLLR (-ml) minLLR 	      (Optional)The minimum Log Likelihood  
+    				      Ratio(Float)  Default is 1.0
+	    
+      --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.    
+    				      Default Value: 1			    
+
+      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to  
+    				      create (2 = bigrams, 3 = trigrams, etc)   
+    				      Default Value:2			 
+   
+      --overwrite (-w)		      If set, overwrite the output directory    
+      --help (-h)			      Print out help			    
+      --sequentialAccessVector (-seq)     (Optional) Whether output vectors should	
+    				      be SequentialAccessVectors If set true	
+    				      else false 
+
+
+<a name="Collocations-CollocDriver"></a>
+### CollocDriver
+
+
+    bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
+    
+    Usage:									    
+     [--input <input> --output <output> --maxNGramSize <ngramSize> --overwrite    
+    --minSupport <minSupport> --minLLR <minLLR> --numReducers <numReducers>     
+    --analyzerName <analyzerName> --preprocess --unigram --help]
+
+    Options 								    
+
+      --input (-i) input		      The Path for input files. 	    
+
+      --output (-o) output		      The Path write output to		    
+
+      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngramsto  
+    				      create (2 = bigrams, 3 = trigrams,etc)   
+    				      Default Value:2			
+    
+      --overwrite (-w)		      If set, overwrite the outputdirectory    
+
+      --minSupport (-s) minSupport	      (Optional) Minimum Support. Default   
+    				      Value: 2				    
+
+      --minLLR (-ml) minLLR 	      (Optional)The minimum Log Likelihood
+    				      Ratio(Float)  Default is 1.0	  
+  
+      --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.    
+    				      Default Value: 1			    
+
+      --analyzerName (-a) analyzerName    The class name of the analyzer	    
+
+      --preprocess (-p)		      If set, input is SequenceFile<Text,Text>  
+    				      where the value is the document, which	
+    				      will be tokenized using the specified 
+    				      analyzer. 			
+    
+      --unigram (-u)		      If set, unigrams will be emitted inthe   
+    				      final output alongside collocations
+   
+      --help (-h)			      Print out help	      
+
+
+<a name="Collocations-Algorithmdetails"></a>
+## Algorithm details
+
+This section describes the implementation of the collocation identification
+algorithm in terms of the map-reduce phases that are used to generate
+ngrams and count the frequencies required to perform the log-likelihood
+calculation. Unless otherwise noted, classes that are indicated in
+CamelCase can be found in the mahout-utils module under the package
+org.apache.mahout.utils.nlp.collocations.llr
+
+The algorithm is implemented in two map-reduce passes:
+
+<a name="Collocations-Pass1:CollocDriver.generateCollocations(...)"></a>
+### Pass 1: CollocDriver.generateCollocations(...)
+
+Generates NGrams and counts frequencies for ngrams, head and tail subgrams.
+
+<a name="Collocations-Map:CollocMapper"></a>
+#### Map: CollocMapper
+
+Input k: Text (documentId), v: StringTuple (tokens) 
+
+Each call to the mapper passes in the full set of tokens for the
+corresponding document using a StringTuple. The ShingleFilter is run across
+these tokens to produce ngrams of the desired length. ngrams and
+frequencies are collected across the entire document.
+
+Once this is done, ngrams are split into head and tail portions. A key of type GramKey is generated which is used later to join ngrams with their heads and tails in the reducer phase. The GramKey is a composite key made up of a string n-gram fragement as the primary key and a secondary key used for grouping and sorting in the reduce phase. The secondary key will either be EMPTY in the case where we are collecting either the head or tail of an ngram as the value or it will contain the byte[](.html)
+ form of the ngram when collecting an ngram as the value.
+
+
+    head_key(EMPTY) -> (head subgram, head frequency)
+
+    head_key(ngram) -> (ngram, ngram frequency) 
+
+    tail_key(EMPTY) -> (tail subgram, tail frequency)
+
+    tail_key(ngram) -> (ngram, ngram frequency)
+
+
+subgram and ngram values are packaged in Gram objects.
+
+For each ngram found, the Count.NGRAM_TOTAL counter is incremented. When
+the pass is complete, this counter will hold the total number of ngrams
+encountered in the input which is used as a part of the LLR calculation.
+
+Output k: GramKey (head or tail subgram), v: Gram (head, tail or ngram with
+frequency)
+
+<a name="Collocations-Combiner:CollocCombiner"></a>
+#### Combiner: CollocCombiner
+
+Input k: GramKey, v:Gram (as above)
+
+This phase merges the counts for unique ngrams or ngram fragments across
+multiple documents. The combiner treats the entire GramKey as the key and
+as such, identical tuples from separate documents are passed into a single
+call to the combiner's reduce method, their frequencies are summed and a
+single tuple is passed out via the collector.
+
+Output k: GramKey, v:Gram
+
+<a name="Collocations-Reduce:CollocReducer"></a>
+#### Reduce: CollocReducer
+
+Input k: GramKey, v: Gram (as above)
+
+The CollocReducer employs the Hadoop secondary sort strategy to avoid
+caching ngram tuples in memory in order to calculate total ngram and
+subgram frequencies. The GramKeyPartitioner ensures that tuples with the
+same primary key are sent to the same reducer while the
+GramKeyGroupComparator ensures that iterator provided by the reduce method
+first returns the subgram and then returns ngram values grouped by ngram.
+This eliminates the need to cache the values returned by the iterator in
+order to calculate total frequencies for both subgrams and ngrams. There
+input will consist of multiple frequencies for each (subgram_key, subgram)
+or (subgram_key, ngram) tuple; one from each map task executed in which the
+particular subgram was found.
+The input will be traversed in the following order:
+
+
+    (head subgram, frequency 1)
+    (head subgram, frequency 2)
+    ... 
+    (head subgram, frequency N)
+    (ngram 1, frequency 1)
+    (ngram 1, frequency 2)
+    ...
+    (ngram 1, frequency N)
+    (ngram 2, frequency 1)
+    (ngram 2, frequency 2)
+    ...
+    (ngram 2, frequency N)
+    ...
+    (ngram N, frequency 1)
+    (ngram N, frequency 2)
+    ...
+    (ngram N, frequency N)
+
+
+Where all of the ngrams above share the same head. Data is presented in the
+same manner for the tail subgrams.
+
+As the values for a subgram or ngram are traversed, frequencies are
+accumulated. Once all values for a subgram or ngram are processed the
+resulting key/value pairs are passed to the collector as long as the ngram
+frequency is equal to or greater than the specified minSupport. When an
+ngram is skipped in this way the Skipped.LESS_THAN_MIN_SUPPORT counter to
+be incremented.
+
+Pairs are passed to the collector in the following format:
+
+
+    ngram, ngram frequency -> subgram subgram frequency
+
+
+In this manner, the output becomes an unsorted version of the following:
+
+
+    ngram 1, frequency -> ngram 1 head, head frequency
+    ngram 1, frequency -> ngram 1 tail, tail frequency
+    ngram 2, frequency -> ngram 2 head, head frequency
+    ngram 2, frequency -> ngram 2 tail, tail frequency
+    ngram N, frequency -> ngram N head, head frequency
+    ngram N, frequency -> ngram N tail, tail frequency
+
+
+Output is in the format k:Gram (ngram, frequency), v:Gram (subgram,
+frequency)
+
+<a name="Collocations-Pass2:CollocDriver.computeNGramsPruneByLLR(...)"></a>
+### Pass 2: CollocDriver.computeNGramsPruneByLLR(...)
+
+Pass 1 has calculated full frequencies for ngrams and subgrams, Pass 2
+performs the LLR calculation.
+
+<a name="Collocations-MapPhase:IdentityMapper(org.apache.hadoop.mapred.lib.IdentityMapper)"></a>
+#### Map Phase: IdentityMapper (org.apache.hadoop.mapred.lib.IdentityMapper)
+
+This phase is a no-op. The data is passed through unchanged. The rest of
+the work for llr calculation is done in the reduce phase.
+
+<a name="Collocations-ReducePhase:LLRReducer"></a>
+#### Reduce Phase: LLRReducer
+
+Input is k:Gram, v:Gram (as above)
+
+This phase receives the head and tail subgrams and their frequencies for
+each ngram (with frequency) produced for the input:
+
+
+    ngram 1, frequency -> ngram 1 head, frequency; ngram 1 tail, frequency
+    ngram 2, frequency -> ngram 2 head, frequency; ngram 2 tail, frequency
+    ...
+    ngram 1, frequency -> ngram N head, frequency; ngram N tail, frequency
+
+
+It also reads the full ngram count obtained from the first pass, passed in
+as a configuration option. The parameters to the llr calculation are
+calculated as follows:
+
+k11 = f_n
+k12 = f_h - f_n
+k21 = f_t - f_n
+k22 = N - ((f_h + f_t) - f_n)
+
+Where f_n is the ngram frequency, f_h and f_t the frequency of head and
+tail and N is the total number of ngrams.
+
+Tokens with a llr below that of the specified minimum llr are dropped and
+the Skipped.LESS_THAN_MIN_LLR counter is incremented.
+
+Output is k: Text (ngram), v: DoubleWritable (llr score)
+
+<a name="Collocations-Unigrampass-through."></a>
+### Unigram pass-through.
+
+By default in seq2sparse, or if the -u option is provided to the
+CollocDriver, unigrams (single tokens) will be passed through the job and
+each token's frequency will be calculated. As with ngrams, unigrams are
+subject to filtering with minSupport and minLLR.
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/gaussian-discriminative-analysis.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/gaussian-discriminative-analysis.md b/website-old/docs/algorithms/map-reduce/classification/gaussian-discriminative-analysis.md
new file mode 100644
index 0000000..d310145
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/gaussian-discriminative-analysis.md
@@ -0,0 +1,20 @@
+---
+layout: algorithm
+title: (Deprecated)  Gaussian Discriminative Analysis
+theme:
+    name: retro-mahout
+---
+
+<a name="GaussianDiscriminativeAnalysis-GaussianDiscriminativeAnalysis"></a>
+# Gaussian Discriminative Analysis
+
+Gaussian Discriminative Analysis is a tool for multigroup classification
+based on extending linear discriminant analysis. The paper on the approach
+is located at http://citeseer.ist.psu.edu/4617.html (note, for some reason
+the paper is backwards, in that page 1 is at the end)
+
+<a name="GaussianDiscriminativeAnalysis-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+<a name="GaussianDiscriminativeAnalysis-Designofpackages"></a>
+## Design of packages

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/hidden-markov-models.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/hidden-markov-models.md b/website-old/docs/algorithms/map-reduce/classification/hidden-markov-models.md
new file mode 100644
index 0000000..3b11f12
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/hidden-markov-models.md
@@ -0,0 +1,102 @@
+---
+layout: algorithm
+title: (Deprecated)  Hidden Markov Models
+theme:
+    name: retro-mahout
+---
+
+# Hidden Markov Models
+
+<a name="HiddenMarkovModels-IntroductionandUsage"></a>
+## Introduction and Usage
+
+Hidden Markov Models are used in multiple areas of Machine Learning, such
+as speech recognition, handwritten letter recognition or natural language
+processing. 
+
+<a name="HiddenMarkovModels-FormalDefinition"></a>
+## Formal Definition
+
+A Hidden Markov Model (HMM) is a statistical model of a process consisting
+of two (in our case discrete) random variables O and Y, which change their
+state sequentially. The variable Y with states \{y_1, ... , y_n\} is called
+the "hidden variable", since its state is not directly observable. The
+state of Y changes sequentially with a so called - in our case first-order
+- Markov Property. This means, that the state change probability of Y only
+depends on its current state and does not change in time. Formally we
+write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)).
+The variable O with states \{o_1, ... , o_m\} is called the "observable
+variable", since its state can be directly observed. O does not have a
+Markov Property, but its state probability depends statically on the
+current state of Y.
+
+Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of hidden states, m is the number of observable states, P is an n-dimensional vector containing initial hidden state probabilities, A is the nxn-dimensional "transition matrix" containing the transition probabilities such that A\[i,j\](i,j\.html)
+=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional "emission matrix"
+containing the observation probabilities such that B\[i,j\]=
+P(O=o_i|Y=y_j).
+
+<a name="HiddenMarkovModels-Problems"></a>
+## Problems
+
+Rabiner \[1\](1\.html)
+ defined three main problems for HMM models:
+
+1. Evaluation: Given a sequence O of observations and a model M, what is
+the probability P(O|M) that sequence O was generated by model M. The
+Evaluation problem can be efficiently solved using the Forward algorithm
+2. Decoding: Given a sequence O of observations and a model M, what is
+the most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to
+generate this sequence. The Decoding problem can be efficiently solved
+using the Viterbi algorithm.
+3. Learning: Given a sequence O of observations, what is the most likely
+model M*=argmax(M)P(O|M) to generate this sequence. The Learning problem
+can be efficiently solved using the Baum-Welch algorithm.
+
+<a name="HiddenMarkovModels-Example"></a>
+## Example
+
+To build a Hidden Markov Model and use it to build some predictions, try a simple example like this:
+
+Create an input file to train the model.  Here we have a sequence drawn from the set of states 0, 1, 2, and 3, separated by space characters.
+
+    $ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 2 2 3 2 1 1 0" > hmm-input
+
+Now run the baumwelch job to train your model, after first setting MAHOUT_LOCAL to true, to use your local file system.
+
+    $ export MAHOUT_LOCAL=true
+    $ $MAHOUT_HOME/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 -e .0001 -m 1000
+
+Output like the following should appear in the console.
+
+    Initial probabilities: 
+    0 1 2 
+    1.0 0.0 3.5659361683006626E-251 
+    Transition matrix:
+      0 1 2 
+    0 6.098919959130616E-5 0.9997275322964165 2.1147850399214744E-4 
+    1 7.404648706054873E-37 0.9086408633885092 0.09135913661149081 
+    2 0.2284374545687356 7.01786289571088E-11 0.7715625453610858 
+    Emission matrix: 
+      0 1 2 3 
+    0 0.9999997858591223 2.0536163836449762E-39 2.1414087769942127E-7 1.052441093535389E-27 
+    1 7.495656581383351E-34 0.2241269055449904 0.4510889999455847 0.32478409450942497 
+    2 0.815051477991782 0.18494852200821799 8.465660634827592E-33 2.8603899591778015E-36 
+    14/03/22 09:52:21 INFO driver.MahoutDriver: Program took 180 ms (Minutes: 0.003)
+
+The model trained with the input set now is in the file 'hmm-model', which we can use to build a predicted sequence.
+
+    $ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10
+
+To see the predictions:
+
+    $ cat hmm-predictions 
+    0 1 3 3 2 2 2 2 1 2
+
+
+<a name="HiddenMarkovModels-Resources"></a>
+## Resources
+
+\[1\]
+ Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models
+and selected applications in speech recognition". Proceedings of the IEEE
+77 (2): 257-286. doi:10.1109/5.18626.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/independent-component-analysis.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/independent-component-analysis.md b/website-old/docs/algorithms/map-reduce/classification/independent-component-analysis.md
new file mode 100644
index 0000000..9216816
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/independent-component-analysis.md
@@ -0,0 +1,17 @@
+---
+layout: algorithm
+title: (Deprecated)  Independent Component Analysis
+theme:
+    name: retro-mahout
+---
+
+<a name="IndependentComponentAnalysis-IndependentComponentAnalysis"></a>
+# Independent Component Analysis
+
+See also: Principal Component Analysis.
+
+<a name="IndependentComponentAnalysis-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+<a name="IndependentComponentAnalysis-Designofpackages"></a>
+## Design of packages

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/locally-weighted-linear-regression.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/locally-weighted-linear-regression.md b/website-old/docs/algorithms/map-reduce/classification/locally-weighted-linear-regression.md
new file mode 100644
index 0000000..fecfc62
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/locally-weighted-linear-regression.md
@@ -0,0 +1,25 @@
+---
+layout: algorithm
+title: (Deprecated)  Locally Weighted Linear Regression
+theme:
+    name: retro-mahout
+---
+
+<a name="LocallyWeightedLinearRegression-LocallyWeightedLinearRegression"></a>
+# Locally Weighted Linear Regression
+
+Model-based methods, such as SVM, Naive Bayes and the mixture of Gaussians,
+use the data to build a parameterized model. After training, the model is
+used for predictions and the data are generally discarded. In contrast,
+"memory-based" methods are non-parametric approaches that explicitly retain
+the training data, and use it each time a prediction needs to be made.
+Locally weighted regression (LWR) is a memory-based method that performs a
+regression around a point of interest using only training data that are
+"local" to that point. Source:
+http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/cohn96a-html/node7.html
+
+<a name="LocallyWeightedLinearRegression-Strategyforparallelregression"></a>
+## Strategy for parallel regression
+
+<a name="LocallyWeightedLinearRegression-Designofpackages"></a>
+## Design of packages

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/logistic-regression.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/logistic-regression.md b/website-old/docs/algorithms/map-reduce/classification/logistic-regression.md
new file mode 100644
index 0000000..762a391
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/logistic-regression.md
@@ -0,0 +1,129 @@
+---
+layout: algorithm
+title: (Deprecated)  Logistic Regression
+theme:
+    name: retro-mahout
+---
+
+<a name="LogisticRegression-LogisticRegression(SGD)"></a>
+# Logistic Regression (SGD)
+
+Logistic regression is a model used for prediction of the probability of
+occurrence of an event. It makes use of several predictor variables that
+may be either numerical or categories.
+
+Logistic regression is the standard industry workhorse that underlies many
+production fraud detection and advertising quality and targeting products. 
+The Mahout implementation uses Stochastic Gradient Descent (SGD) to all
+large training sets to be used.
+
+For a more detailed analysis of the approach, have a look at the [thesis of
+Paul Komarek](http://repository.cmu.edu/cgi/viewcontent.cgi?article=1221&context=robotics) [1].
+
+See MAHOUT-228 for the main JIRA issue for SGD.
+
+A more detailed overview of the Mahout Linear Regression classifier and [detailed discription of building a Logistic Regression classifier](http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/) for the classic [Iris flower dataset](http://en.wikipedia.org/wiki/Iris_flower_data_set) is also available [2]. 
+
+An example of training a Logistic Regression classifier for the [UCI Bank Marketing Dataset](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing) can be found [on the Mahout website](http://mahout.apache.org/users/classification/bankmarketing-example.html) [3].
+
+An example of training and testing a Logistic Regression document classifier for the classic [20 newsgroups corpus](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) [4] is also available. 
+
+<a name="LogisticRegression-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+The bad news is that SGD is an inherently sequential algorithm.  The good
+news is that it is blazingly fast and thus it is not a problem for Mahout's
+implementation to handle training sets of tens of millions of examples. 
+With the down-sampling typical in many data-sets, this is equivalent to a
+dataset with billions of raw training examples.
+
+The SGD system in Mahout is an online learning algorithm which means that
+you can learn models in an incremental fashion and that you can do
+performance testing as your system runs.  Often this means that you can
+stop training when a model reaches a target level of performance.  The SGD
+framework includes classes to do on-line evaluation using cross validation
+(the CrossFoldLearner) and an evolutionary system to do learning
+hyper-parameter optimization on the fly (the AdaptiveLogisticRegression). 
+The AdaptiveLogisticRegression system makes heavy use of threads to
+increase machine utilization.  The way it works is that it runs 20
+CrossFoldLearners in separate threads, each with slightly different
+learning parameters.  As better settings are found, these new settings are
+propagating to the other learners.
+
+<a name="LogisticRegression-Designofpackages"></a>
+## Design of packages
+
+There are three packages that are used in Mahout's SGD system.	These
+include
+
+* The vector encoding package (found in org.apache.mahout.vectorizer.encoders)
+
+* The SGD learning package (found in org.apache.mahout.classifier.sgd)
+
+* The evolutionary optimization system (found in org.apache.mahout.ep)
+
+<a name="LogisticRegression-Featurevectorencoding"></a>
+## Feature vector encoding
+
+Because the SGD algorithms need to have fixed length feature vectors and
+because it is a pain to build a dictionary ahead of time, most SGD
+applications use the hashed feature vector encoding system that is rooted
+at FeatureVectorEncoder.
+
+The basic idea is that you create a vector, typically a
+RandomAccessSparseVector, and then you use various feature encoders to
+progressively add features to that vector.  The size of the vector should
+be large enough to avoid feature collisions as features are hashed.
+
+There are specialized encoders for a variety of data types.  You can
+normally encode either a string representation of the value you want to
+encode or you can encode a byte level representation to avoid string
+conversion.  In the case of ContinuousValueEncoder and
+ConstantValueEncoder, it is also possible to encode a null value and pass
+the real value in as a weight.	This avoids numerical parsing entirely in
+case you are getting your training data from a system like Avro.
+
+Here is a class diagram for the encoders package:
+
+![class diagram](../../images/vector-class-hierarchy.png)
+
+<a name="LogisticRegression-SGDLearning"></a>
+## SGD Learning
+
+For the simplest applications, you can construct an
+OnlineLogisticRegression and be off and running.  Typically, though, it is
+nice to have running estimates of performance on held out data.  To do
+that, you should use a CrossFoldLearner which keeps a stable of five (by
+default) OnlineLogisticRegression objects.  Each time you pass a training
+example to a CrossFoldLearner, it passes this example to all but one of its
+children as training and passes the example to the last child to evaluate
+current performance.  The children are used for evaluation in a round-robin
+fashion so, if you are using the default 5 way split, all of the children
+get 80% of the training data for training and get 20% of the data for
+evaluation.
+
+To avoid the pesky need to configure learning rates, regularization
+parameters and annealing schedules, you can use the
+AdaptiveLogisticRegression.  This class maintains a pool of
+CrossFoldLearners and adapts learning rates and regularization on the fly
+so that you don't have to.
+
+Here is a class diagram for the classifiers.sgd package.  As you can see,
+the number of twiddlable knobs is pretty large.  For some examples, see the
+[TrainNewsGroups](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainNewsGroups.java) example code.
+
+![sgd class diagram](../../images/sgd-class-hierarchy.png)
+
+## References
+
+[1] [Thesis of
+Paul Komarek](http://repository.cmu.edu/cgi/viewcontent.cgi?article=1221&context=robotics)
+
+[2] [An Introduction To Mahout's Logistic Regression SGD Classifier](http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/)
+
+## Examples
+
+[3] [SGD Bank Marketing Example](http://mahout.apache.org/users/classification/bankmarketing-example.html)
+
+[4] [SGD 20 newsgroups classification](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/mahout-collections.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/mahout-collections.md b/website-old/docs/algorithms/map-reduce/classification/mahout-collections.md
new file mode 100644
index 0000000..bed87a6
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/mahout-collections.md
@@ -0,0 +1,60 @@
+---
+layout: algorithm
+title: (Deprecated)  mahout-collections
+theme:
+    name: retro-mahout
+---
+
+# Mahout collections
+
+<a name="mahout-collections-Introduction"></a>
+## Introduction
+
+The Mahout Collections library is a set of container classes that address
+some limitations of the standard collections in Java. [This presentation](http://domino.research.ibm.com/comm/research_people.nsf/pages/sevitsky.pubs.html/$FILE/oopsla08%20memory-efficient%20java%20slides.pdf)
+ describes a number of performance problems with the standard collections. 
+
+Mahout collections addresses two of the more glaring: the lack of support
+for primitive types and the lack of open hashing.
+
+<a name="mahout-collections-PrimitiveTypes"></a>
+## Primitive Types
+
+The most visible feature of Mahout Collections is the large collection of
+primitive type collections. Given Java's asymmetrical support for the
+primitive types, the only efficient way to handle them is with many
+classes. So, there are ArrayList-like containers for all of the primitive
+types, and hash maps for all the useful combinations of primitive type and
+object keys and values.
+
+These classes do not, in general, implement interfaces from *java.util*.
+Even when the *java.util* interfaces could be type-compatible, they tend
+to include requirements that are not consistent with efficient use of
+primitive types.
+
+<a name="mahout-collections-OpenAddressing"></a>
+# Open Addressing
+
+All of the sets and maps in Mahout Collections are open-addressed hash
+tables. Open addressing has a much smaller memory footprint than chaining.
+Since the purpose of these collections is to avoid the memory cost of
+autoboxing, open addressing is a consistent design choice.
+
+<a name="mahout-collections-Sets"></a>
+## Sets
+
+Mahout Collections includes open hash sets. Unlike *java.util*, a set is
+not a recycled hash table; the sets are separately implemented and do not
+have any additional storage usage for unused keys.
+
+<a name="mahout-collections-CreditwhereCreditisdue"></a>
+# Credit where Credit is due
+
+The implementation of Mahout Collections is derived from [Cern Colt](http://acs.lbl.gov/~hoschek/colt/)
+.
+
+
+
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/mlp.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/mlp.md b/website-old/docs/algorithms/map-reduce/classification/mlp.md
new file mode 100644
index 0000000..85e47de
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/mlp.md
@@ -0,0 +1,172 @@
+---
+layout: algorithm
+title: (Deprecated)  Multilayer Perceptron
+theme:
+    name: retro-mahout
+---
+
+Multilayer Perceptron
+=====================
+
+A multilayer perceptron is a biologically inspired feed-forward network that can 
+be trained to represent a nonlinear mapping between input and output data. It 
+consists of multiple layers, each containing multiple artificial neuron units and
+can be used for classification and regression tasks in a supervised learning approach. 
+
+Command line usage
+------------------
+
+The MLP implementation is currently located in the MapReduce-Legacy package. It
+can be used with the following commands: 
+
+
+# model training
+    $ bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron 
+# model usage
+    $ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron
+
+
+To train and use the model, a number of parameters can be specified. Parameters without default values have to be specified by the user. Consider that not all parameters can be used both for training and running the model. We give an example of the usage below.
+
+### Parameters
+
+| Command | Default | Description | Type |
+|:---------|---------:|:-------------|:---------|
+| --input -i | | Path to the input data (currently, only .csv-files are allowed) | |
+| --skipHeader -sh | false | Skip first row of the input file (corresponds to the csv headers)| |
+|--update -u | false | Whether the model should be updated incrementally with every new training instance. If this parameter is not given, the model is trained from scratch. | training |
+| --labels -labels | | Instance labels separated by whitespaces. | training |
+| --model -mo | | Location where the model will be stored / is stored (if the specified location has an existing model, it will update the model through incremental learning). | |
+| --layerSize -ls | | Number of units per layer, including input, hidden and ouput layers. This parameter specifies the topology of the network (see [this image][mlp] for an example specified by `-ls 4 8 3`). | training |
+| --squashingFunction -sf| Sigmoid | The squashing function to use for the units. Currently only the sigmoid fucntion is available. | training |
+| --learningRate -l | 0.5 | The learning rate that is used for weight updates. | training |
+| --momemtumWeight -m | 0.1 | The momentum weight that is used for gradient descent. Must be in the range between 0 ... 1.0 | training |
+| --regularizationWeight -r | 0 | Regularization value for the weight vector. Must be in the range between 0 ... 0.1 | training |
+| --format -f | csv | Input file format. Currently only csv is supported. | |
+|--columnRange -cr | | Range of the columns to use from the input file, starting with 0 (i.e. `-cr 0 5` for including the first six columns only) | testing |
+| --output -o | | Path to store the labeled results from running the model. | testing |
+
+Example usage
+-------------
+
+In this example, we will train a multilayer perceptron for classification on the iris data set. The iris flower data set contains data of three flower species where each datapoint consists of four features.
+The dimensions of the data set are given through some flower parameters (sepal length, sepal width, ...). All samples contain a label that indicates the flower species they belong to.
+
+### Training
+
+To train our multilayer perceptron model from the command line, we call the following command
+
+
+    $ bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron \
+                -i ./mrlegacy/src/test/resources/iris.csv -sh \
+                -labels setosa versicolor virginica \
+                -mo /tmp/model.model -ls 4 8 3 -l 0.2 -m 0.35 -r 0.0001
+
+
+The individual parameters are explained in the following.
+
+- `-i ./mrlegacy/src/test/resources/iris.csv` use the iris data set as input data
+- `-sh` since the file `iris.csv` contains a header row, this row needs to be skipped 
+- `-labels setosa versicolor virginica` we specify, which class labels should be learnt (which are the flower species in this case)
+- `-mo /tmp/model.model` specify where to store the model file
+- `-ls 4 8 3` we specify the structure and depth of our layers. The actual network structure can be seen in the figure below.
+- `-l 0.2` we set the learning rate to `0.2`
+- `-m 0.35` momemtum weight is set to `0.35`
+- `-r 0.0001` regularization weight is set to `0.0001`
+ 
+|  |  |
+|---|---|
+| The picture shows the architecture defined by the above command. The topolgy of the network is completely defined through the number of layers and units because in this implementation of the MLP every unit is fully connected to the units of the next and previous layer. Bias units are added automatically. | ![Multilayer perceptron network][mlp] |
+
+[mlp]: mlperceptron_structure.png "Architecture of a three-layer MLP"
+### Testing
+
+To test / run the multilayer perceptron classification on the trained model, we can use the following command
+
+
+    $ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron \
+                -i ./mrlegacy/src/test/resources/iris.csv -sh -cr 0 3 \
+                -mo /tmp/model.model -o /tmp/labelResult.txt
+                
+
+The individual parameters are explained in the following.
+
+- `-i ./mrlegacy/src/test/resources/iris.csv` use the iris data set as input data
+- `-sh` since the file `iris.csv` contains a header row, this row needs to be skipped
+- `-cr 0 3` we specify the column range of the input file
+- `-mo /tmp/model.model` specify where the model file is stored
+- `-o /tmp/labelResult.txt` specify where the labeled output file will be stored
+
+Implementation 
+--------------
+
+The Multilayer Perceptron implementation is based on a more general Neural Network class. Command line support was added later on and provides a simple usage of the MLP as shown in the example. It is implemented to run on a single machine using stochastic gradient descent where the weights are updated using one datapoint at a time, resulting in a weight update of the form:
+$$ \vec{w}^{(t + 1)} = \vec{w}^{(t)} - n \Delta E_n(\vec{w}^{(t)}) $$
+
+where *a* is the activation of the unit. It is not yet possible to change the learning to more advanced methods using adaptive learning rates yet. 
+
+The number of layers and units per layer can be specified manually and determines the whole topology with each unit being fully connected to the previous layer. A bias unit is automatically added to the input of every layer. 
+Currently, the logistic sigmoid is used as a squashing function in every hidden and output layer. It is of the form:
+
+$$ \frac{1}{1 + exp(-a)} $$
+
+The command line version **does not perform iterations** which leads to bad results on small datasets. Another restriction is, that the CLI version of the MLP only supports classification, since the labels have to be given explicitly when executing on the command line. 
+
+A learned model can be stored and updated with new training instanced using the `--update` flag. Output of classification reults is saved as a .txt-file and only consists of the assigned labels. Apart from the command-line interface, it is possible to construct and compile more specialized neural networks using the API and interfaces in the mrlegacy package. 
+
+
+Theoretical Background
+-------------------------
+
+The *multilayer perceptron* was inspired by the biological structure of the brain where multiple neurons are connected and form columns and layers. Perceptual input enters this network through our sensory organs and is then further processed into higher levels. 
+The term multilayer perceptron is a little misleading since the *perceptron* is a special case of a single *artificial neuron* that can be used for simple computations [\[1\]][1]. The difference is that the perceptron uses a discontinous nonlinearity while for the MLP neurons that are implemented in mahout it is important to use continous nonlinearities. This is necessary for the implemented learning algorithm, where the error is propagated back from the output layer to the input layer and the weights of the connections are changed according to their contribution to the overall error. This algorithm is called backpropagation and uses gradient descent to update the weights. To compute the gradients we need continous nonlinearities. But let's start from the beginning!
+
+The first layer of the MLP represents the input and has no other purpose than routing the input to every connected unit in a feed-forward fashion. Following layers are called hidden layers and the last layer serves the special purpose to determine the output. The activation of a unit *u* in a hidden layer is computed through a weighted sum of all inputs, resulting in 
+$$ a_j = \sum_{i=1}^{D} w_{ji}^{(l)} x_i + w_{j0}^{(l)} $$
+This computes the activation *a* for neuron *j* where *w* is the weight from neuron *i* to neuron *j* in layer *l*. The last part, where *i = 0* is called the bias and can be used as an offset, independent from the input.
+
+The activation is then transformed by the aforementioned differentiable, nonlinear *activation function* and serves as the input to the next layer. The activation function is usually chosen from the family of sigmoidal functions such as *tanh* or *logistic sigmoidal* [\[2\]][2]. Often sigmoidal and logistic sigmoidal are used synonymous. Another word for the activation function is *squashing function* since the s-shape of this function class *squashes* the input.
+
+For different units or layers, different activation functions can be used to obtain different behaviors. Especially in the output layer, the activation function can be chosen to obtain the output value *y*, depending on the learning problem:
+$$ y_k = \sigma (a_k) $$
+
+If the learning problem is a linear regression task, sigma can be chosen to be the identity function. In case of classification problems, the choice of the squashing functions depends on the exact task at hand and often softmax activation functions are used. 
+
+The equation for a MLP with three layers (one input, one hidden and one output) is then given by
+
+$$ y_k(\vec{x}, \vec{w}) = h \left( \sum_{j=1}^{M} w_{kj}^{(2)} h \left( \sum_{i=1}^{D} w_{ji}^{(1)} x_i + w_{j0}^{(1)} \right) + w_{k0}^{(2)} \right) $$ 
+
+where *h* indicates the respective squashing function that is used in the units of a layer. *M* and *D* specify the number of incoming connections to a unit and we can see that the input to the first layer (hidden layer) is just the original input *x* whereas the input into the second layer (output layer) is the transformed output of layer one. The output *y* of unit *k* is therefore given by the above equation and depends on the input *x* and the weight vector *w*. This shows us, that the parameter that we can optimize during learning is *w* since we can not do anything about the input *x*. To facilitate the following steps, we can include the bias-terms into the weight vector and correct for the indices by adding another dimension with the value 1 to the input vector. The bias is a constant factor that is added to the weighted sum and that serves as a scaling factor of the nonlinear transformation. Including it into the weight vector leads to:
+
+$$ y_k(\vec{x}, \vec{w}) = h \left( \sum_{j=0}^{M} w_{kj}^{(2)} h \left( \sum_{i=0}^{D} w_{ji}^{(1)} x_i \right) \right) $$ 
+
+The previous paragraphs described how the MLP transforms a given input into some output using a combination of different nonlinear functions. Of course what we really want is to learn the structure of our data so that we can feed data with unknown labels into the network and get the estimated target labels *t*. To achieve this, we have to train our network. In this context, training means optimizing some function such that the error between the real labels *y* and the network-output *t* becomes smallest. We have seen in the previous pragraph, that our only knob to change is the weight vector *w*, making the function to be optimized a function of *w*. For simplicitly and because it is widely used, we choose the so called *sum-of-squares* error function as an example that is given by
+
+$$ E(\vec{w}) = \frac{1}{2} \sum_{n=1}^N \left( y(\vec{x}_n, \vec{w}) - t_n \right)^2 $$
+
+The goal is to minimize this function and thereby increase the performance of our model. A common method to achieve this is to use gradient descent and the so called technique of *backpropagation* where the goal is to compute the contribution of every unit to the overall error and changing the weight according to this contribution and into the direction of the gradient of the error function at this particular unit. In the following we try to give a short overview of the model training with gradient descent and backpropagation. A more detailed example can be found in [\[3\]][3] where much of this information is taken from.
+
+The problem with minimizing the error function is that the error can only be computed at the output layers where we get *t*, but we want to update all the weights of all the units. Therefore we use the technique of backpropagation to propagate the error, that we first compute at the output layer, back to the units of the previous layers. For this approach we also need to compute the gradients of the activation function. 
+
+Weights are then updated with a small step in the direction of the negative gradient, regulated by the learning rate *n* such that we arrive at the formula for weight update:
+
+$$ \vec{w}^{(t + 1)} = \vec{w}^{(t)} - n \Delta E(\vec{w}^{(t)}) $$
+
+A momentum weight can be set as a parameter of the gradient descent method to increase the probability of finding better local or global optima of the error function.
+
+
+
+
+
+[1]: http://en.wikipedia.org/wiki/Perceptron "The perceptron in wikipedia"
+[2]: http://en.wikipedia.org/wiki/Sigmoid_function "Sigmoid function on wikipedia"
+[3]: http://research.microsoft.com/en-us/um/people/cmbishop/prml/ "Christopher M. Bishop: Pattern Recognition and Machine Learning, Springer 2009"
+
+References
+
+\[1\] http://en.wikipedia.org/wiki/Perceptron
+
+\[2\] http://en.wikipedia.org/wiki/Sigmoid_function
+
+\[3\] [Christopher M. Bishop: Pattern Recognition and Machine Learning, Springer 2009](http://research.microsoft.com/en-us/um/people/cmbishop/prml/)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/naivebayes.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/naivebayes.md b/website-old/docs/algorithms/map-reduce/classification/naivebayes.md
new file mode 100644
index 0000000..bbe1a2b
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/naivebayes.md
@@ -0,0 +1,45 @@
+---
+layout: algorithm
+title: (Deprecated)  NaiveBayes
+theme:
+    name: retro-mahout
+---
+
+<a name="NaiveBayes-NaiveBayes"></a>
+# Naive Bayes
+
+Naive Bayes is an algorithm that can be used to classify objects into
+usually binary categories. It is one of the most common learning algorithms
+in spam filters. Despite its simplicity and rather naive assumptions it has
+proven to work surprisingly well in practice.
+
+Before applying the algorithm, the objects to be classified need to be
+represented by numerical features. In the case of e-mail spam each feature
+might indicate whether some specific word is present or absent in the mail
+to classify. The algorithm comes in two phases: Learning and application.
+During learning, a set of feature vectors is given to the algorithm, each
+vector labeled with the class the object it represents, belongs to. From
+that it is deduced which combination of features appears with high
+probability in spam messages. Given this information, during application
+one can easily compute the probability of a new message being either spam
+or not.
+
+The algorithm does make several assumptions, that are not true for most
+datasets, but make computations easier. The worst probably being, that all
+features of an objects are considered independent. In practice, that means,
+given the phrase "Statue of Liberty" was already found in a text, does not
+influence the probability of seeing the phrase "New York" as well.
+
+<a name="NaiveBayes-StrategyforaparallelNaiveBayes"></a>
+## Strategy for a parallel Naive Bayes
+
+See [https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9)
+.
+
+
+<a name="NaiveBayes-Examples"></a>
+## Examples
+
+[20Newsgroups](20newsgroups.html)
+ - Example code showing how to train and use the Naive Bayes classifier
+using the 20 Newsgroups data available at [http://people.csail.mit.edu/jrennie/20Newsgroups/]

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/classification/neural-network.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/classification/neural-network.md b/website-old/docs/algorithms/map-reduce/classification/neural-network.md
new file mode 100644
index 0000000..0cf09bd
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/classification/neural-network.md
@@ -0,0 +1,22 @@
+---
+layout: algorithm
+title: (Deprecated)  Neural Network
+theme:
+    name: retro-mahout
+---
+
+<a name="NeuralNetwork-NeuralNetworks"></a>
+# Neural Networks
+
+Neural Networks are a means for classifying multi dimensional objects. We
+concentrate on implementing back propagation networks with one hidden layer
+as these networks have been covered by the [2006 NIPS map reduce paper](http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf)
+. Those networks are capable of learning not only linear separating hyper
+planes but arbitrary decision boundaries.
+
+<a name="NeuralNetwork-Strategyforparallelbackpropagationnetwork"></a>
+## Strategy for parallel backpropagation network
+
+
+<a name="NeuralNetwork-Designofimplementation"></a>
+## Design of implementation


Mime
View raw message