# mahout-commits mailing list archives

##### Site index · List index
Message view
Top
From rawkintr...@apache.org
Subject [05/13] mahout git commit: WEBSITE Porting Old Website
Date Sun, 30 Apr 2017 03:24:09 GMT
http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/sparkbindings/faq.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/sparkbindings/faq.md b/website/old_site_migration/completed/sparkbindings/faq.md
new file mode 100644
index 0000000..9649e3b
--- /dev/null
+++ b/website/old_site_migration/completed/sparkbindings/faq.md
@@ -0,0 +1,52 @@
+---
+layout: default
+title: FAQ
+theme:
+    name: retro-mahout
+---
+
+# FAQ for using Mahout with Spark
+
+**Q: Mahout Spark shell doesn't start; "ClassNotFound" problems or various classpath problems.**
+
+**A:** So far as of the time of this writing all reported problems starting the Spark shell in Mahout were revolving
+around classpath issues one way or another.
+
+If you are getting method signature like errors, most probably you have mismatch between Mahout's Spark dependency
+and actual Spark installed. (At the time of this writing the HEAD depends on Spark 1.1.0) but check mahout/pom.xml.
+
+Troubleshooting general classpath issues is pretty straightforward. Since Mahout is using Spark's installation
+and its classpath as reported by Spark itself for Spark-related dependencies, it is important to make sure
+the classpath is sane and is made available to Mahout:
+
+1. Check Spark is of correct version (same as in Mahout's poms), is compiled and SPARK_HOME is set.
+2. Check Mahout is compiled and MAHOUT_HOME is set.
+3. Run $SPARK_HOME/bin/compute-classpath.sh and make sure it produces sane result with no errors. +If it outputs something other than a straightforward classpath string, most likely Spark is not compiled/set correctly (later spark versions require +sbt/sbt assembly to be run, simply runnig sbt/sbt publish-local is not enough any longer). +4. Run $MAHOUT_HOME/bin/mahout -spark classpath and check that path reported in step (3) is included.
+
+**Q: I am using the command line Mahout jobs that run on Spark or am writing my own application that uses
+Mahout's Spark code. When I run the code on my cluster I get ClassNotFound or signature errors during serialization.
+What's wrong?**
+
+**A:** The Spark artifacts in the maven ecosystem may not match the exact binary you are running on your cluster. This may
+cause class name or version mismatches. In this case you may wish
+to build Spark yourself to guarantee that you are running exactly what you are building Mahout against. To do this follow these steps
+in order:
+
+1. Build Spark with maven, but **do not** use the "package" target as described on the Spark site. Build with the "clean install" target instead.
+Something like: "mvn clean install -Dhadoop1.2.1" or whatever your particular build options are. This will put the jars for Spark
+in the local maven cache.
+2. Deploy **your** Spark build to your cluster and test it there.
+3. Build Mahout. This will cause maven to pull the jars for Spark from the local maven cache and may resolve missing
+or mis-identified classes.
+4. if you are building your own code do so against the local builds of Spark and Mahout.
+
+**Q: The implicit SparkContext 'sc' does not work in the Mahout spark-shell.**
+
+**A:** In the Mahout spark-shell the SparkContext is called 'sdc', where the 'd' stands for distributed.
+
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/sparkbindings/home.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/sparkbindings/home.md b/website/old_site_migration/completed/sparkbindings/home.md
new file mode 100644
index 0000000..5075612
--- /dev/null
+++ b/website/old_site_migration/completed/sparkbindings/home.md
@@ -0,0 +1,101 @@
+---
+layout: default
+title: Spark Bindings
+theme:
+    name: retro-mahout
+---
+
+# Scala & Spark Bindings:
+*Bringing algebraic semantics*
+
+## What is Scala & Spark Bindings?
+
+In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic optimizer of something like this (actual formula from **(d)spca**)
+
+
+$\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}$
+
+bound to in-core and distributed computations (currently, on Apache Spark).
+
+
+Mahout Scala & Spark Bindings expression of the above:
+
+        val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)
+
+The main idea is that a scientist writing algebraic expressions cannot care less of distributed
+operation plans and works **entirely on the logical level** just like he or she would do with R.
+
+Another idea is decoupling logical expression from distributed back-end. As more back-ends are added,
+this implies **"write once, run everywhere"**.
+
+The linear algebra side works with scalars, in-core vectors and matrices, and Mahout Distributed
+Row Matrices (DRMs).
+
+The ecosystem of operators is built in the R's image, i.e. it follows R naming such as %*%,
+colSums, nrow, length operating over vectors or matices.
+
+Important part of Spark Bindings is expression optimizer. It looks at expression as a whole
+and figures out how it can be simplified, and which physical operators should be picked. For example,
+there are currently about 5 different physical operators performing DRM-DRM multiplication
+picked based on matrix geometry, distributed dataset partitioning, orientation etc.
+If we count in DRM by in-core combinations, that would be another 4, i.e. 9 total -- all of it for just
+simple x %*% y logical notation.
+
+
+
+Please refer to the documentation for details.
+
+## Status
+
+This environment addresses mostly R-like Linear Algebra optmizations for
+
+
+## Documentation
+
+* Scala and Spark bindings manual: [web](http://apache.github.io/mahout/doc/ScalaSparkBindings.html), [pdf](ScalaSparkBindings.pdf)
+* Overview blog on 0.10.x releases: [blog](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html)
+
+## Distributed methods and solvers using Bindings
+
+* In-core ([ssvd]) and Distributed ([dssvd]) Stochastic SVD -- guinea pigs -- see the bindings manual
+* In-core ([spca]) and Distributed ([dspca]) Stochastic PCA -- guinea pigs -- see the bindings manual
+* Distributed thin QR decomposition ([dqrThin]) -- guinea pig -- see the bindings manual
+* [Current list of algorithms](https://mahout.apache.org/users/basics/algorithms.html)
+
+[ssvd]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
+[spca]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
+[dssvd]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala
+[dspca]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala
+[dqrThin]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala
+
+
+## Related history of note
+
+* CLI and Driver for Spark version of item similarity -- [MAHOUT-1541](https://issues.apache.org/jira/browse/MAHOUT-1541)
+* Command line interface for generalizable Spark pipelines -- [MAHOUT-1569](https://issues.apache.org/jira/browse/MAHOUT-1569)
+* Cooccurrence Analysis / Item-based Recommendation -- [MAHOUT-1464](https://issues.apache.org/jira/browse/MAHOUT-1464)
+* Spark Bindings -- [MAHOUT-1346](https://issues.apache.org/jira/browse/MAHOUT-1346)
+* Scala Bindings -- [MAHOUT-1297](https://issues.apache.org/jira/browse/MAHOUT-1297)
+* Interactive Scala & Spark Bindings Shell & Script processor -- [MAHOUT-1489](https://issues.apache.org/jira/browse/MAHOUT-1489)
+* OLS tutorial using Mahout shell -- [MAHOUT-1542](https://issues.apache.org/jira/browse/MAHOUT-1542)
+* Full abstraction of DRM apis and algorithms from a distributed engine -- [MAHOUT-1529](https://issues.apache.org/jira/browse/MAHOUT-1529)
+* Port Naive Bayes -- [MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)
+
+## Work in progress
+* Text-delimited files for input and output -- [MAHOUT-1568](https://issues.apache.org/jira/browse/MAHOUT-1568)
+<!-- * Weighted (Implicit Feedback) ALS -- [MAHOUT-1365](https://issues.apache.org/jira/browse/MAHOUT-1365) -->
+<!--* Data frame R-like bindings -- [MAHOUT-1490](https://issues.apache.org/jira/browse/MAHOUT-1490) -->
+
+
+<!-- ## Stuff wanted:
+* Data frame R-like bindings (similarly to linalg bindings)
+* Stat R-like bindings (perhaps we can just adapt to commons.math stat)
+* **BYODMs:** Bring Your Own Distributed Method on SparkBindings!
+* In-core GPU matrix adapters -->
+
+
+
+
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/sparkbindings/play-with-shell.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/sparkbindings/play-with-shell.md b/website/old_site_migration/completed/sparkbindings/play-with-shell.md
new file mode 100644
index 0000000..3cdb8f7
--- /dev/null
+++ b/website/old_site_migration/completed/sparkbindings/play-with-shell.md
@@ -0,0 +1,199 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+# Playing with Mahout's Spark Shell
+
+This tutorial will show you how to play with Mahout's scala DSL for linear algebra and its Spark shell. **Please keep in mind that this code is still in a very early experimental stage**.
+
+_(Edited for 0.10.2)_
+
+## Intro
+
+We'll use an excerpt of a publicly available [dataset about cereals](http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html). The dataset tells the protein, fat, carbohydrate and sugars (in milligrams) contained in a set of cereals, as well as a customer rating for the cereals. Our aim for this example is to fit a linear model which infers the customer rating from the ingredients.
+
+
+Name                    | protein | fat | carbo | sugars | rating
+:-----------------------|:--------|:----|:------|:-------|:---------
+Apple Cinnamon Cheerios | 2       | 2   | 10.5  | 10     | 29.509541
+Cap'n'Crunch            | 1       | 2   | 12    | 12     | 18.042851
+Cocoa Puffs             | 1       | 1   | 12    | 13     | 22.736446
+Froot Loops             | 2       |	1   | 11    | 13     | 32.207582
+Honey Graham Ohs        | 1       |	2   | 12    | 11     | 21.871292
+Wheaties Honey Gold     | 2       | 1   | 16    |  8     | 36.187559
+Cheerios                | 6       |	2   | 17    |  1     | 50.764999
+Clusters                | 3       |	2   | 13    |  7     | 40.400208
+Great Grains Pecan      | 3       | 3   | 13    |  4     | 45.811716
+
+
+## Installing Mahout & Spark on your local machine
+
+We describe how to do a quick toy setup of Spark & Mahout on your local machine, so that you can run this example and play with the shell.
+
+ 1. Change to the directory where you unpacked Spark and type sbt/sbt assembly to build it
+ 1. Create a directory for Mahout somewhere on your machine, change to there and checkout the master branch of Apache Mahout from GitHub git clone https://github.com/apache/mahout mahout
+ 1. Change to the mahout directory and build mahout using mvn -DskipTests clean install
+
+## Starting Mahout's Spark shell
+
+ 1. Goto the directory where you unpacked Spark and type sbin/start-all.sh to locally start Spark
+ 1. Open a browser, point it to [http://localhost:8080/](http://localhost:8080/) to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with **spark://**)
+ 1. Define the following environment variables: <pre class="codehilite">export MAHOUT_HOME=[directory into which you checked out Mahout]
+export SPARK_HOME=[directory where you unpacked Spark]
+export MASTER=[url of the Spark master]
+</pre>
+ 1. Finally, change to the directory where you unpacked Mahout and type bin/mahout spark-shell,
+you should see the shell starting and get the prompt mahout> . Check
+[FAQ](http://mahout.apache.org/users/sparkbindings/faq.html) for further troubleshooting.
+
+## Implementation
+
+We'll use the shell to interactively play with the data and incrementally implement a simple [linear regression](https://en.wikipedia.org/wiki/Linear_regression) algorithm. Let's first load the dataset. Usually, we wouldn't need Mahout unless we processed a large dataset stored in a distributed filesystem. But for the sake of this example, we'll use our tiny toy dataset and "pretend" it was too big to fit onto a single machine.
+
+*Note: You can incrementally follow the example by copy-and-pasting the code into your running Mahout shell.*
+
+Mahout's linear algebra DSL has an abstraction called *DistributedRowMatrix (DRM)* which models a matrix that is partitioned by rows and stored in the memory of a cluster of machines. We use dense() to create a dense in-memory matrix from our toy dataset and use drmParallelize to load it into the cluster, "mimicking" a large, partitioned dataset.
+
+<div class="codehilite"><pre>
+val drmData = drmParallelize(dense(
+  (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
+  (1, 2, 12,   12, 18.042851),  // Cap'n'Crunch
+  (1, 1, 12,   13, 22.736446),  // Cocoa Puffs
+  (2, 1, 11,   13, 32.207582),  // Froot Loops
+  (1, 2, 12,   11, 21.871292),  // Honey Graham Ohs
+  (2, 1, 16,   8,  36.187559),  // Wheaties Honey Gold
+  (6, 2, 17,   1,  50.764999),  // Cheerios
+  (3, 2, 13,   7,  40.400208),  // Clusters
+  (3, 3, 13,   4,  45.811716)), // Great Grains Pecan
+  numPartitions = 2);
+</pre></div>
+
+Have a look at this matrix. The first four columns represent the ingredients
+(our features) and the last column (the rating) is the target variable for
+our regression. [Linear regression](https://en.wikipedia.org/wiki/Linear_regression)
+assumes that the **target variable** $$\mathbf{y}$$ is generated by the
+linear combination of **the feature matrix** $$\mathbf{X}$$ with the
+**parameter vector** $$\boldsymbol{\beta}$$ plus the
+ **noise** $$\boldsymbol{\varepsilon}$$, summarized in the formula
+$$\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}$$.
+Our goal is to find an estimate of the parameter vector
+$$\boldsymbol{\beta}$$ that explains the data very well.
+
+As a first step, we extract $$\mathbf{X}$$ and $$\mathbf{y}$$ from our data matrix. We get *X* by slicing: we take all rows (denoted by ::) and the first four columns, which have the ingredients in milligrams as content. Note that the result is again a DRM. The shell will not execute this code yet, it saves the history of operations and defers the execution until we really access a result. **Mahout's DSL automatically optimizes and parallelizes all operations on DRMs and runs them on Apache Spark.**
+
+<div class="codehilite"><pre>
+val drmX = drmData(::, 0 until 4)
+</pre></div>
+
+Next, we extract the target variable vector *y*, the fifth column of the data matrix. We assume this one fits into our driver machine, so we fetch it into memory using collect:
+
+<div class="codehilite"><pre>
+val y = drmData.collect(::, 4)
+</pre></div>
+
+Now we are ready to think about a mathematical way to estimate the parameter vector *β*. A simple textbook approach is [ordinary least squares (OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares), which minimizes the sum of residual squares between the true target variable and the prediction of the target variable. In OLS, there is even a closed form expression for estimating $$\boldsymbol{\beta}$$ as
+$$\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{y}$$.
+
+The first thing which we compute for this is  $$\mathbf{X}^{\top}\mathbf{X}$$. The code for doing this in Mahout's scala DSL maps directly to the mathematical formula. The operation .t() transposes a matrix and analogous to R %*% denotes matrix multiplication.
+
+<div class="codehilite"><pre>
+val drmXtX = drmX.t %*% drmX
+</pre></div>
+
+The same is true for computing $$\mathbf{X}^{\top}\mathbf{y}$$. We can simply type the math in scala expressions into the shell. Here, *X* lives in the cluster, while is *y* in the memory of the driver, and the result is a DRM again.
+<div class="codehilite"><pre>
+val drmXty = drmX.t %*% y
+</pre></div>
+
+We're nearly done. The next step we take is to fetch $$\mathbf{X}^{\top}\mathbf{X}$$ and
+$$\mathbf{X}^{\top}\mathbf{y}$$ into the memory of our driver machine (we are targeting
+features matrices that are tall and skinny ,
+so we can assume that $$\mathbf{X}^{\top}\mathbf{X}$$ is small enough
+to fit in). Then, we provide them to an in-memory solver (Mahout provides
+the an analog to R's solve() for that) which computes beta, our
+OLS estimate of the parameter vector $$\boldsymbol{\beta}$$.
+
+<div class="codehilite"><pre>
+val XtX = drmXtX.collect
+val Xty = drmXty.collect(::, 0)
+
+val beta = solve(XtX, Xty)
+</pre></div>
+
+That's it! We have a implemented a distributed linear regression algorithm
+on Apache Spark. I hope you agree that we didn't have to worry a lot about
+parallelization and distributed systems. The goal of Mahout's linear algebra
+DSL is to abstract away the ugliness of programming a distributed system
+as much as possible, while still retaining decent performance and
+scalability.
+
+We can now check how well our model fits its training data.
+First, we multiply the feature matrix $$\mathbf{X}$$ by our estimate of
+$$\boldsymbol{\beta}$$. Then, we look at the difference (via L2-norm) of
+the target variable $$\mathbf{y}$$ to the fitted target variable:
+
+<div class="codehilite"><pre>
+val yFitted = (drmX %*% beta).collect(::, 0)
+(y - yFitted).norm(2)
+</pre></div>
+
+We hope that we could show that Mahout's shell allows people to interactively and incrementally write algorithms. We have entered a lot of individual commands, one-by-one, until we got the desired results. We can now refactor a little by wrapping our statements into easy-to-use functions. The definition of functions follows standard scala syntax.
+
+We put all the commands for ordinary least squares into a function ols.
+
+<div class="codehilite"><pre>
+def ols(drmX: DrmLike[Int], y: Vector) =
+  solve(drmX.t %*% drmX, drmX.t %*% y)(::, 0)
+
+</pre></div>
+
+Note that DSL declares implicit collect if coersion rules require an in-core argument. Hence, we can simply
+skip explicit collects.
+
+Next, we define a function goodnessOfFit that tells how well a model fits the target variable:
+
+<div class="codehilite"><pre>
+def goodnessOfFit(drmX: DrmLike[Int], beta: Vector, y: Vector) = {
+  val fittedY = (drmX %*% beta).collect(::, 0)
+  (y - fittedY).norm(2)
+}
+</pre></div>
+
+So far we have left out an important aspect of a standard linear regression
+model. Usually there is a constant bias term added to the model. Without
+that, our model always crosses through the origin and we only learn the
+right angle. An easy way to add such a bias term to our model is to add a
+column of ones to the feature matrix $$\mathbf{X}$$.
+The corresponding weight in the parameter vector will then be the bias term.
+
+Here is how we add a bias column:
+
+<div class="codehilite"><pre>
+val drmXwithBiasColumn = drmX cbind 1
+</pre></div>
+
+Now we can give the newly created DRM drmXwithBiasColumn to our model fitting method ols and see how well the resulting model fits the training data with goodnessOfFit. You should see a large improvement in the result.
+
+<div class="codehilite"><pre>
+val betaWithBiasTerm = ols(drmXwithBiasColumn, y)
+goodnessOfFit(drmXwithBiasColumn, betaWithBiasTerm, y)
+</pre></div>
+
+As a further optimization, we can make use of the DSL's caching functionality. We use drmXwithBiasColumn repeatedly  as input to a computation, so it might be beneficial to cache it in memory. This is achieved by calling checkpoint(). In the end, we remove it from the cache with uncache:
+
+<div class="codehilite"><pre>
+val cachedDrmX = drmXwithBiasColumn.checkpoint()
+
+val betaWithBiasTerm = ols(cachedDrmX, y)
+val goodness = goodnessOfFit(cachedDrmX, betaWithBiasTerm, y)
+
+cachedDrmX.uncache()
+
+goodness
+</pre></div>
+
+
+Liked what you saw? Checkout Mahout's overview for the [Scala and Spark bindings](https://mahout.apache.org/users/sparkbindings/home.html).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/twenty-newsgroups.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/twenty-newsgroups.md b/website/old_site_migration/completed/twenty-newsgroups.md
new file mode 100644
index 0000000..472aaf6
--- /dev/null
+++ b/website/old_site_migration/completed/twenty-newsgroups.md
@@ -0,0 +1,179 @@
+---
+layout: default
+title: Twenty Newsgroups
+theme:
+    name: retro-mahout
+---
+
+
+<a name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a>
+## Twenty Newsgroups Classification Example
+
+<a name="TwentyNewsgroups-Introduction"></a>
+## Introduction
+
+The 20 newsgroups dataset is a collection of approximately 20,000
+newsgroup documents, partitioned (nearly) evenly across 20 different
+newsgroups. The 20 newsgroups collection has become a popular data set for
+experiments in text applications of machine learning techniques, such as
+text classification and text clustering. We will use the [Mahout CBayes](http://mahout.apache.org/users/mapreduce/classification/bayesian.html)
+classifier to create a model that would classify a new document into one of
+the 20 newsgroups.
+
+<a name="TwentyNewsgroups-Prerequisites"></a>
+### Prerequisites
+
+* Maven is available
+* Your environment has the following variables:
+     * **MAHOUT_HOME** Environment variables refers to where Mahout lives
+
+<a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a>
+### Instructions for running the example
+
+1. If running Hadoop in cluster mode, start the hadoop daemons by executing the following commands:
+
+            $cd$HADOOP_HOME/bin
+            $./start-all.sh + + Otherwise: + +$ export MAHOUT_LOCAL=true
+
+2. In the trunk directory of Mahout, compile and install Mahout:
+
+            $cd$MAHOUT_HOME
+            $mvn -DskipTests clean install + +3. Run the [20 newsgroups example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) by executing: + +$ ./examples/bin/classify-20newsgroups.sh
+
+4. You will be prompted to select a classification method algorithm:
+
+            1. Complement Naive Bayes
+            2. Naive Bayes
+
+Select 1 and the the script will perform the following:
+
+1. Create a working directory for the dataset and all input/output.
+2. Download and extract the *20news-bydate.tar.gz* from the [20 newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) to the working directory.
+3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile.
+4. Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.
+5. Split the preprocessed dataset into training and testing sets.
+6. Train the classifier.
+7. Test the classifier.
+
+
+Output should look something like:
+
+
+    =======================================================
+    Confusion Matrix
+    -------------------------------------------------------
+     a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t <--Classified as
+    381 0  0  0  0  9  1  0  0  0  1  0  0  2  0  1  0  0  3  0 |398 a=rec.motorcycles
+     1 284 0  0  0  0  1  0  6  3  11 0  66 3  0  6  0  4  9  0 |395 b=comp.windows.x
+     2  0 339 2  0  3  5  1  0  0  0  0  1  1  12 1  7  0  2  0 |376 c=talk.politics.mideast
+     4  0  1 327 0  2  2  0  0  2  1  1  0  5  1  4  12 0  2  0 |364 d=talk.politics.guns
+     7  0  4  32 27 7  7  2  0  12 0  0  6  0 100 9  7  31 0  0 |251 e=talk.religion.misc
+     10 0  0  0  0 359 2  2  0  0  3  0  1  6  0  1  0  0  11 0 |396 f=rec.autos
+     0  0  0  0  0  1 383 9  1  0  0  0  0  0  0  0  0  3  0  0 |397 g=rec.sport.baseball
+     1  0  0  0  0  0  9 382 0  0  0  0  1  1  1  0  2  0  2  0 |399 h=rec.sport.hockey
+     2  0  0  0  0  4  3  0 330 4  4  0  5  12 0  0  2  0  12 7 |385 i=comp.sys.mac.hardware
+     0  3  0  0  0  0  1  0  0 368 0  0  10 4  1  3  2  0  2  0 |394 j=sci.space
+     0  0  0  0  0  3  1  0  27 2 291 0  11 25 0  0  1  0  13 18|392 k=comp.sys.ibm.pc.hardware
+     8  0  1 109 0  6  11 4  1  18 0  98 1  3  11 10 27 1  1  0 |310 l=talk.politics.misc
+     0  11 0  0  0  3  6  0  10 6  11 0 299 13 0  2  13 0  7  8 |389 m=comp.graphics
+     6  0  1  0  0  4  2  0  5  2  12 0  8 321 0  4  14 0  8  6 |393 n=sci.electronics
+     2  0  0  0  0  0  4  1  0  3  1  0  3  1 372 6  0  2  1  2 |398 o=soc.religion.christian
+     4  0  0  1  0  2  3  3  0  4  2  0  7  12 6 342 1  0  9  0 |396 p=sci.med
+     0  1  0  1  0  1  4  0  3  0  1  0  8  4  0  2 369 0  1  1 |396 q=sci.crypt
+     10 0  4  10 1  5  6  2  2  6  2  0  2  1 86 15 14 152 0  1 |319 r=alt.atheism
+     4  0  0  0  0  9  1  1  8  1  12 0  3  0  2  0  0  0 341 2 |390 s=misc.forsale
+     8  5  0  0  0  1  6  0  8  5  50 0  40 2  1  0  9  0  3 256|394 t=comp.os.ms-windows.misc
+    =======================================================
+    Statistics
+    -------------------------------------------------------
+    Kappa                                       0.8808
+    Accuracy                                   90.8596%
+    Reliability                                86.3632%
+    Reliability (standard deviation)            0.2131
+
+
+
+
+
+<a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a>
+## End to end commands to build a CBayes model for 20 newsgroups
+The [20 newsgroups example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) issues the following commands as outlined above. We can build a CBayes classifier from the command line by following the process in the script:
+
+*Be sure that **MAHOUT_HOME**/bin and **HADOOP_HOME**/bin are in your **$PATH*** + +1. Create a working directory for the dataset and all input/output. + +$ export WORK_DIR=/tmp/mahout-work-${USER} +$ mkdir -p ${WORK_DIR} + +2. Download and extract the *20news-bydate.tar.gz* from the [20newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) to the working directory. + +$ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
+                -o ${WORK_DIR}/20news-bydate.tar.gz +$ mkdir -p ${WORK_DIR}/20news-bydate +$ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd .. +$ mkdir ${WORK_DIR}/20news-all +$ cp -R ${WORK_DIR}/20news-bydate/*/*${WORK_DIR}/20news-all
+     * If you're running on a Hadoop cluster:
+
+            $hadoop dfs -put${WORK_DIR}/20news-all ${WORK_DIR}/20news-all + +3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. + +$ mahout seqdirectory
+                -i ${WORK_DIR}/20news-all + -o${WORK_DIR}/20news-seq
+                -ow
+
+4. Convert and preprocesses the dataset into  a < Text, VectorWritable > SequenceFile containing term frequencies for each document.
+
+            $mahout seq2sparse + -i${WORK_DIR}/20news-seq
+                -o ${WORK_DIR}/20news-vectors + -lnorm + -nv + -wt tfidf +If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization. See the [Creating vectors from text](http://mahout.apache.org/users/basics/creating-vectors-from-text.html) page for a list of all seq2sparse options. + +5. Split the preprocessed dataset into training and testing sets. + +$ mahout split
+                -i ${WORK_DIR}/20news-vectors/tfidf-vectors + --trainingOutput${WORK_DIR}/20news-train-vectors
+                --testOutput ${WORK_DIR}/20news-test-vectors + --randomSelectionPct 40 + --overwrite --sequenceFiles -xm sequential + +6. Train the classifier. + +$ mahout trainnb
+                -i ${WORK_DIR}/20news-train-vectors + -el + -o${WORK_DIR}/model
+                -li ${WORK_DIR}/labelindex + -ow + -c + +7. Test the classifier. + +$ mahout testnb
+                -i ${WORK_DIR}/20news-test-vectors + -m${WORK_DIR}/model
+                -l ${WORK_DIR}/labelindex + -ow + -o${WORK_DIR}/20news-testing
+                -c
+
+
+
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/wikipedia-classifier-example.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/wikipedia-classifier-example.md b/website/old_site_migration/completed/wikipedia-classifier-example.md
new file mode 100644
index 0000000..9df07da
--- /dev/null
+++ b/website/old_site_migration/completed/wikipedia-classifier-example.md
@@ -0,0 +1,57 @@
+---
+layout: default
+title: Wikipedia XML parser and Naive Bayes Example
+theme:
+    name: retro-mahout
+---
+# Wikipedia XML parser and Naive Bayes Classifier Example
+
+## Introduction
+Mahout has an [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1] which will download a recent XML dump of the (entire if desired) [English Wikipedia database](http://dumps.wikimedia.org/enwiki/latest/). After running the classification script, you can use the [document classification script](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala) from the Mahout [spark-shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html) to vectorize and classify text from outside of the training and testing corpus using a modle built on the Wikipedia dataset.
+
+You can run this script to build and test a Naive Bayes classifier for option (1) 10 arbitrary countries or option (2) 2 countries (United States and United Kingdom).
+
+## Oververview
+
+Tou run the example simply execute the $MAHOUT_HOME/examples/bin/classify-wikipedia.sh script. + +By defult the script is set to run on a medium sized Wikipedia XML dump. To run on the full set (the entire english Wikipedia) you can change the download by commenting out line 78, and uncommenting line 80 of [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1]. However this is not recommended unless you have the resources to do so. *Be sure to clean your work directory when changing datasets- option (3).* + +The step by step process for Creating a Naive Bayes Classifier for the Wikipedia XML dump is very similar to that for [creating a 20 Newsgroups Classifier](http://mahout.apache.org/users/classification/twenty-newsgroups.html) [4]. The only difference being that instead of running $mahout seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki on the unzipped Wikipedia xml dump. + +$ mahout seqwiki
+
+The above command launches WikipediaToSequenceFile.java which accepts a text file of categories [3] and starts an MR job to parse the each document in the XML file.  This process will seek to extract documents with a wikipedia category tag which (exactly, if the -exactMatchOnly option is set) matches a line in the category file.  If no match is found and the -all option is set, the document will be dumped into an "unknown" category. The documents will then be written out as a <Text,Text> sequence file of the form (K:/category/document_title , V: document).
+
+There are 3 different example category files available to in the /examples/src/test/resources
+directory:  country.txt, country10.txt and country2.txt.  You can edit these categories to extract a different corpus from the Wikipedia dataset.
+
+The CLI options for seqwiki are as follows:
+
+    --input          (-i)         input pathname String
+    --output         (-o)         the output pathname String
+    --categories     (-c)         the file containing the Wikipedia categories
+    --exactMatchOnly (-e)         if set, then the Wikipedia category must match
+                                    exactly instead of simply containing the category string
+    --all            (-all)       if set select all categories
+    --removeLabels   (-rl)        if set, remove [[Category:labels]] from document text after extracting label.
+
+
+After seqwiki, the script runs seq2sparse, split, trainnb and testnb as in the [step by step 20newsgroups example](http://mahout.apache.org/users/classification/twenty-newsgroups.html).  When all of the jobs have finished, a confusion matrix will be displayed.
+
+#Resourcese
+
+[1] [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
+
+[2] [Document classification script for the Mahout Spark Shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
+
+[3] [Example category file](https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt)
+
+[4] [Step by step instructions for building a Naive Bayes classifier for 20newsgroups from the command line](http://mahout.apache.org/users/classification/twenty-newsgroups.html)
+
+[5] [Mahout MapReduce Naive Bayes](http://mahout.apache.org/users/classification/bayesian.html)
+
+[6] [Mahout Spark Naive Bayes](http://mahout.apache.org/users/algorithms/spark-naive-bayes.html)
+
+[7] [Mahout Scala Spark and H2O Bindings](http://mahout.apache.org/users/sparkbindings/home.html)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/dont_migrate/algorithms.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/dont_migrate/algorithms.md b/website/old_site_migration/dont_migrate/algorithms.md
new file mode 100644
index 0000000..c3a7e4f
--- /dev/null
+++ b/website/old_site_migration/dont_migrate/algorithms.md
@@ -0,0 +1,58 @@
+---
+layout: default
+title: Algorithms
+theme:
+    name: retro-mahout
+---
+
+NOTE: As we move away from Mapreduce, all MRs are deprecated.  If anything maybe move this to the Mapreduce home page and drop teh spark, flink, h2o columns
+---
+*Mahout 0.12.0 Features by Engine*
+---
+
+---------------------------------------------|:----------------:|:-----------:|:------:|:---:|:----:|
+**Mahout Math-Scala Core Library and Scala DSL**|
+|   [Mahout Distributed BLAS. Distributed Row Matrix API with R and Matlab like operators. Distributed ALS, SPCA, SSVD, thin-QR. Similarity Analysis](http://mahout.apache.org/users/sparkbindings/home.html).    | |  | [x](https://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf) | [x](https://github.com/apache/mahout/tree/master/h2o) |[x](https://github.com/apache/mahout/tree/flink-binding/flink)
+||
+**Mahout Interactive Shell**|
+|   [Interactive REPL shell for Spark optimized Mahout DSL](http://mahout.apache.org/users/sparkbindings/play-with-shell.html) | | | x |
+||
+**Collaborative Filtering** *with CLI drivers*|
+    User-Based Collaborative Filtering           | *deprecated* | *deprecated*|[x](https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html)
+    Item-Based Collaborative Filtering           | x | [x](https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html) | [x](https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html) |
+    Matrix Factorization with ALS | x | [x](https://mahout.apache.org/users/recommender/intro-als-hadoop.html) |  |
+    Matrix Factorization with ALS on Implicit Feedback | x | [x](https://mahout.apache.org/users/recommender/intro-als-hadoop.html) |  |
+    Weighted Matrix Factorization, SVD++  | x | |
+||
+**Classification** *with CLI drivers*| | |
+    Logistic Regression - trained via SGD   | [*deprecated*](http://mahout.apache.org/users/classification/logistic-regression.html) |
+    Naive Bayes / Complementary Naive Bayes  | | [*deprecated*](https://mahout.apache.org/users/classification/bayesian.html) | [x](https://mahout.apache.org/users/algorithms/spark-naive-bayes.html) |
+    Hidden Markov Models   | [*deprecated*](https://mahout.apache.org/users/classification/hidden-markov-models.html) |
+||
+**Clustering** *with CLI drivers*||
+    Canopy Clustering  | [*deprecated*](https://mahout.apache.org/users/clustering/canopy-clustering.html) | [*deprecated*](https://mahout.apache.org/users/clustering/canopy-clustering.html)|
+    k-Means Clustering   | [*deprecated*](https://mahout.apache.org/users/clustering/k-means-clustering.html) | [*deprecated*](https://mahout.apache.org/users/clustering/k-means-clustering.html) |
+    Fuzzy k-Means   | [*deprecated*](https://mahout.apache.org/users/clustering/fuzzy-k-means.html) | [*deprecated*](https://mahout.apache.org/users/clustering/fuzzy-k-means.html)|
+    Streaming k-Means   | [*deprecated*](https://mahout.apache.org/users/clustering/streaming-k-means.html) | [*deprecated*](https://mahout.apache.org/users/clustering/streaming-k-means.html) |
+    Spectral Clustering   |  | [*deprecated*](https://mahout.apache.org/users/clustering/spectral-clustering.html) |
+||
+**Dimensionality Reduction** *note: most scala-based dimensionality reduction algorithms are available through the [Mahout Math-Scala Core Library for all engines](https://mahout.apache.org/users/sparkbindings/home.html)*||
+    Singular Value Decomposition | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/sparkbindings/home.html) |[x](http://mahout.apache.org/users/environment/h2o-internals.html) |   [x](http://mahout.apache.org/users/flinkbindings/flink-internals.html)
+    Lanczos Algorithm  | *deprecated* | *deprecated* |
+    Stochastic SVD  | [*deprecated*](https://mahout.apache.org/users/dim-reduction/ssvd.html) | [*deprecated*](https://mahout.apache.org/users/dim-reduction/ssvd.html) | [x](http://mahout.apache.org/users/algorithms/d-ssvd.html) | [x](http://mahout.apache.org/users/algorithms/d-ssvd.html)|    [x](http://mahout.apache.org/users/algorithms/d-ssvd.html)
+    PCA (via Stochastic SVD) | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/sparkbindings/home.html)  |[x](http://mahout.apache.org/users/environment/h2o-internals.html) |   [x](http://mahout.apache.org/users/flinkbindings/flink-internals.html)
+    QR Decomposition         | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/algorithms/d-qr.html) |[x](http://mahout.apache.org/users/algorithms/d-qr.html) |   [x](http://mahout.apache.org/users/algorithms/d-qr.html)
+||
+**Topic Models**||
+    Latent Dirichlet Allocation  | *deprecated* | *deprecated* |
+||
+**Miscellaneous**||
+    RowSimilarityJob   |  | *deprecated* | [x](https://github.com/apache/mahout/blob/master/spark/src/test/scala/org/apache/mahout/drivers/RowSimilarityDriverSuite.scala) |
+    Collocations  |  | [*deprecated*](https://mahout.apache.org/users/basics/collocations.html) |
+    Sparse TF-IDF Vectors from Text |  | [*deprecated*](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) |
+    XML Parsing|  | [*deprecated*](https://issues.apache.org/jira/browse/MAHOUT-1479?jql=text%20~%20%22wikipedia%20mahout%22) |
+    Email Archive Parsing |  | [*deprecated*](https://github.com/apache/mahout/tree/master/integration/src/main/java/org/apache/mahout/text) |
+    Evolutionary Processes | [x](https://github.com/apache/mahout/tree/master/mr/src/main/java/org/apache/mahout/ep) |
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_convenience/algorithms.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/algorithms.md b/website/old_site_migration/needs_work_convenience/algorithms.md
deleted file mode 100644
index 657efde..0000000
--- a/website/old_site_migration/needs_work_convenience/algorithms.md
+++ /dev/null
@@ -1,58 +0,0 @@
----
-layout: default
-title: Algorithms
-theme:
-    name: retro-mahout
----
-
-
----
-*Mahout 0.12.0 Features by Engine*
----
-
----------------------------------------------|:----------------:|:-----------:|:------:|:---:|:----:|
-**Mahout Math-Scala Core Library and Scala DSL**|
-|   [Mahout Distributed BLAS. Distributed Row Matrix API with R and Matlab like operators. Distributed ALS, SPCA, SSVD, thin-QR. Similarity Analysis](http://mahout.apache.org/users/sparkbindings/home.html).    | |  | [x](https://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf) | [x](https://github.com/apache/mahout/tree/master/h2o) |[x](https://github.com/apache/mahout/tree/flink-binding/flink)
-||
-**Mahout Interactive Shell**|
-|   [Interactive REPL shell for Spark optimized Mahout DSL](http://mahout.apache.org/users/sparkbindings/play-with-shell.html) | | | x |
-||
-**Collaborative Filtering** *with CLI drivers*|
-    User-Based Collaborative Filtering           | *deprecated* | *deprecated*|[x](https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html)
-    Item-Based Collaborative Filtering           | x | [x](https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html) | [x](https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html) |
-    Matrix Factorization with ALS | x | [x](https://mahout.apache.org/users/recommender/intro-als-hadoop.html) |  |
-    Matrix Factorization with ALS on Implicit Feedback | x | [x](https://mahout.apache.org/users/recommender/intro-als-hadoop.html) |  |
-    Weighted Matrix Factorization, SVD++  | x | |
-||
-**Classification** *with CLI drivers*| | |
-    Logistic Regression - trained via SGD   | [*deprecated*](http://mahout.apache.org/users/classification/logistic-regression.html) |
-    Naive Bayes / Complementary Naive Bayes  | | [*deprecated*](https://mahout.apache.org/users/classification/bayesian.html) | [x](https://mahout.apache.org/users/algorithms/spark-naive-bayes.html) |
-    Hidden Markov Models   | [*deprecated*](https://mahout.apache.org/users/classification/hidden-markov-models.html) |
-||
-**Clustering** *with CLI drivers*||
-    Canopy Clustering  | [*deprecated*](https://mahout.apache.org/users/clustering/canopy-clustering.html) | [*deprecated*](https://mahout.apache.org/users/clustering/canopy-clustering.html)|
-    k-Means Clustering   | [*deprecated*](https://mahout.apache.org/users/clustering/k-means-clustering.html) | [*deprecated*](https://mahout.apache.org/users/clustering/k-means-clustering.html) |
-    Fuzzy k-Means   | [*deprecated*](https://mahout.apache.org/users/clustering/fuzzy-k-means.html) | [*deprecated*](https://mahout.apache.org/users/clustering/fuzzy-k-means.html)|
-    Streaming k-Means   | [*deprecated*](https://mahout.apache.org/users/clustering/streaming-k-means.html) | [*deprecated*](https://mahout.apache.org/users/clustering/streaming-k-means.html) |
-    Spectral Clustering   |  | [*deprecated*](https://mahout.apache.org/users/clustering/spectral-clustering.html) |
-||
-**Dimensionality Reduction** *note: most scala-based dimensionality reduction algorithms are available through the [Mahout Math-Scala Core Library for all engines](https://mahout.apache.org/users/sparkbindings/home.html)*||
-    Singular Value Decomposition | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/sparkbindings/home.html) |[x](http://mahout.apache.org/users/environment/h2o-internals.html) |   [x](http://mahout.apache.org/users/flinkbindings/flink-internals.html)
-    Lanczos Algorithm  | *deprecated* | *deprecated* |
-    Stochastic SVD  | [*deprecated*](https://mahout.apache.org/users/dim-reduction/ssvd.html) | [*deprecated*](https://mahout.apache.org/users/dim-reduction/ssvd.html) | [x](http://mahout.apache.org/users/algorithms/d-ssvd.html) | [x](http://mahout.apache.org/users/algorithms/d-ssvd.html)|    [x](http://mahout.apache.org/users/algorithms/d-ssvd.html)
-    PCA (via Stochastic SVD) | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/sparkbindings/home.html)  |[x](http://mahout.apache.org/users/environment/h2o-internals.html) |   [x](http://mahout.apache.org/users/flinkbindings/flink-internals.html)
-    QR Decomposition         | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/algorithms/d-qr.html) |[x](http://mahout.apache.org/users/algorithms/d-qr.html) |   [x](http://mahout.apache.org/users/algorithms/d-qr.html)
-||
-**Topic Models**||
-    Latent Dirichlet Allocation  | *deprecated* | *deprecated* |
-||
-**Miscellaneous**||
-    RowSimilarityJob   |  | *deprecated* | [x](https://github.com/apache/mahout/blob/master/spark/src/test/scala/org/apache/mahout/drivers/RowSimilarityDriverSuite.scala) |
-    Collocations  |  | [*deprecated*](https://mahout.apache.org/users/basics/collocations.html) |
-    Sparse TF-IDF Vectors from Text |  | [*deprecated*](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) |
-    XML Parsing|  | [*deprecated*](https://issues.apache.org/jira/browse/MAHOUT-1479?jql=text%20~%20%22wikipedia%20mahout%22) |
-    Email Archive Parsing |  | [*deprecated*](https://github.com/apache/mahout/tree/master/integration/src/main/java/org/apache/mahout/text) |
-    Evolutionary Processes | [x](https://github.com/apache/mahout/tree/master/mr/src/main/java/org/apache/mahout/ep) |
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md b/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md
deleted file mode 100644
index c72a7ae..0000000
--- a/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md
+++ /dev/null
@@ -1,51 +0,0 @@
----
-layout: default
-title:
-theme:
-   name: retro-mahout
----
-
-# Introduction
-
-This document provides an overview of how the Mahout Samsara environment is implemented over the H2O backend engine. The document is aimed at Mahout developers, to give a high level description of the design so that one can explore the code inside h2o/ with some context.
-
-## H2O Overview
-
-H2O is a distributed scalable machine learning system. Internal architecture of H2O has a distributed math engine (h2o-core) and a separate layer on top for algorithms and UI. The Mahout integration requires only the math engine (h2o-core).
-
-## H2O Data Model
-
-The data model of the H2O math engine is a distributed columnar store (of primarily numbers, but also strings). A column of numbers is called a Vector, which is broken into Chunks (of a few thousand elements). Chunks are distributed across the cluster based on a deterministic hash. Therefore, any member of the cluster knows where a particular Chunk of a Vector is homed. Each Chunk is separately compressed in memory and elements are individually decompressed on the fly upon access with purely register operations (thereby achieving high memory throughput). An ordered set of similarly partitioned Vecs are composed into a Frame. A Frame is therefore a large two dimensional table of numbers. All elements of a logical row in the Frame are guaranteed to be homed in the same server of the cluster. Generally speaking, H2O works well on "tall skinny" data, i.e, lots of rows (100s of millions) and modest number of columns (10s of thousands).
-
-
-## Mahout DRM
-
-The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. Mahout's scala DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. Examples are the Spark and H2O backend engines. Each engine has it's own design of mapping the abstract API onto its data model and provides implementations for algebraic operators over that mapping.
-
-
-## H2O Environment Engine
-
-The H2O backend implements the abstract DRM as an H2O Frame. Each logical column in the DRM is an H2O Vector. All elements of a logical DRM row are guaranteed to be homed on the same server. A set of rows stored on a server are presented as a read-only virtual in-core Matrix (i.e BlockMatrix) for the closure method in the mapBlock(...) API.
-
-H2O provides a flexible execution framework called MRTask. The MRTask framework typically executes over a Frame (or even a Vector), supports various types of map() methods, can optionally modify the Frame or Vector (though this never happens in the Mahout integration), and optionally create a new Vector or set of Vectors (to combine them into a new Frame, and consequently a new DRM).
-
-
-## Source Layout
-
-Within mahout.git, the top level directory, h2o/ holds all the source code related to the H2O backend engine. Part of the code (that interfaces with the rest of the Mahout componenets) is in Scala, and part of the code (that interfaces with h2o-core and implements algebraic operators) is in Java. Here is a brief overview of what functionality can be found where within h2o/.
-
-  h2o/ - top level directory containing all H2O related code
-
-  h2o/src/main/java/org/apache/mahout/h2obindings/ops/*.java - Physical operator code for the various DSL algebra
-
-  h2o/src/main/java/org/apache/mahout/h2obindings/drm/*.java - DRM backing (onto Frame) and Broadcast implementation
-
-  h2o/src/main/java/org/apache/mahout/h2obindings/H2OHdfs.java - Read / Write between DRM (Frame) and files on HDFS
-
-  h2o/src/main/java/org/apache/mahout/h2obindings/H2OBlockMatrix.java - A vertical block matrix of DRM presented as a virtual copy-on-write in-core Matrix. Used in mapBlock() API
-
-  h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java - A collection of various functionality and helpers. For e.g, convert between in-core Matrix and DRM, various summary statistics on DRM/Frame.
-
-  h2o/src/main/scala/org/apache/mahout/h2obindings/H2OEngine.scala - DSL operator graph evaluator and various abstract API implementations for a distributed engine
-
-  h2o/src/main/scala/org/apache/mahout/h2obindings/* - Various abstract API implementations ("glue work")
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_convenience/environment/spark-internals.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/environment/spark-internals.md b/website/old_site_migration/needs_work_convenience/environment/spark-internals.md
deleted file mode 100644
index f5d72a4..0000000
--- a/website/old_site_migration/needs_work_convenience/environment/spark-internals.md
+++ /dev/null
@@ -1,25 +0,0 @@
----
-layout: default
-title:
-theme:
-   name: retro-mahout
----
-
-# Introduction
-
-This document provides an overview of how the Mahout Scala DSL (distributed algebraic operators) is implemented over the Spark back end engine. The document is aimed at Mahout developers, to give a high level description of the design.
-
-## Spark Overview
-
-## Spark Data Model
-
-
-## Mahout DRM
-
-Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. The DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. Examples are Spark and H2O backend engines. Each engine has its own design of mapping the abstract API onto its data model and provide implementations for algebraic operators over that mapping.
-
-
-## Spark DSL Engine
-
-
-## Source Layout

----------------------------------------------------------------------
deleted file mode 100644
index 8c8145a..0000000
+++ /dev/null
@@ -1,50 +0,0 @@
----
-layout: default
-title:
-theme:
-   name: retro-mahout
----
-
-#Introduction
-
-This document provides an overview of how the Mahout Samsara environment is implemented over the Apache Flink backend engine. This document gives an overview of the code layout for the Flink backend engine, the source code for which can be found under /flink directory in the Mahout codebase.
-
-Apache Flink is a distributed big data streaming engine that supports both Streaming and Batch interfaces. Batch processing is an extension of Flink’s Stream processing engine.
-
-The Mahout Flink integration presently supports Flink’s batch processing capabilities leveraging the DataSet API.
-
-The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. Mahout's scala DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. An example is the Spark backend engine. Each engine has it's own design of mapping the abstract API onto its data model and provides implementations for algebraic operators over that mapping.
-
-
-Apache Flink is an open source, distributed Stream and Batch Processing Framework. At it's core, Flink is a Stream Processing engine and Batch processing is an extension of Stream Processing.
-
-
- <ol>
-<li><b>DataSet API</b> for Batch data in Java, Scala and Python</li>
-<li><b>DataStream API</b> for Stream Processing in Java and Scala</li>
-<li><b>Table API</b> with SQL-like regular expression language in Java and Scala</li>
-<li><b>Gelly</b> Graph Processing API in Java and Scala</li>
-<li><b>CEP API</b>, a complex event processing library</li>
-</ol>
-
-The Flink backend implements the abstract DRM as a Flink DataSet. A Flink job runs in the context of an ExecutionEnvironment (from the Flink Batch processing API).
-
-#Source Layout
-
-Within mahout.git, the top level directory, flink/ holds all the source code for the Flink backend engine. Sections of code that interface with the rest of the Mahout components are in Scala, and sections of the code that interface with Flink DataSet API and implement algebraic operators are in Java. Here is a brief overview of what functionality can be found within flink/ folder.
-
-
-
-
-
-
-

----------------------------------------------------------------------
deleted file mode 100644
index 4bbcd33..0000000
+++ /dev/null
@@ -1,111 +0,0 @@
----
-layout: default
-title:
-theme:
-   name: retro-mahout
----
-
-## Getting Started
-
-To get started, add the following dependency to the pom:
-
-    <dependency>
-      <groupId>org.apache.mahout</groupId>
-      <version>0.12.0</version>
-    </dependency>
-
-Here is how to use the Flink backend:
-
-	import org.apache.mahout.math.drm._
-	import org.apache.mahout.math.drm.RLikeDrmOps._
-
-
-	  def main(args: Array[String]): Unit = {
-	    val filePath = "path/to/the/input/file"
-
-	    val env = ExecutionEnvironment.getExecutionEnvironment
-	    implicit val ctx = new FlinkDistributedContext(env)
-
-	    val drm = readCsv(filePath, delim = "\t", comment = "#")
-	    val C = drm.t %*% drm
-	    println(C.collect)
-	  }
-
-	}
-
-## Current Status
-
-The top JIRA for Flink backend is [MAHOUT-1570](https://issues.apache.org/jira/browse/MAHOUT-1570) which has been fully implemented.
-
-### Implemented
-
-* [MAHOUT-1701](https://issues.apache.org/jira/browse/MAHOUT-1701) Mahout DSL for Flink: implement AtB ABt and AtA operators
-* [MAHOUT-1702](https://issues.apache.org/jira/browse/MAHOUT-1702) implement element-wise operators (like A + 2 or A + B)
-* [MAHOUT-1703](https://issues.apache.org/jira/browse/MAHOUT-1703) implement cbind and rbind
-* [MAHOUT-1709](https://issues.apache.org/jira/browse/MAHOUT-1709) implement slicing (like A(1 to 10, ::))
-* [MAHOUT-1710](https://issues.apache.org/jira/browse/MAHOUT-1710) implement right in-core matrix multiplication (A %*% B when B is in-core)
-* [MAHOUT-1712](https://issues.apache.org/jira/browse/MAHOUT-1712) implement operators At, Ax, Atx - Ax and At are implemented
-* [MAHOUT-1734](https://issues.apache.org/jira/browse/MAHOUT-1734) implement I/O - should be able to read results of Flink bindings
-* [MAHOUT-1747](https://issues.apache.org/jira/browse/MAHOUT-1747) add support for different types of indexes (String, long, etc) - now supports Int, Long and String
-* [MAHOUT-1748](https://issues.apache.org/jira/browse/MAHOUT-1748) switch to Flink Scala API
-* [MAHOUT-1749](https://issues.apache.org/jira/browse/MAHOUT-1749) Implement Atx
-* [MAHOUT-1750](https://issues.apache.org/jira/browse/MAHOUT-1750) Implement ABt
-* [MAHOUT-1751](https://issues.apache.org/jira/browse/MAHOUT-1751) Implement AtA
-* [MAHOUT-1755](https://issues.apache.org/jira/browse/MAHOUT-1755) Flush intermediate results to FS - Flink, unlike Spark, does not store intermediate results in memory.
-* [MAHOUT-1776](https://issues.apache.org/jira/browse/MAHOUT-1776) Refactor common Engine agnostic classes to Math-Scala module
-* [MAHOUT-1777](https://issues.apache.org/jira/browse/MAHOUT-1777) move HDFSUtil classes into the HDFS module
-* [MAHOUT-1804](https://issues.apache.org/jira/browse/MAHOUT-1804) Implement drmParallelizeWithRowLabels(..) in Flink
-* [MAHOUT-1805](https://issues.apache.org/jira/browse/MAHOUT-1805) Implement allReduceBlock(..) in Flink bindings
-* [MAHOUT-1809](https://issues.apache.org/jira/browse/MAHOUT-1809) Failing tests in flin-bindings: dals and dspca
-* [MAHOUT-1810](https://issues.apache.org/jira/browse/MAHOUT-1810) Failing test in flink-bindings: A + B Identically partitioned (mapBlock Checkpointing issue)
-* [MAHOUT-1812](https://issues.apache.org/jira/browse/MAHOUT-1812) Implement drmParallelizeWithEmptyLong(..) in flink bindings
-* [MAHOUT-1814](https://issues.apache.org/jira/browse/MAHOUT-1814) Implement drm2intKeyed in flink bindings
-* [MAHOUT-1815](https://issues.apache.org/jira/browse/MAHOUT-1815) dsqDist(X,Y) and dsqDist(X) failing in flink tests
-* [MAHOUT-1816](https://issues.apache.org/jira/browse/MAHOUT-1816) Implement newRowCardinality in CheckpointedFlinkDrm
-* [MAHOUT-1817](https://issues.apache.org/jira/browse/MAHOUT-1817) Implement caching in Flink Bindings
-* [MAHOUT-1818](https://issues.apache.org/jira/browse/MAHOUT-1818) dals test failing in Flink Bindings
-* [MAHOUT-1820](https://issues.apache.org/jira/browse/MAHOUT-1820) Add a method to generate Tuple<PartitionId, Partition elements count>> to support Flink backend
-* [MAHOUT-1821](https://issues.apache.org/jira/browse/MAHOUT-1821) Use a mahout-flink-conf.yaml configuration file for Mahout specific Flink configuration
-* [MAHOUT-1824](https://issues.apache.org/jira/browse/MAHOUT-1824) Optimize FlinkOpAtA to use upper triangular matrices
-
-### Tests
-
-There is a set of standard tests that all engines should pass (see [MAHOUT-1764](https://issues.apache.org/jira/browse/MAHOUT-1764)).
-
-* DistributedDecompositionsSuite
-* DrmLikeOpsSuite
-* DrmLikeSuite
-* RLikeDrmOpsSuite
-
-
-These are Flink-backend specific tests, e.g.
-
-* DrmLikeOpsSuite for operations like norm, rowSums, rowMeans
-* RLikeOpsSuite for basic LA like A.t %*% A, A.t %*% x, etc
-* LATestSuite tests for specific operators like AtB, Ax, etc
-* UseCasesSuite has more complex examples, like power iteration, ridge regression, etc
-
-## Environment
-
-For development the minimal supported configuration is
-
-* [Scala 2.10]
-
-When using mahout, please import the following modules:
-
-* mahout-math
-* mahout-math-scala
-* mahout-flink_2.10
-*
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md
deleted file mode 100644
index 846a4ce..0000000
--- a/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md
+++ /dev/null
@@ -1,53 +0,0 @@
----
-layout: default
-title:
-theme:
-    name: retro-mahout
----
-
-Notice:    Licensed to the Apache Software Foundation (ASF) under one
-           or more contributor license agreements.  See the NOTICE file
-           distributed with this work for additional information
-           to you under the Apache License, Version 2.0 (the
-           "License"); you may not use this file except in compliance
-           with the License.  You may obtain a copy of the License at
-           .
-           .
-           Unless required by applicable law or agreed to in writing,
-           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-           KIND, either express or implied.  See the License for the
-           specific language governing permissions and limitations
-
-#Bank Marketing Example
-
-### Introduction
-
-This page describes how to run Mahout's SGD classifier on the [UCI Bank Marketing dataset](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing).
-The goal is to predict if the client will subscribe a term deposit offered via a phone call. The features in the dataset consist
-of information such as age, job, marital status as well as information about the last contacts from the bank.
-
-### Code & Data
-
-The bank marketing example code lives under
-
-*mahout-examples/src/main/java/org.apache.mahout.classifier.sgd.bankmarketing*
-
-The data can be found at
-
-*mahout-examples/src/main/resources/bank-full.csv*
-
-### Code details
-
-This example consists of 3 classes:
-
-  - BankMarketingClassificationMain
-  - TelephoneCall
-  - TelephoneCallParser
-
-When you run the main method of BankMarketingClassificationMain it parses the dataset using the TelephoneCallParser and trains
-a logistic regression model with 20 runs and 20 passes. The TelephoneCallParser uses Mahout's feature vector encoder
-to encode the features in the dataset into a vector. Afterwards the model is tested and the learning rate and AUC is printed accuracy is printed to standard output.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md
deleted file mode 100644
index 51a5c74..0000000
--- a/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md
+++ /dev/null
@@ -1,147 +0,0 @@
----
-layout: default
-title:
-theme:
-    name: retro-mahout
----
-
-# Naive Bayes
-
-
-## Intro
-
-Mahout currently has two Naive Bayes implementations.  The first is standard Multinomial Naive Bayes. The second is an implementation of Transformed Weight-normalized Complement Naive Bayes as introduced by Rennie et al. [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). We refer to the former as Bayes and the latter as CBayes.
-
-Where Bayes has long been a standard in text classification, CBayes is an extension of Bayes that performs particularly well on datasets with skewed classes and has been shown to be competitive with algorithms of higher complexity such as Support Vector Machines.
-
-
-## Implementations
-Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and classification can be done via a MapReduce Job or sequentially.  Mahout provides CLI drivers for preprocessing, training and testing. A Spark implementation is currently in the works ([MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)).
-
-## Preprocessing and Algorithm
-
-As described in [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive Bayes is broken down into the following steps (assignments are over all possible index values):
-
-- Let $$\vec{d}=(\vec{d_1},...,\vec{d_n})$$ be a set of documents; $$d_{ij}$$ is the count of word $$i$$ in document $$j$$.
-- Let $$\vec{y}=(y_1,...,y_n)$$ be their labels.
-- Let $$\alpha_i$$ be a smoothing parameter for all words in the vocabulary; let $$\alpha=\sum_i{\alpha_i}$$.
-- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length normalization of $$\vec{d}$$
-    1. $$d_{ij} = \sqrt{d_{ij}}$$
-    2. $$d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)$$
-    3. $$d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}$$
-- **Training: Bayes**$$(\vec{d},\vec{y})$$ calculate term weights $$w_{ci}$$ as:
-    1. $$\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}$$
-    2. $$w_{ci}=\log{\hat\theta_{ci}}$$
-- **Training: CBayes**$$(\vec{d},\vec{y})$$ calculate term weights $$w_{ci}$$ as:
-    1. $$\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}$$
-    2. $$w_{ci}=-\log{\hat\theta_{ci}}$$
-    3. $$w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}$$
-- **Label Assignment/Testing:**
-    1. Let $$\vec{t}= (t_1,...,t_n)$$ be a test document; let $$t_i$$ be the count of the word $$t$$.
-    2. Label the document according to $$l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}$$
-
-As we can see, the main difference between Bayes and CBayes is the weight calculation step.  Where Bayes weighs terms more heavily based on the likelihood that they belong to class $$c$$, CBayes seeks to maximize term weights on the likelihood that they do not belong to any other class.
-
-## Running from the command line
-
-Mahout provides CLI drivers for all above steps.  Here we will give a simple overview of Mahout CLI commands used to preprocess the data, train the model and assign labels to the training set. An [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) is given for the full process from data acquisition through classification of the classic [20 Newsgroups corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html).
-
-- **Preprocessing:**
-For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the [mahout seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization (-n 2 option) as follows:
-
-        mahout seq2sparse
-          -i ${PATH_TO_SEQUENCE_FILES} - -o${PATH_TO_TFIDF_VECTORS}
-          -nv
-          -n 2
-          -wt tfidf
-
-- **Training:**
-The model is then trained using mahout trainnb .  The default is to train a Bayes model. The -c option is given to train a CBayes model:
-
-        mahout trainnb
-          -i ${PATH_TO_TFIDF_VECTORS} - -o${PATH_TO_MODEL}/model
-          -li ${PATH_TO_MODEL}/labelindex - -ow - -c - -- **Label Assignment/Testing:** -Classification and testing on a holdout set can then be performed via mahout testnb. Again, the -c option indicates that the model is CBayes. The -seq option tells mahout testnb to run sequentially: - - mahout testnb - -i${PATH_TO_TFIDF_TEST_VECTORS}
-          -m ${PATH_TO_MODEL}/model - -l${PATH_TO_MODEL}/labelindex
-          -ow
-          -o \${PATH_TO_OUTPUT}
-          -c
-          -seq
-
-## Command line options
-
-- **Preprocessing:**
-
-  Only relevant parameters used for Bayes/CBayes as detailed above are shown. Several other transformations can be performed by mahout seq2sparse and used as input to Bayes/CBayes.  For a full list of mahout seq2Sparse options see the [Creating vectors from text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) page.
-
-        mahout seq2sparse
-          --output (-o) output             The directory pathname for output.
-          --input (-i) input               Path to job input directory.
-          --weight (-wt) weight            The kind of weight to use. Currently TF
-                                               or TFIDF. Default: TFIDF
-          --norm (-n) norm                 The norm to use, expressed as either a
-                                               float or "INF" if you want to use the
-                                               Infinite norm.  Must be greater or equal
-                                               to 0.  The default is not to normalize
-          --overwrite (-ow)                If set, overwrite the output directory
-          --sequentialAccessVector (-seq)  (Optional) Whether output vectors should
-                                               be SequentialAccessVectors. If set true
-                                               else false
-          --namedVector (-nv)              (Optional) Whether output vectors should
-                                               be NamedVectors. If set true else false
-
-- **Training:**
-
-        mahout trainnb
-          --input (-i) input               Path to job input directory.
-          --output (-o) output             The directory pathname for output.
-          --alphaI (-a) alphaI             Smoothing parameter. Default is 1.0
-          --trainComplementary (-c)        Train complementary? Default is false.
-          --labelIndex (-li) labelIndex    The path to store the label index in
-          --overwrite (-ow)                If present, overwrite the output directory
-                                               before running job
-          --help (-h)                      Print out help
-          --tempDir tempDir                Intermediate output directory
-          --startPhase startPhase          First phase to run
-          --endPhase endPhase              Last phase to run
-
-- **Testing:**
-
-        mahout testnb
-          --input (-i) input               Path to job input directory.
-          --output (-o) output             The directory pathname for output.
-          --overwrite (-ow)                If present, overwrite the output directory
-                                               before running job
-
-
-          --model (-m) model               The path to the model built during training
-          --testComplementary (-c)         Test complementary? Default is false.
-          --runSequential (-seq)           Run sequential?
-          --labelIndex (-l) labelIndex     The path to the location of the label index
-          --help (-h)                      Print out help
-          --tempDir tempDir                Intermediate output directory
-          --startPhase startPhase          First phase to run
-          --endPhase endPhase              Last phase to run
-
-
-## Examples
-
-Mahout provides an example for Naive Bayes classification:
-
-1. [Classify 20 Newsgroups](twenty-newsgroups.html)
-
-## References
-
-[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). [Tackling the Poor Assumptions of Naive Bayes Text Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003).
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md
deleted file mode 100644
index d8d049e..0000000
--- a/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md
+++ /dev/null
@@ -1,67 +0,0 @@
----
-layout: default
-title: Breiman Example
-theme:
-    name: retro-mahout
----
-
-#Breiman Example
-
-#### Introduction
-
-This page describes how to run the Breiman example, which implements the test procedure described in [Leo Breiman's paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.3999&rep=rep1&type=pdf). The basic algorithm is as follows :
-
- * repeat *I* iterations
- * in each iteration do
-  * keep 10% of the dataset apart as a testing set
-  * build two forests using the training set, one with *m = int(log2(M) + 1)* (called Random-Input) and one with *m = 1* (called Single-Input)
-  * choose the forest that gave the lowest oob error estimation to compute
-the test set error
-  * compute the test set error using the Single Input Forest (test error),
-this demonstrates that even with *m = 1*, Decision Forests give comparable
-results to greater values of *m*
-  * compute the mean testset error using every tree of the chosen forest
-(tree error). This should indicate how well a single Decision Tree performs
- * compute the mean test error for all iterations
- * compute the mean tree error for all iterations
-
-
-#### Running the Example
-
-The current implementation is compatible with the [UCI repository](http://archive.ics.uci.edu/ml/) file format. We'll show how to run this example on two datasets:
-
-First, we deal with [Glass Identification](http://archive.ics.uci.edu/ml/datasets/Glass+Identification): download the [dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data) file called **glass.data** and store it onto your local machine. Next, we must generate the descriptor file **glass.info** for this dataset with the following command:
-
-    bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/glass.data -f /path/to/glass.info -d I 9 N L
-
-Substitute */path/to/* with the folder where you downloaded the dataset, the argument "I 9 N L" indicates the nature of the variables. Here it means 1
-ignored (I) attribute, followed by 9 numerical(N) attributes, followed by
-the label (L).
-
-Finally, we build and evaluate our random forest classifier as follows:
-
-    bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/glass.data -ds /path/to/glass.info -i 10 -t 100
-which builds 100 trees (-t argument) and repeats the test 10 iterations (-i
-argument)
-
-The example outputs the following results:
-
- * Selection error: mean test error for the selected forest on all iterations
- * Single Input error: mean test error for the single input forest on all
-iterations
- * One Tree error: mean single tree error on all iterations
- * Mean Random Input Time: mean build time for random input forests on all
-iterations
- * Mean Single Input Time: mean build time for single input forests on all
-iterations
-
-We can repeat this for a [Sonar](http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar,+Mines+vs.+Rocks%29) usecase: download the [dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data) file called **sonar.all-data** and store it onto your local machine. Generate the descriptor file **sonar.info** for this dataset with the following command:
-
-    bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/sonar.all-data -f /path/to/sonar.info -d 60 N L
-
-The argument "60 N L" means 60 numerical(N) attributes, followed by the label (L). Analogous to the previous case, we run the evaluation as follows:
-
-    bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/sonar.all-data -ds /path/to/sonar.info -i 10 -t 100
-
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md
deleted file mode 100644
index a24cc14..0000000
--- a/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md
+++ /dev/null
@@ -1,155 +0,0 @@
----
-layout: default
-title: Class Discovery
-theme:
-    name: retro-mahout
----
-<a name="ClassDiscovery-ClassDiscovery"></a>
-# Class Discovery
-
-See http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-417.pdf
-
-CDGA uses a Genetic Algorithm to discover a classification rule for a given
-dataset.
-A dataset can be seen as a table:
-
-<table>
-<tr><th> </th><th>attribute 1</th><th>attribute 2</th><th>...</th><th>attribute N</th></tr>
-<tr><td>row 1</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
-<tr><td>row 2</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
-<tr><td>...</td><td>...</td><td>...</td><td>...</td><td>...</td></tr>
-<tr><td>row M</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
-</table>
-
-An attribute can be numerical, for example a "temperature" attribute, or
-categorical, for example a "color" attribute. For classification purposes,
-one of the categorical attributes is designated as a *label*, which means
-that its value defines the *class* of the rows.
-A classification rule can be represented as follows:
-<table>
-<tr><th> </th><th>attribute 1</th><th>attribute 2</th><th>...</th><th>attribute N</th></tr>
-<tr><td>weight</td><td>w1</td><td>w2</td><td>...</td><td>wN</td></tr>
-<tr><td>operator</td><td>op1</td><td>op2</td><td>...</td><td>opN</td></tr>
-<tr><td>value</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
-</table>
-
-For a given *target* class and a weight *threshold*, the classification
-
-
-    for each row of the dataset
-      if (rule.w1 < threshold || (rule.w1 >= threshold && row.value1 rule.op1
-rule.value1)) &&
-         (rule.w2 < threshold || (rule.w2 >= threshold && row.value2 rule.op2
-rule.value2)) &&
-         ...
-         (rule.wN < threshold || (rule.wN >= threshold && row.valueN rule.opN
-rule.valueN)) then
-        row is part of the target class
-
-
-*Important:* The label attribute is not evaluated by the rule.
-
-The threshold parameter allows some conditions of the rule to be skipped if
-their weight is too small. The operators available depend on the attribute
-types:
-* for a numerical attributes, the available operators are '<' and '>='
-* for categorical attributes, the available operators are '!=' and '=='
-
-The "threshold" and "target" are user defined parameters, and because the
-label is always a categorical attribute, the target is the (zero based)
-index of the class label value in all the possible values of the label. For
-example, if the label attribute can have the following values (blue, brown,
-green), then a target of 1 means the "blue" class.
-
-For example, we have the following dataset (the label attribute is "Eyes
-Color"):
-<table>
-<tr><th> </th><th>Age</th><th>Eyes Color</th><th>Hair Color</th></tr>
-<tr><td>row 1</td><td>16</td><td>brown</td><td>dark</td></tr>
-<tr><td>row 2</td><td>25</td><td>green</td><td>light</td></tr>
-<tr><td>row 3</td><td>12</td><td>blue</td><td>light</td></tr>
-and a classification rule:
-<tr><td>weight</td><td>0</td><td>1</td></tr>
-<tr><td>operator</td><td><</td><td>!=</td></tr>
-<tr><td>value</td><td>20</td><td>light</td></tr>
-and the following parameters: threshold = 1 and target = 0 (brown).
-</table>
-
-This rule can be read as follows:
-
-    for each row of the dataset
-      if (0 < 1 || (0 >= 1 && row.value1 < 20)) &&
-         (1 < 1 || (1 >= 1 && row.value2 != light)) then
-        row is part of the "brown Eye Color" class
-
-
-Please note how the rule skipped the label attribute (Eye Color), and how
-the first condition is ignored because its weight is < threshold.
-
-<a name="ClassDiscovery-Runningtheexample:"></a>
-# Running the example:
-NOTE: Substitute in the appropriate version for the Mahout JOB jar
-
-1. cd <MAHOUT_HOME>/examples
-1. ant job
-<MAHOUT_HOME>/examples/src/test/resources/wdbc wdbc{code}
-<MAHOUT_HOME>/examples/src/test/resources/wdbc.infos wdbc.infos{code}
-<MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job
-org.apache.mahout.ga.watchmaker.cd.CDGA
-<MAHOUT_HOME>/examples/src/test/resources/wdbc 1 0.9 1 0.033 0.1 0 100 10
-
-    CDGA needs 9 parameters:
-    * param 1 : path of the directory that contains the dataset and its infos
-file
-    * param 2 : target class
-    * param 3 : threshold
-    * param 4 : number of crossover points for the multi-point crossover
-    * param 5 : mutation rate
-    * param 6 : mutation range
-    * param 7 : mutation precision
-    * param 8 : population size
-    * param 9 : number of generations before the program stops
-
-.
-    For a detailed explanation about the 5th, 6th and 7th parameters, please
-see [Real Valued Mutation|http://www.geatbx.com/docu/algindex-04.html#P659_42386]
-.
-
-    *TODO*: Fill in where to find the output and what it means.
-
-    h1. The info file:
-    To run properly, CDGA needs some informations about the dataset. Each
-dataset should be accompanied by an .infos file that contains the needed
-informations. for each attribute a corresponding line in the info file
-describes it, it can be one of the following:
-    * IGNORED
-      if the attribute is ignored
-    * LABEL, val1, val2,...
-      if the attribute is the label (class), and its possible values
-    * CATEGORICAL, val1, val2,...
-      if the attribute is categorial (nominal), and its possible values
-    * NUMERICAL, min, max
-      if the attribute is numerical, and its min and max values
-
-    This file can be generated automaticaly using a special tool available with
-CDGA.
-
-
-
-*  the tool searches for an existing infos file (*must be filled by the
-user*), in the same directory of the dataset with the same name and with
-the ".infos" extension, that contain the type of the attributes:
-  ** 'N' numerical attribute
-  ** 'C' categorical attribute
-  ** 'L' label (this also a categorical attribute)
-  ** 'I' to ignore the attribute
-  each attribute is in a separate
-* A Hadoop job is used to parse the dataset and collect the informations.
-This means that *the dataset can be distributed over HDFS*.
-* the results are written back in the same .info file, with the correct
-format needed by CDGA.


Mime
View raw message