mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vans...@apache.org
Subject [23/52] [partial] mahout git commit: removed website directory- this folder should be empty except for the output directory
Date Tue, 27 Jun 2017 16:30:02 GMT
http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/distributed/spark-bindings/faq.md
----------------------------------------------------------------------
diff --git a/website/docs/distributed/spark-bindings/faq.md b/website/docs/distributed/spark-bindings/faq.md
deleted file mode 100644
index 9649e3b..0000000
--- a/website/docs/distributed/spark-bindings/faq.md
+++ /dev/null
@@ -1,52 +0,0 @@
----
-layout: default
-title: FAQ
-theme:
-    name: retro-mahout
----
-
-# FAQ for using Mahout with Spark
-
-**Q: Mahout Spark shell doesn't start; "ClassNotFound" problems or various classpath problems.**
-
-**A:** So far as of the time of this writing all reported problems starting the Spark shell in Mahout were revolving 
-around classpath issues one way or another. 
-
-If you are getting method signature like errors, most probably you have mismatch between Mahout's Spark dependency 
-and actual Spark installed. (At the time of this writing the HEAD depends on Spark 1.1.0) but check mahout/pom.xml.
-
-Troubleshooting general classpath issues is pretty straightforward. Since Mahout is using Spark's installation 
-and its classpath as reported by Spark itself for Spark-related dependencies, it is important to make sure 
-the classpath is sane and is made available to Mahout:
-
-1. Check Spark is of correct version (same as in Mahout's poms), is compiled and SPARK_HOME is set.
-2. Check Mahout is compiled and MAHOUT_HOME is set.
-3. Run `$SPARK_HOME/bin/compute-classpath.sh` and make sure it produces sane result with no errors. 
-If it outputs something other than a straightforward classpath string, most likely Spark is not compiled/set correctly (later spark versions require 
-`sbt/sbt assembly` to be run, simply runnig `sbt/sbt publish-local` is not enough any longer).
-4. Run `$MAHOUT_HOME/bin/mahout -spark classpath` and check that path reported in step (3) is included.
-
-**Q: I am using the command line Mahout jobs that run on Spark or am writing my own application that uses 
-Mahout's Spark code. When I run the code on my cluster I get ClassNotFound or signature errors during serialization. 
-What's wrong?**
- 
-**A:** The Spark artifacts in the maven ecosystem may not match the exact binary you are running on your cluster. This may 
-cause class name or version mismatches. In this case you may wish 
-to build Spark yourself to guarantee that you are running exactly what you are building Mahout against. To do this follow these steps
-in order:
-
-1. Build Spark with maven, but **do not** use the "package" target as described on the Spark site. Build with the "clean install" target instead. 
-Something like: "mvn clean install -Dhadoop1.2.1" or whatever your particular build options are. This will put the jars for Spark
-in the local maven cache.
-2. Deploy **your** Spark build to your cluster and test it there.
-3. Build Mahout. This will cause maven to pull the jars for Spark from the local maven cache and may resolve missing 
-or mis-identified classes.
-4. if you are building your own code do so against the local builds of Spark and Mahout.
-
-**Q: The implicit SparkContext 'sc' does not work in the Mahout spark-shell.**
-
-**A:** In the Mahout spark-shell the SparkContext is called 'sdc', where the 'd' stands for distributed. 
-
-
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/distributed/spark-bindings/index.md
----------------------------------------------------------------------
diff --git a/website/docs/distributed/spark-bindings/index.md b/website/docs/distributed/spark-bindings/index.md
deleted file mode 100644
index 54324c7..0000000
--- a/website/docs/distributed/spark-bindings/index.md
+++ /dev/null
@@ -1,104 +0,0 @@
----
-layout: default
-title: Spark Bindings
-theme:
-    name: retro-mahout
----
-
-# Scala & Spark Bindings:
-*Bringing algebraic semantics*
-
-## What is Scala & Spark Bindings?
-
-In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic optimizer of something like this (actual formula from **(d)spca**)
-        
-
-`\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]`
-
-bound to in-core and distributed computations (currently, on Apache Spark).
-
-
-Mahout Scala & Spark Bindings expression of the above:
-
-        val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)
-
-The main idea is that a scientist writing algebraic expressions cannot care less of distributed 
-operation plans and works **entirely on the logical level** just like he or she would do with R.
-
-Another idea is decoupling logical expression from distributed back-end. As more back-ends are added, 
-this implies **"write once, run everywhere"**.
-
-The linear algebra side works with scalars, in-core vectors and matrices, and Mahout Distributed
-Row Matrices (DRMs).
-
-The ecosystem of operators is built in the R's image, i.e. it follows R naming such as %*%, 
-colSums, nrow, length operating over vectors or matices. 
-
-Important part of Spark Bindings is expression optimizer. It looks at expression as a whole 
-and figures out how it can be simplified, and which physical operators should be picked. For example,
-there are currently about 5 different physical operators performing DRM-DRM multiplication
-picked based on matrix geometry, distributed dataset partitioning, orientation etc. 
-If we count in DRM by in-core combinations, that would be another 4, i.e. 9 total -- all of it for just 
-simple x %*% y logical notation.
-
-Please refer to the documentation for details.
-
-## Status
-
-This environment addresses mostly R-like Linear Algebra optmizations for 
-Spark, Flink and H20.
-
-
-## Documentation
-
-* Scala and Spark bindings manual: [web](http://apache.github.io/mahout/doc/ScalaSparkBindings.html), [pdf](ScalaSparkBindings.pdf), [pptx](MahoutScalaAndSparkBindings.pptx)
-* [Spark Bindings FAQ](faq.html)
-<!-- dead link* Overview blog on 0.10.x releases: [blog](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html) -->
-
-## Distributed methods and solvers using Bindings
-
-* In-core ([ssvd]) and Distributed ([dssvd]) Stochastic SVD -- guinea pigs -- see the bindings manual
-* In-core ([spca]) and Distributed ([dspca]) Stochastic PCA -- guinea pigs -- see the bindings manual
-* Distributed thin QR decomposition ([dqrThin]) -- guinea pig -- see the bindings manual 
-* [Current list of algorithms](https://mahout.apache.org/users/basics/algorithms.html)
-
-[ssvd]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
-[spca]: https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
-[dssvd]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala
-[dspca]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala
-[dqrThin]: https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala
-
-## Reading RDDs and DataFrames into DRMs
-TODO
-
-
-TODO: Do we still want this? (I don't think so...)
-## Related history of note 
-
-* CLI and Driver for Spark version of item similarity -- [MAHOUT-1541](https://issues.apache.org/jira/browse/MAHOUT-1541)
-* Command line interface for generalizable Spark pipelines -- [MAHOUT-1569](https://issues.apache.org/jira/browse/MAHOUT-1569)
-* Cooccurrence Analysis / Item-based Recommendation -- [MAHOUT-1464](https://issues.apache.org/jira/browse/MAHOUT-1464)
-* Spark Bindings -- [MAHOUT-1346](https://issues.apache.org/jira/browse/MAHOUT-1346)
-* Scala Bindings -- [MAHOUT-1297](https://issues.apache.org/jira/browse/MAHOUT-1297)
-* Interactive Scala & Spark Bindings Shell & Script processor -- [MAHOUT-1489](https://issues.apache.org/jira/browse/MAHOUT-1489)
-* OLS tutorial using Mahout shell -- [MAHOUT-1542](https://issues.apache.org/jira/browse/MAHOUT-1542)
-* Full abstraction of DRM apis and algorithms from a distributed engine -- [MAHOUT-1529](https://issues.apache.org/jira/browse/MAHOUT-1529)
-* Port Naive Bayes -- [MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)
-
-## Work in progress 
-* Text-delimited files for input and output -- [MAHOUT-1568](https://issues.apache.org/jira/browse/MAHOUT-1568)
-<!-- * Weighted (Implicit Feedback) ALS -- [MAHOUT-1365](https://issues.apache.org/jira/browse/MAHOUT-1365) -->
-<!--* Data frame R-like bindings -- [MAHOUT-1490](https://issues.apache.org/jira/browse/MAHOUT-1490) -->
-
-* *Your issue here!*
-
-<!-- ## Stuff wanted: 
-* Data frame R-like bindings (similarly to linalg bindings)
-* Stat R-like bindings (perhaps we can just adapt to commons.math stat)
-* **BYODMs:** Bring Your Own Distributed Method on SparkBindings! 
-* In-core jBlas matrix adapter
-* In-core GPU matrix adapters -->
-
-
-
-  
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/index.md
----------------------------------------------------------------------
diff --git a/website/docs/index.md b/website/docs/index.md
deleted file mode 100755
index 9d7f667..0000000
--- a/website/docs/index.md
+++ /dev/null
@@ -1,110 +0,0 @@
----
-layout: page
-title: Welcome to the Docs
-tagline: Apache Mahout from 30,000 feet (10,000 meters)
----
-
-
-You've probably already noticed Mahout has a lot of things going on at different levels, and it can be hard to know where
-to start.  Let's provide an overview to help you see how the pieces fit together. In general the stack is something like this:
-
-1. Application Code
-1. Samsara Scala-DSL (Syntactic Sugar)
-1. Logical/Physical DAG
-1. Engine Bindings
-1. Code runs in Engine
-1. Native Solvers 
-
-## Application Code
-
-You have an JAVA/Scala applicatoin (skip this if you're working from an interactive shell or Apache Zeppelin)
-
-    
-    def main(args: Array[String]) {
-
-      println("Welcome to My Mahout App")
-
-      if (args.isEmpty) {
-
-This may seem like a trivial part to call out, but the point is important- Mahout runs _inline_ with your regular application 
-code. E.g. if this is an Apache Spark app, then you do all your Spark things, including ETL and data prep in the same 
-application, and then invoke Mahout's mathematically expressive Scala DSL when you're ready to math on it.
-
-## Samsara Scala-DSL (Syntactic Sugar)
-
-So when you get to a point in your code where you're ready to math it up (in this example Spark) you can elegently express 
-yourself mathematically.
-
-    implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)
-    
-    val A = drmWrap(rddA)
-    val B = drmWrap(rddB) 
-    
-    val C = A.t %*% A + A %*% B.t
-    
-We've defined a `MahoutDistributedContext` (which is a wrapper on the Spark Context), and two Disitributed Row Matrices (DRMs)
-which are wrappers around RDDs (in Spark).  
-
-## Logical / Physical DAG
-
-At this point there is a bit of optimization that happens.  For example, consider the
-    
-    A.t %*% A
-    
-Which is 
-<center>\(\mathbf{A^\intercal A}\)</center>
-
-Transposing a large matrix is a very expensive thing to do, and in this case we don't actually need to do it. There is a
-more efficient way to calculate <foo>\(\mathbf{A^\intercal A}\)</foo> that doesn't require a physical transpose. 
-
-(Image showing this)
-
-Mahout converts this code into something that looks like:
-
-    OpAtA(A) + OpABt(A, B) //  illustrative pseudocode with real functions called
-
-There's a little more magic that happens at this level, but the punchline is _Mahout translates the pretty scala into a
-a series of operators, which at the next level are turned implemented at the engine_.
-
-## Engine Bindings and Engine Level Ops
-
-When one creates new engine bindings, one is in essence defining
-
-1. What the engine specific underlying structure for a DRM is (in Spark its an RDD).  The underlying structure also has 
-rows of `MahoutVector`s, so in Spark `RDD[(index, MahoutVector)]`.  This will be important when we get to the native solvers. 
-1. Implementing a set of BLAS (basic linear algebra) functions for working on the underlying structure- in Spark this means
-implementing things like `AtA` on an RDD. See [the sparkbindings on github](https://github.com/apache/mahout/tree/master/spark/src/main/scala/org/apache/mahout/sparkbindings)
-
-Now your mathematically expresive Samsara Scala code has been translated into optimized engine specific functions.
-
-## Native Solvers
-
-Recall how I said the rows of the DRMs are `org.apache.mahout.math.Vector`.  Here is where this becomes important. I'm going 
-to explain this in the context of Spark, but the principals apply to all distributed backends. 
-
-If you are familiar with how mapping and reducing in Spark, then envision this RDD of `MahoutVector`s,  each partition, 
-and indexed collection of vectors is a _block_ of the distributed matrix, however this _block_ is totally incore, and therefor
-is treated like an in core matrix. 
-
-Now Mahout defines its own incore BLAS packs and refers to them as _Native Solvers_.  The default native solver is just plain
-old JVM, which is painfully slow, but works just about anywhere.  
-
-When the data gets to the node and an operation on the matrix block is called.  In the same way Mahout converts abstract
-operators on the DRM that are implemented on various distributed engines, it calls abstract operators on the incore matrix 
-and vectors which are implemented on various native solvers. 
-
-The default "native solver" is the JVM, which isn't native at all- and if no actual native solvers are present operations 
-will fall back to this. However, IF a native solver is present (the jar was added to the notebook), then the magic will happen.
-
-Imagine still we have our Spark executor- it has this block of a matrix sitting in its core. Now let's suppose the `ViennaCl-OMP`
-native solver is in use.  When Spark calls an operation on this incore matrix, the matrix dumps out of the JVM and the 
-calculation is carried out on _all available CPUs_. 
-
-In a similar way, the `ViennaCL` native solver dumps the matrix out of the JVM and looks for a GPU to execute the operations on.
- 
-Once the operations are complete, the result is loaded back up into the JVM, and Spark (or whatever distributed engine) and 
-shipped back to the driver. 
-
-The native solver operatoins are only defined on `org.apache.mahout.math.Vector` and `org.apache.mahout.math.Matrix`, which is 
-why it is critical that the underlying structure composed row-wise of `Vector` or `Matrices`. 
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/mahout-samsara/faq.md
----------------------------------------------------------------------
diff --git a/website/docs/mahout-samsara/faq.md b/website/docs/mahout-samsara/faq.md
deleted file mode 100644
index af9d466..0000000
--- a/website/docs/mahout-samsara/faq.md
+++ /dev/null
@@ -1,51 +0,0 @@
----
-layout: page
-title: Mahout Samsara
-theme:
-    name: mahout2
----
-# FAQ for using Mahout with Spark
-
-**Q: Mahout Spark shell doesn't start; "ClassNotFound" problems or various classpath problems.**
-
-**A:** So far as of the time of this writing all reported problems starting the Spark shell in Mahout were revolving 
-around classpath issues one way or another. 
-
-If you are getting method signature like errors, most probably you have mismatch between Mahout's Spark dependency 
-and actual Spark installed. (At the time of this writing the HEAD depends on Spark 1.1.0) but check mahout/pom.xml.
-
-Troubleshooting general classpath issues is pretty straightforward. Since Mahout is using Spark's installation 
-and its classpath as reported by Spark itself for Spark-related dependencies, it is important to make sure 
-the classpath is sane and is made available to Mahout:
-
-1. Check Spark is of correct version (same as in Mahout's poms), is compiled and SPARK_HOME is set.
-2. Check Mahout is compiled and MAHOUT_HOME is set.
-3. Run `$SPARK_HOME/bin/compute-classpath.sh` and make sure it produces sane result with no errors. 
-If it outputs something other than a straightforward classpath string, most likely Spark is not compiled/set correctly (later spark versions require 
-`sbt/sbt assembly` to be run, simply runnig `sbt/sbt publish-local` is not enough any longer).
-4. Run `$MAHOUT_HOME/bin/mahout -spark classpath` and check that path reported in step (3) is included.
-
-**Q: I am using the command line Mahout jobs that run on Spark or am writing my own application that uses 
-Mahout's Spark code. When I run the code on my cluster I get ClassNotFound or signature errors during serialization. 
-What's wrong?**
- 
-**A:** The Spark artifacts in the maven ecosystem may not match the exact binary you are running on your cluster. This may 
-cause class name or version mismatches. In this case you may wish 
-to build Spark yourself to guarantee that you are running exactly what you are building Mahout against. To do this follow these steps
-in order:
-
-1. Build Spark with maven, but **do not** use the "package" target as described on the Spark site. Build with the "clean install" target instead. 
-Something like: "mvn clean install -Dhadoop1.2.1" or whatever your particular build options are. This will put the jars for Spark
-in the local maven cache.
-2. Deploy **your** Spark build to your cluster and test it there.
-3. Build Mahout. This will cause maven to pull the jars for Spark from the local maven cache and may resolve missing 
-or mis-identified classes.
-4. if you are building your own code do so against the local builds of Spark and Mahout.
-
-**Q: The implicit SparkContext 'sc' does not work in the Mahout spark-shell.**
-
-**A:** In the Mahout spark-shell the SparkContext is called 'sdc', where the 'd' stands for distributed. 
-
-
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/mahout-samsara/in-core-reference.md
----------------------------------------------------------------------
diff --git a/website/docs/mahout-samsara/in-core-reference.md b/website/docs/mahout-samsara/in-core-reference.md
deleted file mode 100644
index a3f78dc..0000000
--- a/website/docs/mahout-samsara/in-core-reference.md
+++ /dev/null
@@ -1,303 +0,0 @@
----
-layout: page
-title: Mahout Samsara In Core
-theme:
-    name: mahout2
----
-## Mahout-Samsara's In-Core Linear Algebra DSL Reference
-
-#### Imports
-
-The following imports are used to enable Mahout-Samsara's Scala DSL bindings for in-core Linear Algebra:
-
-    import org.apache.mahout.math._
-    import scalabindings._
-    import RLikeOps._
-    
-#### Inline initalization
-
-Dense vectors:
-
-    val densVec1: Vector = (1.0, 1.1, 1.2)
-    val denseVec2 = dvec(1, 0, 1,1 ,1,2)
-
-Sparse vectors:
-
-    val sparseVec1: Vector = (5 -> 1.0) :: (10 -> 2.0) :: Nil
-    val sparseVec1 = svec((5 -> 1.0) :: (10 -> 2.0) :: Nil)
-
-    // to create a vector with specific cardinality
-    val sparseVec1 = svec((5 -> 1.0) :: (10 -> 2.0) :: Nil, cardinality = 20)
-    
-Inline matrix initialization, either sparse or dense, is always done row wise. 
-
-Dense matrices:
-
-    val A = dense((1, 2, 3), (3, 4, 5))
-    
-Sparse matrices:
-
-    val A = sparse(
-              (1, 3) :: Nil,
-              (0, 2) :: (1, 2.5) :: Nil
-                  )
-
-Diagonal matrix with constant diagonal elements:
-
-    diag(3.5, 10)
-
-Diagonal matrix with main diagonal backed by a vector:
-
-    diagv((1, 2, 3, 4, 5))
-    
-Identity matrix:
-
-    eye(10)
-    
-####Slicing and Assigning
-
-Getting a vector element:
-
-    val d = vec(5)
-
-Setting a vector element:
-    
-    vec(5) = 3.0
-    
-Getting a matrix element:
-
-    val d = m(3,5)
-    
-Setting a matrix element:
-
-    M(3,5) = 3.0
-    
-Getting a matrix row or column:
-
-    val rowVec = M(3, ::)
-    val colVec = M(::, 3)
-    
-Setting a matrix row or column via vector assignment:
-
-    M(3, ::) := (1, 2, 3)
-    M(::, 3) := (1, 2, 3)
-    
-Setting a subslices of a matrix row or column:
-
-    a(0, 0 to 1) = (3, 5)
-   
-Setting a subslices of a matrix row or column via vector assignment:
-
-    a(0, 0 to 1) := (3, 5)
-   
-Getting a matrix as from matrix contiguous block:
-
-    val B = A(2 to 3, 3 to 4)
-   
-Assigning a contiguous block to a matrix:
-
-    A(0 to 1, 1 to 2) = dense((3, 2), (3 ,3))
-   
-Assigning a contiguous block to a matrix using the matrix assignment operator:
-
-    A(o to 1, 1 to 2) := dense((3, 2), (3, 3))
-   
-Assignment operator used for copying between vectors or matrices:
-
-    vec1 := vec2
-    M1 := M2
-   
-Assignment operator using assignment through a functional literal for a matrix:
-
-    M := ((row, col, x) => if (row == col) 1 else 0
-    
-Assignment operator using assignment through a functional literal for a vector:
-
-    vec := ((index, x) => sqrt(x)
-    
-#### BLAS-like operations
-
-Plus/minus either vector or numeric with assignment or not:
-
-    a + b
-    a - b
-    a + 5.0
-    a - 5.0
-    
-Hadamard (elementwise) product, either vector or matrix or numeric operands:
-
-    a * b
-    a * 0.5
-
-Operations with assignment:
-
-    a += b
-    a -= b
-    a += 5.0
-    a -= 5.0
-    a *= b
-    a *= 5
-   
-*Some nuanced rules*: 
-
-1/x in R (where x is a vector or a matrix) is elementwise inverse.  In scala it would be expressed as:
-
-    val xInv = 1 /: x
-
-and R's 5.0 - x would be:
-   
-    val x1 = 5.0 -: x
-    
-*note: All assignment operations, including :=, return the assignee just like in C++*:
-
-    a -= b 
-    
-assigns **a - b** to **b** (in-place) and returns **b**.  Similarly for **a /=: b** or **1 /=: v** 
-    
-
-Dot product:
-
-    a dot b
-    
-Matrix and vector equivalency (or non-equivalency).  **Dangerous, exact equivalence is rarely useful, better to use norm comparisons with an allowance of small errors.**
-    
-    a === b
-    a !== b
-    
-Matrix multiply:    
-
-    a %*% b
-    
-Optimized Right Multiply with a diagonal matrix: 
-
-    diag(5, 5) :%*% b
-   
-Optimized Left Multiply with a diagonal matrix:
-
-    A %*%: diag(5, 5)
-
-Second norm, of a vector or matrix:
-
-    a.norm
-    
-Transpose:
-
-    val Mt = M.t
-    
-*note: Transposition is currently handled via view, i.e. updating a transposed matrix will be updating the original.*  Also computing something like `\(\mathbf{X^\top}\mathbf{X}\)`:
-
-    val XtX = X.t %*% X
-    
-will not therefore incur any additional data copying.
-
-#### Decompositions
-
-Matrix decompositions require an additional import:
-
-    import org.apache.mahout.math.decompositions._
-
-
-All arguments in the following are matricies.
-
-**Cholesky decomposition**
-
-    val ch = chol(M)
-    
-**SVD**
-
-    val (U, V, s) = svd(M)
-    
-**EigenDecomposition**
-
-    val (V, d) = eigen(M)
-    
-**QR decomposition**
-
-    val (Q, R) = qr(M)
-    
-**Rank**: Check for rank deficiency (runs rank-revealing QR)
-
-    M.isFullRank
-   
-**In-core SSVD**
-
-    Val (U, V, s) = ssvd(A, k = 50, p = 15, q = 1)
-    
-**Solving linear equation systems and matrix inversion:** fully similar to R semantics; there are three forms of invocation:
-
-
-Solve `\(\mathbf{AX}=\mathbf{B}\)`:
-
-    solve(A, B)
-   
-Solve `\(\mathbf{Ax}=\mathbf{b}\)`:
-  
-    solve(A, b)
-   
-Compute `\(\mathbf{A^{-1}}\)`:
-
-    solve(A)
-   
-#### Misc
-
-Vector cardinality:
-
-    a.length
-    
-Matrix cardinality:
-
-    m.nrow
-    m.ncol
-    
-Means and sums:
-
-    m.colSums
-    m.colMeans
-    m.rowSums
-    m.rowMeans
-    
-Copy-By-Value:
-
-    val b = a cloned
-    
-#### Random Matrices
-
-`\(\mathcal{U}\)`(0,1) random matrix view:
-
-    val incCoreA = Matrices.uniformView(m, n, seed)
-
-    
-`\(\mathcal{U}\)`(-1,1) random matrix view:
-
-    val incCoreA = Matrices.symmetricUniformView(m, n, seed)
-
-`\(\mathcal{N}\)`(-1,1) random matrix view:
-
-    val incCoreA = Matrices.gaussianView(m, n, seed)
-    
-#### Iterators 
-
-Mahout-Math already exposes a number of iterators.  Scala code just needs the following imports to enable implicit conversions to scala iterators.
-
-    import collection._
-    import JavaConversions._
-    
-Iterating over rows in a Matrix:
-
-    for (row <- m) {
-      ... do something with row
-    }
-    
-<!--Iterating over non-zero and all elements of a vector:
-*Note that Vector.Element also has some implicit syntatic sugar, e.g to add 5.0 to every non-zero element of a matrix, the following code may be used:*
-
-    for (row <- m; el <- row.nonZero) el = 5.0 + el
-    ... or 
-    for (row <- m; el <- row.nonZero) el := 5.0 + el
-    
-Similarly **row.all** produces an iterator over all elements in a row (Vector). 
--->
-
-For more information including information on Mahout-Samsara's out-of-core Linear algebra bindings see: [Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf)
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/mahout-samsara/out-of-core-reference.md
----------------------------------------------------------------------
diff --git a/website/docs/mahout-samsara/out-of-core-reference.md b/website/docs/mahout-samsara/out-of-core-reference.md
deleted file mode 100644
index 3642e49..0000000
--- a/website/docs/mahout-samsara/out-of-core-reference.md
+++ /dev/null
@@ -1,317 +0,0 @@
----
-layout: page
-title: Mahout Samsara Out of Core
-theme:
-    name: mahout2
----
-# Mahout-Samsara's Distributed Linear Algebra DSL Reference
-
-**Note: this page is meant only as a quick reference to Mahout-Samsara's R-Like DSL semantics.  For more information, including information on Mahout-Samsara's Algebraic Optimizer please see: [Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf).**
-
-The subjects of this reference are solely applicable to Mahout-Samsara's **DRM** (distributed row matrix).
-
-In this reference, DRMs will be denoted as e.g. `A`, and in-core matrices as e.g. `inCoreA`.
-
-#### Imports 
-
-The following imports are used to enable seamless in-core and distributed algebraic DSL operations:
-
-    import org.apache.mahout.math._
-    import scalabindings._
-    import RLikeOps._
-    import drm._
-    import RLikeDRMOps._
-    
-If working with mixed scala/java code:
-    
-    import collection._
-    import JavaConversions._
-    
-If you are working with Mahout-Samsara's Spark-specific operations e.g. for context creation:
-
-    import org.apache.mahout.sparkbindings._
-    
-The Mahout shell does all of these imports automatically.
-
-
-#### DRM Persistence operators
-
-**Mahout-Samsara's DRM persistance to HDFS is compatible with all Mahout-MapReduce algorithms such as seq2sparse.**
-
-
-Loading a DRM from (HD)FS:
-
-    drmDfsRead(path = hdfsPath)
-     
-Parallelizing from an in-core matrix:
-
-    val inCoreA = (dense(1, 2, 3), (3, 4, 5))
-    val A = drmParallelize(inCoreA)
-    
-Creating an empty DRM:
-
-    val A = drmParallelizeEmpty(100, 50)
-    
-Collecting to driver's jvm in-core:
-
-    val inCoreA = A.collect
-    
-**Warning: The collection of distributed matrices happens implicitly whenever conversion to an in-core (o.a.m.math.Matrix) type is required. E.g.:**
-
-    val inCoreA: Matrix = ...
-    val drmB: DrmLike[Int] =...
-    val inCoreC: Matrix = inCoreA %*%: drmB
-    
-**implies (incoreA %*%: drmB).collect**
-
-Collecting to (HD)FS as a Mahout's DRM formatted file:
-
-    A.dfsWrite(path = hdfsPath)
-    
-#### Logical algebraic operators on DRM matrices:
-
-A logical set of operators are defined for distributed matrices as a subset of those defined for in-core matrices.  In particular, since all distributed matrices are immutable, there are no assignment operators (e.g. **A += B**)
-*Note: please see: [Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf) for information on Mahout-Samsars's Algebraic Optimizer, and translation from logical operations to a physical plan for the back end.*
- 
-    
-Cache a DRM and trigger an optimized physical plan: 
-
-    drmA.checkpoint(CacheHint.MEMORY_AND_DISK)
-   
-Other valid caching Instructions:
-
-    drmA.checkpoint(CacheHint.NONE)
-    drmA.checkpoint(CacheHint.DISK_ONLY)
-    drmA.checkpoint(CacheHint.DISK_ONLY_2)
-    drmA.checkpoint(CacheHint.MEMORY_ONLY)
-    drmA.checkpoint(CacheHint.MEMORY_ONLY_2)
-    drmA.checkpoint(CacheHint.MEMORY_ONLY_SER
-    drmA.checkpoint(CacheHint.MEMORY_ONLY_SER_2)
-    drmA.checkpoint(CacheHint.MEMORY_AND_DISK_2)
-    drmA.checkpoint(CacheHint.MEMORY_AND_DISK_SER)
-    drmA.checkpoint(CacheHint.MEMORY_AND_DISK_SER_2)
-
-*Note: Logical DRM operations are lazily computed.  Currently the actual computations and optional caching will be triggered by dfsWrite(...), collect(...) and blockify(...).*
-
-
-
-Transposition:
-
-    A.t
- 
-Elementwise addition *(Matrices of identical geometry and row key types)*:
-  
-    A + B
-
-Elementwise subtraction *(Matrices of identical geometry and row key types)*:
-
-    A - B
-    
-Elementwise multiplication (Hadamard) *(Matrices of identical geometry and row key types)*:
-
-    A * B
-    
-Elementwise division *(Matrices of identical geometry and row key types)*:
-
-    A / B
-    
-**Elementwise operations involving one in-core argument (int-keyed DRMs only)**:
-
-    A + inCoreB
-    A - inCoreB
-    A * inCoreB
-    A / inCoreB
-    A :+ inCoreB
-    A :- inCoreB
-    A :* inCoreB
-    A :/ inCoreB
-    inCoreA +: B
-    inCoreA -: B
-    inCoreA *: B
-    inCoreA /: B
-
-Note the Spark associativity change (e.g. `A *: inCoreB` means `B.leftMultiply(A`), same as when both arguments are in core). Whenever operator arguments include both in-core and out-of-core arguments, the operator can only be associated with the out-of-core (DRM) argument to support the distributed implementation.
-    
-**Matrix-matrix multiplication %*%**:
-
-`\(\mathbf{M}=\mathbf{AB}\)`
-
-    A %*% B
-    A %*% inCoreB
-    A %*% inCoreDiagonal
-    A %*%: B
-
-
-*Note: same as above, whenever operator arguments include both in-core and out-of-core arguments, the operator can only be associated with the out-of-core (DRM) argument to support the distributed implementation.*
- 
-**Matrix-vector multiplication %*%**
-Currently we support a right multiply product of a DRM and an in-core Vector(`\(\mathbf{Ax}\)`) resulting in a single column DRM, which then can be collected in front (usually the desired outcome):
-
-    val Ax = A %*% x
-    val inCoreX = Ax.collect(::, 0)
-    
-
-**Matrix-scalar +,-,*,/**
-Elementwise operations of every matrix element and a scalar:
-
-    A + 5.0
-    A - 5.0
-    A :- 5.0
-    5.0 -: A
-    A * 5.0
-    A / 5.0
-    5.0 /: a
-    
-Note that `5.0 -: A` means `\(m_{ij} = 5 - a_{ij}\)` and `5.0 /: A` means `\(m_{ij} = \frac{5}{a{ij}}\)` for all elements of the result.
-    
-    
-#### Slicing
-
-General slice:
-
-    A(100 to 200, 100 to 200)
-    
-Horizontal Block:
-
-    A(::, 100 to 200)
-    
-Vertical Block:
-
-    A(100 to 200, ::)
-    
-*Note: if row range is not all-range (::) the the DRM must be `Int`-keyed.  General case row slicing is not supported by DRMs with key types other than `Int`*.
-
-
-#### Stitching
-
-Stitch side by side (cbind R semantics):
-
-    val drmAnextToB = drmA cbind drmB
-    
-Stitch side by side (Scala):
-
-    val drmAnextToB = drmA.cbind(drmB)
-    
-Analogously, vertical concatenation is available via **rbind**
-
-#### Custom pipelines on blocks
-Internally, Mahout-Samsara's DRM is represented as a distributed set of vertical (Key, Block) tuples.
-
-**drm.mapBlock(...)**:
-
-The DRM operator `mapBlock` provides transformational access to the distributed vertical blockified tuples of a matrix (Row-Keys, Vertical-Matrix-Block).
-
-Using `mapBlock` to add 1.0 to a DRM:
-
-    val inCoreA = dense((1, 2, 3), (2, 3 , 4), (3, 4, 5))
-    val drmA = drmParallelize(inCoreA)
-    val B = A.mapBlock() {
-        case (keys, block) => keys -> (block += 1.0)
-    }
-    
-#### Broadcasting Vectors and matrices to closures
-Generally we can create and use one-way closure attributes to be used on the back end.
-
-Scalar matrix multiplication:
-
-    val factor: Int = 15
-    val drm2 = drm1.mapBlock() {
-        case (keys, block) => block *= factor
-        keys -> block
-    }
-
-**Closure attributes must be java-serializable. Currently Mahout's in-core Vectors and Matrices are not java-serializable, and must be broadcast to the closure using `drmBroadcast(...)`**:
-
-    val v: Vector ...
-    val bcastV = drmBroadcast(v)
-    val drm2 = drm1.mapBlock() {
-        case (keys, block) =>
-            for(row <- 0 until block.nrow) block(row, ::) -= bcastV
-        keys -> block    
-    }
-
-#### Computations providing ad-hoc summaries
-
-
-Matrix cardinality:
-
-    drmA.nrow
-    drmA.ncol
-
-*Note: depending on the stage of optimization, these may trigger a computational action.  I.e. if one calls `nrow()` n times, then the back end will actually recompute `nrow` n times.*
-    
-Means and sums:
-
-    drmA.colSums
-    drmA.colMeans
-    drmA.rowSums
-    drmA.rowMeans
-    
- 
-*Note: These will always trigger a computational action.  I.e. if one calls `colSums()` n times, then the back end will actually recompute `colSums` n times.*
-
-#### Distributed Matrix Decompositions
-
-To import the decomposition package:
-    
-    import org.apache.mahout.math._
-    import decompositions._
-    
-Distributed thin QR:
-
-    val (drmQ, incoreR) = dqrThin(drmA)
-    
-Distributed SSVD:
- 
-    val (drmU, drmV, s) = dssvd(drmA, k = 40, q = 1)
-    
-Distributed SPCA:
-
-    val (drmU, drmV, s) = dspca(drmA, k = 30, q = 1)
-
-Distributed regularized ALS:
-
-    val (drmU, drmV, i) = dals(drmA,
-                            k = 50,
-                            lambda = 0.0,
-                            maxIterations = 10,
-                            convergenceThreshold = 0.10))
-                            
-#### Adjusting parallelism of computations
-
-Set the minimum parallelism to 100 for computations on `drmA`:
-
-    drmA.par(min = 100)
- 
-Set the exact parallelism to 100 for computations on `drmA`:
-
-    drmA.par(exact = 100)
-
-
-Set the engine specific automatic parallelism adjustment for computations on `drmA`:
-
-    drmA.par(auto = true)
-
-#### Retrieving the engine specific data structure backing the DRM:
-
-**A Spark RDD:**
-
-    val myRDD = drmA.checkpoint().rdd
-    
-**An H2O Frame and Key Vec:**
-
-    val myFrame = drmA.frame
-    val myKeys = drmA.keys
-    
-**A Flink DataSet:**
-
-    val myDataSet = drmA.ds
-    
-For more information including information on Mahout-Samsara's Algebraic Optimizer and in-core Linear algebra bindings see: [Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf)
-
-
-
-    
-
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/native-solvers/cuda.md
----------------------------------------------------------------------
diff --git a/website/docs/native-solvers/cuda.md b/website/docs/native-solvers/cuda.md
deleted file mode 100644
index 1ec7807..0000000
--- a/website/docs/native-solvers/cuda.md
+++ /dev/null
@@ -1,6 +0,0 @@
----
-layout: page
-title: Native Solvers- CUDA
-theme:
-    name: mahout2
----

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/native-solvers/viennacl-omp.md
----------------------------------------------------------------------
diff --git a/website/docs/native-solvers/viennacl-omp.md b/website/docs/native-solvers/viennacl-omp.md
deleted file mode 100644
index 7540ad3..0000000
--- a/website/docs/native-solvers/viennacl-omp.md
+++ /dev/null
@@ -1,6 +0,0 @@
----
-layout: page
-title: Native Solvers- ViennaCL-OMP
-theme:
-    name: mahout2
----
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/native-solvers/viennacl.md
----------------------------------------------------------------------
diff --git a/website/docs/native-solvers/viennacl.md b/website/docs/native-solvers/viennacl.md
deleted file mode 100644
index d41e0f7..0000000
--- a/website/docs/native-solvers/viennacl.md
+++ /dev/null
@@ -1,6 +0,0 @@
----
-layout: page
-title: Native Solvers- ViennaCL
-theme:
-    name: mahout2
----
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/quickstart.md
----------------------------------------------------------------------
diff --git a/website/docs/quickstart.md b/website/docs/quickstart.md
deleted file mode 100644
index 2a947ad..0000000
--- a/website/docs/quickstart.md
+++ /dev/null
@@ -1,63 +0,0 @@
----
-layout: default
-title: Quickstart
-theme: 
-    name: mahout2
----
-# Mahout Quick Start 
-# TODO : Fill this in with the bare essential basics
-
-
-
-# Mahout MapReduce Overview
-
-## Getting Mahout
-
-#### Download the latest release
-
-Download the latest release [here](http://www.apache.org/dyn/closer.cgi/mahout/).
-
-Or checkout the latest code from [here](http://mahout.apache.org/developers/version-control.html)
-
-#### Alternatively: Add Mahout 0.13.0 to a maven project
-
-Mahout is also available via a [maven repository](http://mvnrepository.com/artifact/org.apache.mahout) under the group id *org.apache.mahout*.
-If you would like to import the latest release of mahout into a java project, add the following dependency in your *pom.xml*:
-
-    <dependency>
-        <groupId>org.apache.mahout</groupId>
-        <artifactId>mahout-mr</artifactId>
-        <version>0.13.0</version>
-    </dependency>
- 
-
-## Features
-
-For a full list of Mahout's features see our [Features by Engine](http://mahout.apache.org/users/basics/algorithms.html) page.
-
-    
-## Using Mahout
-
-Mahout has prepared a bunch of examples and tutorials for users to quickly learn how to use its machine learning algorithms.
-
-#### Recommendations
-
-Check the [Recommender Quickstart](/users/recommender/quickstart.html) or the tutorial on [creating a userbased recommender in 5 minutes](/users/recommender/userbased-5-minutes.html).
-
-If you are building a recommender system for the first time, please also refer to a list of [Dos and Don'ts](/users/recommender/recommender-first-timer-faq.html) that might be helpful.
-
-#### Clustering
-
-Check the [Synthetic data](/users/clustering/clustering-of-synthetic-control-data.html) example.
-
-#### Classification
-
-If you are interested in how to train a **Naive Bayes** model, look at the [20 newsgroups](/users/classification/twenty-newsgroups.html) example.
-
-If you plan to build a **Hidden Markov Model** for speech recognition, the example [here](/users/classification/hidden-markov-models.html) might be instructive. 
-
-Or you could build a **Random Forest** model by following this [quick start page](/users/classification/partial-implementation.html).
-
-#### Working with Text 
-
-If you need to convert raw text into word vectors as input to clustering or classification algorithms, please refer to this page on [how to create vectors from text](/users/basics/creating-vectors-from-text.html).

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/screenshots/landing.png
----------------------------------------------------------------------
diff --git a/website/docs/screenshots/landing.png b/website/docs/screenshots/landing.png
deleted file mode 100644
index d879e46..0000000
Binary files a/website/docs/screenshots/landing.png and /dev/null differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/screenshots/mr-algos.png
----------------------------------------------------------------------
diff --git a/website/docs/screenshots/mr-algos.png b/website/docs/screenshots/mr-algos.png
deleted file mode 100644
index 34b4f53..0000000
Binary files a/website/docs/screenshots/mr-algos.png and /dev/null differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/screenshots/tutorials.png
----------------------------------------------------------------------
diff --git a/website/docs/screenshots/tutorials.png b/website/docs/screenshots/tutorials.png
deleted file mode 100644
index 500187a..0000000
Binary files a/website/docs/screenshots/tutorials.png and /dev/null differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/tutorials/cco-lastfm/cco-lastfm.scala
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/cco-lastfm/cco-lastfm.scala b/website/docs/tutorials/cco-lastfm/cco-lastfm.scala
deleted file mode 100644
index 6ba46a9..0000000
--- a/website/docs/tutorials/cco-lastfm/cco-lastfm.scala
+++ /dev/null
@@ -1,83 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
-*/
-
-/*
- * Download data from: http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip
- * then run this in the mahout shell.
- */
-
-import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
-
-// We need to turn our raw text files into RDD[(String, String)] 
-val userTagsRDD = sc.textFile("/path/to/lastfm/user_taggedartists.dat").map(line => line.split("\t")).map(a => (a(0), a(2))).filter(_._1 != "userID")
-val userTagsIDS = IndexedDatasetSpark.apply(userTagsRDD)(sc)
-
-val userArtistsRDD = sc.textFile("/path/to/lastfm/user_artists.dat").map(line => line.split("\t")).map(a => (a(0), a(1))).filter(_._1 != "userID")
-val userArtistsIDS = IndexedDatasetSpark.apply(userArtistsRDD)(sc)
-
-val userFriendsRDD = sc.textFile("/path/to/data/lastfm/user_friends.dat").map(line => line.split("\t")).map(a => (a(0), a(1))).filter(_._1 != "userID")
-val userFriendsIDS = IndexedDatasetSpark.apply(userFriendsRDD)(sc)
-
-import org.apache.mahout.math.cf.SimilarityAnalysis
-
-val artistReccosLlrDrmListByArtist = SimilarityAnalysis.cooccurrencesIDSs(Array(userArtistsIDS, userTagsIDS, userFriendsIDS), maxInterestingItemsPerThing = 20, maxNumInteractions = 500, randomSeed = 1234)
-
-// Anonymous User
-
-val artistMap = sc.textFile("/path/to/lastfm/artists.dat").map(line => line.split("\t")).map(a => (a(1), a(0))).filter(_._1 != "name").collect.toMap
-val tagsMap = sc.textFile("/path/to/lastfm/tags.dat").map(line => line.split("\t")).map(a => (a(1), a(0))).filter(_._1 != "tagValue").collect.toMap
-
-// Watch your skin- you're not wearing armour. (This will fail on misspelled artists
-// This is neccessary because the ids are integer-strings already, and for this demo I didn't want to chance them to Integer types (bc more often you'll have strings).
-val kilroyUserArtists = svec( (userArtistsIDS.columnIDs.get(artistMap("Beck")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("David Bowie")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Gary Numan")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Less Than Jake")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Lou Reed")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Parliament")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Radiohead")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Seu Jorge")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("The Skatalites")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Reverend Horton Heat")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Talking Heads")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Tom Waits")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Waylon Jennings")).get, 1) ::
-  (userArtistsIDS.columnIDs.get(artistMap("Wu-Tang Clan")).get, 1) :: Nil, cardinality = userArtistsIDS.columnIDs.size
-)
-
-val kilroyUserTags = svec(
-  (userTagsIDS.columnIDs.get(tagsMap("classical")).get, 1) ::
-  (userTagsIDS.columnIDs.get(tagsMap("skacore")).get, 1) ::
-  (userTagsIDS.columnIDs.get(tagsMap("why on earth is this just a bonus track")).get, 1) ::
-  (userTagsIDS.columnIDs.get(tagsMap("punk rock")).get, 1) :: Nil, cardinality = userTagsIDS.columnIDs.size)
-
-val kilroysRecs = (artistReccosLlrDrmListByArtist(0).matrix %*% kilroyUserArtists + artistReccosLlrDrmListByArtist(1).matrix %*% kilroyUserTags).collect
-
-
-import org.apache.mahout.math.scalabindings.MahoutCollections._
-import collection._
-import JavaConversions._
-
-// Which Users I should Be Friends with.
-println(kilroysRecs(::, 0).toMap.toList.sortWith(_._2 > _._2).take(5))
-
-/**
-  * So there you have it- the basis for a new dating/friend finding app based on musical preferences which
-  * is actually a pretty dope idea.
-  *
-  * Solving for which bands a user might like is left as an exercise to the reader.
-  */
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/tutorials/cco-lastfm/index.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/cco-lastfm/index.md b/website/docs/tutorials/cco-lastfm/index.md
deleted file mode 100644
index ca95f9d..0000000
--- a/website/docs/tutorials/cco-lastfm/index.md
+++ /dev/null
@@ -1,151 +0,0 @@
----
-layout: tutorial
-title: CCOs with Last.fm
-theme:
-    name: mahout2
----
-
-Most reccomender examples utilize the MovieLense dataset, but that relies only on ratings (which makes the recommender being demonstrated look less trivial).  Right next to the MovieLense dataset is the LastFM data set.  The LastFM dataset has ratings by user, friends of the user, bands listened to by user, and tags by user.  This is the kind of exciting data set we’d like to work with!
-
-Start by downloading the LastFM dataset from 
-http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip
-
-I’m going to assume you’ve unzipped them to /path/to/lastfm/*
-We’re going to use a new trick for creating our IndexedDataSets, the `apply` function.  `apply` takes an `RDD[(String, String)]` that is an RDD of tuples where both elements are strings. We load RDDs, and use Spark to manipulate the RDDs into this form.  The files from LastFM are tab seperated- but it should be noted, that this could easily be done from log files, but would just take a touch more Spark-Fu.  
-
-The second important thing to note is that the first element in each tuple is going to be the rows in the resulting matrix, the second element will be the column, and at that position there will be a one.  The BiDictionary will automatically be created from the strings. 
-For those following along at home- [the full Scala worksheet](cco-lastfm.scala) might be easier than copying and pasting 
-from this page.
-
-```
-import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark
-
-val userTagsRDD = sc.textFile("/path/to/lastfm/user_taggedartists.dat")
-.map(line => line.split("\t"))
-.map(a => (a(0), a(2)))
-.filter(_._1 != "userID")
-val userTagsIDS = IndexedDatasetSpark.apply(userTagsRDD)(sc)
-
-val userArtistsRDD = sc.textFile("/path/to/lastfm/user_artists.dat")
-.map(line => line.split("\t"))
-.map(a => (a(0), a(1)))
-.filter(_._1 != "userID")
-val userArtistsIDS = IndexedDatasetSpark.apply(userArtistsRDD)(sc)
-
-val userFriendsRDD = sc.textFile("/path/to/lastfm/user_friends.dat")
-.map(line => line.split("\t"))
-.map(a => (a(0), a(1)))
-.filter(_._1 != "userID")
-val userFriendsIDS = IndexedDatasetSpark.apply(userFriendsRDD)(sc)
-```
-
-How much easier was that?! In each RDD creations we:
-
-Load our data using sc.textFile
-    
-    sc.textFile("/path/to/lastfm/user_taggedartists.dat")
-
-Split the data into an array based on tabs (\t)
-
-    .map(line => line.split("\t"))
-
-Pull the userID column into the first position of the tuple, and the other attribute we want into the second position.
-
-    .map(a => (a(0), a(1)))
-
-Remove the header (the only line that will have “userID” in that position)
-
-    .filter(_._1 != "userID")
-
-Then we easily create an IndexedDataSet using the `apply` method. 
-val userTagsIDS = IndexedDatasetSpark.apply(userTagsRDD)(sc)
-Note the `(sc)` at the end. You may or may not need that.  `sc` is the SparkContext and should be passed as an implicit parameter, however the REPL environment (e.g. Mahout Shell or notebooks) has a hard time with the implicits, so I had to pass it explicitly.  
-
-Now we compute our co-occurrence matrices:
-```scala
-import org.apache.mahout.math.cf.SimilarityAnalysis
-
-val artistReccosLlrDrmListByArtist = SimilarityAnalysis.cooccurrencesIDSs(
-Array(userArtistsIDS, userTagsIDS, userFriendsIDS), 
-maxInterestingItemsPerThing = 20,
-maxNumInteractions = 500, 
-randomSeed = 1234)
-```
-
-
-Let’s see an example of how this would work-
-
-First we have a small problem. If you look at our original input files, the userIDs, artistIDs, and tags were all integers. We loaded them as strings and if you look at the BiDictionaries associated with each IDS, you’ll see they map the original integers as strings to the integer indices of our matrix. Not super helpful.  There are other files which contain mappings from LastFM ID to human readable band and tag names.  I could have sorted this out in the begining but I chose to do it on the backside because it is a bit of clever Spark/Scala only needed to work around a quirk in this particular dataset.  We have to reverse map a few things if we want to input ‘human readable’ attributes, which I did.  If this doesn’t make sense, please don’t be discouraged- the important part was above, this is just some magic for working with this dataset in a pretty way. 
-
-First I load, and create incore maps from the mapping files:
-
-```scala
-val artistMap = sc.textFile("/path/to/lastfm/artists.dat")
-  .map(line => line.split("\t"))
-  .map(a => (a(1), a(0)))
-  .filter(_._1 != "name")
-  .collect
-  .toMap
-
-val tagsMap = sc.textFile("/path/tolastfm/tags.dat")
-  .map(line => line.split("\t"))
-  .map(a => (a(1), a(0)))
-  .filter(_._1 != "tagValue")
-  .collect
-  .toMap
-
-```
-
-This will create some `Map`s that I can use to type readable names for the artist and tags to create my ‘history’.
-
-```scala
-val kilroyUserArtists = svec( (userArtistsIDS.columnIDs.get(artistMap("Beck")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("David Bowie")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Gary Numan")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Less Than Jake")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Lou Reed")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Parliament")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Radiohead")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Seu Jorge")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("The Skatalites")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Reverend Horton Heat")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Talking Heads")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Tom Waits")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Waylon Jennings")).get, 1) ::
- (userArtistsIDS.columnIDs.get(artistMap("Wu-Tang Clan")).get, 1) :: Nil, 
- cardinality = userArtistsIDS.columnIDs.size
-)
-
-
-
-val kilroyUserTags = svec(
- (userTagsIDS.columnIDs.get(tagsMap("classical")).get, 1) ::
- (userTagsIDS.columnIDs.get(tagsMap("skacore")).get, 1) ::
- (userTagsIDS.columnIDs.get(tagsMap("why on earth is this just a bonus track")).get, 1) ::
- (userTagsIDS.columnIDs.get(tagsMap("punk rock")).get, 1) :: Nil,
- cardinality = userTagsIDS.columnIDs.size)
-```
-
-So what we have then is me typing in a name to `artistMap` where the keys are human readable names of my favorite bands, which returns the value which is the LastFM ID, which in turn is the key in the BiDictionary map, and returns the matrix position.  I’m making a sparse vector where I want the index at the value I just fetched (which in an awry way refers to the artist I specified) to have the value 1.  
-
-Same idea for the tags. 
-
-I now have two history vectors.  I didn’t make one for the users table, because I don’t have any friends on LastFM yet. That’s about to change though, because I’m about to have some friends recommended to me. 
-
-val kilroysRecs = (artistReccosLlrDrmListByArtist(0).matrix %*% kilroyUserArtists + artistReccosLlrDrmListByArtist(1).matrix %*% kilroyUserTags).collect
-Finally let’s sort that vector out and get some user ids and strengths. 
-```scala
-import org.apache.mahout.math.scalabindings.MahoutCollections._
-import collection._
-import JavaConversions._
-
-// Which Users I should Be Friends with.
-println(kilroysRecs(::, 0).toMap.toList.sortWith(_._2 > _._2).take(5))
-
-```
-
-`kilroysRecs` is actually a one column matrix, so we take that, and the convert it into something we can sort. We then take the top 5 suggestions.  Keep in mind, this will return the Mahout user ID, which you would also have to reverse map back to the lastFM userID.  The lastFM userID is just another Integer, and not particularly exciting so I left that out. 
-
-If you wanted to recommend artists like a normal recommendation engine- you would change the first position in all of the input matrices to be “artistID”. This is left as an exercise to the user. 
-
-[Full Scala Worksheet](cco-lastfm.scala)

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/tutorials/eigenfaces/eigenfaces.png
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/eigenfaces/eigenfaces.png b/website/docs/tutorials/eigenfaces/eigenfaces.png
deleted file mode 100644
index b388575..0000000
Binary files a/website/docs/tutorials/eigenfaces/eigenfaces.png and /dev/null differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/tutorials/eigenfaces/index.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/eigenfaces/index.md b/website/docs/tutorials/eigenfaces/index.md
deleted file mode 100644
index 08f3bb6..0000000
--- a/website/docs/tutorials/eigenfaces/index.md
+++ /dev/null
@@ -1,128 +0,0 @@
----
-layout: tutorial
-title: Eigenfaces Demo
-theme:
-   name: mahout3
----
-
-*Credit: [original blog post by rawkintrevo](https://rawkintrevo.org/2016/11/10/deep-magic-volume-3-eigenfaces/). This will be maintained through version changes, blog post will not.*
-
-*Eigenfaces* are an image equivelent(ish) to *eigenvectors* if you recall your high school linear algebra classes. If you don't recall: [read wikipedia](https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors) otherwise, it is a set of 'faces' that by a linear combination can be used to represent other faces.
-
-Their are lots of "image recognition" things out there right now, and deep learning is the popular one everyone is talking about.
-Deep learning will admittedly do better a recognizing and correctly classifying faces, however it does so at a price.
-1. Neural networks are very costly to train in the first place
-1. Everytime a new person is added, the neural network must be retrained to recognize the new person
-
-The advantage/use-case for the eigenfaces approach is when new faces are being regularly added. Even when building a production
-grade eigenfaces based system- neural networks still have a place- _idenitifying faces_ in images, and creating _centered and scaled_ images around
-the face.  This is scalable because we only need to train our neural network to detect, center, and scale faces once.  E.g. 
-a neural network would be deployed as a microservice, and then eigenfaces would be deployed as a microservice.
-
-A production version ends up looking something like this:
-- Image comes in- is fed to 'detect faces, center, scale- neural network based microservice'
-- Neural network microservice detects faces, centers and scales.  Passes each face to eigenfaces microservice
-- For each face:<br>
-    a. Decompose face into linear combination of eigenfaces<br>
-    b. Determine if linear combination vector is close enough to any exististing vector to declare a match <br>
-    c. If no match "add new person" to face corpus. 
-
-### Get the data
-
-The first thing we're going to do is collect a set of 13,232 face images (250x250 pixels) from the <a href="http://vis-www.cs.umass.edu/lfw/">Labeled Faces in the Wild</a> data set.
-
-    cd /tmp
-    mkdir eigenfaces
-    wget http://vis-www.cs.umass.edu/lfw/lfw-deepfunneled.tgz
-    tar -xzf lfw-deepfunneled.tgz
-
-### Load dependencies
-
-    cd $MAHOUT_HOME/bin
-    ./mahout spark-shell \
-        --packages com.sksamuel.scrimage:scrimage-core_2.10:2.1.0, \
-        com.sksamuel.scrimage:scrimage-io-extra_2.10:2.1.0, \
-        com.sksamuel.scrimage:scrimage-filters_2.10:2.1.0
-    
-
-
-### Create a DRM of Vectorized Images
-
-```scala
-import com.sksamuel.scrimage._
-import com.sksamuel.scrimage.filter.GrayscaleFilter
-
-val imagesRDD:DrmRdd[Int] = sc.binaryFiles("/tmp/lfw-deepfunneled/*/*", 500)
-       .map(o => new DenseVector( Image.apply(o._2.toArray)
-       .filter(GrayscaleFilter)
-       .pixels
-       .map(p => p.toInt.toDouble / 10000000)) )
-   .zipWithIndex
-   .map(o => (o._2.toInt, o._1))
-
-val imagesDRM = drmWrap(rdd= imagesRDD).par(min = 500).checkpoint()
-
-println(s"Dataset: ${imagesDRM.nrow} images, ${imagesDRM.ncol} pixels per image")
-```
-
-### Mean Center the Images
-
-```scala
-import org.apache.mahout.math.algorithms.preprocessing.MeanCenter
-
-
-val scaler: MeanCenterModel = new MeanCenter().fit(imagesDRM)
-
-val centeredImages = scaler.transform(imagesDRM)
-```
-
-
-### Calculate the Eigenimages via DS-SVD
-
-```scala
-import org.apache.mahout.math._
-import decompositions._
-import drm._
-
-val(drmU, drmV, s) = dssvd(centeredImages, k= 20, p= 15, q = 0)
-```
-
-### Write the Eigenfaces to Disk
-
-```scala
-import java.io.File
-import javax.imageio.ImageIO
-
-val sampleImagePath = "/home/guest/lfw-deepfunneled/Aaron_Eckhart/Aaron_Eckhart_0001.jpg"
-val sampleImage = ImageIO.read(new File(sampleImagePath))  
-val w = sampleImage.getWidth
-val h = sampleImage.getHeight
-
-val eigenFaces = drmV.t.collect(::,::)
-val colMeans = scaler.colCentersV
-
-for (i <- 0 until 20){
-    val v = (eigenFaces(i, ::) + colMeans) * 10000000
-    val output = new Array[com.sksamuel.scrimage.Pixel](v.size)
-    for (i <- 0 until v.size) {
-        output(i) = Pixel(v.get(i).toInt)
-    }
-    val image = Image(w, h, output)
-    image.output(new File(s"/tmp/eigenfaces/${i}.png"))
-}
-```
-
-### View the Eigenfaces
-
-If using Zeppelin, the following can be used to generate a fun table of the Eigenfaces:
-
-```python
-%python
- 
-r = 4
-c = 5
-print '%html\n<table style="width:100%">' + "".join(["<tr>" + "".join([ '<td><img src="/tmp/eigenfaces/%i.png"></td>' % (i + j) for j in range(0, c) ]) + "</tr>" for i in range(0, r * c, r +1 ) ]) + '</table>'
-
-```
-
-![Eigenfaces](eigenfaces.png)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/978d4467/website/docs/tutorials/intro-cooccurrence-spark/index.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/intro-cooccurrence-spark/index.md b/website/docs/tutorials/intro-cooccurrence-spark/index.md
deleted file mode 100644
index d7d0185..0000000
--- a/website/docs/tutorials/intro-cooccurrence-spark/index.md
+++ /dev/null
@@ -1,446 +0,0 @@
----
-layout: algorithm
-title: Intro to Cooccurrence Recommenders with Spark
-theme:
-    name: retro-mahout
----
-
-# Intro to Cooccurrence Recommenders with Spark
-
-Mahout provides several important building blocks for creating recommendations using Spark. *spark-itemsimilarity* can 
-be used to create "other people also liked these things" type recommendations and paired with a search engine can 
-personalize recommendations for individual users. *spark-rowsimilarity* can provide non-personalized content based 
-recommendations and when paired with a search engine can be used to personalize content based recommendations.
-
-![image](http://s6.postimg.org/r0m8bpjw1/recommender_architecture.png)
-
-This is a simplified Lambda architecture with Mahout's *spark-itemsimilarity* playing the batch model building role and a search engine playing the realtime serving role.
-
-You will create two collections, one for user history and one for item "indicators". Indicators are user interactions that lead to the wished for interaction. So for example if you wish a user to purchase something and you collect all users purchase interactions *spark-itemsimilarity* will create a purchase indicator from them. But you can also use other user interactions in a cross-cooccurrence calculation, to create purchase indicators. 
-
-User history is used as a query on the item collection with its cooccurrence and cross-cooccurrence indicators (there may be several indicators). The primary interaction or action is picked to be the thing you want to recommend, other actions are believed to be corelated but may not indicate exactly the same user intent. For instance in an ecom recommender a purchase is a very good primary action, but you may also know product detail-views, or additions-to-wishlists. These can be considered secondary actions which may all be used to calculate cross-cooccurrence indicators. The user history that forms the recommendations query will contain recorded primary and secondary actions all targetted towards the correct indicator fields.
-
-## References
-
-1. A free ebook, which talks about the general idea: [Practical Machine Learning](https://www.mapr.com/practical-machine-learning)
-2. A slide deck, which talks about mixing actions or other indicators: [Creating a Unified Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
-3. Two blog posts: [What's New in Recommenders: part #1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/)
-and  [What's New in Recommenders: part #2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/)
-3. A post describing the loglikelihood ratio:  [Surprise and Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)  LLR is used to reduce noise in the data while keeping the calculations O(n) complexity.
-
-Below are the command line jobs but the drivers and associated code can also be customized and accessed from the Scala APIs.
-
-## 1. spark-itemsimilarity
-*spark-itemsimilarity* is the Spark counterpart of the of the Mahout mapreduce job called *itemsimilarity*. It takes in elements of interactions, which have userID, itemID, and optionally a value. It will produce one of more indicator matrices created by comparing every user's interactions with every other user. The indicator matrix is an item x item matrix where the values are log-likelihood ratio strengths. For the legacy mapreduce version, there were several possible similarity measures but these are being deprecated in favor of LLR because in practice it performs the best.
-
-Mahout's mapreduce version of itemsimilarity takes a text file that is expected to have user and item IDs that conform to 
-Mahout's ID requirements--they are non-negative integers that can be viewed as row and column numbers in a matrix.
-
-*spark-itemsimilarity* also extends the notion of cooccurrence to cross-cooccurrence, in other words the Spark version will 
-account for multi-modal interactions and create cross-cooccurrence indicator matrices allowing the use of much more data in 
-creating recommendations or similar item lists. People try to do this by mixing different actions and giving them weights. 
-For instance they might say an item-view is 0.2 of an item purchase. In practice this is often not helpful. Spark-itemsimilarity's
-cross-cooccurrence is a more principled way to handle this case. In effect it scrubs secondary actions with the action you want
-to recommend.   
-
-
-    spark-itemsimilarity Mahout 1.0
-    Usage: spark-itemsimilarity [options]
-    
-    Disconnected from the target VM, address: '127.0.0.1:64676', transport: 'socket'
-    Input, output options
-      -i <value> | --input <value>
-            Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required)
-      -i2 <value> | --input2 <value>
-            Secondary input path for cross-similarity calculation, same restrictions as "--input" (optional). Default: empty.
-      -o <value> | --output <value>
-            Path for output, any local or HDFS supported URI (required)
-    
-    Algorithm control options:
-      -mppu <value> | --maxPrefs <value>
-            Max number of preferences to consider per user (optional). Default: 500
-      -m <value> | --maxSimilaritiesPerItem <value>
-            Limit the number of similarities per item to this number (optional). Default: 100
-    
-    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure.
-    
-    Input text file schema options:
-      -id <value> | --inDelim <value>
-            Input delimiter character (optional). Default: "[,\t]"
-      -f1 <value> | --filter1 <value>
-            String (or regex) whose presence indicates a datum for the primary item set (optional). Default: no filter, all data is used
-      -f2 <value> | --filter2 <value>
-            String (or regex) whose presence indicates a datum for the secondary item set (optional). If not present no secondary dataset is collected
-      -rc <value> | --rowIDColumn <value>
-            Column number (0 based Int) containing the row ID string (optional). Default: 0
-      -ic <value> | --itemIDColumn <value>
-            Column number (0 based Int) containing the item ID string (optional). Default: 1
-      -fc <value> | --filterColumn <value>
-            Column number (0 based Int) containing the filter string (optional). Default: -1 for no filter
-    
-    Using all defaults the input is expected of the form: "userID<tab>itemId" or "userID<tab>itemID<tab>any-text..." and all rows will be used
-    
-    File discovery options:
-      -r | --recursive
-            Searched the -i path recursively for files that match --filenamePattern (optional), Default: false
-      -fp <value> | --filenamePattern <value>
-            Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory
-    
-    Output text file schema options:
-      -rd <value> | --rowKeyDelim <value>
-            Separates the rowID key from the vector values list (optional). Default: "\t"
-      -cd <value> | --columnIdStrengthDelim <value>
-            Separates column IDs from their values in the vector values list (optional). Default: ":"
-      -td <value> | --elementDelim <value>
-            Separates vector element values in the values list (optional). Default: " "
-      -os | --omitStrength
-            Do not write the strength to the output files (optional), Default: false.
-    This option is used to output indexable data for creating a search engine recommender.
-    
-    Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..."
-    
-    Spark config options:
-      -ma <value> | --master <value>
-            Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]"
-      -sem <value> | --sparkExecutorMem <value>
-            Max Java heap available as "executor memory" on each node (optional). Default: 4g
-      -rs <value> | --randomSeed <value>
-            
-      -h | --help
-            prints this usage text
-
-This looks daunting but defaults to simple fairly sane values to take exactly the same input as legacy code and is pretty flexible. It allows the user to point to a single text file, a directory full of files, or a tree of directories to be traversed recursively. The files included can be specified with either a regex-style pattern or filename. The schema for the file is defined by column numbers, which map to the important bits of data including IDs and values. The files can even contain filters, which allow unneeded rows to be discarded or used for cross-cooccurrence calculations.
-
-See ItemSimilarityDriver.scala in Mahout's spark module if you want to customize the code. 
-
-### Defaults in the _**spark-itemsimilarity**_ CLI
-
-If all defaults are used the input can be as simple as:
-
-    userID1,itemID1
-    userID2,itemID2
-    ...
-
-With the command line:
-
-
-    bash$ mahout spark-itemsimilarity --input in-file --output out-dir
-
-
-This will use the "local" Spark context and will output the standard text version of a DRM
-
-    itemID1<tab>itemID2:value2<space>itemID10:value10...
-
-### <a name="multiple-actions">How To Use Multiple User Actions</a>
-
-Often we record various actions the user takes for later analytics. These can now be used to make recommendations. 
-The idea of a recommender is to recommend the action you want the user to make. For an ecom app this might be 
-a purchase action. It is usually not a good idea to just treat other actions the same as the action you want to recommend. 
-For instance a view of an item does not indicate the same intent as a purchase and if you just mixed the two together you 
-might even make worse recommendations. It is tempting though since there are so many more views than purchases. With *spark-itemsimilarity*
-we can now use both actions. Mahout will use cross-action cooccurrence analysis to limit the views to ones that do predict purchases.
-We do this by treating the primary action (purchase) as data for the indicator matrix and use the secondary action (view) 
-to calculate the cross-cooccurrence indicator matrix.  
-
-*spark-itemsimilarity* can read separate actions from separate files or from a mixed action log by filtering certain lines. For a mixed 
-action log of the form:
-
-    u1,purchase,iphone
-    u1,purchase,ipad
-    u2,purchase,nexus
-    u2,purchase,galaxy
-    u3,purchase,surface
-    u4,purchase,iphone
-    u4,purchase,galaxy
-    u1,view,iphone
-    u1,view,ipad
-    u1,view,nexus
-    u1,view,galaxy
-    u2,view,iphone
-    u2,view,ipad
-    u2,view,nexus
-    u2,view,galaxy
-    u3,view,surface
-    u3,view,nexus
-    u4,view,iphone
-    u4,view,ipad
-    u4,view,galaxy
-
-###Command Line
-
-
-Use the following options:
-
-    bash$ mahout spark-itemsimilarity \
-    	--input in-file \     # where to look for data
-        --output out-path \   # root dir for output
-        --master masterUrl \  # URL of the Spark master server
-        --filter1 purchase \  # word that flags input for the primary action
-        --filter2 view \      # word that flags input for the secondary action
-        --itemIDPosition 2 \  # column that has the item ID
-        --rowIDPosition 0 \   # column that has the user ID
-        --filterPosition 1    # column that has the filter word
-
-
-
-### Output
-
-The output of the job will be the standard text version of two Mahout DRMs. This is a case where we are calculating 
-cross-cooccurrence so a primary indicator matrix and cross-cooccurrence indicator matrix will be created
-
-    out-path
-      |-- similarity-matrix - TDF part files
-      \-- cross-similarity-matrix - TDF part-files
-
-The similarity-matrix will contain the lines:
-
-    galaxy\tnexus:1.7260924347106847
-    ipad\tiphone:1.7260924347106847
-    nexus\tgalaxy:1.7260924347106847
-    iphone\tipad:1.7260924347106847
-    surface
-
-The cross-similarity-matrix will contain:
-
-    iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847
-    ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897
-    nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897
-    galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847
-    surface\tsurface:4.498681156950466 nexus:0.6795961471815897
-
-**Note:** You can run this multiple times to use more than two actions or you can use the underlying 
-SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any number of cross-cooccurrence indicators.
-
-### Log File Input
- 
-A common method of storing data is in log files. If they are written using some delimiter they can be consumed directly by spark-itemsimilarity. For instance input of the form:
-
-    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad
-    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface
-    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy
-    2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface
-    2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus
-    2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone
-    2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad
-    2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy    
-
-Can be parsed with the following CLI and run on the cluster producing the same output as the above example.
-
-    bash$ mahout spark-itemsimilarity \
-        --input in-file \
-        --output out-path \
-        --master spark://sparkmaster:4044 \
-        --filter1 purchase \
-        --filter2 view \
-        --inDelim "\t" \
-        --itemIDPosition 4 \
-        --rowIDPosition 1 \
-        --filterPosition 2
-
-## 2. spark-rowsimilarity
-
-*spark-rowsimilarity* is the companion to *spark-itemsimilarity* the primary difference is that it takes a text file version of 
-a matrix of sparse vectors with optional application specific IDs and it finds similar rows rather than items (columns). Its use is
-not limited to collaborative filtering. The input is in text-delimited form where there are three delimiters used. By 
-default it reads (rowID&lt;tab>columnID1:strength1&lt;space>columnID2:strength2...) Since this job only supports LLR similarity,
- which does not use the input strengths, they may be omitted in the input. It writes 
-(rowID&lt;tab>rowID1:strength1&lt;space>rowID2:strength2...) 
-The output is sorted by strength descending. The output can be interpreted as a row ID from the primary input followed 
-by a list of the most similar rows.
-
-The command line interface is:
-
-    spark-rowsimilarity Mahout 1.0
-    Usage: spark-rowsimilarity [options]
-    
-    Input, output options
-      -i <value> | --input <value>
-            Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required)
-      -o <value> | --output <value>
-            Path for output, any local or HDFS supported URI (required)
-    
-    Algorithm control options:
-      -mo <value> | --maxObservations <value>
-            Max number of observations to consider per row (optional). Default: 500
-      -m <value> | --maxSimilaritiesPerRow <value>
-            Limit the number of similarities per item to this number (optional). Default: 100
-    
-    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure.
-    Disconnected from the target VM, address: '127.0.0.1:49162', transport: 'socket'
-    
-    Output text file schema options:
-      -rd <value> | --rowKeyDelim <value>
-            Separates the rowID key from the vector values list (optional). Default: "\t"
-      -cd <value> | --columnIdStrengthDelim <value>
-            Separates column IDs from their values in the vector values list (optional). Default: ":"
-      -td <value> | --elementDelim <value>
-            Separates vector element values in the values list (optional). Default: " "
-      -os | --omitStrength
-            Do not write the strength to the output files (optional), Default: false.
-    This option is used to output indexable data for creating a search engine recommender.
-    
-    Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..."
-    
-    File discovery options:
-      -r | --recursive
-            Searched the -i path recursively for files that match --filenamePattern (optional), Default: false
-      -fp <value> | --filenamePattern <value>
-            Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory
-    
-    Spark config options:
-      -ma <value> | --master <value>
-            Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]"
-      -sem <value> | --sparkExecutorMem <value>
-            Max Java heap available as "executor memory" on each node (optional). Default: 4g
-      -rs <value> | --randomSeed <value>
-            
-      -h | --help
-            prints this usage text
-
-See RowSimilarityDriver.scala in Mahout's spark module if you want to customize the code. 
-
-# 3. Using *spark-rowsimilarity* with Text Data
-
-Another use case for *spark-rowsimilarity* is in finding similar textual content. For instance given the tags associated with 
-a blog post,
- which other posts have similar tags. In this case the columns are tags and the rows are posts. Since LLR is 
-the only similarity method supported this is not the optimal way to determine general "bag-of-words" document similarity. 
-LLR is used more as a quality filter than as a similarity measure. However *spark-rowsimilarity* will produce 
-lists of similar docs for every doc if input is docs with lists of terms. The Apache [Lucene](http://lucene.apache.org) project provides several methods of [analyzing and tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description) documents.
-
-# <a name="unified-recommender">4. Creating a Multimodal Recommender</a>
-
-Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can build a miltimodal cooccurrence and content based
- recommender that can be used in both or either mode depending on indicators available and the history available at 
-runtime for a user. Some slide describing this method can be found [here](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
-
-## Requirements
-
-1. Mahout SNAPSHOT-1.0 or later
-2. Hadoop
-3. Spark, the correct version for your version of Mahout and Hadoop
-4. A search engine like Solr or Elasticsearch
-
-## Indicators
-
-Indicators come in 3 types
-
-1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions
-2. **Content**: calculated from item metadata or content using *spark-rowsimilarity*
-3. **Intrinsic**: assigned to items as metadata. Can be anything that describes the item.
-
-The query for recommendations will be a mix of values meant to match one of your indicators. The query can be constructed 
-from user history and values derived from context (category being viewed for instance) or special precalculated data 
-(popularity rank for instance). This blending of indicators allows for creating many flavors or recommendations to fit 
-a very wide variety of circumstances.
-
-With the right mix of indicators developers can construct a single query that works for completely new items and new users 
-while working well for items with lots of interactions and users with many recorded actions. In other words by adding in content and intrinsic 
-indicators developers can create a solution for the "cold-start" problem that gracefully improves with more user history
-and as items have more interactions. It is also possible to create a completely content-based recommender that personalizes 
-recommendations.
-
-## Example with 3 Indicators
-
-You will need to decide how you store user action data so they can be processed by the item and row similarity jobs and 
-this is most easily done by using text files as described above. The data that is processed by these jobs is considered the 
-training data. You will need some amount of user history in your recs query. It is typical to use the most recent user history 
-but need not be exactly what is in the training set, which may include a greater volume of historical data. Keeping the user 
-history for query purposes could be done with a database by storing it in a users table. In the example above the two 
-collaborative filtering actions are "purchase" and "view", but let's also add tags (taken from catalog categories or other 
-descriptive metadata). 
-
-We will need to create 1 cooccurrence indicator from the primary action (purchase) 1 cross-action cooccurrence indicator 
-from the secondary action (view) 
-and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once and *spark-rowsimilarity* once.
-
-We have described how to create the collaborative filtering indicators for purchase and view (the [How to use Multiple User 
-Actions](#multiple-actions) section) but tags will be a slightly different process. We want to use the fact that 
-certain items have tags similar to the ones associated with a user's purchases. This is not a collaborative filtering indicator 
-but rather a "content" or "metadata" type indicator since you are not using other users' history, only the 
-individual that you are making recs for. This means that this method will make recommendations for items that have 
-no collaborative filtering data, as happens with new items in a catalog. New items may have tags assigned but no one
- has purchased or viewed them yet. In the final query we will mix all 3 indicators.
-
-##Content Indicator
-
-To create a content-indicator we'll make use of the fact that the user has purchased items with certain tags. We want to find 
-items with the most similar tags. Notice that other users' behavior is not considered--only other item's tags. This defines a 
-content or metadata indicator. They are used when you want to find items that are similar to other items by using their 
-content or metadata, not by which users interacted with them.
-
-**Note**: It may be advisable to treat tags as cross-cooccurrence indicators but for the sake of an example they are treated here as content only.
-
-For this we need input of the form:
-
-    itemID<tab>list-of-tags
-    ...
-
-The full collection will look like the tags column from a catalog DB. For our ecom example it might be:
-
-    3459860b<tab>men long-sleeve chambray clothing casual
-    9446577d<tab>women tops chambray clothing casual
-    ...
-
-We'll use *spark-rowimilairity* because we are looking for similar rows, which encode items in this case. As with the 
-collaborative filtering indicators we use the --omitStrength option. The strengths created are 
-probabilistic log-likelihood ratios and so are used to filter unimportant similarities. Once the filtering or downsampling 
-is finished we no longer need the strengths. We will get an indicator matrix of the form:
-
-    itemID<tab>list-of-item IDs
-    ...
-
-This is a content indicator since it has found other items with similar content or metadata.
-
-    3459860b<tab>3459860b 3459860b 6749860c 5959860a 3434860a 3477860a
-    9446577d<tab>9446577d 9496577d 0943577d 8346577d 9442277d 9446577e
-    ...  
-    
-We now have three indicators, two collaborative filtering type and one content type.
-
-##  Multimodal Recommender Query
-
-The actual form of the query for recommendations will vary depending on your search engine but the intent is the same. For a given user, map their history of an action or content to the correct indicator field and perform an OR'd query. 
-
-We have 3 indicators, these are indexed by the search engine into 3 fields, we'll call them "purchase", "view", and "tags". 
-We take the user's history that corresponds to each indicator and create a query of the form:
-
-    Query:
-      field: purchase; q:user's-purchase-history
-      field: view; q:user's view-history
-      field: tags; q:user's-tags-associated-with-purchases
-      
-The query will result in an ordered list of items recommended for purchase but skewed towards items with similar tags to 
-the ones the user has already purchased. 
-
-This is only an example and not necessarily the optimal way to create recs. It illustrates how business decisions can be 
-translated into recommendations. This technique can be used to skew recommendations towards intrinsic indicators also. 
-For instance you may want to put personalized popular item recs in a special place in the UI. Create a popularity indicator 
-by tagging items with some category of popularity (hot, warm, cold for instance) then
-index that as a new indicator field and include the corresponding value in a query 
-on the popularity field. If we use the ecom example but use the query to get "hot" recommendations it might look like this:
-
-    Query:
-      field: purchase; q:user's-purchase-history
-      field: view; q:user's view-history
-      field: popularity; q:"hot"
-
-This will return recommendations favoring ones that have the intrinsic indicator "hot".
-
-## Notes
-1. Use as much user action history as you can gather. Choose a primary action that is closest to what you want to recommend and the others will be used to create cross-cooccurrence indicators. Using more data in this fashion will almost always produce better recommendations.
-2. Content can be used where there is no recorded user behavior or when items change too quickly to get much interaction history. They can be used alone or mixed with other indicators.
-3. Most search engines support "boost" factors so you can favor one or more indicators. In the example query, if you want tags to only have a small effect you could boost the CF indicators.
-4. In the examples we have used space delimited strings for lists of IDs in indicators and in queries. It may be better to use arrays of strings if your storage system and search engine support them. For instance Solr allows multi-valued fields, which correspond to arrays.


Mime
View raw message