mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rawkintr...@apache.org
Subject [3/9] mahout git commit: WEBSITE Triage of Old Site Migration
Date Sat, 29 Apr 2017 23:24:52 GMT
http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md
b/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md
new file mode 100644
index 0000000..2a157f6
--- /dev/null
+++ b/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md
@@ -0,0 +1,446 @@
+---
+layout: default
+title: Dimensional Reduction
+theme:
+   name: retro-mahout
+---
+
+# Support for dimensional reduction
+
+Matrix algebra underpins the way many Big Data algorithms and data
+structures are composed: full-text search can be viewed as doing matrix
+multiplication of the term-document matrix by the query vector (giving a
+vector over documents where the components are the relevance score),
+computing co-occurrences in a collaborative filtering context (people who
+viewed X also viewed Y, or ratings-based CF like the Netflix Prize contest)
+is taking the squaring the user-item interaction matrix, calculating users
+who are k-degrees separated from each other in a social network or
+web-graph can be found by looking at the k-fold product of the graph
+adjacency matrix, and the list goes on (and these are all cases where the
+linear structure of the matrix is preserved!)
+
+Each of these examples deal with cases of matrices which tend to be
+tremendously large (often millions to tens of millions to hundreds of
+millions of rows or more, by sometimes a comparable number of columns), but
+also rather sparse. Sparse matrices are nice in some respects: dense
+matrices which are 10^7 on a side would have 100 trillion non-zero entries!
+But the sparsity is often problematic, because any given two rows (or
+columns) of the matrix may have zero overlap. Additionally, any
+machine-learning work done on the data which comprises the rows has to deal
+with what is known as "the curse of dimensionality", and for example, there
+are too many columns to train most regression or classification problems on
+them independently.
+
+One of the more useful approaches to dealing with such huge sparse data
+sets is the concept of dimensionality reduction, where a lower dimensional
+space of the original column (feature) space of your data is found /
+constructed, and your rows are mapped into that subspace (or sub-manifold).
+ In this reduced dimensional space, "important" components to distance
+between points are exaggerated, and unimportant ones washed away, and
+additionally, sparsity of your rows is traded for drastically reduced
+dimensional, but dense "signatures". While this loss of sparsity can lead
+to its own complications, a proper dimensionality reduction can help reveal
+the most important features of your data, expose correlations among your
+supposedly independent original variables, and smooth over the zeroes in
+your correlation matrix.
+
+One of the most straightforward techniques for dimensionality reduction is
+the matrix decomposition: singular value decomposition, eigen
+decomposition, non-negative matrix factorization, etc. In their truncated
+form these decompositions are an excellent first approach toward linearity
+preserving unsupervised feature selection and dimensional reduction. Of
+course, sparse matrices which don't fit in RAM need special treatment as
+far as decomposition is concerned. Parallelizable and/or stream-oriented
+algorithms are needed.
+
+<a name="DimensionalReduction-SingularValueDecomposition"></a>
+# Singular Value Decomposition
+
+Currently implemented in Mahout (as of 0.3, the first release with MAHOUT-180 applied), are
two scalable implementations of SVD, a stream-oriented implementation using the Asymmetric
Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb's paper ([Gorrell
and Webb 2005](-http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf.html)
+); and there is a [Lanczos | http://en.wikipedia.org/wiki/Lanczos_algorithm]
+ implementation, both single-threaded, and in the
+o.a.m.math.decomposer.lanczos package (math module), as a hadoop map-reduce
+(series of) job(s) in o.a.m.math.hadoop.decomposer package (core module).
+Coming soon: stochastic decomposition.
+
+See also: [https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition](Wikipedia
- SVD)
+
+<a name="DimensionalReduction-Lanczos"></a>
+## Lanczos
+
+The Lanczos algorithm is designed for eigen-decomposition, but like any
+such algorithm, getting singular vectors out of it is immediate (singular
+vectors of matrix A are just the eigenvectors of A^t * A or A * A^t). 
+Lanczos works by taking a starting seed vector *v* (with cardinality equal
+to the number of columns of the matrix A), and repeatedly multiplying A by
+the result: *v'* = A.times(*v*) (and then subtracting off what is
+proportional to previous *v'*'s, and building up an auxiliary matrix of
+projections).  In the case where A is not square (in general: not
+symmetric), then you actually want to repeatedly multiply A*A^t by *v*:
+*v'* = (A * A^t).times(*v*), or equivalently, in Mahout,
+A.timesSquared(*v*) (timesSquared is merely an optimization: by changing
+the order of summation in A*A^t.times(*v*), you can do the same computation
+as one pass over the rows of A instead of two).
+
+After *k* iterations of *v_i* = A.timesSquared(*v_(i-1)*), a *k*- by -*k*
+tridiagonal matrix has been created (the auxiliary matrix mentioned above),
+out of which a good (often extremely good) approximation to *k* of the
+singular values (and with the basis spanned by the *v_i*, the *k* singular
+*vectors* may also be extracted) of A may be efficiently extracted.  Which
+*k*?  It's actually a spread across the entire spectrum: the first few will
+most certainly be the largest singular values, and the bottom few will be
+the smallest, but you have no guarantee that just because you have the n'th
+largest singular value of A, that you also have the (n-1)'st as well.  A
+good rule of thumb is to try and extract out the top 3k singular vectors
+via Lanczos, and then discard the bottom two thirds, if you want primarily
+the largest singular values (which is the case for using Lanczos for
+dimensional reduction).
+
+<a name="DimensionalReduction-ParallelizationStragegy"></a>
+### Parallelization Stragegy
+
+Lanczos is "embarassingly parallelizable": matrix multiplication of a
+matrix by a vector may be carried out row-at-a-time without communication
+until at the end, the results of the intermediate matrix-by-vector outputs
+are accumulated on one final vector.  When it's truly A.times(*v*), the
+final accumulation doesn't even have collision / synchronization issues
+(the outputs are individual separate entries on a single vector), and
+multicore approaches can be very fast, and there should also be tricks to
+speed things up on Hadoop.  In the asymmetric case, where the operation is
+A.timesSquared(*v*), the accumulation does require synchronization (the
+vectors to be summed have nonzero elements all across their range), but
+delaying writing to disk until Mapper close(), and remembering that having
+a Combiner be the same as the Reducer, the bottleneck in accumulation is
+nowhere near a single point.
+
+<a name="DimensionalReduction-Mahoutusage"></a>
+### Mahout usage
+
+The Mahout DistributedLanzcosSolver is invoked by the
+<MAHOUT_HOME>/bin/mahout svd command. This command takes the following
+arguments (which can be reproduced by just entering the command with no
+arguments):
+
+
+    Job-Specific Options:							    
+      --input (-i) input			  Path to job input directory.	    
+      --output (-o) output			  The directory pathname for output.    
+      --numRows (-nr) numRows		  Number of rows of the input matrix	  
+      --numCols (-nc) numCols		  Number of columns of the input matrix 
+      --rank (-r) rank			  Desired decomposition rank (note: 
+    					  only roughly 1/4 to 1/3 of these will 
+    					  have the top portion of the spectrum) 
+      --symmetric (-sym) symmetric		  Is the input matrix square and    
+    					  symmetric?			    
+      --cleansvd (-cl) cleansvd		  Run the EigenVerificationJob to clean 
+    					  the eigenvectors after SVD	    
+      --maxError (-err) maxError		  Maximum acceptable error	    
+      --minEigenvalue (-mev) minEigenvalue	  Minimum eigenvalue to keep the vector for		
	    
+      --inMemory (-mem) inMemory		  Buffer eigen matrix into memory (if you have enough!)
	    
+      --help (-h)				  Print out help		    
+      --tempDir tempDir			  Intermediate output directory     
+      --startPhase startPhase		  First phase to run		    
+      --endPhase endPhase			  Last phase to run		    
+
+
+The short form invocation may be used to perform the SVD on the input data: 
+
+      <MAHOUT_HOME>/bin/mahout svd \
+      --input (-i) <Path to input matrix> \   
+      --output (-o) <The directory pathname for output> \	
+      --numRows (-nr) <Number of rows of the input matrix> \   
+      --numCols (-nc) <Number of columns of the input matrix> \
+      --rank (-r) <Desired decomposition rank> \
+      --symmetric (-sym) <Is the input matrix square and symmetric>    
+
+
+The --input argument is the location on HDFS where a
+SequenceFile<Writable,VectorWritable> (preferably
+SequentialAccessSparseVectors instances) lies which you wish to decompose. 
+Each vector of which has --numcols entries.  --numRows is the number of
+input rows and is used to properly size the matrix data structures.
+
+After execution, the --output directory will have a file named
+"rawEigenvectors" containing the raw eigenvectors. As the
+DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
+eigenvalues aren't valid, and also scales all of the eigenvalues down by
+the max eignenvalue (to avoid floating point overflow), there is an
+additional step which spits out the nice correctly scaled (and
+non-spurious) eigenvector/value pairs. This is done by the "cleansvd" shell
+script step (c.f. EigenVerificationJob).
+
+If you have run he short form svd invocation above and require this
+"cleaning" of the eigen/singular output you can run "cleansvd" as a
+separate command:
+
+      <MAHOUT_HOME>/bin/mahout cleansvd \
+      --eigenInput <path to raw eigenvectors> \
+      --corpusInput <path to corpus> \
+      --output <path to output directory> \
+      --maxError <maximum allowed error. Default is 0.5> \
+      --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \
+      --inMemory <true if the eigenvectors can all fit into memory. Default false>
+
+
+The --corpusInput is the input path from the previous step, --eigenInput is
+the output from the previous step (<output>/rawEigenvectors), and --output
+is the desired output path (same as svd argument). The two "cleaning"
+params are --maxError - the maximum allowed 1-cosAngle(v,
+A.timesSquared(v)), and --minEigenvalue.  Eigenvectors which have too large
+error, or too small eigenvalue are discarded.  Optional argument:
+--inMemory, if you have enough memory on your local machine (not on the
+hadoop cluster nodes!) to load all eigenvectors into memory at once (at
+least 8 bytes/double * rank * numCols), then you will see some speedups on
+this cleaning process.
+
+After execution, the --output directory will have a file named
+"cleanEigenvectors" containing the clean eigenvectors. 
+
+These two steps can also be invoked together by the svd command by using
+the long form svd invocation:
+
+      <MAHOUT_HOME>/bin/mahout svd \
+      --input (-i) <Path to input matrix> \   
+      --output (-o) <The directory pathname for output> \	
+      --numRows (-nr) <Number of rows of the input matrix> \   
+      --numCols (-nc) <Number of columns of the input matrix> \
+      --rank (-r) <Desired decomposition rank> \
+      --symmetric (-sym) <Is the input matrix square and symmetric> \  
+      --cleansvd "true"   \
+      --maxError <maximum allowed error. Default is 0.5> \
+      --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \
+      --inMemory <true if the eigenvectors can all fit into memory. Default false>
+
+
+After execution, the --output directory will contain two files: the
+"rawEigenvectors" and the "cleanEigenvectors".
+
+TODO: also allow exclusion based on improper orthogonality (currently
+computed, but not checked against constraints).
+
+<a name="DimensionalReduction-Example:SVDofASFMailArchivesonAmazonElasticMapReduce"></a>
+#### Example: SVD of ASF Mail Archives on Amazon Elastic MapReduce
+
+This section walks you through a complete example of running the Mahout SVD
+job on Amazon Elastic MapReduce cluster and then preparing the output to be
+used for clustering. This example was developed as part of the effort to
+benchmark Mahout's clustering algorithms using a large document set (see [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588)
+). Specifically, we use the ASF mail archives located at
+http://aws.amazon.com/datasets/7791434387204566.  You will need to likely
+run seq2sparse on these first.	See
+$MAHOUT_HOME/examples/bin/build-asf-email.sh (on trunk) for examples of
+processing this data.
+
+At a high level, the steps we're going to perform are:
+
+bin/mahout svd (original -> svdOut)
+bin/mahout cleansvd ...
+bin/mahout transpose svdOut -> svdT
+bin/mahout transpose original -> originalT
+bin/mahout matrixmult originalT svdT -> newMatrix
+bin/mahout kmeans newMatrix
+
+The bulk of the content for this section was extracted from the Mahout user
+mailing list, see: [Using SVD with Canopy/KMeans](http://search.lucidimagination.com/search/document/6e5889ee6f0f253b/using_svd_with_canopy_kmeans#66a50fe017cebbe8)
+ and [Need a little help with using SVD](http://search.lucidimagination.com/search/document/748181681ae5238b/need_a_little_help_with_using_svd#134fb2771fd52928)
+
+Note: Some of this work is due in part to credits donated by the Amazon
+Elastic MapReduce team.
+
+<a name="DimensionalReduction-1.LaunchEMRCluster"></a>
+##### 1. Launch EMR Cluster
+
+For a detailed explanation of the steps involved in launching an Amazon
+Elastic MapReduce cluster for running Mahout jobs, please read the
+"Building Vectors for Large Document Sets" section of [Mahout on Elastic MapReduce](https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce)
+.
+
+In the remaining steps below, remember to replace JOB_ID with the Job ID of
+your EMR cluster.
+
+<a name="DimensionalReduction-2.LoadMahout0.5+JARintoS3"></a>
+##### 2. Load Mahout 0.5+ JAR into S3
+
+These steps were created with the mahout-0.5-SNAPSHOT because they rely on
+the patch for [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639)
+
+<a name="DimensionalReduction-3.CopyTFIDFVectorsintoHDFS"></a>
+##### 3. Copy TFIDF Vectors into HDFS
+
+Before running your SVD job on the vectors, you need to copy them from S3
+to your EMR cluster's HDFS.
+
+
+    elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
+      --arg s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors\
+      --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors \
+      -j JOB_ID
+
+
+<a name="DimensionalReduction-4.RuntheSVDJob"></a>
+##### 4. Run the SVD Job
+
+Now you're ready to run the SVD job on the vectors stored in HDFS:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg svd \
+      --arg -i --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors\
+      --arg -o --arg /asf-mail-archives/mahout/svd \
+      --arg --rank --arg 100 \
+      --arg --numCols --arg 20444 \
+      --arg --numRows --arg 6076937 \
+      --arg --cleansvd --arg "true" \
+      -j JOB_ID
+
+
+This will run 100 iterations of the LanczosSolver SVD job to produce 87
+eigenvectors in:
+
+
+    /asf-mail-archives/mahout/svd/cleanEigenvectors
+
+
+Only 87 eigenvectors were produced because of the cleanup step, which
+removes any duplicate eigenvectors caused by convergence issues and numeric
+overflow and any that don't appear to be "eigen" enough (ie, they don't
+satisfy the eigenvector criterion with high enough fidelity). - Jake Mannix
+
+<a name="DimensionalReduction-5.TransformyourTFIDFVectorsintoMahoutMatrix"></a>
+##### 5. Transform your TFIDF Vectors into Mahout Matrix
+
+The tfidf vectors created by the seq2sparse job are
+SequenceFile<Text,VectorWritable>. The Mahout RowId job transforms these
+vectors into a matrix form that is a
+SequenceFile<IntWritable,VectorWritable> and a
+SequenceFile<IntWritable,Text> (where the original one is the join of these
+new ones, on the new int key).
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg rowid \
+      --arg
+-Dmapred.input.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors
+\
+      --arg
+-Dmapred.output.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix
+\
+      -j JOB_ID
+
+
+This is not a distributed job and will only run on the master server in
+your EMR cluster. The job produces the following output:
+
+
+    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/docIndex
+    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix
+
+
+where docIndex is the SequenceFile<IntWritable,Text> and matrix is
+SequenceFile<IntWritable,VectorWritable>.
+
+<a name="DimensionalReduction-6.TransposetheMatrix"></a>
+##### 6. Transpose the Matrix
+
+Our ultimate goal is to multiply the TFIDF vector matrix times our SVD
+eigenvectors. For the mathematically inclined, from the rowid job, we now
+have an m x n matrix T (m=6076937, n=20444). The SVD eigenvector matrix E
+is p x n (p=87, n=20444). So to multiply these two matrices, I need to
+transpose E so that the number of columns in T equals the number of rows in
+E (i.e. E^T is n x p) the result of the matrixmult would give me an m x p
+matrix (m=6076937, p=87).
+
+However, in practice, computing the matrix product of two matrices as a
+map-reduce job is efficiently done as a map-side join on two row-based
+matrices with the same number of rows, and the columns are the ones which
+are different.	In particular, if you take a matrix X which is represented
+as a set of numRowsX rows, each of which has numColsX, and another matrix
+with numRowsY == numRowsX, each of which has numColsY (!= numColsX), then
+by summing the outer-products of each of the numRowsX pairs of vectors, you
+get a matrix of with numRowsZ == numColsX, and numColsZ == numColsY (if you
+instead take the reverse outer product of the vector pairs, you can end up
+with the transpose of this final result, with numRowsZ == numColsY, and
+numColsZ == numColsX). - Jake Mannix
+
+Thus, we need to transpose the matrix using Mahout's Transpose Job:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg transpose \
+      --arg -i --arg
+/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix \
+      --arg --numRows --arg 6076937 \
+      --arg --numCols --arg 20444 \
+      --arg --tempDir --arg
+/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \
+      -j JOB_ID
+
+
+This job requires the patch to [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639)
+
+The job creates the following output:
+
+
+    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose
+
+
+<a name="DimensionalReduction-7.TransposeEigenvectors"></a>
+##### 7. Transpose Eigenvectors
+
+If you followed Jake's explanation in step 6 above, then you know that we
+also need to transpose the eigenvectors:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg transpose \
+      --arg -i --arg /asf-mail-archives/mahout/svd/cleanEigenvectors \
+      --arg --numRows --arg 87 \
+      --arg --numCols --arg 20444 \
+      --arg --tempDir --arg /asf-mail-archives/mahout/svd/transpose \
+      -j JOB_ID
+
+
+Note: You need to use the same number of reducers that was used for
+transposing the matrix you are multiplying the vectors with.
+
+The job creates the following output:
+
+
+    /asf-mail-archives/mahout/svd/transpose
+
+
+<a name="DimensionalReduction-8.MatrixMultiplication"></a>
+##### 8. Matrix Multiplication
+
+Lastly, we need to multiply the transposed vectors using Mahout's
+matrixmult job:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg matrixmult \
+      --arg --numRowsA --arg 20444 \
+      --arg --numColsA --arg 6076937 \
+      --arg --numRowsB --arg 20444 \
+      --arg --numColsB --arg 87 \
+      --arg --inputPathA --arg
+/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \
+      --arg --inputPathB --arg /asf-mail-archives/mahout/svd/transpose \
+      -j JOB_ID
+
+
+This job produces output such as:
+
+
+    /user/hadoop/productWith-189
+
+
+<a name="DimensionalReduction-Resources"></a>
+# Resources
+
+* [LSA tutorial](http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm)
+* [SVD tutorial](http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html)

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md
new file mode 100644
index 0000000..50ff7be
--- /dev/null
+++ b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md
@@ -0,0 +1,127 @@
+---
+layout: default
+title:     Stochastic SVD
+theme:
+   name: retro-mahout
+---
+
+# Stochastic Singular Value Decomposition #
+
+Stochastic SVD method in Mahout produces reduced rank Singular Value Decomposition output
in its 
+strict mathematical definition: ` \(\mathbf{A\approx U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)`.
+
+##The benefits over other methods are:
+
+ - reduced flops required compared to Krylov subspace methods
+
+ - In map-reduce world, a fixed number of MR iterations required regardless of rank requested
+
+ - Tweak precision/speed balance with options.
+
+ - A is a Distributed Row Matrix where rows may be identified by any Writable (such as a
document path). As such, it would work directly on the output of seq2sparse.
+
+ - As of 0.7 trunk, includes PCA and dimensionality reduction workflow (EXPERIMENTAL! Feedback
on performance/other PCA related issues/ blogs is greatly appreciated.)
+
+### Map-Reduce characteristics: 
+SSVD uses at most 3 MR sequential steps (map-only + map-reduce + 2 optional parallel map-reduce
jobs) to produce reduced rank approximation of U, V and S matrices. Additionally, two more
map-reduce steps are added for each power iteration step if requested.
+
+##Potential drawbacks:
+
+potentially less precise (but adding even one power iteration seems to fix that quite a bit).
+
+##Documentation
+
+[Overview and Usage][3]
+
+Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later.
+
+##Publications
+
+[Nathan Halko's dissertation][1] "Randomized methods for computing low-rank
+approximations of matrices" contains comprehensive definition of parallelization strategy
taken in Mahout SSVD implementation and also some precision/scalability benchmarks, esp. w.r.t.
Mahout Lanczos implementation on a typical corpus data set.
+
+[Halko, Martinsson, Tropp] paper discusses family of random projection-based algorithms and
contains theoretical error estimates.
+
+**R simulation**
+
+[Non-parallel SSVD simulation in R][2] with power iterations and PCA options. Note that this
implementation is not most optimal for sequential flow solver, but it is for demonstration
purposes only.
+
+However, try this R code to simulate a meaningful input:
+
+
+
+**tests.R**
+
+
+
+    n<-1000
+    m<-2000
+    k<-10
+     
+    qi<-1
+     
+    #simulated input
+    svalsim<-diag(k:1)
+     
+    usim<- qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k)))
+    vsim<- qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k)))
+     
+     
+    x<- usim %*% svalsim %*% t(vsim)
+
+
+and try to compare ssvd.svd(x) and stock svd(x) performance for the same rank k, notice the
difference in the running time. Also play with power iterations (qIter) and compare accuracies
of standard svd and SSVD.
+
+Note: numerical stability of R algorithms may differ from that of Mahout's distributed version.
We haven't studied accuracy of the R simulation. For study of accuracy of Mahout's version,
please refer to Nathan's dissertation as referenced above.
+
+
+  [1]: http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf
+  [2]: ssvd.page/ssvd.R
+  [3]: ssvd.page/SSVD-CLI.pdf
+
+
+#### Modified SSVD Algorithm.
+
+Given an `\(m\times n\)`
+matrix `\(\mathbf{A}\)`, a target rank `\(k\in\mathbb{N}_{1}\)`
+, an oversampling parameter `\(p\in\mathbb{N}_{1}\)`, 
+and the number of additional power iterations `\(q\in\mathbb{N}_{0}\)`, 
+this procedure computes an `\(m\times\left(k+p\right)\)`
+SVD `\(\mathbf{A\approx U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)`:
+
+  1. Create seed for random `\(n\times\left(k+p\right)\)`
+  matrix `\(\boldsymbol{\Omega}\)`. The seed defines matrix `\(\mathbf{\Omega}\)`
+  using Gaussian unit vectors per one of suggestions in [Halko, Martinsson, Tropp].
+
+  2. `\(\mathbf{Y=A\boldsymbol{\Omega}},\,\mathbf{Y}\in\mathbb{R}^{m\times\left(k+p\right)}\)`
+ 
+
+  3. Column-orthonormalize `\(\mathbf{Y}\rightarrow\mathbf{Q}\)`
+  by computing thin decomposition `\(\mathbf{Y}=\mathbf{Q}\mathbf{R}\)`.
+  Also, `\(\mathbf{Q}\in\mathbb{R}^{m\times\left(k+p\right)},\,\mathbf{R}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)`.
+  I denote this as `\(\mathbf{Q}=\mbox{qr}\left(\mathbf{Y}\right).\mathbf{Q}\)`
+ 
+
+  4. `\(\mathbf{B}_{0}=\mathbf{Q}^{\top}\mathbf{A}:\,\,\mathbf{B}\in\mathbb{R}^{\left(k+p\right)\times
n}\)`.
+ 
+  5. If `\(q>0\)`
+  repeat: for `\(i=1..q\)`: 
+  `\(\mathbf{B}_{i}^{\top}=\mathbf{A}^{\top}\mbox{qr}\left(\mathbf{A}\mathbf{B}_{i-1}^{\top}\right).\mathbf{Q}\)`
+  (power iterations step).
+
+  6. Compute Eigensolution of a small Hermitian `\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}=\mathbf{\hat{U}}\boldsymbol{\Lambda}\mathbf{\hat{U}}^{\top}\)`,
+  `\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)`.
+ 
+
+  7. Singular values `\(\mathbf{\boldsymbol{\Sigma}}=\boldsymbol{\Lambda}^{0.5}\)`,
+  or, in other words, `\(s_{i}=\sqrt{\sigma_{i}}\)`.
+ 
+
+  8. If needed, compute `\(\mathbf{U}=\mathbf{Q}\hat{\mathbf{U}}\)`.
+ 
+
+  9. If needed, compute `\(\mathbf{V}=\mathbf{B}_{q}^{\top}\hat{\mathbf{U}}\boldsymbol{\Sigma}^{-1}\)`.
+Another way is `\(\mathbf{V}=\mathbf{A}^{\top}\mathbf{U}\boldsymbol{\Sigma}^{-1}\)`.
+
+[Halko, Martinsson, Tropp]: http://arxiv.org/abs/0909.4061
+ 

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf
b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf
new file mode 100644
index 0000000..ab5999d
Binary files /dev/null and b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf
differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R
b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R
new file mode 100644
index 0000000..fa5fa84
--- /dev/null
+++ b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R
@@ -0,0 +1,181 @@
+
+# standard SSVD
+ssvd.svd <- function(x, k, p=25, qiter=0 ) { 
+
+a <- as.matrix(x)
+m <- nrow(a)
+n <- ncol(a)
+p <- min( min(m,n)-k,p)
+r <- k+p
+
+omega <- matrix ( rnorm(r*n), nrow=n, ncol=r)
+
+y <- a %*% omega
+
+q <- qr.Q(qr(y))
+
+b<- t(q) %*% a
+
+#power iterations
+for ( i in 1:qiter ) { 
+  y <- a %*% t(b)
+  q <- qr.Q(qr(y))
+  b <- t(q) %*% a
+}
+
+bbt <- b %*% t(b)
+
+e <- eigen(bbt, symmetric=T)
+
+res <- list()
+
+res$svalues <- sqrt(e$values)[1:k]
+uhat=e$vectors[1:k,1:k]
+
+res$u <- (q %*% e$vectors)[,1:k]
+res$v <- (t(b) %*% e$vectors %*% diag(1/e$values))[,1:k]
+
+return(res)
+}
+
+#SSVD with Q=YR^-1 substitute.
+# this is just a simulation, because it is suboptimal to verify the actual result
+ssvd.svd1 <- function(x, k, p=25, qiter=0 ) { 
+
+a <- as.matrix(x)
+m <- nrow(a)
+n <- ncol(a)
+p <- min( min(m,n)-k,p)
+r <- k+p
+
+omega <- matrix ( rnorm(r*n), nrow=n, ncol=r)
+
+# in reality we of course don't need to form and persist y
+# but this is just verification
+y <- a %*% omega
+
+yty <- t(y) %*% y
+R <- chol(yty, pivot = T)
+q <- y %*% solve(R)
+
+b<- t( q ) %*% a   
+
+#power iterations
+for ( i in 1:qiter ) { 
+  y <- a %*% t(b)
+
+  yty <- t(y) %*% y
+  R <- chol(yty, pivot = T)
+  q <- y %*% solve(R)
+  b <- t(q) %*% a
+}
+
+bbt <- b %*% t(b)
+
+e <- eigen(bbt, symmetric=T)
+
+res <- list()
+
+res$svalues <- sqrt(e$values)[1:k]
+uhat=e$vectors[1:k,1:k]
+
+res$u <- (q %*% e$vectors)[,1:k]
+res$v <- (t(b) %*% e$vectors %*% diag(1/e$values))[,1:k]
+
+return(res)
+}
+
+
+#############
+## ssvd with pci options
+ssvd.cpca <- function ( x, k, p=25, qiter=0, fixY=T ) { 
+
+a <- as.matrix(x)
+m <- nrow(a)
+n <- ncol(a)
+p <- min( min(m,n)-k,p)
+r <- k+p
+
+
+# compute median xi
+xi<-colMeans(a)
+
+omega <- matrix ( rnorm(r*n), nrow=n, ncol=r)
+
+y <- a %*% omega
+
+#fix y
+if ( fixY ) { 
+  #debug
+  cat ("fixing Y...\n");
+
+  s_o = t(omega) %*% cbind(xi)
+  for (i in 1:r ) y[,i]<- y[,i]-s_o[i]
+}
+
+
+q <- qr.Q(qr(y))
+
+b<- t(q) %*% a
+
+# compute sum of q rows 
+s_q <- cbind(colSums(q))
+
+# compute B*xi
+# of course in MR implementation 
+# it will be collected as sums of ( B[,i] * xi[i] ) and reduced after.
+s_b <- b %*% cbind(xi)
+
+
+#power iterations
+for ( i in 1:qiter ) { 
+
+  # fix b 
+  b <- b - s_q %*% rbind(xi) 
+
+  y <- a %*% t(b)
+
+  # fix y 
+  if ( fixY )  
+    for (i in 1:r ) y[,i]<- y[,i]-s_b[i]
+  
+
+  q <- qr.Q(qr(y))
+  b <- t(q) %*% a
+
+  # recompute s_{q}
+  s_q <- cbind(colSums(q))
+
+  #recompute s_{b}
+  s_b <- b %*% cbind(xi)
+
+}
+
+
+
+#C is the outer product of S_q and S_b per doc
+C <- s_q %*% t(s_b)
+
+# fixing BB'
+bbt <- b %*% t(b) -C -t(C) + sum(xi * xi)* (s_q %*% t(s_q))
+
+e <- eigen(bbt, symmetric=T)
+
+res <- list()
+
+res$svalues <- sqrt(e$values)[1:k]
+uhat=e$vectors[1:k,1:k]
+
+res$u <- (q %*% e$vectors)[,1:k]
+
+res$v <- (t(b- s_q %*% rbind(xi) ) %*% e$vectors %*% diag(1/e$values))[,1:k]
+
+return(res)
+
+}
+
+
+
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/spark-naive-bayes.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/spark-naive-bayes.md b/website/old_site_migration/needs_work_priority/spark-naive-bayes.md
new file mode 100644
index 0000000..8823812
--- /dev/null
+++ b/website/old_site_migration/needs_work_priority/spark-naive-bayes.md
@@ -0,0 +1,132 @@
+---
+layout: default
+title: Spark Naive Bayes
+theme:
+    name: retro-mahout
+---
+
+# Spark Naive Bayes
+
+
+## Intro
+
+Mahout currently has two flavors of Naive Bayes.  The first is standard Multinomial Naive
Bayes. The second is an implementation of Transformed Weight-normalized Complement Naive Bayes
as introduced by Rennie et al. [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf).
We refer to the former as Bayes and the latter as CBayes.
+
+Where Bayes has long been a standard in text classification, CBayes is an extension of Bayes
that performs particularly well on datasets with skewed classes and has been shown to be competitive
with algorithms of higher complexity such as Support Vector Machines. 
+
+
+## Implementations
+The mahout `math-scala` library has an implemetation of both Bayes and CBayes which is further
optimized in the `spark` module. Currently the Spark optimized version provides CLI drivers
for training and testing. Mahout Spark-Naive-Bayes models can also be trained, tested and
saved to the filesystem from the Mahout Spark Shell. 
+
+## Preprocessing and Algorithm
+
+As described in [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive
Bayes is broken down into the following steps (assignments are over all possible index values):
 
+
+- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; `\(d_{ij}\)` is the
count of word `\(i\)` in document `\(j\)`.
+- Let `\(\vec{y}=(y_1,...,y_n)\)` be their labels.
+- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; let `\(\alpha=\sum_i{\alpha_i}\)`.

+- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length normalization of
`\(\vec{d}\)`
+    1. `\(d_{ij} = \sqrt{d_{ij}}\)` 
+    2. `\(d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` 
+    3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` 
+- **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)`
+    2. `\(w_{ci}=\log{\hat\theta_{ci}}\)`
+- **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)`
+    2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)`
+    3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)`
+- **Label Assignment/Testing:**
+    1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document; let `\(t_i\)` be the count of
the word `\(t\)`.
+    2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}\)`
+
+As we can see, the main difference between Bayes and CBayes is the weight calculation step.
 Where Bayes weighs terms more heavily based on the likelihood that they belong to class `\(c\)`,
CBayes seeks to maximize term weights on the likelihood that they do not belong to any other
class.  
+
+## Running from the command line
+
+Mahout provides CLI drivers for all above steps.  Here we will give a simple overview of
Mahout CLI commands used to preprocess the data, train the model and assign labels to the
training set. An [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
is given for the full process from data acquisition through classification of the classic
[20 Newsgroups corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html).
 
+
+- **Preprocessing:**
+For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the [mahout seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html)
command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization
(-n 2 option) as follows:
+
+        $ mahout seq2sparse 
+          -i ${PATH_TO_SEQUENCE_FILES} 
+          -o ${PATH_TO_TFIDF_VECTORS} 
+          -nv 
+          -n 2
+          -wt tfidf
+
+- **Training:**
+The model is then trained using `mahout spark-trainnb`.  The default is to train a Bayes
model. The -c option is given to train a CBayes model:
+
+        $ mahout spark-trainnb
+          -i ${PATH_TO_TFIDF_VECTORS} 
+          -o ${PATH_TO_MODEL}
+          -ow 
+          -c
+
+- **Label Assignment/Testing:**
+Classification and testing on a holdout set can then be performed via `mahout spark-testnb`.
Again, the -c option indicates that the model is CBayes:
+
+        $ mahout spark-testnb 
+          -i ${PATH_TO_TFIDF_TEST_VECTORS}
+          -m ${PATH_TO_MODEL} 
+          -c 
+
+## Command line options
+
+- **Preprocessing:** *note: still reliant on MapReduce seq2sparse* 
+  
+  Only relevant parameters used for Bayes/CBayes as detailed above are shown. Several other
transformations can be performed by `mahout seq2sparse` and used as input to Bayes/CBayes.
 For a full list of `mahout seq2Sparse` options see the [Creating vectors from text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html)
page.
+
+        $ mahout seq2sparse                         
+          --output (-o) output             The directory pathname for output.        
+          --input (-i) input               Path to job input directory.              
+          --weight (-wt) weight            The kind of weight to use. Currently TF   
+                                               or TFIDF. Default: TFIDF                 

+          --norm (-n) norm                 The norm to use, expressed as either a    
+                                               float or "INF" if you want to use the    

+                                               Infinite norm.  Must be greater or equal 

+                                               to 0.  The default is not to normalize   

+          --overwrite (-ow)                If set, overwrite the output directory    
+          --sequentialAccessVector (-seq)  (Optional) Whether output vectors should  
+                                               be SequentialAccessVectors. If set true  

+                                               else false                               

+          --namedVector (-nv)              (Optional) Whether output vectors should  
+                                               be NamedVectors. If set true else false  

+
+- **Training:**
+
+        $ mahout spark-trainnb
+          --input (-i) input               Path to job input directory.                 
+          --output (-o) output             The directory pathname for output.           
+          --trainComplementary (-c)        Train complementary? Default is false.
+          --master (-ma)                   Spark Master URL (optional). Default: "local".
+                                               Note that you can specify the number of 
+                                               cores to get a performance improvement, 
+                                               for example "local[4]"
+          --help (-h)                      Print out help                               
+
+- **Testing:**
+
+        $ mahout spark-testnb   
+          --input (-i) input               Path to job input directory.                 

+          --model (-m) model               The path to the model built during training. 
 
+          --testComplementary (-c)         Test complementary? Default is false.        
                 
+          --master (-ma)                   Spark Master URL (optional). Default: "local".

+                                               Note that you can specify the number of 
+                                               cores to get a performance improvement, 
+                                               for example "local[4]"                   
    
+          --help (-h)                      Print out help                               

+
+## Examples
+1. [20 Newsgroups classification](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
+2. [Document classification with Naive Bayes in the Mahout shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
+        
+ 
+## References
+
+[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). [Tackling the
Poor Assumptions of Naive Bayes Text Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf).
Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003).
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/sparkbindings/MahoutScalaAndSparkBindings.pptx
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/sparkbindings/MahoutScalaAndSparkBindings.pptx
b/website/old_site_migration/needs_work_priority/sparkbindings/MahoutScalaAndSparkBindings.pptx
new file mode 100644
index 0000000..ec1de04
Binary files /dev/null and b/website/old_site_migration/needs_work_priority/sparkbindings/MahoutScalaAndSparkBindings.pptx
differ


Mime
View raw message