mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Stochastic Singular Value Decomposition
Date Mon, 07 May 2012 19:00:01 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Stochastic Singular Value Decomposition (https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition)


Edited by Dmitriy Lyubimov:
---------------------------------------------------------------------
Stochastic SVD method in Mahout produces reduced rank Singular Value Decomposition output
in its strict mathematical definition: A=USV'.

h5. The benefits over other methods are: 
* reduced flops required compared to Krylov subspace methods
* In map-reduce world, a fixed number of MR iterations required regardless of rank requested
* Tweak precision/speed balance with options.
* A is a Distributed Row Matrix where rows may be identified by any Writable (such as a document
path). As such, it would work directly on the output of seq2sparse. 
* As of 0.7 trunk, includes PCA and dimensionality reduction workflow (EXPERIMENTAL! Feedback
on performance/other PCA related issues/ blogs is greatly appreciated.)

map-reduce characteristics: 
SSVD uses at most 3 MR _sequential_ steps (map-only + map-reduce + 2 optional parallel map-reduce
jobs) to produce reduced rank approximation of U, V and S matrices. Additionally, two more
map-reduce steps are added for each power iteration step if requested.

h5. Potential drawbacks: 
* potentially less precise (but adding even one power iteration seems to fix that quite a
bit).

h5. Documentation

[Overview and Usage|^SSVD-CLI.pdf]
Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later.

h5. Publications

[Nathan Halko's dissertation|http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf]
"Randomized methods for computing low-rank
approximations of matrices" contains comprehensive definition of parallelization strategy
taken in Mahout SSVD implementation and also some precision/scalability benchmarks, esp. w.r.t.
Mahout Lanczos implementation on a typical corpus data set.

h5. R simulation
[Non-parallel SSVD simulation in R with power iterations and PCA options|^ssvd.R]. Note that
this implementation is not most optimal for sequential flow solver, but it is for demonstration
purposes only. 

However, try this R code to simulate a meaningful input:
{code:title=tests.R}
n<-1000
m<-2000
k<-10

qi<-1

#simulated input
svalsim<-diag(k:1)

usim<- qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k)))
vsim<- qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k)))


x<- usim %*% svalsim %*% t(vsim) 

{code}

and try to compare ssvd.svd\(x\) and stock svd\(x\) performance for the same rank k, notice
the difference in the running time. Also play with power iterations (qIter) and compare accuracies
of standard svd and SSVD.



Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message