mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy Lyubimov (Confluence)" <>
Subject [CONF] Apache Mahout > Stochastic Singular Value Decomposition
Date Wed, 09 Oct 2013 18:29:00 GMT
    <base href="">
            <link rel="stylesheet" href="/confluence/s/en/2176/1/186/_/styles/combined.css?spaceKey=MAHOUT&amp;forWysiwyg=true"
<body style="background: white;" bgcolor="white" class="email-body">
<div id="pageContent">
<div id="notificationFormat">
<div class="wiki-content">
<div class="email">
    <h2><a href="">Stochastic
Singular Value Decomposition</a></h2>
    <h4>Page <b>edited</b> by             <a href="">Dmitriy
                         <h4>Changes (1)</h4>
<div id="page-diffs">
                    <table class="diff" cellpadding="0" cellspacing="0">
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" > <br>h5. R simulation <br></td></tr>
            <tr><td class="diff-changed-lines" >[Non-parallel SSVD simulation
in R with power iterations and PCA options|^ssvd.R]. Note that this implementation is not
most optimal for sequential flow solver, but it is for demonstration purposes only. <span
class="diff-added-words"style="background-color: #dfd;">Note: numerical stability of R
algorithms may differ from that of Mahout&#39;s distributed version. For study of precision
of Mahout&#39;s version, please refer back to Nathan&#39;s dissertation referenced
above.</span> <br></td></tr>
            <tr><td class="diff-unchanged" > <br>However, try this R code
to simulate a meaningful input: <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
    </div>                            <h4>Full Content</h4>
                    <div class="notificationGreySide">
        <p>Stochastic SVD method in Mahout produces reduced rank Singular Value Decomposition
output in its strict mathematical definition: A=USV'.</p>

<h5><a name="StochasticSingularValueDecomposition-Thebenefitsoverothermethodsare%3A"></a>The
benefits over other methods are: </h5>
	<li>reduced flops required compared to Krylov subspace methods</li>
	<li>In map-reduce world, a fixed number of MR iterations required regardless of rank
	<li>Tweak precision/speed balance with options.</li>
	<li>A is a Distributed Row Matrix where rows may be identified by any Writable (such
as a document path). As such, it would work directly on the output of seq2sparse.</li>
	<li>As of 0.7 trunk, includes PCA and dimensionality reduction workflow (EXPERIMENTAL!
Feedback on performance/other PCA related issues/ blogs is greatly appreciated.)</li>

<p>map-reduce characteristics: <br/>
SSVD uses at most 3 MR <em>sequential</em> steps (map-only + map-reduce + 2 optional
parallel map-reduce jobs) to produce reduced rank approximation of U, V and S matrices. Additionally,
two more map-reduce steps are added for each power iteration step if requested.</p>

<h5><a name="StochasticSingularValueDecomposition-Potentialdrawbacks%3A"></a>Potential
drawbacks: </h5>
	<li>potentially less precise (but adding even one power iteration seems to fix that
quite a bit).</li>

<h5><a name="StochasticSingularValueDecomposition-Documentation"></a>Documentation</h5>

<p><a href="/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=17&amp;modificationDate=1349984685000">Overview
and Usage</a><br/>
Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later.</p>

<h5><a name="StochasticSingularValueDecomposition-Publications"></a>Publications</h5>

<p><a href=""
class="external-link" rel="nofollow">Nathan Halko's dissertation</a> "Randomized
methods for computing low-rank<br/>
approximations of matrices" contains comprehensive definition of parallelization strategy
taken in Mahout SSVD implementation and also some precision/scalability benchmarks, esp. w.r.t.
Mahout Lanczos implementation on a typical corpus data set.</p>

<h5><a name="StochasticSingularValueDecomposition-Rsimulation"></a>R simulation</h5>
<p><a href="/confluence/download/attachments/27832158/ssvd.R?version=1&amp;modificationDate=1323358453000">Non-parallel
SSVD simulation in R with power iterations and PCA options</a>. Note that this implementation
is not most optimal for sequential flow solver, but it is for demonstration purposes only.
Note: numerical stability of R algorithms may differ from that of Mahout's distributed version.
For study of precision of Mahout's version, please refer back to Nathan's dissertation referenced

<p>However, try this R code to simulate a meaningful input:</p>
<div class="code panel" style="border-width: 1px;"><div class="codeHeader panelHeader"
style="border-bottom-width: 1px;"><b>tests.R</b></div><div class="codeContent
<pre class="theme: Default; brush: java; gutter: false" style="font-size:12px; font-family:


#simulated input

usim&lt;- qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k)))
vsim&lt;- qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k)))

x&lt;- usim %*% svalsim %*% t(vsim) 


<p>and try to compare ssvd.svd&#40;x&#41; and stock svd&#40;x&#41; performance
for the same rank k, notice the difference in the running time. Also play with power iterations
(qIter) and compare accuracies of standard svd and SSVD.</p>

        <div id="commentsSection" class="wiki-content pageSection">
        <div style="float: right;" class="grey">
                        <a href="">Stop
watching space</a>
            <span style="padding: 0px 5px;">|</span>
                <a href="">Change
email notification preferences</a>
        <a href="">View
        <a href="">View

View raw message