mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Computing SVD Of "Large Sparse Data"
Date Mon, 06 Jun 2011 10:32:45 GMT
I would push for SSVD as well if you want a real SVD.

Also, I don't think that you lose information about which vectors are which
(or as Jake put it "what they mean").  The stochastic decomposition gives a
very accurate estimate of the top-k singular vectors.  It does this by using
the random projection to project the top singular vectors into a sub-space
and then correcting the results obtained back into the original space.  This
is not the same as simply doing the decomposition on the random projection
and then using that decomposition.

On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <
evijayakumar@mozilla.com> wrote:

> Hi Jake,
>  Thank you for your reply. Good to know that we can use Lanczos. I will
> have to look into SSVD algorithm closer to figure out whether the
> information loss is worth the gain in speed (and computational efficiency).
> I guess We will have to run more tests to see which works best to decide on
> which path to go by.
>
>
> Esh
>
> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:
>
> > With 50k columns, you're well within the "sweet spot" for traditional SVD
> > via Lanczos, so give it a try.
> >
> > SSVD will probably run faster, but you lose some information on what the
> > singular vectors "mean".  If you don't need this information, SSVD may be
> > better for you.
> >
> > What would be awesome for *us* is if you tried both and told us what you
> > found, in terms of performance and relevance.  :)
> >
> >  -jake
> >
> > On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <
> evijayakumar@mozilla.com>
> > wrote:
> >
> > Hello all,
> > We are trying to build a clustering system which will have an SVD
> > component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
> > and SSVD. Could someone give me some tips on which would be a better
> choice
> > of a solver given that the size of the data will be roughly 100 million
> rows
> > with each row having roughly 50 K dimensions (100 million X 50000 ). We
> will
> > be working with text data so the resultant matrix should be relatively
> > sparse to begin with.
> >
> > Thanks
> > Eshwaran
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message