mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radim Rehurek <RadimRehu...@seznam.cz>
Subject SVD in Mahout (was: Mahout Lanczos SVD complexity)
Date Sun, 18 Dec 2011 17:39:46 GMT
By the way, are there similar experimental results for performance/accuracy of SVD in Mahout?


I believe there are several SVD implementations available in Mahout (Lanczos/randomized, distributed/local),
I would be interested in seeing their comparison in terms of speed and accuracy, on some reasonably
large public dataset.

Cheers,
Radim


> ------------ Původní zpráva ------------
> Od: Radim Rehurek <RadimRehurek@seznam.cz>
> Předmět: Re: Mahout Lanczos SVD complexity
> Datum: 18.12.2011 09:41:23
> ----------------------------------------
> Hi Ted,
> 
> appreciate your comments!
> 
> > You are correct that the randomized projection methods are of limited
> > accuracy when the singular values are not decreasing rapidly, but I think
> > that you were asking for too many singular values (500) from a relatively
> > small matrix to make much sense.  My argument for saying this is that the
> > singular values at this point have become quite small and are likely to be
> > modeling the randomized variation in your corpus rather than the systematic
> > patterns of occurrence.  As such, the "accuracy" of the decomposition is of
> > little interest.
> 
> Agreed that the small factors are modelling random variations and could likely
> be replaced by random direction vectors (of comparable magnitude). 
> 
> The 500 is a common "gold-standard" dimensionality used in Latent Semantic
> Indexing (one of the applications of SVD), and users explicitly ask for SVD
> accuracy -- so there it is, hard numbers :)
> Also note that a few million documents, with a few 10k-100k vocabulary, is by
> far the most common use-case for gensim users. That's why I picked the English
> wikipedia to test on. If use-cases of Mahout SVD target millions of features on
> billions of documents, YMMV.
> 
> 
> > It may well be that using a high number of singular values increases
> > apparent accuracy in your application over using a smaller number, but it
> > is likely that augmenting a small decomposition with random vectors instead
> > of true singular vectors would provide virtually identical performance.  As
> > such, computing these extra singular vectors more or less accuracy is
> > hardly of the essence.
> > 
> > In order to determine whether this is the case, the interesting comparison
> > to make is whether you get higher application accuracy on held out data
> > from, say, 100 singular vectors + 400 random vectors or your 500 singular
> > vectors.  It is common with LSI-like applications to test 100 singular
> > vectors against 500 singular vectors, but does not really measure how much
> > information is being extracted in the singular decomposition.
> 
> But those tests didn't measure application accuracy! (though I very much urge
> users to do that). They simply measure SVD reconstruction error (not LSI
> retrieval error or anything).
> 
> Again, I complete agree that real app accuracy beats some theoretical nuisance
> metric (such as reconstruction error by frobenius norm).
> 
> 
> Best,
> Radim
> 
> 
> > Regarding the accuracy of the decomposition, your choice of k is well below
> > the bound suggested by the theoretical considerations in the Halko, et al,
> > paper.  For small matrices with the size measured in thousands or tens of
> > thousands and very tight desired accuracies, this bound is not all that
> > helpful and it isn't surprising that you may have some problems.
> > 
> > In any, case I think that your experience is divergent enough from the
> > applications being considered in Mahout, that conclusions should be drawn
> > with some caution.
> > 
> > On Sat, Dec 17, 2011 at 10:00 PM, Radim Rehurek
> <RadimRehurek@seznam.cz>wrote:
> > 
> > > Ted Dunning wrote:
> > > > Also, it isn't entirely clear yet whether power iterations are more
> > > > efficient than simply increasing the fudge factor p. Power iterations
are
> > > > very effective, and increasing p increases costs in the cube, but running
> > > > MR passes is expensive enough that some increase in p might be sufficient
> > > > and still faster than a power iteration.
> > >
> > > Don't even think about it -- on large matrices (significant truncation),
> > > the randomized SVD without power iterations is complete rubbish, regardless
> > > of any sane value of `p`.
> > >
> > > (assuming your fudge factor p = oversampling parameter `p` from Halko et
> > > al.).
> > >
> > > You can see some accuracy experiments (wikipedia, images) here:
> > > http://groups.google.com/group/gensim/msg/1f3106b85837ce9c and here
> > > http://groups.google.com/group/gensim/msg/240a348b70f18b30 .
> > >
> > > Also see the end of that (longish) thread for some notes on accuracy in
> > > face of many power iterations (~ beware of numeric overflows).
> > >
> > > Best,
> > > Radim
> > >
> > 
> > 
> > 
> 
> 
> 

Mime
View raw message