mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fernando Fernández <>
Subject Re: Lanczos Algorithm
Date Tue, 23 Nov 2010 23:01:15 GMT
That helps a lot, Jake, I want to thank you so much for your patience
answering all of our questions.

Regarding the negative eigenvalues issue, I have obtained today a similar
result: the smallest eigenvalue negative (but very close to 0) and the
biggest one bigger than 1 (~1.02), but I must point that this happened with
artificially augmented data (copied and pasted the same 4 rows hundreds of
times to make some performance tests) so I think the problem can be related
to this. Pedro, are your data artificially generated?

Last, I think that it would be helpful for other people if these questions
are somewhat moved to the Mahout wiki (maybe some kind of FAQ) since I think
they are becoming so "frequent".

Thanks a lot again Jake.


2010/11/23 Jake Mannix <>

> On Tue, Nov 23, 2010 at 9:58 AM, Derek O'Callaghan
> <>wrote:
> > Hi Jake,
> >
> > I have some related questions about the usage of the eigenvectors and
> > eigenvalues generated by Lanczos, they're more or less on-topic so I
> thought
> > it'd be okay to post them here, but I can start a new thread if you like.
> > I've been going through some of the mails on the dev list regarding the
> > projection of a matrix onto an SVD basis which is generated by Lanczos,
> in
> > order to reduce the dimensionality of the matrix columns. The new matrix
> is
> > then passed to KMeans for clustering.
> >
> Ok, sounds good.
> > From Jeff's mail above, and the code in TestClusterDumper, it seems like
> > the second multiplication by S^-1 step is not performed/required, i.e.
> the
> > only step to project the original matrix A is:
> >
> > Reduced matrix X = A . V (or A . P using Jeff's notation)
> >
> Not sure about what is done in TestClusterDumper, but in general, to
> project
> the original rows of your matrix onto the reduced space defined by the
> decomposition, you do want to rescale by S^-1, or else you'll basically
> find
> that all of your rows seem to point in the direction of the largest
> eigenvector (that's why it's the largest eigenvector: most of the matrix
> points in it's direction!).
> > and the reduced matrix X can then be passed to KMeans for clustering. I
> > wanted to confirm if this is correct, and that the S (derived from the
> > Lanczos-generated eigenvalues) diagonal matrix can be ignored when
> > projecting the original matrix? Is this the reason why Lanczos only
> persists
> > the eigenvectors, and discards the eigenvalues
> > (DistributedLanczosSolver.serializeOutput())?
> >
> I don't think so.  I think you do want the eigenvalues as well.  Because
> Lanczos can sometimes have stability issues, and end up with repeats of
> eigenvector/eigenvalue pairs, you need to do some checking on the output.
>  This is done in the EigenVerificationJob class, which takes your original
> corpus, and the supposed eigenvectors (doesn't need the eigenvalues), and
> throws away any duplicates or incorrect vectors, and recomputes the
> eigenvalues/singular values and indeed stores them as well as the vectors
> (see the method saveCleanEigens() ).
> These recent discussions reminds me that the EigenVerificationJob needs to
> be just folded into the DistributedLanczosSolver, because it's confusing
> and
> nobody sees that they typically need to use it.
>  -jake

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message