That helps a lot, Jake, I want to thank you so much for your patience
answering all of our questions.
Regarding the negative eigenvalues issue, I have obtained today a similar
result: the smallest eigenvalue negative (but very close to 0) and the
biggest one bigger than 1 (~1.02), but I must point that this happened with
artificially augmented data (copied and pasted the same 4 rows hundreds of
times to make some performance tests) so I think the problem can be related
to this. Pedro, are your data artificially generated?
Last, I think that it would be helpful for other people if these questions
are somewhat moved to the Mahout wiki (maybe some kind of FAQ) since I think
they are becoming so "frequent".
Thanks a lot again Jake.
Best,
Fernando.
2010/11/23 Jake Mannix <jake.mannix@gmail.com>
> On Tue, Nov 23, 2010 at 9:58 AM, Derek O'Callaghan
> <derek.ocallaghan@ucd.ie>wrote:
>
> > Hi Jake,
> >
> > I have some related questions about the usage of the eigenvectors and
> > eigenvalues generated by Lanczos, they're more or less ontopic so I
> thought
> > it'd be okay to post them here, but I can start a new thread if you like.
> > I've been going through some of the mails on the dev list regarding the
> > projection of a matrix onto an SVD basis which is generated by Lanczos,
> in
> > order to reduce the dimensionality of the matrix columns. The new matrix
> is
> > then passed to KMeans for clustering.
> >
>
> Ok, sounds good.
>
>
> > From Jeff's mail above, and the code in TestClusterDumper, it seems like
> > the second multiplication by S^1 step is not performed/required, i.e.
> the
> > only step to project the original matrix A is:
> >
> > Reduced matrix X = A . V (or A . P using Jeff's notation)
> >
>
> Not sure about what is done in TestClusterDumper, but in general, to
> project
> the original rows of your matrix onto the reduced space defined by the
> decomposition, you do want to rescale by S^1, or else you'll basically
> find
> that all of your rows seem to point in the direction of the largest
> eigenvector (that's why it's the largest eigenvector: most of the matrix
> points in it's direction!).
>
>
> > and the reduced matrix X can then be passed to KMeans for clustering. I
> > wanted to confirm if this is correct, and that the S (derived from the
> > Lanczosgenerated eigenvalues) diagonal matrix can be ignored when
> > projecting the original matrix? Is this the reason why Lanczos only
> persists
> > the eigenvectors, and discards the eigenvalues
> > (DistributedLanczosSolver.serializeOutput())?
> >
>
> I don't think so. I think you do want the eigenvalues as well. Because
> Lanczos can sometimes have stability issues, and end up with repeats of
> eigenvector/eigenvalue pairs, you need to do some checking on the output.
> This is done in the EigenVerificationJob class, which takes your original
> corpus, and the supposed eigenvectors (doesn't need the eigenvalues), and
> throws away any duplicates or incorrect vectors, and recomputes the
> eigenvalues/singular values and indeed stores them as well as the vectors
> (see the method saveCleanEigens() ).
>
> These recent discussions reminds me that the EigenVerificationJob needs to
> be just folded into the DistributedLanczosSolver, because it's confusing
> and
> nobody sees that they typically need to use it.
>
> jake
>
