mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Lanczos Algorithm
Date Tue, 23 Nov 2010 22:03:04 GMT
On Tue, Nov 23, 2010 at 9:58 AM, Derek O'Callaghan
<derek.ocallaghan@ucd.ie>wrote:

> Hi Jake,
>
> I have some related questions about the usage of the eigenvectors and
> eigenvalues generated by Lanczos, they're more or less on-topic so I thought
> it'd be okay to post them here, but I can start a new thread if you like.
> I've been going through some of the mails on the dev list regarding the
> projection of a matrix onto an SVD basis which is generated by Lanczos, in
> order to reduce the dimensionality of the matrix columns. The new matrix is
> then passed to KMeans for clustering.
>

Ok, sounds good.


> From Jeff's mail above, and the code in TestClusterDumper, it seems like
> the second multiplication by S^-1 step is not performed/required, i.e. the
> only step to project the original matrix A is:
>
> Reduced matrix X = A . V (or A . P using Jeff's notation)
>

Not sure about what is done in TestClusterDumper, but in general, to project
the original rows of your matrix onto the reduced space defined by the
decomposition, you do want to rescale by S^-1, or else you'll basically find
that all of your rows seem to point in the direction of the largest
eigenvector (that's why it's the largest eigenvector: most of the matrix
points in it's direction!).


> and the reduced matrix X can then be passed to KMeans for clustering. I
> wanted to confirm if this is correct, and that the S (derived from the
> Lanczos-generated eigenvalues) diagonal matrix can be ignored when
> projecting the original matrix? Is this the reason why Lanczos only persists
> the eigenvectors, and discards the eigenvalues
> (DistributedLanczosSolver.serializeOutput())?
>

I don't think so.  I think you do want the eigenvalues as well.  Because
Lanczos can sometimes have stability issues, and end up with repeats of
eigenvector/eigenvalue pairs, you need to do some checking on the output.
 This is done in the EigenVerificationJob class, which takes your original
corpus, and the supposed eigenvectors (doesn't need the eigenvalues), and
throws away any duplicates or incorrect vectors, and recomputes the
eigenvalues/singular values and indeed stores them as well as the vectors
(see the method saveCleanEigens() ).

These recent discussions reminds me that the EigenVerificationJob needs to
be just folded into the DistributedLanczosSolver, because it's confusing and
nobody sees that they typically need to use it.

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message