mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Derek O'Callaghan <derek.ocallag...@ucd.ie>
Subject Re: Lanczos Algorithm
Date Wed, 24 Nov 2010 13:14:09 GMT
Hi Jake,

Thanks for the clarification regarding S-1. FYI the EigenVerificationJob 
is already included in one of the DistributedLanczosSolver run() 
methods. I see the EigenVector instances being written in 
EigenVerificationJob.saveCleanEigens(), however, when they're read back 
in TestClusterDumper.testKmeansSVD(), the vectors are actually 
DenseVector instances, not EigenVectors, and so the associated 
eigenValue is lost as it's currently encapsulated in EigenVector.name. I 
think VectorWritable is just persisting DenseVectors and isn't aware of 
EigenVectors, but I'd need to dig a bit deeper to confirm.

I just wanted to confirm that S should be constructed using the sqrts of 
the eigenvalues generated by Lanczos/EigenVerificationJob?

Thanks again,

Derek

On 23/11/10 22:03, Jake Mannix wrote:
> Not sure about what is done in TestClusterDumper, but in general, to 
> project
> the original rows of your matrix onto the reduced space defined by the
> decomposition, you do want to rescale by S^-1, or else you'll basically find
> that all of your rows seem to point in the direction of the largest
> eigenvector (that's why it's the largest eigenvector: most of the matrix
> points in it's direction!).
>
>
>    
>> and the reduced matrix X can then be passed to KMeans for clustering. I
>> wanted to confirm if this is correct, and that the S (derived from the
>> Lanczos-generated eigenvalues) diagonal matrix can be ignored when
>> projecting the original matrix? Is this the reason why Lanczos only persists
>> the eigenvectors, and discards the eigenvalues
>> (DistributedLanczosSolver.serializeOutput())?
>>
>>      
> I don't think so.  I think you do want the eigenvalues as well.  Because
> Lanczos can sometimes have stability issues, and end up with repeats of
> eigenvector/eigenvalue pairs, you need to do some checking on the output.
>   This is done in the EigenVerificationJob class, which takes your original
> corpus, and the supposed eigenvectors (doesn't need the eigenvalues), and
> throws away any duplicates or incorrect vectors, and recomputes the
> eigenvalues/singular values and indeed stores them as well as the vectors
> (see the method saveCleanEigens() ).
>
> These recent discussions reminds me that the EigenVerificationJob needs to
> be just folded into the DistributedLanczosSolver, because it's confusing and
> nobody sees that they typically need to use it.
>
>    -jake
>
>    

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message