mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Lanczos Algorithm
Date Fri, 19 Nov 2010 08:17:51 GMT

Dear Mahout developers,

I'm a Computer Science student from the National University of Distance Education in Spain.
I'm currently developing my final year project which is about Diffusion Maps.

This method is used for dimensionality reduction and it uses the Lanczos algorithm during
its operations. The method is already implemented in the last release version 
of Mahout in the LanczosSolver class but we foresee the need to use the algorithm with distributed
calculations. This implementation of Diffusion Maps has to deal 
with extremely large matrices and the distributed calculation is critical for me.

I have noticed that there is a DistributedLanzcosSolver class implemented in the Mahout library
but I can’t have access to the source code because it isn't in the 
last release version of Mahout.

Could you please let me know if I could have access to the source code of this class?Also
I would like to ask you about how the LanczosSolver implementation works. I have made some
test between this class and other 
program which has been implemented in R. This program is using a library called Arpack, which
also uses the Lanczos algorithm. When I calculate the eigenvalues and the eigenvectors of
a symmetric matrix. I haven’t the same results. For example:
For this matrix:

4.42282138 1.51744077 0.07690571 0.93650042 2.19609401 
1.51744077 1.73849477 -0.11856149 0.76555191 1.3673608 
0.07690571 -0.11856149 0.55065932 1.72163263 -0.2283693 
0.93650042 0.76555191 1.72163263 0.09470345 -1.16626194 
2.19609401 1.3673608 -0.2283693 -1.16626194 -0.37321311 
Results for R:


 -0.6442398  1.1084103  2.3946915  6.2018925

Eigenvectors            [,1]        [,2]         [,3]       [,4]

[1,] -0.17050824  0.46631043 -0.010360993 0.83660453
[2,] -0.06455473 -0.87762807 -0.008814402 0.40939079
[3,]  0.68602882  0.04706265 -0.666429293 0.02602181
[4,] -0.39567054 -0.07491643 -0.670834157 0.12161492
[5,]  0.58272541 -0.06705358  0.325066897 0.34208875

Results for Java:


0.0 0.007869004183962289 0.023293016691817894 0.10872358093523908 0.13087002850143611 
I never get the same eigenvalues, I think this is because the documentation of the class says:
To avoid floating point overflow problems which arise in power-methods like Lanczos, an initial
pass is made through the input matrix to generate a good 
starting seed vector by summing all the rows of the input matrix, and compute the trace(inputMatrixt*matrix)
This latter value, being the sum of all of the singular values, is used to rescale the entire
matrix, effectively forcing the largest singular value to be strictly 
less than one, and transforming floating point overflow problems into floating point underflow
(ie, very small singular values will become invisible, as they will 
appear to be zero and the algorithm will terminate). 
Is it possible to return the eigenvalues to theirs original value?

-0.83660453 0.23122937 0.010360993   0.46631043 -0.17050824
-0.40939079 0.24067227 0.008814402  -0.87762807 -0.06455473 
-0.02602181 0.28695718 0.666429293   0.04706265 0.68602882 
-0.12161492 -0.61075665 0.670834157  -0.07491643 -0.39567054 
-0.34208875 -0.65821099 -0.325066897  -0.06705358 0.58272541 

Always happens the same. I have to force the calculation of N vectors (with an N x N matrix)
to obtain the same values for the eigenvectors, 
except in the sign of some of the values, which is acceptable. I thought this implementation
of the algorithm should return the eigenvectors sorted but all the time I’m obtaining a
vector which I don’t want to calculate between them. 
In the example above it’s the second one starting from the left.Why is this happen?

Thanks in advance.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message