Hi ,
Sorry for my late response.
Thanks Dmitry and Ted for your suggestions about smaller value of k and statistical noise.
I have some knowledge about the problem I am dealing with and that’s why I expected that.
It is like this: there are some inherent groups (clusters) in my dataset and items in one
group
are essentially not at all similar to items from other groups. That’s the reason of the
sparsity.
For instance, if items 10,11,12,....,30 are in one group, for these rows in the affinity matrix
only
nonzero columns will be 10,....30 and rest all zero.
But I expected the clustering algorithms to detect these groups and in addition find some
subgroups
which is my main requirement. That is maybe 10,11,...20 and 21,22,...30 are two clusters from
that group.
The clustering process could be executed for separate groups sequentially but I wished to
exploit the
parallelism of HADOOP and perform the clustering process faster.
With that said, can u suggest me something for this problem.
Dan, I have tested the spectralkmeans code comparing with my matlab script and it is running
ok now !
Although, I did not test the last kmeans clustering step as it is initialization dependent
and I assumed that part
to be fine.
After my investigations I found two issues with the code:
1. I still could not find any reason why L = D^0.5 * W * D^0.5 ? I think it should be changed
to D^0.5 (which I did in the code
I am using).
2. Lanczos Solver: Other than the fact this is time consuming for large matrices, I think
there is another important issue.
For spectral clustering we need the largest "k" eigenvectors of the Laplacian. But the Lanczos
solver gives the eigenvectors
in ascending order of eigenvalues. Jake Mannix (http://lucene.472066.n3.nabble.com/LanczosAlgorithmtd1929299.html)
suggested to go for 4k5k eigenvectors when we need "k" ones. However, if we really need
the largest "k" we have to get
all eigenvectors of the Laplacian.
I will try the SSVD and let you know.
Thanks,
Aniruddha
Original Message
From: Dan Brickley [mailto:danbri@danbri.org]
Sent: Thursday, July 19, 2012 11:12 PM
To: Aniruddha Basak
Subject: Re: eigendecomposition of very large matrices
Hi there,
I started replying below, and then eventually (hey, it's early morning for me :) ... then
realised that you were the original poster here. I was going to say that we had been trying
to get spectral clustering working again.
What was the conclusion of your investigations? Did you get the original spectralkmeans code
running ok, even if it does not suit your problem?
Porting / adapting to use SSVD (and maybe Ted's new kmeans stuff) sounds useful...
Dan
On 20 July 2012 02:32, Aniruddha Basak <tabasak@expedia.com> wrote:
> Thanks Dmitriy for your reply.
> The matrix I am working on, has 1020 non zero entries per row. So its very sparse.
> I am trying to do spectral clustering which involves eigendecomposition.
> I am wondering whether anyone has tried to do spectral clustering
> using mahout for very large affinity matrix (input).
Mahout does ship with a spectral clustering component, built on Mahout's SVD/Lancszos facility.
It has suffered various forms of coderot. Shannon, Aniruddha, and I have been trying to get
it running reliably again. See 'bin/mahout spectralkmeans', https://cwiki.apache.org/MAHOUT/spectralclustering.html
and nearby
> Aniruddha
>
>
> Original Message
> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> Sent: Thursday, July 19, 2012 6:28 PM
> To: user@mahout.apache.org
> Subject: Re: eigendecomposition of very large matrices
>
> very significant sparsity may be a problem though for q >=1 parameters. Again, depends
on the hardware you have and the # of nonzero elements in the input. but q=1 is still the
most recommended setting here.
>
>
> On Thu, Jul 19, 2012 at 6:20 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>> you may try SSVD.
>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singula
>> r
>> +Value+Decomposition
>>
>> but 4k eigenvectors (or, rather, singular values) is kind of still a
>> lot though and may push the precision out of the error estimates. I
>> don't we had precision study for that many. Also need quite a bit of
>> memory to compute that (not to mention flops). More realistically you
>> probably may try 1k singular values . You may try more if you have
>> access to more powerful hardware than we did in the studies but
>> distributed computation time will grow at about k^1.5, i.e. faster
>> than linear, even if you have enough nodes for the tasks.
>>
>> d
>>
>> On Thu, Jul 19, 2012 at 6:12 PM, Aniruddha Basak <tabasak@expedia.com> wrote:
>>> Hi,
>>> I am working on a clustering problem which involves determining the
>>> largest "k" eigenvectors of a very large matrix. The matrices, I
>>> work on, are typically of the order of 10^6 by 10^6.
>>> Trying to do this using the Lanczos solver available in Mahout, I
>>> found it is very slow and takes around 1.5 minutes to compute each eigenvectors.
>>> Hence to get 4000 eigenvectors, it takes 100 hours or 4 days !!
>>>
>>> So I am looking for something faster to solve the "Eigen decomposition"
>>> problem for very large sparse matrix. Please suggest me what should I use ?
>>>
>>>
>>> Thanks,
>>> Aniruddha
>>>
