Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Tue, 23 Nov 2010 17:58:13 +0000
From: Derek O'Callaghan <derek.ocallaghan@ucd.ie>
Subject: Re: Lanczos Algorithm
In-reply-to: <AANLkTikFbM5fJDQkHAMftUBcF-fZVfBPuu4gY=hHwZZK@mail.gmail.com>
To: user@mahout.apache.org
Message-id: <4CEC00B5.2010601@ucd.ie>
MIME-version: 1.0
Content-type: text/plain; format=flowed; charset=ISO-8859-1
Content-transfer-encoding: 7BIT
References: <SNT112-W594CB357EBD257A1C6DA55AB390@phx.gbl>
 <SNT112-W105A5B61A02D781B3BAA0AB3A0@phx.gbl>
 <AANLkTikC8aqmA3-6GtauHbUBgwkWPz8++XVDBC4KipSB@mail.gmail.com>
 <AANLkTimj6+sJWmiR9nj_6oV2ozCOBFa6_OmA=vw4peor@mail.gmail.com>
 <SNT112-W1824CDDC27CA8B7F366CB6AB3A0@phx.gbl>
 <AANLkTimb-QK6vL1a=q7kLMhOpJ5GLGFfktYGHL=6BwDP@mail.gmail.com>
 <SNT112-W430449B11361F7158B4D0EAB3A0@phx.gbl>
 <AANLkTik=7tSOBYxRgPbR8KG0NWGk1viZcEEDiR0Q9pmo@mail.gmail.com>
 <SNT112-W54EA18D9FF27F3EA806013AB3D0@phx.gbl>
 <AANLkTikFbM5fJDQkHAMftUBcF-fZVfBPuu4gY=hHwZZK@mail.gmail.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.15)
 Gecko/20101027 Thunderbird/3.0.10

Hi Jake,

I have some related questions about the usage of the eigenvectors and 
eigenvalues generated by Lanczos, they're more or less on-topic so I 
thought it'd be okay to post them here, but I can start a new thread if 
you like. I've been going through some of the mails on the dev list 
regarding the projection of a matrix onto an SVD basis which is 
generated by Lanczos, in order to reduce the dimensionality of the 
matrix columns. The new matrix is then passed to KMeans for clustering.

(see "Re: Using SVD with Canopy/KMeans" thread on the dev list)

On 03/09/10 04:28, Jeff Eastman wrote:

"...A P is the projection of the data set onto the eigenvector basis:

A = original data matrix ([15x39])
P = eigenvector column matrix  ([39x7])
D = eigenvalue diagonal matrix

Then:
A P = P D => A = P D P' "

where the projection of A is achieved with: A times P (see 
TestClusterDumper.testKmeansSVD())

On 12/09/10 04:21, Jake Mannix wrote:

"To adapt this to do SVD, you just substitute A = C' * C, for any
(non-square) matrix C. You get a set of eigenvectors of C' * C, which in
turn are the right-singular vectors of C (what is called V, if C = U D V')."

This seems fine so far, in that the P in A times P is the V from the SVD 
definition in Jake's mail.

However, from an older thread ("distributed svd" on the user list):

On 24/07/10 00:09, Jake Mannix wrote:

"What you want to do here is actually take the A matrix, represented by a
DistributedRowMatrix, and the V_k matrix (you actually get basically V_k's
transposed from the SVD job), represented by that same class, and then just
compute A . V . S^-1, where S is the diagonal matrix of singular values
(note that these are the sqrts of the eigenvalues you get out of Lancos)."


 From Jeff's mail above, and the code in TestClusterDumper, it seems 
like the second multiplication by S^-1 step is not performed/required, 
i.e. the only step to project the original matrix A is:

Reduced matrix X = A . V (or A . P using Jeff's notation)

and the reduced matrix X can then be passed to KMeans for clustering. I 
wanted to confirm if this is correct, and that the S (derived from the 
Lanczos-generated eigenvalues) diagonal matrix can be ignored when 
projecting the original matrix? Is this the reason why Lanczos only 
persists the eigenvectors, and discards the eigenvalues 
(DistributedLanczosSolver.serializeOutput())?

Thanks,

Derek


On 22/11/10 22:01, Jake Mannix wrote:
> Hi Pedro,
>
> I wrote the Lanczos stuff in Mahout, but I've been swamped with my new job
> the past few months, and haven't been able to interact on the list 
> much, but
> I've got a minute or two to try and answer your questions.
>
> First: the LanczosSolver does indeed leave you with eigenvalues in the
> opposite order than you expect (ascending order, instead of 
> descending). It
> should be noted that in general, the Lanczos method *does not* give 
> you the
> "largest N eigenvalues" - if you ask for a rank-N decomposition via 
> Lanczos,
> you'll get the smallest eigenvalue for sure, then somewhat less than 
> N/2 of
> the other small eigenvalues (with the last few being right in the 
> middle of
> the spectrum), then somewhat less than N/2 of the largest eigenvalues 
> (with
> the first few being in the middle of the spectrum), ending with the 
> largest
> eigenvalue.
>
> If you really want the top N eigenvalues, then you should ask for 
> somewhere
> between 3N and 4N rank decomposition, and take the last N of these. You
> won't be guaranteed to get *exactly all* of the top N eigenvalues,
> especially if you have a very rapidly decreasing eigenvalue spectrum 
> (as is
> usually the case with real data), and you reach the "plateau" of middling
> values before N. In this case, if there are K<N "large" eigenvalues, 
> you'll
> get all K of these, and then N-K's worth of a sampling of the middling
> eigenvalues.
>
> One other thing you need to remember about the LanczosSolver: if your 
> input
> matrix is symmetric, pass in the boolean isSymmetric=true to the solve()
> method, or else you'll get wrong values.
>
> -jake
>
> On Mon, Nov 22, 2010 at 1:07 PM, PEDRO MANUEL JIMENEZ RODRIGUEZ <
> pmjimenez1983@hotmail.com> wrote:
>
>>
>> Hi Ted,
>>
>> I can't give you an exact amount but more or less it could be around 10^5
>> non-zero elements per row.
>>
>> Could you please let me know, why the lanzcos algorithm is not always
>> returning the values in a decreasing order?
>>
>> Thanks.
>>
>> Pedro.
>>
>> ----------------------------------------
>>> From: ted.dunning@gmail.com
>>> Date: Fri, 19 Nov 2010 13:34:19 -0800
>>> Subject: Re: Lanczos Algorithm
>>> To: user@mahout.apache.org
>>>
>>> How many non-zero elements?
>>>
>>> On Fri, Nov 19, 2010 at 12:34 PM, PEDRO MANUEL JIMENEZ RODRIGUEZ <
>>> pmjimenez1983@hotmail.com> wrote:
>>>
>>>>
>>>>
>>>> I was talking about 10^9 rows and 10^9 columns
>>>>
>>>> ----------------------------------------
>>>>> From: ted.dunning@gmail.com
>>>>> Date: Fri, 19 Nov 2010 12:07:16 -0800
>>>>> Subject: Re: Lanczos Algorithm
>>>>> To: user@mahout.apache.org
>>>>>
>>>>> On Fri, Nov 19, 2010 at 11:17 AM, PEDRO MANUEL JIMENEZ RODRIGUEZ <
>>>>> pmjimenez1983@hotmail.com> wrote:
>>>>>
>>>>>> In this project I would have to work with matrix of 10^9, which
>> have a
>>>> very
>>>>>> sparse data.
>>>>>
>>>>>
>>>>> I think you mean 10^9 rows and 10^9 columns with much fewer 10^18
>>>> non-zero
>>>>> elements.
>>>>>
>>>>> Is that correct?
>>>>>
>>>>> Can you say how many non-zero elements?
>>>>
>>>>
>>
>>
>