Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 43518 invoked from network); 23 Nov 2010 17:57:53 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 23 Nov 2010 17:57:53 -0000 Received: (qmail 63707 invoked by uid 500); 23 Nov 2010 17:58:24 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 63650 invoked by uid 500); 23 Nov 2010 17:58:24 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 63642 invoked by uid 99); 23 Nov 2010 17:58:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Nov 2010 17:58:24 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [193.1.169.34] (HELO dakota.ucd.ie) (193.1.169.34) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Nov 2010 17:58:16 +0000 Received: from conversion-daemon.dakota.ucd.ie by dakota.ucd.ie (Sun Java System Messaging Server 6.2-2.05 (built Apr 28 2005)) id <0LCC00201N5FBN00@dakota.ucd.ie> (original mail from derek.ocallaghan@ucd.ie) for user@mahout.apache.org; Tue, 23 Nov 2010 17:57:54 +0000 (GMT) Received: from [137.43.154.252] by dakota.ucd.ie (Sun Java System Messaging Server 6.2-2.05 (built Apr 28 2005)) with ESMTPSA id <0LCC000U7N8HKB4B@dakota.ucd.ie> for user@mahout.apache.org; Tue, 23 Nov 2010 17:57:54 +0000 (GMT) Date: Tue, 23 Nov 2010 17:58:13 +0000 From: Derek O'Callaghan Subject: Re: Lanczos Algorithm In-reply-to: To: user@mahout.apache.org Message-id: <4CEC00B5.2010601@ucd.ie> MIME-version: 1.0 Content-type: text/plain; format=flowed; charset=ISO-8859-1 Content-transfer-encoding: 7BIT References: User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.15) Gecko/20101027 Thunderbird/3.0.10 Hi Jake, I have some related questions about the usage of the eigenvectors and eigenvalues generated by Lanczos, they're more or less on-topic so I thought it'd be okay to post them here, but I can start a new thread if you like. I've been going through some of the mails on the dev list regarding the projection of a matrix onto an SVD basis which is generated by Lanczos, in order to reduce the dimensionality of the matrix columns. The new matrix is then passed to KMeans for clustering. (see "Re: Using SVD with Canopy/KMeans" thread on the dev list) On 03/09/10 04:28, Jeff Eastman wrote: "...A P is the projection of the data set onto the eigenvector basis: A = original data matrix ([15x39]) P = eigenvector column matrix ([39x7]) D = eigenvalue diagonal matrix Then: A P = P D => A = P D P' " where the projection of A is achieved with: A times P (see TestClusterDumper.testKmeansSVD()) On 12/09/10 04:21, Jake Mannix wrote: "To adapt this to do SVD, you just substitute A = C' * C, for any (non-square) matrix C. You get a set of eigenvectors of C' * C, which in turn are the right-singular vectors of C (what is called V, if C = U D V')." This seems fine so far, in that the P in A times P is the V from the SVD definition in Jake's mail. However, from an older thread ("distributed svd" on the user list): On 24/07/10 00:09, Jake Mannix wrote: "What you want to do here is actually take the A matrix, represented by a DistributedRowMatrix, and the V_k matrix (you actually get basically V_k's transposed from the SVD job), represented by that same class, and then just compute A . V . S^-1, where S is the diagonal matrix of singular values (note that these are the sqrts of the eigenvalues you get out of Lancos)." From Jeff's mail above, and the code in TestClusterDumper, it seems like the second multiplication by S^-1 step is not performed/required, i.e. the only step to project the original matrix A is: Reduced matrix X = A . V (or A . P using Jeff's notation) and the reduced matrix X can then be passed to KMeans for clustering. I wanted to confirm if this is correct, and that the S (derived from the Lanczos-generated eigenvalues) diagonal matrix can be ignored when projecting the original matrix? Is this the reason why Lanczos only persists the eigenvectors, and discards the eigenvalues (DistributedLanczosSolver.serializeOutput())? Thanks, Derek On 22/11/10 22:01, Jake Mannix wrote: > Hi Pedro, > > I wrote the Lanczos stuff in Mahout, but I've been swamped with my new job > the past few months, and haven't been able to interact on the list > much, but > I've got a minute or two to try and answer your questions. > > First: the LanczosSolver does indeed leave you with eigenvalues in the > opposite order than you expect (ascending order, instead of > descending). It > should be noted that in general, the Lanczos method *does not* give > you the > "largest N eigenvalues" - if you ask for a rank-N decomposition via > Lanczos, > you'll get the smallest eigenvalue for sure, then somewhat less than > N/2 of > the other small eigenvalues (with the last few being right in the > middle of > the spectrum), then somewhat less than N/2 of the largest eigenvalues > (with > the first few being in the middle of the spectrum), ending with the > largest > eigenvalue. > > If you really want the top N eigenvalues, then you should ask for > somewhere > between 3N and 4N rank decomposition, and take the last N of these. You > won't be guaranteed to get *exactly all* of the top N eigenvalues, > especially if you have a very rapidly decreasing eigenvalue spectrum > (as is > usually the case with real data), and you reach the "plateau" of middling > values before N. In this case, if there are K you'll > get all K of these, and then N-K's worth of a sampling of the middling > eigenvalues. > > One other thing you need to remember about the LanczosSolver: if your > input > matrix is symmetric, pass in the boolean isSymmetric=true to the solve() > method, or else you'll get wrong values. > > -jake > > On Mon, Nov 22, 2010 at 1:07 PM, PEDRO MANUEL JIMENEZ RODRIGUEZ < > pmjimenez1983@hotmail.com> wrote: > >> >> Hi Ted, >> >> I can't give you an exact amount but more or less it could be around 10^5 >> non-zero elements per row. >> >> Could you please let me know, why the lanzcos algorithm is not always >> returning the values in a decreasing order? >> >> Thanks. >> >> Pedro. >> >> ---------------------------------------- >>> From: ted.dunning@gmail.com >>> Date: Fri, 19 Nov 2010 13:34:19 -0800 >>> Subject: Re: Lanczos Algorithm >>> To: user@mahout.apache.org >>> >>> How many non-zero elements? >>> >>> On Fri, Nov 19, 2010 at 12:34 PM, PEDRO MANUEL JIMENEZ RODRIGUEZ < >>> pmjimenez1983@hotmail.com> wrote: >>> >>>> >>>> >>>> I was talking about 10^9 rows and 10^9 columns >>>> >>>> ---------------------------------------- >>>>> From: ted.dunning@gmail.com >>>>> Date: Fri, 19 Nov 2010 12:07:16 -0800 >>>>> Subject: Re: Lanczos Algorithm >>>>> To: user@mahout.apache.org >>>>> >>>>> On Fri, Nov 19, 2010 at 11:17 AM, PEDRO MANUEL JIMENEZ RODRIGUEZ < >>>>> pmjimenez1983@hotmail.com> wrote: >>>>> >>>>>> In this project I would have to work with matrix of 10^9, which >> have a >>>> very >>>>>> sparse data. >>>>> >>>>> >>>>> I think you mean 10^9 rows and 10^9 columns with much fewer 10^18 >>>> non-zero >>>>> elements. >>>>> >>>>> Is that correct? >>>>> >>>>> Can you say how many non-zero elements? >>>> >>>> >> >> >