Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 544A943B3 for ; Thu, 23 Jun 2011 18:39:38 +0000 (UTC) Received: (qmail 40293 invoked by uid 500); 23 Jun 2011 18:39:37 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 40267 invoked by uid 500); 23 Jun 2011 18:39:37 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 40259 invoked by uid 99); 23 Jun 2011 18:39:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jun 2011 18:39:37 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [129.25.8.238] (HELO mail.cs.drexel.edu) (129.25.8.238) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jun 2011 18:39:30 +0000 Received: from localhost (localhost [127.0.0.1]) by mail.cs.drexel.edu (Postfix) with ESMTP id D8A6C3491137 for ; Thu, 23 Jun 2011 14:39:08 -0400 (EDT) Received: from mail.cs.drexel.edu ([127.0.0.1]) by localhost (mail.cs.drexel.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aGVUU1LeuEAR for ; Thu, 23 Jun 2011 14:39:08 -0400 (EDT) Received: from mail.cs.drexel.edu (localhost [127.0.0.1]) by mail.cs.drexel.edu (Postfix) with ESMTP for ; Thu, 23 Jun 2011 14:39:08 -0400 (EDT) Received: from 128.244.141.95 (SquirrelMail authenticated user tra26) by mail.cs.drexel.edu with HTTP; Thu, 23 Jun 2011 14:39:08 -0400 Message-ID: <4f11ed28877472f54d9dfd64725b3ce8.squirrel@mail.cs.drexel.edu> In-Reply-To: References: <09983df9c2ab4da8012e6b77aa96418b.squirrel@mail.cs.drexel.edu> <82f9bdba5cb431f8f17080256f85c06e.squirrel@mail.cs.drexel.edu> Date: Thu, 23 Jun 2011 14:39:08 -0400 Subject: Re: LanczosSVD and Eigenvalues From: tra26@cs.drexel.edu To: user@mahout.apache.org User-Agent: SquirrelMail/1.4.19 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-2022-jp Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-Virus-Checked: Checked by ClamAV on apache.org I will, if it works I may have to make an m/r job for it. All the data we have will be tall and dense (lets say 5000 columns, with millions of rows). Now doing PCA on that will create a covariance matrix that is square and dense. Thanks again guys. -Trevor > Try the QR trick. It is amazingly effective. > > 2011/6/23 > >> Alright, thanks guys. >> >> > The cases where Lanczos or the stochastic projection helps are cases >> where >> > you have *many* columns but where the data are sparse. If you have a >> very >> > tall dense matrix, the QR method is to be muchly preferred. >> > >> > 2011/6/23 >> > >> >> Ok, then what would you think to be the minimum number of columns in >> the >> >> dataset for Lanczos to give a reasonable result? >> >> >> >> Thanks, >> >> -Trevor >> >> >> >> > A gazillion rows of 2-columned data is really much better suited to >> >> doing >> >> > the following: >> >> > >> >> > if each row is of the form [a, b], then compute the matrix >> >> > >> >> > [[a*a, a*b], [a*b, b*b]] >> >> > >> >> > (the outer product of the vector with itself) >> >> > >> >> > Then take the matrix sum of all of these, from each row of your >> input >> >> > matrix. >> >> > >> >> > You'll now have a 2x2 matrix, which you can diagonalize by hand. >> It >> >> will >> >> > give you your eigenvalues, and also the right-singular vectors of >> your >> >> > original matrix. >> >> > >> >> > -jake >> >> > >> >> > 2011/6/23 >> >> > >> >> >> Yes, exactly why I asked it for only 2 eigenvalues. So what is >> being >> >> >> said, >> >> >> is if I have lets say 50M rows of 2 columned data, Lanczos can't >> do >> >> >> anything with it (assuming it puts the 0 eigenvalue in the mix - >> of >> >> the >> >> >> 2 >> >> >> eigenvectors only 1 is returned because of the 0 eigenvalue taking >> up >> >> a >> >> >> slot)? >> >> >> >> >> >> If the eigenvalue of 0 is invalid, then should it not be filtered >> out >> >> so >> >> >> that it returns "rank" number of eigenvalues that could be valid? >> >> >> >> >> >> -Trevor >> >> >> >> >> >> > Ah, if your matrix only has 2 columns, you can't go to rank 10. >> >> Try >> >> >> on >> >> >> > some slightly less synthetic data of more than rank 10. You >> can't >> >> >> > ask Lanczos for more reduced rank than that of the matrix >> itself. >> >> >> > >> >> >> > -jake >> >> >> > >> >> >> > 2011/6/23 >> >> >> > >> >> >> >> Alright I can reorder that is easy, just had to verify that the >> >> >> ordering >> >> >> >> was correct. So when I increased the rank of the results I get >> >> >> Lanczos >> >> >> >> bailing out. Which incidentally causes a NullPointerException: >> >> >> >> >> >> >> >> INFO: 9 passes through the corpus so far... >> >> >> >> WARNING: Lanczos parameters out of range: alpha = NaN, beta = >> NaN. >> >> >> >> Bailing out early! >> >> >> >> INFO: Lanczos iteration complete - now to diagonalize the >> >> >> tri-diagonal >> >> >> >> auxiliary matrix. >> >> >> >> Exception in thread "main" java.lang.NullPointerException >> >> >> >> at >> >> >> >> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133) >> >> >> >> at >> >> >> >> >> >> >> >> >> >> >> >> >> >> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160) >> >> >> >> at pca.PCASolver.solve(PCASolver.java:53) >> >> >> >> at pca.PCA.main(PCA.java:20) >> >> >> >> >> >> >> >> So I should probably note that my data only has 2 columns, the >> >> real >> >> >> data >> >> >> >> will have quite a bit more. >> >> >> >> >> >> >> >> The failing happens with 10 and more for rank, with the last, >> and >> >> >> >> therefore most significant eigenvector being . >> >> >> >> >> >> >> >> -Trevor >> >> >> >> > The 0 eigenvalue output is not valid, and yes, the output >> will >> >> list >> >> >> >> the >> >> >> >> > results >> >> >> >> > in *increasing* order, even though it is finding the largest >> >> >> >> > eigenvalues/vectors >> >> >> >> > first. >> >> >> >> > >> >> >> >> > Remember that convergence is gradual, so if you only ask for >> 3 >> >> >> >> > eigevectors/values, you won't be very accurate. If you ask >> for >> >> 10 >> >> >> or >> >> >> >> > more, >> >> >> >> > the >> >> >> >> > largest few will now be quite good. If you ask for 50, now >> the >> >> top >> >> >> >> 10-20 >> >> >> >> > will >> >> >> >> > be *extremely* accurate, and maybe the top 30 will still be >> >> quite >> >> >> >> good. >> >> >> >> > >> >> >> >> > Try out a non-distributed form of what is in the >> >> >> EigenverificationJob >> >> >> >> to >> >> >> >> > re-order the output and collect how accurate your results are >> >> (it >> >> >> >> computes >> >> >> >> > errors for you as well). >> >> >> >> > >> >> >> >> > -jake >> >> >> >> > >> >> >> >> > 2011/6/23 >> >> >> >> > >> >> >> >> >> So, I know that MAHOUT-369 fixed a bug with the distributed >> >> >> version >> >> >> >> of >> >> >> >> >> the >> >> >> >> >> LanczosSolver but I am experiencing a similar problem with >> the >> >> >> >> >> non-distributed version. >> >> >> >> >> >> >> >> >> >> I send a dataset of gaussian distributed numbers (testing >> PCA >> >> >> stuff) >> >> >> >> and >> >> >> >> >> my eigenvalues are seemingly reversed. Below I have the >> output >> >> >> given >> >> >> >> in >> >> >> >> >> the logs from LanczosSolver. >> >> >> >> >> >> >> >> >> >> Output: >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue 347.8703086831804 >> >> >> >> >> INFO: LanczosSolver finished. >> >> >> >> >> >> >> >> >> >> So it returns a vector with eigenvalue 0 before one with an >> >> >> >> eigenvalue >> >> >> >> >> of >> >> >> >> >> 347?. Whats more interesting is that when I increase the >> rank, >> >> I >> >> >> get >> >> >> >> a >> >> >> >> >> new >> >> >> >> >> eigenvector with a value between 0 and 347: >> >> >> >> >> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue 44.794928654801566 >> >> >> >> >> INFO: Eigenvector 2 found with eigenvalue 347.8286920203704 >> >> >> >> >> >> >> >> >> >> Shouldn't the eigenvalues be in descending order? Also is >> the >> >> 0.0 >> >> >> >> >> eigenvalue even valid? >> >> >> >> >> >> >> >> >> >> Thanks, >> >> >> >> >> Trevor >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > >> >> >> >