Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D5E4B10A83 for ; Wed, 3 Jul 2013 21:36:27 +0000 (UTC) Received: (qmail 90810 invoked by uid 500); 3 Jul 2013 21:36:26 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 90759 invoked by uid 500); 3 Jul 2013 21:36:26 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 90751 invoked by uid 99); 3 Jul 2013 21:36:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Jul 2013 21:36:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.223.172] (HELO mail-ie0-f172.google.com) (209.85.223.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Jul 2013 21:36:19 +0000 Received: by mail-ie0-f172.google.com with SMTP id 16so1741360iea.3 for ; Wed, 03 Jul 2013 14:35:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=Pzv830pqxWgs1AZgFbc51BppYDVQsy8DWRY5Mi2Upuc=; b=FTFfDkdMCEuyhFAN7MIgzBLuFqUD4oBNAYu6RoIgQugpr87orAHuvYEjNFloXZWLaU paoJ02aYMcshAYpEwbFM28N+HbT0ctKkDkeycS0EYIzvc9wNP5BwRaxY+xhwAS8XxvZX 7WxePZro3f193tIxRjluQcdKY9Uk/HeoOLFJIJBcoR9KIQnbctvYOjjAgXvt95bLxSTC RuUyQok9Oar9CFTU0/sfJ1ay93G4ocrDFhTFPv/CcPLl3WqMC+G2CzikShlduI+cWfcC O1X4luGoFJ7fuLuxZiWsY6kJ0cVAJVxCEHZUfbEJbDaBT8DoXtUFGnDfhWLDoOdaSzQK Bj/A== MIME-Version: 1.0 X-Received: by 10.42.79.70 with SMTP id q6mr1436318ick.113.1372887337959; Wed, 03 Jul 2013 14:35:37 -0700 (PDT) Received: by 10.64.42.40 with HTTP; Wed, 3 Jul 2013 14:35:37 -0700 (PDT) In-Reply-To: References: Date: Wed, 3 Jul 2013 17:35:37 -0400 Message-ID: Subject: Re: PCA using Java Code From: Chirag Lakhani To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=20cf300e4e6f18a30704e0a23ef4 X-Gm-Message-State: ALoCoQlfZXOmPmMjihYfy5o4XtJzfWL29sc9ZqEegjcbTWAYemGt2d2EM61qBhNOTiuUN5oEgtXw X-Virus-Checked: Checked by ClamAV on apache.org --20cf300e4e6f18a30704e0a23ef4 Content-Type: text/plain; charset=ISO-8859-1 Thanks for pointing those relevant codes out explicitly. I will try that out but am getting an error java.lang.StackOverflowError but according to a previous comment I need to use the trunk version. Chirag On Wed, Jul 3, 2013 at 4:39 PM, Dmitriy Lyubimov wrote: > yeah. specifically this code computes the mean (it is called "xi" to > conform to notations used in math solution for MAHOUT-817) > > // MAHOUT-817 > if (pca && xiPath == null) { > xiPath = new Path(tempPath, "xi"); > if (overwrite) { > fs.delete(xiPath, true); > } > ====> MatrixColumnMeansJob.run(conf, inputPaths[0], xiPath); > } > > ... and then passing it all to the SVD solver .. : > > SVDSolver solver = > new SSVDSolver(conf, > inputPaths, > new Path(tempPath, "ssvd"), > r, > k, > p, > reduceTasks); > > solver.setMinSplitSize(minSplitSize); > solver.setComputeU(computeU); > solver.setComputeV(computeV); > solver.setcUHalfSigma(cUHalfSigma); > solver.setcVHalfSigma(cVHalfSigma); > solver.setcUSigma(cUSigma); > solver.setOuterBlockHeight(h); > solver.setAbtBlockHeight(abh); > solver.setQ(q); > solver.setBroadcast(broadcast); > solver.setOverwrite(overwrite); > > > if (xiPath != null) { > ====> solver.setPcaMeanPath(new Path(xiPath, "part-*")); > } > > > > essential pieces marked with double arrows. > > > On Wed, Jul 3, 2013 at 1:34 PM, Chirag Lakhani > wrote: > > > okay thanks. It looks like I have that part running so I will go back to > > the SSVDCli to finish the rest. Thanks for your help. > > > > Chirag > > > > > > On Wed, Jul 3, 2013 at 4:19 PM, Dmitriy Lyubimov > > wrote: > > > > > On Wed, Jul 3, 2013 at 12:25 PM, Chirag Lakhani > > > wrote: > > > > > > > Okay thanks for that. After working on that issue I am still having > > > > trouble running the SSVD solver. I know I have asked this before > but I > > > > still can not initiate the SSVD solver when the input called > > inputFolder > > > is > > > > the location of the sequence files of DenseVectors. Is there > > something I > > > > am missing with this code? > > > > > > > > > > > > String inputFolder = "/data_csv_for_pca/"; > > > > String pcaOutput = "/vectors/"; > > > > String column_type = "DenseVector"; > > > > Path input_vec = new Path(inputFolder); > > > > > > > > SSVDSolver solver = new SSVDSolver(conf, new Path[] {input_vec}, > new > > > > Path(pcaOutput),18,5,3,10); > > > > > > > > > > > > > SSVDSolver does not encapsulate the entire PCA workflow on its own. > > > > > > You can use SSVDCli as an example to build the entire thing to embed. > > > SSVDSolver class does not compute pca offset on its own, SSVDCli uses > > > another job from Distributed Matrix to compute that (again, see SSVDCli > > > code). > > > > > > Problems with not finding input -- about 1 million reasons in your > case. > > > Try to use absolute hdfs:// -prefixed paths for all files. > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2013 at 12:24 PM, Dmitriy Lyubimov > > > > > wrote: > > > > > > > > > There's probably confusion about options. > > > > > > > > > > (1) --pca=true enables pca flow in general. There's more to it than > > > just > > > > > taking a mean and re-centering. > > > > > (2) --us=true enables computation of U*Sigma flow which what > > > approximates > > > > > dimensionality reduced output with original variances. This is what > > one > > > > > usually wants from PCA, although in some cases it may be useful to > > just > > > > use > > > > > U. > > > > > (3) optionally, one may supply externally computed colmean by using > > > > > --pcaOffset. Motivation behind this option is that usually PCA is > > > never a > > > > > standalone job in a pipeline. Usually there's a MR job that preps > the > > > PCA > > > > > input, in which case it is very easy to take row averages in the > > > reducers > > > > > of the previous step (and do final averaging in front end). That > > saves > > > > one > > > > > MR pass over the input, because in SSVD average will require one > > > > additional > > > > > MR pass over A. > > > > > > > > > > Bottom line, typically one wants something along the lines > > > > > > > > > > ssvd --pca=true -u=false -v=false -us=true ... > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov < > dlieu.7@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > > On Jul 3, 2013 6:56 AM, "Chirag Lakhani" > > > wrote: > > > > > > > > > > > > > > So how does the column mean get calculated if the --pcaOffset > > > option > > > > is > > > > > > not > > > > > > By taking average of all row vectors. See code for details. > > > > > > > > > > > > > specified? I would think you are just doing SVD at that point. > > > > > > This statement is incorrect. I know becuse i designed this code. > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov < > > > dlieu.7@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani < > > > > clakhani@zaloni.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > I am trying to use the Mahout/Java API to do PCA but I am > > > > confused > > > > > > about > > > > > > > > > the write order to do things. To start, I have a list of > > > > > > DenseVectors > > > > > > > > that > > > > > > > > > I am reading into the code and turning it into a > distributed > > > > matrix > > > > > > in > > > > > > > > the > > > > > > > > > following form. > > > > > > > > > > > > > > > > > > DistributedRowMatrix m = new > DistributedRowMatrix(input_vec, > > > > > > > > matrix_path, > > > > > > > > > num_rows,num_cols); > > > > > > > > > > > > > > > > > > When I run this code, I would have thought it would output > > the > > > > > result > > > > > > > > into > > > > > > > > > the path called "matrix_path" so that I can then use > > something > > > > like > > > > > > > > > MatrixColumnMeansJob.run > > > > > > > > > to get mean. When I run this bit of code I get no output, > is > > > > there > > > > > > > > > something else I should do or is there a better way to > > > calculate > > > > > the > > > > > > mean > > > > > > > > > for my file. > > > > > > > > > > > > > > > > > > > > > > > > > > > From what I understand about the SSVD CI code, you need to > > > > > calculate > > > > > > the > > > > > > > > > column mean and then output it into a directory > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > No, you don't have to (although you have an _option_ to > > calculate > > > > and > > > > > > > > substitute one yourself if for some reason it is already > > known.) > > > > > > Default > > > > > > > > use assumes it would calculate it for you. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is there a good way to do > > > > > > > > > this if I am starting from a file which is a sequence file > of > > > > > > > > DenseVectors? > > > > > > > > > > > > > > > > > > > > > > > > > Yes. just don't specify --pcaOffset option. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > *Chirag Lakhani* > > > > > > > > > > > > > > > > > > Data Scientist > > > > > > > > > > > > > > > > > > Zaloni, Inc. | www.zaloni.com > > > > > > > > > > > > > > > > > > 633 Davis Dr., Suite 200 > > > > > > > > > > > > > > > > > > Durham, NC 27713 > > > > > > > > > e: clakhani@zaloni.com > > > > > > > > > p: 919.602.4965 x7020 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > *Chirag Lakhani* > > > > > > > > > > > > > > Data Scientist > > > > > > > > > > > > > > Zaloni, Inc. | www.zaloni.com > > > > > > > > > > > > > > 633 Davis Dr., Suite 200 > > > > > > > > > > > > > > Durham, NC 27713 > > > > > > > e: clakhani@zaloni.com > > > > > > > p: 919.602.4965 x7020 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > *Chirag Lakhani* > > > > > > > > Data Scientist > > > > > > > > Zaloni, Inc. | www.zaloni.com > > > > > > > > 633 Davis Dr., Suite 200 > > > > > > > > Durham, NC 27713 > > > > e: clakhani@zaloni.com > > > > p: 919.602.4965 x7020 > > > > > > > > > > > > > > > -- > > > > *Chirag Lakhani* > > > > Data Scientist > > > > Zaloni, Inc. | www.zaloni.com > > > > 633 Davis Dr., Suite 200 > > > > Durham, NC 27713 > > e: clakhani@zaloni.com > > p: 919.602.4965 x7020 > > > -- *Chirag Lakhani* Data Scientist Zaloni, Inc. | www.zaloni.com 633 Davis Dr., Suite 200 Durham, NC 27713 e: clakhani@zaloni.com p: 919.602.4965 x7020 --20cf300e4e6f18a30704e0a23ef4--