mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chirag Lakhani <clakh...@zaloni.com>
Subject Re: PCA using Java Code
Date Wed, 03 Jul 2013 21:35:37 GMT
Thanks for pointing those relevant codes out explicitly.  I will try that
out but am getting an error java.lang.StackOverflowError but according to a
previous comment I need to use the trunk version.

Chirag


On Wed, Jul 3, 2013 at 4:39 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> yeah. specifically this code computes the mean (it is called "xi" to
> conform to notations used in math solution for MAHOUT-817)
>
>     // MAHOUT-817
>     if (pca && xiPath == null) {
>       xiPath = new Path(tempPath, "xi");
>       if (overwrite) {
>         fs.delete(xiPath, true);
>       }
>    ====>   MatrixColumnMeansJob.run(conf, inputPaths[0], xiPath);
>     }
>
> ... and  then passing it all to the SVD solver .. :
>
> SVDSolver solver =
>       new SSVDSolver(conf,
>                      inputPaths,
>                      new Path(tempPath, "ssvd"),
>                      r,
>                      k,
>                      p,
>                      reduceTasks);
>
>     solver.setMinSplitSize(minSplitSize);
>     solver.setComputeU(computeU);
>     solver.setComputeV(computeV);
>     solver.setcUHalfSigma(cUHalfSigma);
>     solver.setcVHalfSigma(cVHalfSigma);
>     solver.setcUSigma(cUSigma);
>     solver.setOuterBlockHeight(h);
>     solver.setAbtBlockHeight(abh);
>     solver.setQ(q);
>     solver.setBroadcast(broadcast);
>     solver.setOverwrite(overwrite);
>
>
>     if (xiPath != null) {
> ====>      solver.setPcaMeanPath(new Path(xiPath, "part-*"));
>     }
>
>
>
> essential pieces  marked with double arrows.
>
>
> On Wed, Jul 3, 2013 at 1:34 PM, Chirag Lakhani <clakhani@zaloni.com>
> wrote:
>
> > okay thanks.  It looks like I have that part running so I will go back to
> > the SSVDCli to finish the rest.  Thanks for your help.
> >
> > Chirag
> >
> >
> > On Wed, Jul 3, 2013 at 4:19 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> > wrote:
> >
> > > On Wed, Jul 3, 2013 at 12:25 PM, Chirag Lakhani <clakhani@zaloni.com>
> > > wrote:
> > >
> > > > Okay thanks for that.  After working on that issue I am still having
> > > > trouble running the SSVD solver.  I know I have asked this before
> but I
> > > > still can not initiate the SSVD solver when the input called
> > inputFolder
> > > is
> > > > the location of the sequence files of DenseVectors.  Is there
> > something I
> > > > am missing with this code?
> > > >
> > > >
> > > > String inputFolder = "/data_csv_for_pca/";
> > > >                 String pcaOutput =  "/vectors/";
> > > >                 String column_type = "DenseVector";
> > > >                 Path input_vec = new Path(inputFolder);
> > > >
> > > >  SSVDSolver solver  = new SSVDSolver(conf, new Path[] {input_vec},
> new
> > > > Path(pcaOutput),18,5,3,10);
> > > >
> > >
> > >
> > > SSVDSolver does not encapsulate the entire PCA workflow on its own.
> > >
> > >  You can use SSVDCli as an example to build the entire thing to embed.
> > > SSVDSolver class does not compute pca offset on its own, SSVDCli uses
> > > another job from Distributed Matrix to compute that (again, see SSVDCli
> > > code).
> > >
> > > Problems with not finding input -- about 1 million reasons in your
> case.
> > > Try to use absolute hdfs:// -prefixed paths for all files.
> > >
> > >
> > > >
> > > >
> > > > On Wed, Jul 3, 2013 at 12:24 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> > > > wrote:
> > > >
> > > > > There's probably confusion about options.
> > > > >
> > > > > (1) --pca=true enables pca flow in general. There's more to it than
> > > just
> > > > > taking a mean and re-centering.
> > > > > (2) --us=true enables computation of U*Sigma flow which what
> > > approximates
> > > > > dimensionality reduced output with original variances. This is what
> > one
> > > > > usually wants from PCA, although in some cases it may be useful to
> > just
> > > > use
> > > > > U.
> > > > > (3) optionally, one may supply externally computed colmean by using
> > > > > --pcaOffset. Motivation behind this option is that usually PCA is
> > > never a
> > > > > standalone job in a pipeline. Usually there's a MR job that preps
> the
> > > PCA
> > > > > input, in which case it is very easy to take row averages in the
> > > reducers
> > > > > of the previous step (and do final averaging in front end). That
> > saves
> > > > one
> > > > > MR pass over the input, because in SSVD average will require one
> > > > additional
> > > > > MR pass over A.
> > > > >
> > > > > Bottom line, typically one wants something along the lines
> > > > >
> > > > > ssvd --pca=true -u=false -v=false -us=true ...
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > On Jul 3, 2013 6:56 AM, "Chirag Lakhani" <clakhani@zaloni.com>
> > > wrote:
> > > > > > >
> > > > > > > So how does the column mean get calculated if the --pcaOffset
> > > option
> > > > is
> > > > > > not
> > > > > > By taking average of all row vectors. See code for details.
> > > > > >
> > > > > > > specified?  I would think you are just doing SVD at that
point.
> > > > > > This statement is incorrect. I know becuse i designed this code.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov <
> > > dlieu.7@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani <
> > > > clakhani@zaloni.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I am trying to use the Mahout/Java API to do
PCA but I am
> > > > confused
> > > > > > about
> > > > > > > > > the write order to do things.  To start, I have
a list of
> > > > > > DenseVectors
> > > > > > > > that
> > > > > > > > > I am reading into the code and turning it into
a
> distributed
> > > > matrix
> > > > > > in
> > > > > > > > the
> > > > > > > > > following form.
> > > > > > > > >
> > > > > > > > >  DistributedRowMatrix m = new
> DistributedRowMatrix(input_vec,
> > > > > > > > matrix_path,
> > > > > > > > > num_rows,num_cols);
> > > > > > > > >
> > > > > > > > > When I run this code, I would have thought it
would output
> > the
> > > > > result
> > > > > > > > into
> > > > > > > > > the path called "matrix_path" so that I can then
use
> > something
> > > > like
> > > > > > > > > MatrixColumnMeansJob.run
> > > > > > > > > to get mean. When I run this bit of code I get
no output,
> is
> > > > there
> > > > > > > > > something else I should do or is there a better
way to
> > > calculate
> > > > > the
> > > > > > mean
> > > > > > > > > for my file.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > From what I understand about the SSVD CI code,
you need to
> > > > > calculate
> > > > > > the
> > > > > > > > > column mean and then output it into a directory
> > > > > > > >
> > > > > > > > .
> > > > > > > >
> > > > > > > >
> > > > > > > > No, you don't have to (although you have an _option_
to
> > calculate
> > > > and
> > > > > > > > substitute one yourself if for some reason it is already
> > known.)
> > > > > > Default
> > > > > > > > use assumes it would calculate it for you.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Is there a good way to do
> > > > > > > > > this if I am starting from a file which is a
sequence file
> of
> > > > > > > > DenseVectors?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes. just don't specify --pcaOffset option.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > *Chirag Lakhani*
> > > > > > > > >
> > > > > > > > > Data Scientist
> > > > > > > > >
> > > > > > > > > Zaloni, Inc. | www.zaloni.com
> > > > > > > > >
> > > > > > > > > 633 Davis Dr., Suite 200
> > > > > > > > >
> > > > > > > > > Durham, NC 27713
> > > > > > > > > e: clakhani@zaloni.com
> > > > > > > > > p: 919.602.4965 x7020
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > *Chirag Lakhani*
> > > > > > >
> > > > > > > Data Scientist
> > > > > > >
> > > > > > > Zaloni, Inc. | www.zaloni.com
> > > > > > >
> > > > > > > 633 Davis Dr., Suite 200
> > > > > > >
> > > > > > > Durham, NC 27713
> > > > > > > e: clakhani@zaloni.com
> > > > > > > p: 919.602.4965 x7020
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Chirag Lakhani*
> > > >
> > > > Data Scientist
> > > >
> > > > Zaloni, Inc. | www.zaloni.com
> > > >
> > > > 633 Davis Dr., Suite 200
> > > >
> > > > Durham, NC 27713
> > > > e: clakhani@zaloni.com
> > > > p: 919.602.4965 x7020
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Chirag Lakhani*
> >
> > Data Scientist
> >
> > Zaloni, Inc. | www.zaloni.com
> >
> > 633 Davis Dr., Suite 200
> >
> > Durham, NC 27713
> > e: clakhani@zaloni.com
> > p: 919.602.4965 x7020
> >
>



-- 

*Chirag Lakhani*

Data Scientist

Zaloni, Inc. | www.zaloni.com

633 Davis Dr., Suite 200

Durham, NC 27713
e: clakhani@zaloni.com
p: 919.602.4965 x7020

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message