Hmm..You make a good point. Thanks for the suggestion. I will try to code up
a sequential algorithm and check how it scales up.
On Thu, May 5, 2011 at 1:38 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Vckay,
>
> You say you are doing SVD on image data. Why are you worrying about
> Mahout?
>
> I just did a quick test using R and on my laptop it takes random
> projections
> about 5 seconds to extract the
> first 50 singular values and corresponding eigenvectors of a 10,000 x
> 10,000
> random dense matrix.
> With sufficient memory to store the original matrix roughly twice, you
> should be able to get very fast results on
> any reasonable sized image. Even if you have to read the matrix from disk,
> you only need to make a few passes
> over it to get the results. Thus, if you have a million rows and 10,000
> rows I would expect that you would be
> able to do full on SVD in an hour or so.
>
> Because the sequential version is so fast, I would be surprised if you are
> able to get significant
> wins from any dense matrix I can imagine coming from an image source.
> Dimitriy's random projection code
> should be as good as it gets on this, but with dense data I am not so sure
> you will see a big win.
>
>
>
> On Thu, May 5, 2011 at 11:05 AM, Vckay <darkvckay@gmail.com> wrote:
>
> > On Thu, May 5, 2011 at 12:22 PM, Jake Mannix <jake.mannix@gmail.com>
> > wrote:
> >
> > > On Thu, May 5, 2011 at 8:24 AM, Vckay <darkvckay@gmail.com> wrote:
> > >
> > > > So I am trying to build PCA. I was recommended in a previous thread
> > that
> > > it
> > > > was better that my data is available at the start as a distributed
> row
> > > > matrix. The work flow (already posted in a previous thread) would be:
> > > > 1. Get the data into distributed row matrix format.
> > > > 2. Compute empirical mean vector.
> > > >
> > >
> > > Note that as we've mentioned in other threads, this step:
> > >
> > >
> > >
> > I know what you guys were saying in the previous thread. I believe I did
> > mention that since I would be working with image data that is
> overwhelming
> > dense meaning that even if I did do a subtract from mean, I would
> > essentially get a sparse matrix. In fact, running SVD separately on the
> > matrix and the low rank matrix (e*m') would probably in this case be a
> bad
> > idea because you would end up having to run the code on a dense matrix.
> >
>
