mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Regarding PCA implementation
Date Thu, 28 Apr 2011 03:21:07 GMT
On Wed, Apr 27, 2011 at 6:41 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>  > 3. Now that I have the centered data, computing the covariance matrix
> > shouldn't be too hard if I have represented my matrix as a distributed
> row
> > matrix. I can then use "times" to produce the covariance matrix.
> >
>
> Actually, this is liable to be a disaster because the covariance matrix
> will
> be dense after you subtract the mean.
>

This is exactly what I was thinking.


> a) can you do the SVD of the original matrix rather than the eigen-value
> computation of the covariance?  I think that this is likely to be
> numerically better.
>
> b) is there some perturbation trick such that you can do to avoid the mean
> shift problem?  I know that you can deal with (A - \lambda I), but you have
> (A -  e m') where e is the vector with all ones.
>

I would love to know the answer to this question.

Thinking on it a little bit further, this is not so bad: Let's say we had a
finished
patch to the idea discussed in MAHOUT-672 - virtual distributed matrices,
where
in this case, we have (A - e m'), where e and m are represented in a nice
compact fashion (just vectors, after all).  Then Lanczos operates by
repeated
multiplication of this matrix and some dense vector.  A . v is fine, and
then
(e m') . v = (v.dot(m) ) e is also easy to compute, so repeated iteration is
not
so bad at all.

I'm guessing that I've just reinvented sparse PCA, unless this is all crazy?

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message