On Wed, Apr 27, 2011 at 6:41 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > 3. Now that I have the centered data, computing the covariance matrix
> > shouldn't be too hard if I have represented my matrix as a distributed
> row
> > matrix. I can then use "times" to produce the covariance matrix.
> >
>
> Actually, this is liable to be a disaster because the covariance matrix
> will
> be dense after you subtract the mean.
>
This is exactly what I was thinking.
> a) can you do the SVD of the original matrix rather than the eigenvalue
> computation of the covariance? I think that this is likely to be
> numerically better.
>
> b) is there some perturbation trick such that you can do to avoid the mean
> shift problem? I know that you can deal with (A  \lambda I), but you have
> (A  e m') where e is the vector with all ones.
>
I would love to know the answer to this question.
Thinking on it a little bit further, this is not so bad: Let's say we had a
finished
patch to the idea discussed in MAHOUT672  virtual distributed matrices,
where
in this case, we have (A  e m'), where e and m are represented in a nice
compact fashion (just vectors, after all). Then Lanczos operates by
repeated
multiplication of this matrix and some dense vector. A . v is fine, and
then
(e m') . v = (v.dot(m) ) e is also easy to compute, so repeated iteration is
not
so bad at all.
I'm guessing that I've just reinvented sparse PCA, unless this is all crazy?
jake
