mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: LanczosSVD and Eigenvalues
Date Thu, 23 Jun 2011 19:08:55 GMT
I think that you can do the covariance using Jakes old outer product trick.

Of course you need to do something clever to deal with mean subtraction.

2011/6/23 <tra26@cs.drexel.edu>

> Yes but a M/R job to create the covariance matrix would be required. With
> millions of rows that is, unless I am missing something.
>
> > Doing PCA on 5000 x 5000 matrix is still an in-memory thing.  That's
> > only  25M doubles, 200MB of memory.  Lots of techniques can run
> > on that set.  You could do Lanczos with whatever rank you want
> > on it (but don't worry about distributed lanczos).
> >
> > 2011/6/23 <tra26@cs.drexel.edu>
> >
> >> I will, if it works I may have to make an m/r job for it. All the data
> >> we
> >> have will be tall and dense (lets say 5000 columns, with millions of
> >> rows). Now doing PCA on that will create a covariance matrix that is
> >> square and dense. Thanks again guys.
> >>
> >> -Trevor
> >>
> >> > Try the QR trick.  It is amazingly effective.
> >> >
> >> > 2011/6/23 <tra26@cs.drexel.edu>
> >> >
> >> >> Alright, thanks guys.
> >> >>
> >> >> > The cases where Lanczos or the stochastic projection helps are
> >> cases
> >> >> where
> >> >> > you have *many* columns but where the data are sparse.  If you
have
> >> a
> >> >> very
> >> >> > tall dense matrix, the QR method is to be muchly preferred.
> >> >> >
> >> >> > 2011/6/23 <tra26@cs.drexel.edu>
> >> >> >
> >> >> >> Ok, then what would you think to be the minimum number of
columns
> >> in
> >> >> the
> >> >> >> dataset for Lanczos to give a reasonable result?
> >> >> >>
> >> >> >> Thanks,
> >> >> >> -Trevor
> >> >> >>
> >> >> >> > A gazillion rows of 2-columned data is really much better
suited
> >> to
> >> >> >> doing
> >> >> >> > the following:
> >> >> >> >
> >> >> >> > if each row is of the form [a, b], then compute the matrix
> >> >> >> >
> >> >> >> > [[a*a, a*b], [a*b, b*b]]
> >> >> >> >
> >> >> >> > (the outer product of the vector with itself)
> >> >> >> >
> >> >> >> > Then take the matrix sum of all of these, from each row
of your
> >> >> input
> >> >> >> > matrix.
> >> >> >> >
> >> >> >> > You'll now have a 2x2 matrix, which you can diagonalize
by hand.
> >> >> It
> >> >> >> will
> >> >> >> > give you your eigenvalues, and also the right-singular
vectors
> >> of
> >> >> your
> >> >> >> > original matrix.
> >> >> >> >
> >> >> >> >   -jake
> >> >> >> >
> >> >> >> > 2011/6/23 <tra26@cs.drexel.edu>
> >> >> >> >
> >> >> >> >> Yes, exactly why I asked it for only 2 eigenvalues.
So what is
> >> >> being
> >> >> >> >> said,
> >> >> >> >> is if I have lets say 50M rows of 2 columned data,
Lanczos
> >> can't
> >> >> do
> >> >> >> >> anything with it (assuming it puts the 0 eigenvalue
in the mix
> >> -
> >> >> of
> >> >> >> the
> >> >> >> >> 2
> >> >> >> >> eigenvectors only 1 is returned because of the 0
eigenvalue
> >> taking
> >> >> up
> >> >> >> a
> >> >> >> >> slot)?
> >> >> >> >>
> >> >> >> >> If the eigenvalue of 0 is invalid, then should it
not be
> >> filtered
> >> >> out
> >> >> >> so
> >> >> >> >> that it returns "rank" number of eigenvalues that
could be
> >> valid?
> >> >> >> >>
> >> >> >> >> -Trevor
> >> >> >> >>
> >> >> >> >> > Ah, if your matrix only has 2 columns, you can't
go to rank
> >> 10.
> >> >> >> Try
> >> >> >> >> on
> >> >> >> >> > some slightly less synthetic data of more than
rank 10.  You
> >> >> can't
> >> >> >> >> > ask Lanczos for more reduced rank than that
of the matrix
> >> >> itself.
> >> >> >> >> >
> >> >> >> >> >   -jake
> >> >> >> >> >
> >> >> >> >> > 2011/6/23 <tra26@cs.drexel.edu>
> >> >> >> >> >
> >> >> >> >> >> Alright I can reorder that is easy, just
had to verify that
> >> the
> >> >> >> >> ordering
> >> >> >> >> >> was correct. So when I increased the rank
of the results I
> >> get
> >> >> >> >> Lanczos
> >> >> >> >> >> bailing out. Which incidentally causes a
> >> NullPointerException:
> >> >> >> >> >>
> >> >> >> >> >> INFO: 9 passes through the corpus so far...
> >> >> >> >> >> WARNING: Lanczos parameters out of range:
alpha = NaN, beta
> >> =
> >> >> NaN.
> >> >> >> >> >> Bailing out early!
> >> >> >> >> >> INFO: Lanczos iteration complete - now to
diagonalize the
> >> >> >> >> tri-diagonal
> >> >> >> >> >> auxiliary matrix.
> >> >> >> >> >> Exception in thread "main" java.lang.NullPointerException
> >> >> >> >> >>        at
> >> >> >> >> >>
> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133)
> >> >> >> >> >>        at
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160)
> >> >> >> >> >>        at pca.PCASolver.solve(PCASolver.java:53)
> >> >> >> >> >>        at pca.PCA.main(PCA.java:20)
> >> >> >> >> >>
> >> >> >> >> >> So I should probably note that my data only
has 2 columns,
> >> the
> >> >> >> real
> >> >> >> >> data
> >> >> >> >> >> will have quite a bit more.
> >> >> >> >> >>
> >> >> >> >> >> The failing happens with 10 and more for
rank, with the
> >> last,
> >> >> and
> >> >> >> >> >> therefore most significant eigenvector being
<NaN,NaN>.
> >> >> >> >> >>
> >> >> >> >> >> -Trevor
> >> >> >> >> >> > The 0 eigenvalue output is not valid,
and yes, the output
> >> >> will
> >> >> >> list
> >> >> >> >> >> the
> >> >> >> >> >> > results
> >> >> >> >> >> > in *increasing* order, even though
it is finding the
> >> largest
> >> >> >> >> >> > eigenvalues/vectors
> >> >> >> >> >> > first.
> >> >> >> >> >> >
> >> >> >> >> >> > Remember that convergence is gradual,
so if you only ask
> >> for
> >> >> 3
> >> >> >> >> >> > eigevectors/values, you won't be very
accurate.  If you
> >> ask
> >> >> for
> >> >> >> 10
> >> >> >> >> or
> >> >> >> >> >> > more,
> >> >> >> >> >> > the
> >> >> >> >> >> > largest few will now be quite good.
 If you ask for 50,
> >> now
> >> >> the
> >> >> >> top
> >> >> >> >> >> 10-20
> >> >> >> >> >> > will
> >> >> >> >> >> > be *extremely* accurate, and maybe
the top 30 will still
> >> be
> >> >> >> quite
> >> >> >> >> >> good.
> >> >> >> >> >> >
> >> >> >> >> >> > Try out a non-distributed form of what
is in the
> >> >> >> >> EigenverificationJob
> >> >> >> >> >> to
> >> >> >> >> >> > re-order the output and collect how
accurate your results
> >> are
> >> >> >> (it
> >> >> >> >> >> computes
> >> >> >> >> >> > errors for you as well).
> >> >> >> >> >> >
> >> >> >> >> >> >   -jake
> >> >> >> >> >> >
> >> >> >> >> >> > 2011/6/23 <tra26@cs.drexel.edu>
> >> >> >> >> >> >
> >> >> >> >> >> >> So, I know that MAHOUT-369 fixed
a bug with the
> >> distributed
> >> >> >> >> version
> >> >> >> >> >> of
> >> >> >> >> >> >> the
> >> >> >> >> >> >> LanczosSolver but I am experiencing
a similar problem
> >> with
> >> >> the
> >> >> >> >> >> >> non-distributed version.
> >> >> >> >> >> >>
> >> >> >> >> >> >> I send a dataset of gaussian distributed
numbers (testing
> >> >> PCA
> >> >> >> >> stuff)
> >> >> >> >> >> and
> >> >> >> >> >> >> my eigenvalues are seemingly reversed.
Below I have the
> >> >> output
> >> >> >> >> given
> >> >> >> >> >> in
> >> >> >> >> >> >> the logs from LanczosSolver.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Output:
> >> >> >> >> >> >> INFO: Eigenvector 0 found with
eigenvalue 0.0
> >> >> >> >> >> >> INFO: Eigenvector 1 found with
eigenvalue
> >> 347.8703086831804
> >> >> >> >> >> >> INFO: LanczosSolver finished.
> >> >> >> >> >> >>
> >> >> >> >> >> >> So it returns a vector with eigenvalue
0 before one with
> >> an
> >> >> >> >> >> eigenvalue
> >> >> >> >> >> >> of
> >> >> >> >> >> >> 347?. Whats more interesting is
that when I increase the
> >> >> rank,
> >> >> >> I
> >> >> >> >> get
> >> >> >> >> >> a
> >> >> >> >> >> >> new
> >> >> >> >> >> >> eigenvector with a value between
0 and 347:
> >> >> >> >> >> >>
> >> >> >> >> >> >> INFO: Eigenvector 0 found with
eigenvalue 0.0
> >> >> >> >> >> >> INFO: Eigenvector 1 found with
eigenvalue
> >> 44.794928654801566
> >> >> >> >> >> >> INFO: Eigenvector 2 found with
eigenvalue
> >> 347.8286920203704
> >> >> >> >> >> >>
> >> >> >> >> >> >> Shouldn't the eigenvalues be in
descending order? Also is
> >> >> the
> >> >> >> 0.0
> >> >> >> >> >> >> eigenvalue even valid?
> >> >> >> >> >> >>
> >> >> >> >> >> >> Thanks,
> >> >> >> >> >> >> Trevor
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message