mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tr...@cs.drexel.edu
Subject Re: LanczosSVD and Eigenvalues
Date Thu, 23 Jun 2011 18:58:09 GMT
Alright thank you again, that makes it a little easier. Now to see about
implementing PCA as described in "Map-Reduce for Machine Learning on
Multicore" - Once I can get it to yield the correct covariant matrix of
course. If this gets done I may be able to (employer may agree to it) give
it back to Mahout, assuming someone else doesn't get PCA done earlier.

-Trevor

> Btw.. the JIRA involved was
> https://issues.apache.org/jira/browse/MAHOUT-376
>
> On Thu, Jun 23, 2011 at 11:44 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
>> If you don't need all 5000 singular values, then you can directly use
>> the
>> stochastic decomposition algorithms in Mahout.
>>
>> If you do want all 5000 singular values, then you can probably use all
>> but
>> the first and last few step of the stochastic decomposition algorithm to
>> get
>> what you need.  If you look back at the pertinent JIRA's you will see
>> some
>> PDF documents that describe my approach and Dmitriy's approach.  His
>> approach is what we have now since he did all the work.  Either should
>> work
>> for you.
>>
>>
>> 2011/6/23 <tra26@cs.drexel.edu>
>>
>>> I will, if it works I may have to make an m/r job for it. All the data
>>> we
>>> have will be tall and dense (lets say 5000 columns, with millions of
>>> rows). Now doing PCA on that will create a covariance matrix that is
>>> square and dense. Thanks again guys.
>>>
>>> -Trevor
>>>
>>> > Try the QR trick.  It is amazingly effective.
>>> >
>>> > 2011/6/23 <tra26@cs.drexel.edu>
>>> >
>>> >> Alright, thanks guys.
>>> >>
>>> >> > The cases where Lanczos or the stochastic projection helps are
>>> cases
>>> >> where
>>> >> > you have *many* columns but where the data are sparse.  If you
>>> have a
>>> >> very
>>> >> > tall dense matrix, the QR method is to be muchly preferred.
>>> >> >
>>> >> > 2011/6/23 <tra26@cs.drexel.edu>
>>> >> >
>>> >> >> Ok, then what would you think to be the minimum number of columns
>>> in
>>> >> the
>>> >> >> dataset for Lanczos to give a reasonable result?
>>> >> >>
>>> >> >> Thanks,
>>> >> >> -Trevor
>>> >> >>
>>> >> >> > A gazillion rows of 2-columned data is really much better
>>> suited
>>> to
>>> >> >> doing
>>> >> >> > the following:
>>> >> >> >
>>> >> >> > if each row is of the form [a, b], then compute the matrix
>>> >> >> >
>>> >> >> > [[a*a, a*b], [a*b, b*b]]
>>> >> >> >
>>> >> >> > (the outer product of the vector with itself)
>>> >> >> >
>>> >> >> > Then take the matrix sum of all of these, from each row
of your
>>> >> input
>>> >> >> > matrix.
>>> >> >> >
>>> >> >> > You'll now have a 2x2 matrix, which you can diagonalize
by
>>> hand.
>>> >> It
>>> >> >> will
>>> >> >> > give you your eigenvalues, and also the right-singular
vectors
>>> of
>>> >> your
>>> >> >> > original matrix.
>>> >> >> >
>>> >> >> >   -jake
>>> >> >> >
>>> >> >> > 2011/6/23 <tra26@cs.drexel.edu>
>>> >> >> >
>>> >> >> >> Yes, exactly why I asked it for only 2 eigenvalues.
So what is
>>> >> being
>>> >> >> >> said,
>>> >> >> >> is if I have lets say 50M rows of 2 columned data,
Lanczos
>>> can't
>>> >> do
>>> >> >> >> anything with it (assuming it puts the 0 eigenvalue
in the mix
>>> -
>>> >> of
>>> >> >> the
>>> >> >> >> 2
>>> >> >> >> eigenvectors only 1 is returned because of the 0 eigenvalue
>>> taking
>>> >> up
>>> >> >> a
>>> >> >> >> slot)?
>>> >> >> >>
>>> >> >> >> If the eigenvalue of 0 is invalid, then should it
not be
>>> filtered
>>> >> out
>>> >> >> so
>>> >> >> >> that it returns "rank" number of eigenvalues that
could be
>>> valid?
>>> >> >> >>
>>> >> >> >> -Trevor
>>> >> >> >>
>>> >> >> >> > Ah, if your matrix only has 2 columns, you can't
go to rank
>>> 10.
>>> >> >> Try
>>> >> >> >> on
>>> >> >> >> > some slightly less synthetic data of more than
rank 10.  You
>>> >> can't
>>> >> >> >> > ask Lanczos for more reduced rank than that of
the matrix
>>> >> itself.
>>> >> >> >> >
>>> >> >> >> >   -jake
>>> >> >> >> >
>>> >> >> >> > 2011/6/23 <tra26@cs.drexel.edu>
>>> >> >> >> >
>>> >> >> >> >> Alright I can reorder that is easy, just
had to verify that
>>> the
>>> >> >> >> ordering
>>> >> >> >> >> was correct. So when I increased the rank
of the results I
>>> get
>>> >> >> >> Lanczos
>>> >> >> >> >> bailing out. Which incidentally causes a
>>> NullPointerException:
>>> >> >> >> >>
>>> >> >> >> >> INFO: 9 passes through the corpus so far...
>>> >> >> >> >> WARNING: Lanczos parameters out of range:
alpha = NaN, beta
>>> =
>>> >> NaN.
>>> >> >> >> >> Bailing out early!
>>> >> >> >> >> INFO: Lanczos iteration complete - now to
diagonalize the
>>> >> >> >> tri-diagonal
>>> >> >> >> >> auxiliary matrix.
>>> >> >> >> >> Exception in thread "main" java.lang.NullPointerException
>>> >> >> >> >>        at
>>> >> >> >> >>
>>> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133)
>>> >> >> >> >>        at
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160)
>>> >> >> >> >>        at pca.PCASolver.solve(PCASolver.java:53)
>>> >> >> >> >>        at pca.PCA.main(PCA.java:20)
>>> >> >> >> >>
>>> >> >> >> >> So I should probably note that my data only
has 2 columns,
>>> the
>>> >> >> real
>>> >> >> >> data
>>> >> >> >> >> will have quite a bit more.
>>> >> >> >> >>
>>> >> >> >> >> The failing happens with 10 and more for
rank, with the
>>> last,
>>> >> and
>>> >> >> >> >> therefore most significant eigenvector being
<NaN,NaN>.
>>> >> >> >> >>
>>> >> >> >> >> -Trevor
>>> >> >> >> >> > The 0 eigenvalue output is not valid,
and yes, the output
>>> >> will
>>> >> >> list
>>> >> >> >> >> the
>>> >> >> >> >> > results
>>> >> >> >> >> > in *increasing* order, even though it
is finding the
>>> largest
>>> >> >> >> >> > eigenvalues/vectors
>>> >> >> >> >> > first.
>>> >> >> >> >> >
>>> >> >> >> >> > Remember that convergence is gradual,
so if you only ask
>>> for
>>> >> 3
>>> >> >> >> >> > eigevectors/values, you won't be very
accurate.  If you
>>> ask
>>> >> for
>>> >> >> 10
>>> >> >> >> or
>>> >> >> >> >> > more,
>>> >> >> >> >> > the
>>> >> >> >> >> > largest few will now be quite good.
 If you ask for 50,
>>> now
>>> >> the
>>> >> >> top
>>> >> >> >> >> 10-20
>>> >> >> >> >> > will
>>> >> >> >> >> > be *extremely* accurate, and maybe the
top 30 will still
>>> be
>>> >> >> quite
>>> >> >> >> >> good.
>>> >> >> >> >> >
>>> >> >> >> >> > Try out a non-distributed form of what
is in the
>>> >> >> >> EigenverificationJob
>>> >> >> >> >> to
>>> >> >> >> >> > re-order the output and collect how
accurate your results
>>> are
>>> >> >> (it
>>> >> >> >> >> computes
>>> >> >> >> >> > errors for you as well).
>>> >> >> >> >> >
>>> >> >> >> >> >   -jake
>>> >> >> >> >> >
>>> >> >> >> >> > 2011/6/23 <tra26@cs.drexel.edu>
>>> >> >> >> >> >
>>> >> >> >> >> >> So, I know that MAHOUT-369 fixed
a bug with the
>>> distributed
>>> >> >> >> version
>>> >> >> >> >> of
>>> >> >> >> >> >> the
>>> >> >> >> >> >> LanczosSolver but I am experiencing
a similar problem
>>> with
>>> >> the
>>> >> >> >> >> >> non-distributed version.
>>> >> >> >> >> >>
>>> >> >> >> >> >> I send a dataset of gaussian distributed
numbers
>>> (testing
>>> >> PCA
>>> >> >> >> stuff)
>>> >> >> >> >> and
>>> >> >> >> >> >> my eigenvalues are seemingly reversed.
Below I have the
>>> >> output
>>> >> >> >> given
>>> >> >> >> >> in
>>> >> >> >> >> >> the logs from LanczosSolver.
>>> >> >> >> >> >>
>>> >> >> >> >> >> Output:
>>> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue
0.0
>>> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue
>>> 347.8703086831804
>>> >> >> >> >> >> INFO: LanczosSolver finished.
>>> >> >> >> >> >>
>>> >> >> >> >> >> So it returns a vector with eigenvalue
0 before one with
>>> an
>>> >> >> >> >> eigenvalue
>>> >> >> >> >> >> of
>>> >> >> >> >> >> 347?. Whats more interesting is
that when I increase the
>>> >> rank,
>>> >> >> I
>>> >> >> >> get
>>> >> >> >> >> a
>>> >> >> >> >> >> new
>>> >> >> >> >> >> eigenvector with a value between
0 and 347:
>>> >> >> >> >> >>
>>> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue
0.0
>>> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue
>>> 44.794928654801566
>>> >> >> >> >> >> INFO: Eigenvector 2 found with eigenvalue
>>> 347.8286920203704
>>> >> >> >> >> >>
>>> >> >> >> >> >> Shouldn't the eigenvalues be in
descending order? Also
>>> is
>>> >> the
>>> >> >> 0.0
>>> >> >> >> >> >> eigenvalue even valid?
>>> >> >> >> >> >>
>>> >> >> >> >> >> Thanks,
>>> >> >> >> >> >> Trevor
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>
>



Mime
View raw message