mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: eigendecomposition of very large matrices
Date Fri, 20 Jul 2012 02:34:40 GMT
I don't see the direct linear connection between dimensionality and number
of clusters.  To wit, there are 2^d corners of a hyper-cube of dimension d.
 Putting it differently, the quasi-orthogonal packing capacity of
d-dimensional space is considerably larger than d for d > 30 or so.  In
fact, the capacity grows exponentially with d.

Likewise, you can approximate point-wise distances very well with dimension
proportional to log d.  This implies that an orthogonal basis of C log d
for small C should suffice to describe your data.  The log of 10^6 is only
14.

These two arguments (quasi-orthogonality and Johnson-Lindenstrauss) come to
the same conclusion from opposite directions.  I have a hard time believing
that you need more than 200 dimensions here.  This is especially true since
you have such sparse vectors.

On Thu, Jul 19, 2012 at 7:25 PM, Aniruddha Basak <t-abasak@expedia.com>wrote:

> Hi Ted,
> Thanks for your reply.
> I am doing clustering of 10^6 objects (thus affinity matrix of that size)
> and expect 4000-10,000 clusters. That's why I need those many eigenvectors.
>
> Will SVD be faster in this case ?
>
> Aniruddha
>
>
>
> On Jul 19, 2012, at 7:20 PM, "Ted Dunning" <ted.dunning@gmail.com> wrote:
>
> > Folks have done SVD on very large matrices with Mahout, but not
> necessarily
> > for spectral clustering.
> >
> > Are you sure that you actually need 4000 vectors?  As sparse as your data
> > is, I would expect that no more than a few hundred are anything but
> > statistical noise.
> >
> > On Thu, Jul 19, 2012 at 6:32 PM, Aniruddha Basak <t-abasak@expedia.com
> >wrote:
> >
> >> Thanks Dmitriy for your reply.
> >> The matrix I am working on, has 10-20 non zero entries per row. So its
> >> very sparse.
> >> I am trying to do spectral clustering which involves
> eigen-decomposition.
> >> I am wondering whether anyone has tried to do spectral clustering using
> >> mahout
> >> for very large affinity matrix (input).
> >>
> >> Aniruddha
> >>
> >>
> >> -----Original Message-----
> >> From: Dmitriy Lyubimov [mailto:dlieu.7@gmail.com]
> >> Sent: Thursday, July 19, 2012 6:28 PM
> >> To: user@mahout.apache.org
> >> Subject: Re: eigendecomposition of very large matrices
> >>
> >> very significant sparsity may be a problem though for -q >=1 parameters.
> >> Again, depends on the hardware you have and the # of non-zero elements
> in
> >> the input. but -q=1 is still the most recommended setting here.
> >>
> >>
> >> On Thu, Jul 19, 2012 at 6:20 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> >> wrote:
> >>> you may try SSVD.
> >>> https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular
> >>> +Value+Decomposition
> >>>
> >>> but 4k eigenvectors (or, rather, singular values) is kind of still a
> >>> lot though and may push the precision out of the error estimates. I
> >>> don't we had precision study for that many. Also need quite a bit of
> >>> memory to compute that (not to mention flops). More realistically you
> >>> probably may try 1k singular values . You may try more if you have
> >>> access to more powerful hardware than we did in the studies but
> >>> distributed computation time will grow at about k^1.5, i.e. faster
> >>> than linear, even if you have enough nodes for the tasks.
> >>>
> >>> -d
> >>>
> >>> On Thu, Jul 19, 2012 at 6:12 PM, Aniruddha Basak <t-abasak@expedia.com
> >
> >> wrote:
> >>>> Hi,
> >>>> I am working on a clustering problem which involves determining the
> >>>> largest "k" eigenvectors of a very large matrix. The matrices, I work
> >>>> on, are typically of the order of 10^6 by 10^6.
> >>>> Trying to do this using the Lanczos solver available in Mahout, I
> >>>> found it is very slow and takes around 1.5 minutes to compute each
> >> eigenvectors.
> >>>> Hence to get 4000 eigenvectors, it takes 100 hours or 4 days !!
> >>>>
> >>>> So I am looking for something faster to solve the "Eigen
> decomposition"
> >>>> problem for very large sparse matrix. Please suggest me what should
I
> >> use ?
> >>>>
> >>>>
> >>>> Thanks,
> >>>> Aniruddha
> >>>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message