mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Methods for Naming Clusters
Date Mon, 04 Jan 2010 22:55:01 GMT
Hmm... the degree to which I've found SVD useful is primarily contained in
the amount to which the metric is *not* preserved, in my experience...
that's
the whole point, or else you get very little out of it: you trade a high
dimensional
sparse computation for a low-dimensional dense one, and if you exactly
preserved
the metric you basically get nothing.

When you notice that for text, ngrams like "software engineer" are now
considerably closer to "c++ developer" than to other ngrams, this gives you
information.  You don't get that information from a random projection.
 You'll
get some of that information from A'AR, because you get second-order
correlations, but then you're still losing all correlations beyond
second-order (and
a true eigenvector is getting you the full infinite series of correlations,
properly
weighted).

I mean, I guess you can use SVD purely for dimensional reduction, but like
you say, doing reduction can be done lots of other more efficient ways.
 Doing
it with reduction which enhances co-occurrence relationships and distorts
the metric to produce better clusters than when you started is something
that
SVD, NMF, and LDA were designed for.

Maybe I'm missing your point?

  -jake

On Mon, Jan 4, 2010 at 2:44 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> SVD is (approximately) metric-preserving while also dimensionality
> reducing.  If you use A'AR instead of the actual term eigenvectors you
> should get similar results.
>
> On Mon, Jan 4, 2010 at 2:21 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
>
> > Ted, how would just doing a random projection do the right thing?  It's a
> > basically metric-preserving technique, and one of the primary reasons to
> > *do* LSA is to use a *different* metric (one in which "similar" terms are
> > nearer to each other than would be otherwise imagined).
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message