mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: LSH notes on text documents
Date Tue, 26 Apr 2011 06:12:04 GMT
Yes.  That would be a random projection that would give zero mean.

You could map to a binary space as you suggest or use a continuous random
projection.  Since you are likely mapping to a lower dimensional space to
avoid disastrous expansion of the problem, I would be tempted to use the
continuous projection to preserve information leading into the LSH.

It would also be interesting to do one round of cooccurrence training a la
semantic indexing.  That would make the LSH vectors be a bit more semantic.

On Mon, Apr 25, 2011 at 10:38 PM, Randall McRee <randall.mcree@gmail.com>wrote:

> Ted,
> Seems like this is not a problem if you choose to map docs into an LSI-like
> vector space, namely instead of assigning each term its own dimension
> assign
> a term to a sparse vector chosen from {0,1,-1} randomly (0 is most
> probable). Problem solved, I think?
>
> Randy
>
> On Mon, Apr 25, 2011 at 3:11 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > Btw... LSH came up recently (thanks Lance!).
> >
> > One wrinkle that should be mentioned that might catch somebody
> implementing
> > this unawares is
> > that documents in a vector space model have highly non-random
> distributions
> > that make the default
> > formulation of LSH very bad.
> >
> > The problem is that document vectors are normally confined to the
> positive
> > orthant.  That means that
> > a random hyper-plane has a very low chance of splitting any to documents
> > and
> > thus picking random
> > vectors as normals is a really bad way to get hash functions.
> >
> > This problem can be solved easily enough by picking separating planes by
> > picking two points at random
> > without replacement and using their difference as the normal vector for
> the
> > separating plane.  This
> > can be shown to give a hashing funcction that has the requisite 50%
> > probability of being positive for
> > any document.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message