mahout-dev mailing list archives

Site index · List index
Message view
Top
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: LSH notes on text documents
Date Mon, 25 Apr 2011 23:16:18 GMT
Btw... for the less jargon inclined:

positive orthant = set of all points with only positive coordinate values.
The term is by analogy with the 2-dimensional phrase positive QUADrant and
3-dimensional phrase positive OCTant.

normal vector = the orientation of any plane in hyper-space can be defined
by the normal vector which is the vector at right angles to the plane.  If
the normal vector is v, then x \dot v = r defines the plane where r is a
real number that measures the closest approach of the plane to the origin in
the direction of v.

LSH = a way of encoding points as binary strings by using a sequence of
planes.  The bits for point x are defined by {... x \dot v_i > r_i ...} for
each plane i with normal vector v_i and threshold r_i.  If these bits are
independent and each have probability 0.5, then the expected number of bits
in common between two encoded points is proportional to the normalized dot
product of the vectors.  If you have lots of such bits, then the number of
common bits is an accurate way to estimate the normalized dot product.

On Mon, Apr 25, 2011 at 4:08 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I thought that for a moment as well, but now I think not.
>
> The trivial counter example consists of one element for each single
> component unit vector e_i.  The SVD of this has unit singular values and the
> eigenvectors are just the e_i themselves.  After transformation, all values
> are still in the positive orthant.
>
> I may have misunderstood what you are suggesting, but I don't think that
> LSH could distribute the values into other orthants.
>
> It may be that in practice that SVD will scatter data into many orthants,
> but I suspect it will not spread the data as widely as LSH assumptions would
> like.  On the other hand, the two point hack that I mentioned guarantees
> that it meets the LSH assumptions so it isn't a failure of LSH, just that a
> random selection of hash functions is sub-optimal for data consisting of
> only positive values.
>
>
> On Mon, Apr 25, 2011 at 3:45 PM, Jake Mannix <jake.mannix@gmail.com>wrote:
>
>> If you run LSH over the SVD-projections of your vectors, on the other
>> hand,
>> you have not only distributed them in all orthants, but made the
>> distribution mostly uniform on the large scale, on account of the singular
>> value rescaling.
>>
>> On Apr 25, 2011 3:12 PM, "Ted Dunning" <ted.dunning@gmail.com> wrote:
>>
>> Btw... LSH came up recently (thanks Lance!).
>>
>> One wrinkle that should be mentioned that might catch somebody
>> implementing
>> this unawares is
>> that documents in a vector space model have highly non-random
>> distributions
>> that make the default
>> formulation of LSH very bad.
>>
>> The problem is that document vectors are normally confined to the positive
>> orthant.  That means that
>> a random hyper-plane has a very low chance of splitting any to documents
>> and
>> thus picking random
>> vectors as normals is a really bad way to get hash functions.
>>
>> This problem can be solved easily enough by picking separating planes by
>> picking two points at random
>> without replacement and using their difference as the normal vector for
>> the
>> separating plane.  This
>> can be shown to give a hashing funcction that has the requisite 50%
>> probability of being positive for
>> any document.
>>
>
>


Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message