spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Higley (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
Date Thu, 26 Nov 2015 23:21:10 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029318#comment-15029318
] 

Karl Higley edited comment on SPARK-5992 at 11/26/15 11:20 PM:
---------------------------------------------------------------

I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two types at least.
One is to calculate hash value. The other is to calculate a similarity between a feature(vector)
and another one. 

For example, random projection algorithm is a type of calculating a similarity. It is designed
to approximate the cosine distance between vectors. On the other hand, min hash algorithm
is a type of calculating a hash value. The hash function maps a d dimensional vector onto
a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) for each feature
vector, and the Hamming distance between two hash values is used to estimate the cosine similarity
between the corresponding feature vectors. The two "types" of LSH mentioned here seem more
like two kinds of operations which are sometimes applied sequentially. Maybe this distinction
makes more sense for other types of LSH?


was (Author: karlhigley):
I'm a bit confused by this section of the design doc:
{quote}
It is pretty hard to define a common interface. Because LSH algorithm has two types at least.
One is to calculate hash value. The other is to calculate a similarity between a feature(vector)
and another one. 

For example, random projection algorithm is a type of calculating a similarity. It is designed
to approximate the cosine distance between vectors. On the other hand, min hash algorithm
is a type of calculating a hash value. The hash function maps a d dimensional vector onto
a set of integers.
{quote}
Sign-random-projection LSH does calculate a hash value (essentially a Bitset) for each feature
vector, and the Hamming distance between two hash values is used to estimate the cosine similarity
between the corresponding vectors. The two "types" of LSH mentioned here seem more like two
kinds of operations which are sometimes applied sequentially. Maybe this distinction makes
more sense for other types of LSH?

> Locality Sensitive Hashing (LSH) for MLlib
> ------------------------------------------
>
>                 Key: SPARK-5992
>                 URL: https://issues.apache.org/jira/browse/SPARK-5992
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be great to discuss
some possible algorithms here, choose an API, and make a PR for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message