datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Hayes (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DATAFU-37) Add Locality Sensitive Hashing UDFs
Date Wed, 30 Apr 2014 04:08:14 GMT

    [ https://issues.apache.org/jira/browse/DATAFU-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985142#comment-13985142
] 

Matthew Hayes edited comment on DATAFU-37 at 4/30/14 4:07 AM:
--------------------------------------------------------------

Something else I was wondering about when going through the code and reading the paper is
how to determine the parameters.

For CosineDistanceHash the important parameter is:
* sRepeat: Number of internal repetitions

For L1PStableHash and L2PStableHash the important parameters are:
* sW: A double representing the quantization parameter (also known as the projection width)
* sRepeat: Number of internal repetitions (generally this should be 1 as the p-stable hashes
have a larger range than one bit) 

You mention that the parameters should be determined empirically.  I also came across a presentation
you did where you mention a tool that can assist in choosing the parameters.  Do you think
we could estimate parameters using a data sample and these UDFs or do we need additional UDFs
to do that?


was (Author: matterhayes):
Something else I was wondering about when going through the code and reading the paper is
how to determine the parameters.

For CosineDistanceHash the important parameter is:
* sRepeat: Number of internal repetitions

For L1PStableHash and L2PStableHash the important parameters are:
* sW: A double representing the quantization parameter (also known as the projection width)
* sRepeat: Number of internal repetitions (generally this should be 1 as the p-stable hashes
have a larger range than one bit) 

You mention that the parameters should be determined empirically.  I also came across a presentation
you did, file:///Users/mhayes/Downloads/presentation.pdf , where you mention a tool that can
assist in choosing the parameters.  Do you think we could estimate parameters using a data
sample and these UDFs or do we need additional UDFs to do that?

> Add Locality Sensitive Hashing UDFs
> -----------------------------------
>
>                 Key: DATAFU-37
>                 URL: https://issues.apache.org/jira/browse/DATAFU-37
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Casey Stella
>            Assignee: Casey Stella
>         Attachments: DATAFU-37.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Create a set of UDFs to implement [Locality Sensitive Hashing|http://en.wikipedia.org/wiki/Locality-sensitive_hashing]
in support of finding k-near neighbors.   Initially, hashes associated with L1, L2 and Cosine
similarity should be supported.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message