Return-Path: X-Original-To: apmail-datafu-dev-archive@minotaur.apache.org Delivered-To: apmail-datafu-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A6BDF10FA3 for ; Wed, 30 Apr 2014 04:08:37 +0000 (UTC) Received: (qmail 87471 invoked by uid 500); 30 Apr 2014 04:08:37 -0000 Delivered-To: apmail-datafu-dev-archive@datafu.apache.org Received: (qmail 87429 invoked by uid 500); 30 Apr 2014 04:08:36 -0000 Mailing-List: contact dev-help@datafu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@datafu.incubator.apache.org Delivered-To: mailing list dev@datafu.incubator.apache.org Received: (qmail 87421 invoked by uid 99); 30 Apr 2014 04:08:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 04:08:36 +0000 X-ASF-Spam-Status: No, hits=-2000.7 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 30 Apr 2014 04:08:34 +0000 Received: (qmail 85092 invoked by uid 99); 30 Apr 2014 04:08:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 04:08:14 +0000 Date: Wed, 30 Apr 2014 04:08:14 +0000 (UTC) From: "Matthew Hayes (JIRA)" To: dev@datafu.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (DATAFU-37) Add Locality Sensitive Hashing UDFs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/DATAFU-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985142#comment-13985142 ] Matthew Hayes edited comment on DATAFU-37 at 4/30/14 4:07 AM: -------------------------------------------------------------- Something else I was wondering about when going through the code and reading the paper is how to determine the parameters. For CosineDistanceHash the important parameter is: * sRepeat: Number of internal repetitions For L1PStableHash and L2PStableHash the important parameters are: * sW: A double representing the quantization parameter (also known as the projection width) * sRepeat: Number of internal repetitions (generally this should be 1 as the p-stable hashes have a larger range than one bit) You mention that the parameters should be determined empirically. I also came across a presentation you did where you mention a tool that can assist in choosing the parameters. Do you think we could estimate parameters using a data sample and these UDFs or do we need additional UDFs to do that? was (Author: matterhayes): Something else I was wondering about when going through the code and reading the paper is how to determine the parameters. For CosineDistanceHash the important parameter is: * sRepeat: Number of internal repetitions For L1PStableHash and L2PStableHash the important parameters are: * sW: A double representing the quantization parameter (also known as the projection width) * sRepeat: Number of internal repetitions (generally this should be 1 as the p-stable hashes have a larger range than one bit) You mention that the parameters should be determined empirically. I also came across a presentation you did, file:///Users/mhayes/Downloads/presentation.pdf , where you mention a tool that can assist in choosing the parameters. Do you think we could estimate parameters using a data sample and these UDFs or do we need additional UDFs to do that? > Add Locality Sensitive Hashing UDFs > ----------------------------------- > > Key: DATAFU-37 > URL: https://issues.apache.org/jira/browse/DATAFU-37 > Project: DataFu > Issue Type: New Feature > Reporter: Casey Stella > Assignee: Casey Stella > Attachments: DATAFU-37.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > Create a set of UDFs to implement [Locality Sensitive Hashing|http://en.wikipedia.org/wiki/Locality-sensitive_hashing] in support of finding k-near neighbors. Initially, hashes associated with L1, L2 and Cosine similarity should be supported. -- This message was sent by Atlassian JIRA (v6.2#6252)