Return-Path: Delivered-To: apmail-lucene-mahout-dev-archive@minotaur.apache.org Received: (qmail 83515 invoked from network); 31 Mar 2010 01:30:51 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 31 Mar 2010 01:30:51 -0000 Received: (qmail 92013 invoked by uid 500); 31 Mar 2010 01:30:51 -0000 Delivered-To: apmail-lucene-mahout-dev-archive@lucene.apache.org Received: (qmail 91971 invoked by uid 500); 31 Mar 2010 01:30:50 -0000 Mailing-List: contact mahout-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-dev@lucene.apache.org Delivered-To: mailing list mahout-dev@lucene.apache.org Received: (qmail 91948 invoked by uid 99); 31 Mar 2010 01:30:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Mar 2010 01:30:50 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Mar 2010 01:30:48 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 2A401234C4B9 for ; Wed, 31 Mar 2010 01:30:27 +0000 (UTC) Message-ID: <1894000467.594611269999027165.JavaMail.jira@brutus.apache.org> Date: Wed, 31 Mar 2010 01:30:27 +0000 (UTC) From: "Drew Farris (JIRA)" To: mahout-dev@lucene.apache.org Subject: [jira] Commented: (MAHOUT-344) Minhash based clustering In-Reply-To: <1704831884.399351269243867184.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851686#action_12851686 ] Drew Farris commented on MAHOUT-344: ------------------------------------ Hi Cristi, Sounds like a great start. Answers for a couple of your questions: {quote} Is there a standard formatting for the input on each clustering alg or the input format follows the same rules for all algorithms, and then the users write conversion tools which ? {quote} Take a look at the various Vector clases in the math module and the VectorWritable wrapper. Most of the clustering algorithms take vectors of one kind or another as input and the assumption is that users will write tools to convert their data to these common formats. The wiki page http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html is a good place to start {quote} would it be ok if I attach the code which does an example of running min-hash clustering in the examples dirs ? (it would first convert the dataset format accordingly) {quote} Go for it, code is good, patches are even better, see: http://cwiki.apache.org/MAHOUT/howtocontribute.html#HowToContribute-Creatingthepatchfile and simply attach it to this issue. > Minhash based clustering > ------------------------- > > Key: MAHOUT-344 > URL: https://issues.apache.org/jira/browse/MAHOUT-344 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.3 > Reporter: Ankur > Assignee: Ankur > Attachments: MAHOUT-344-v1.patch > > > Minhash clustering performs probabilistic dimension reduction of high dimensional data. The essence of the technique is to hash each item using multiple independent hash functions such that the probability of collision of similar items is higher. Multiple such hash tables can then be constructed to answer near neighbor type of queries efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.